The Big R-Book. Philippe J. S. De Brouwer
is more than one variable, it is useful to understand what the interdependencies of variables are. For example when measuring the size of peoples hands and their length, one can expect that people with larger hands on average are taller than people with smaller hands. The hand size and length are positively correlated.
The basic measure for linear interdependence is covariance, defined as
8.3.1 8.3.1 The Pearson Correlation
An important metric for linear relationship is the Pearson correlation coefficient ρ.
correlation – Pearson
Definition: Pearson Correlation Coefficient
cor(mtcars$hp,mtcars$wt) ## [1] 0.6587479
cor()
Of course, we also have functions that provide the covariance matrix and functions that convert the one in the other.
d <- data.frame(mpg = mtcars$mpg, wt = mtcars$wt, hp = mtcars$hp) # Note that we can feed a whole data-frame in the functions. var(d) ## mpg wt hp ## mpg 36.324103 -5.116685 -320.73206 ## wt -5.116685 0. 957379 44.19266 ## hp -320.732056 44.192661 4700.86694 cov(d) ## mpg wt hp ## mpg 36.324103 -5.116685 -320.73206 ## wt -5.116685 0.957379 44.19266 ## hp -320.732056 44.192661 4700.86694 cor(d) ## mpg wt hp ## mpg 1.0000000 -0.8676594 -0.7761684 ## wt -0.8676594 1.0000000 0.6587479 ## hp -0.7761684 0.6587479 1.0000000
var()
cov()
cor()
cov2cor(cov(d)) ## mpg wt hp ## mpg 1.0000000 -0.8676594 -0.7761684 ## wt -0.8676594 1.0000000 0.6587479 ## hp -0.7761684 0.6587479 1.0000000
cov2cor()
8.3.2 8.3.2 The Spearman Correlation
The measure for correlation, as defined in previous section, actually tests for a linear relation. This means that even the presence of a strong non-linear relationship can go undetected.
x <- c(-10:10) df <- data.frame(x=x, x_sq=x∧2, x_abs=abs(x), x_exp=exp(x)) cor(df) ## x x_sq x_abs x_exp ## x 1.000000 0.0000000 0.0000000 0.5271730 ## x_sq 0.000000 1.0000000 0.9671773 0.5491490 ## x_abs 0.000000 0.9671773 1.0000000 0.4663645 ## x_exp 0.527173 0.5491490 0.4663645 1.0000000
The correlation between x and x2 is zero, and the correlation between x and exp(x) is a meagre 0.527173.
correlation – Spearman
The Spearman correlation is the correlation applied to the ranks of the data. It is one if an increase in the variable X is always accompanied with an increase in variable Y.
cor(rank(df$x), rank(df$x_exp)) ## [1] 1
The Spearman correlation checks for a relationship that can bemore general than only linear. It will be one if X increases when Y increases.
Consider the vectors
1 x = c(1, 2, 33, 44) and y = c(22, 23, 100, 200),
2 x = c(1 : 10) and y = 2 * x,
3 x = c(1 : 10) and y = exp(x),
Plot y in function of x. What is their Pearson correlation? What is their Spearman correlation? How do you understand that?
Not even the Spearman correlation will discover all types of dependencies. Consider the example above with x2.
x <- c(-10:10) cor(rank(x), rank(x∧2)) ## [1] 0
8.3.3 Chi-square Tests
Chi-square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population, and they should be categorical like “Yes/No,” “Male/Female,” “Red/Amber/Green,” etc.
test – chi square
For example, we can build a dataset with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found, we can plan for appropriate stock of flavours by knowing the number of gender of people visiting.
Chi-Square test in R
Function use for chisq.test()
chisq.test(data)
where data
is the data in form of a table containing the count value of the variables
For example, we can use the mtcars
dataset that is most probably loaded when R was initialised.
# we use the dataset mtcars from MASS df <- data.frame(mtcars$cyl,mtcars$am) chisq.test(df) ## Warning in chisq.test(df): Chi-squared approximation may be incorrect ## ## Pearson’s Chi-squared test ## ## data: df ## X-squared = 25.077, df = 31, p-value = 0.7643
chisq.test()
The chi-square test reports a p-value. This p-value is the probability that the correlations is actually insignificant. It appears that in practice a correlation lower than 5% can be considered as insignificant. In this example, the p-value is higher than 0.05, so there is no significant correlation.
8.4. Distributions
R is a statistical language and most of thework in R will include statistics. Therefore we introduce the reader to how statistical distributions are implemented in R and how they can be used.
The names of the functions related to statistical distributions in R are composed of two sections: the first letter refers to the function (in the following) and the remainder is the distribution name.
d: The pdf (probability density function)
p: The cdf (cumulative probability density function)
q: The quantile function
r: The random number generator.
probability density function
cdf
cumulative density function