The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book

is more than one variable, it is useful to understand what the interdependencies of variables are. For example when measuring the size of peoples hands and their length, one can expect that people with larger hands on average are taller than people with smaller hands. The hand size and length are positively correlated.

The basic measure for linear interdependence is covariance, defined as

8.3.1 8.3.1 The Pearson Correlation

An important metric for linear relationship is the Pearson correlation coefficient ρ.

correlation – Pearson

Definition: Pearson Correlation Coefficient

cor(mtcars$hp,mtcars$wt) ## [1] 0.6587479

cor()

Of course, we also have functions that provide the covariance matrix and functions that convert the one in the other.

d <- data.frame(mpg = mtcars$mpg, wt = mtcars$wt, hp = mtcars$hp) # Note that we can feed a whole data-frame in the functions. var(d) ## mpg wt hp ## mpg 36.324103 -5.116685 -320.73206 ## wt -5.116685 0. 957379 44.19266 ## hp -320.732056 44.192661 4700.86694 cov(d) ## mpg wt hp ## mpg 36.324103 -5.116685 -320.73206 ## wt -5.116685 0.957379 44.19266 ## hp -320.732056 44.192661 4700.86694 cor(d) ## mpg wt hp ## mpg 1.0000000 -0.8676594 -0.7761684 ## wt -0.8676594 1.0000000 0.6587479 ## hp -0.7761684 0.6587479 1.0000000

var()

cov()

cor()

cov2cor(cov(d)) ## mpg wt hp ## mpg 1.0000000 -0.8676594 -0.7761684 ## wt -0.8676594 1.0000000 0.6587479 ## hp -0.7761684 0.6587479 1.0000000

cov2cor()

8.3.2 8.3.2 The Spearman Correlation

The measure for correlation, as defined in previous section, actually tests for a linear relation. This means that even the presence of a strong non-linear relationship can go undetected.

x <- c(-10:10) df <- data.frame(x=x, x_sq=x^∧2, x_abs=abs(x), x_exp=exp(x)) cor(df) ## x x_sq x_abs x_exp ## x 1.000000 0.0000000 0.0000000 0.5271730 ## x_sq 0.000000 1.0000000 0.9671773 0.5491490 ## x_abs 0.000000 0.9671773 1.0000000 0.4663645 ## x_exp 0.527173 0.5491490 0.4663645 1.0000000

The correlation between x and x² is zero, and the correlation between x and exp(x) is a meagre 0.527173.

correlation – Spearman

The Spearman correlation is the correlation applied to the ranks of the data. It is one if an increase in the variable X is always accompanied with an increase in variable Y.

cor(rank(df$x), rank(df$x_exp)) ## [1] 1

The Spearman correlation checks for a relationship that can bemore general than only linear. It will be one if X increases when Y increases.

Question #10

Consider the vectors

1 x = c(1, 2, 33, 44) and y = c(22, 23, 100, 200),

2 x = c(1 : 10) and y = 2 * x,

3 x = c(1 : 10) and y = exp(x),

Plot y in function of x. What is their Pearson correlation? What is their Spearman correlation? How do you understand that?

Warning – Correlation is more specific than relation

Not even the Spearman correlation will discover all types of dependencies. Consider the example above with x².

x <- c(-10:10) cor(rank(x), rank(x^∧2)) ## [1] 0

8.3.3 Chi-square Tests

Chi-square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population, and they should be categorical like “Yes/No,” “Male/Female,” “Red/Amber/Green,” etc.

test – chi square

For example, we can build a dataset with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found, we can plan for appropriate stock of flavours by knowing the number of gender of people visiting.

Chi-Square test in R

Function use for chisq.test()

chisq.test(data)

where data is the data in form of a table containing the count value of the variables

For example, we can use the mtcars dataset that is most probably loaded when R was initialised.

# we use the dataset mtcars from MASS df <- data.frame(mtcars$cyl,mtcars$am) chisq.test(df) ## Warning in chisq.test(df): Chi-squared approximation may be incorrect ## ## Pearson’s Chi-squared test ## ## data: df ## X-squared = 25.077, df = 31, p-value = 0.7643

chisq.test()

The chi-square test reports a p-value. This p-value is the probability that the correlations is actually insignificant. It appears that in practice a correlation lower than 5% can be considered as insignificant. In this example, the p-value is higher than 0.05, so there is no significant correlation.

8.4. Distributions

R is a statistical language and most of thework in R will include statistics. Therefore we introduce the reader to how statistical distributions are implemented in R and how they can be used.

The names of the functions related to statistical distributions in R are composed of two sections: the first letter refers to the function (in the following) and the remainder is the distribution name.

d: The pdf (probability density function)

p: The cdf (cumulative probability density function)

q: The quantile function

r: The random number generator.

pdf

probability density function

cdf

cumulative density function

Скачать книгу