The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book - Philippe J. S. De Brouwer


Скачать книгу
is more than one variable, it is useful to understand what the interdependencies of variables are. For example when measuring the size of peoples hands and their length, one can expect that people with larger hands on average are taller than people with smaller hands. The hand size and length are positively correlated.

      The basic measure for linear interdependence is covariance, defined as

equation

      8.3.1 8.3.1 The Pearson Correlation

      An important metric for linear relationship is the Pearson correlation coefficient ρ.

       correlation – Pearson

       Definition: Pearson Correlation Coefficient

equation

      cor(mtcars$hp,mtcars$wt) ## [1] 0.6587479

       cor()

      Of course, we also have functions that provide the covariance matrix and functions that convert the one in the other.

      d <- data.frame(mpg = mtcars$mpg, wt = mtcars$wt, hp = mtcars$hp) # Note that we can feed a whole data-frame in the functions. var(d) ## mpg wt hp ## mpg 36.324103 -5.116685 -320.73206 ## wt -5.116685 0. 957379 44.19266 ## hp -320.732056 44.192661 4700.86694 cov(d) ## mpg wt hp ## mpg 36.324103 -5.116685 -320.73206 ## wt -5.116685 0.957379 44.19266 ## hp -320.732056 44.192661 4700.86694 cor(d) ## mpg wt hp ## mpg 1.0000000 -0.8676594 -0.7761684 ## wt -0.8676594 1.0000000 0.6587479 ## hp -0.7761684 0.6587479 1.0000000

       var()

       cov()

       cor()

      cov2cor(cov(d)) ## mpg wt hp ## mpg 1.0000000 -0.8676594 -0.7761684 ## wt -0.8676594 1.0000000 0.6587479 ## hp -0.7761684 0.6587479 1.0000000

       cov2cor()

      8.3.2 8.3.2 The Spearman Correlation

      x <- c(-10:10) df <- data.frame(x=x, x_sq=x2, x_abs=abs(x), x_exp=exp(x)) cor(df) ## x x_sq x_abs x_exp ## x 1.000000 0.0000000 0.0000000 0.5271730 ## x_sq 0.000000 1.0000000 0.9671773 0.5491490 ## x_abs 0.000000 0.9671773 1.0000000 0.4663645 ## x_exp 0.527173 0.5491490 0.4663645 1.0000000

      The correlation between x and x2 is zero, and the correlation between x and exp(x) is a meagre 0.527173.

       correlation – Spearman

      The Spearman correlation is the correlation applied to the ranks of the data. It is one if an increase in the variable X is always accompanied with an increase in variable Y.

      cor(rank(df$x), rank(df$x_exp)) ## [1] 1

      The Spearman correlation checks for a relationship that can bemore general than only linear. It will be one if X increases when Y increases.

      image Question #10

      Consider the vectors

      1 x = c(1, 2, 33, 44) and y = c(22, 23, 100, 200),

      2 x = c(1 : 10) and y = 2 * x,

      3 x = c(1 : 10) and y = exp(x),

      Plot y in function of x. What is their Pearson correlation? What is their Spearman correlation? How do you understand that?

      image Warning – Correlation is more specific than relation

      Not even the Spearman correlation will discover all types of dependencies. Consider the example above with x2.

      x <- c(-10:10) cor(rank(x), rank(x2)) ## [1] 0

      8.3.3 Chi-square Tests

       test – chi square

      For example, we can build a dataset with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found, we can plan for appropriate stock of flavours by knowing the number of gender of people visiting.

      Chi-Square test in R

      Function use for chisq.test()

      chisq.test(data)

      where data is the data in form of a table containing the count value of the variables

      For example, we can use the mtcars dataset that is most probably loaded when R was initialised.

      # we use the dataset mtcars from MASS df <- data.frame(mtcars$cyl,mtcars$am) chisq.test(df) ## Warning in chisq.test(df): Chi-squared approximation may be incorrect ## ## Pearson’s Chi-squared test ## ## data: df ## X-squared = 25.077, df = 31, p-value = 0.7643

       chisq.test()

      The chi-square test reports a p-value. This p-value is the probability that the correlations is actually insignificant. It appears that in practice a correlation lower than 5% can be considered as insignificant. In this example, the p-value is higher than 0.05, so there is no significant correlation.

      The names of the functions related to statistical distributions in R are composed of two sections: the first letter refers to the function (in the following) and the remainder is the distribution name.

       d: The pdf (probability density function)

       p: The cdf (cumulative probability density function)

       q: The quantile function

       r: The random number generator.

       pdf

       probability density function

       cdf

       cumulative density function

      


Скачать книгу