Data Science in Theory and Practice. Maria Cristina Mariani

Data Science in Theory and Practice - Maria Cristina Mariani


Скачать книгу
sample correlation coefficient is a measure of the linear association between two variables and does not depend on the units of measurement, i.e. when you construct the sample correlation coefficient, the units of measurement that are used cancel out. The sample correlation matrix is analogous to the covariance matrix with correlations in place of covariances:

      The population correlation matrix similar to (3.8) is defined as follows:

      (3.9)bold upper P equals left-parenthesis rho Subscript i k Baseline right-parenthesis equals Start 4 By 4 Matrix 1st Row 1st Column 1 2nd Column rho Subscript 1 comma 2 Baseline 3rd Column midline-horizontal-ellipsis 4th Column rho Subscript 1 comma p Baseline 2nd Row 1st Column rho Subscript 2 comma 1 Baseline 2nd Column 1 3rd Column midline-horizontal-ellipsis 4th Column rho Subscript 2 comma p Baseline 3rd Row 1st Column vertical-ellipsis 2nd Column vertical-ellipsis 3rd Column Blank 4th Column vertical-ellipsis 4th Row 1st Column rho Subscript p comma 1 Baseline 2nd Column rho Subscript p comma 2 Baseline 3rd Column midline-horizontal-ellipsis 4th Column 1 EndMatrix comma

      where

rho equals StartFraction sigma Subscript i k Baseline Over StartRoot sigma Subscript i i Baseline EndRoot StartRoot sigma Subscript k k Baseline EndRoot EndFraction period

      1 The value of the sample correlation must lie between and inclusive. indicates perfect linear relationship and indicates perfect inverse relationship.

      2 The sample correlation measures the strength of the linear association between two variables. If equals to zero, it implies no linear association between the components. Otherwise, the sign of indicates the direction of the association. If is positive, it means that as one variable gets larger the other gets larger. If is negative, it means that as one gets larger, the other gets smaller (often called an “inverse” correlation). A larger value of implies greater linear strength. This is an indication that both variables move in the opposite direction if one variable increases, the other variable decreases with the same magnitude (and vice versa).

      Example 3.4 Consider the following data matrix introduced in Example 3.1:

bold upper X equals Start 3 By 2 Matrix 1st Row 1st Column 48 2nd Column 3 2nd Row 1st Column 22 2nd Column 1 3rd Row 1st Column 50 2nd Column 2 EndMatrix period

      Each receipt yields a pair of measurements, total dollar sales, and number of movies sold. We find the sample correlation bold upper R as follows:

StartLayout 1st Row 1st Column r 12 2nd Column equals StartFraction s 12 Over StartRoot s 11 EndRoot StartRoot s 22 EndRoot EndFraction 2nd Row 1st Column Blank 2nd Column equals StartFraction 13 Over StartRoot 244 EndRoot StartRoot 1 EndRoot EndFraction equals 0.8321 comma 3rd Row 1st Column r 21 2nd Column equals r 12 period EndLayout

      Therefore,

bold upper R equals Start 2 By 2 Matrix 1st Row 1st Column 1 2nd Column 0.832 2nd Row 1st Column 0.832 2nd Column 1 EndMatrix period

      In this example, we observe the variables x 1 and x 2 are highly positively correlated since r equals 0.832. This implies that if dollar sales (x 1) increases, the number of movies sold (x 2) also increases.

      Most often, we are interested in linear combinations of the variables x 1 comma x 2 comma ellipsis comma x Subscript p Baseline. In this section, we investigate the means, variances, and covariances of linear combinations.

      (3.10)z equals a 1 x 1 plus a 2 x 2 plus midline-horizontal-ellipsis plus a Subscript p Baseline x Subscript p Baseline equals bold a Superscript upper T Baseline bold upper X comma

      where bold a Superscript upper T Baseline equals left-parenthesis a 1 comma a 2 comma ellipsis comma a Subscript p Baseline right-parenthesis. If the same coefficient vector bold a is applied to each bold x Subscript i in a sample, we have

      (3.11)z Subscript i Baseline equals a 1 x Subscript i Baseline 1 Baseline plus a 2 x Subscript i Baseline 2 Baseline plus midline-horizontal-ellipsis plus a Subscript p Baseline x Subscript i p Baseline equals bold a Superscript upper T Baseline bold x Subscript i Baseline comma i equals 1 comma 2 comma ellipsis comma n period

      For example, if i equals 1, we have

StartLayout 1st Row 1st Column z 1 2nd Column equals bold a Superscript upper T Baseline bold x Subscript 1 Baseline 2nd Row 1st Column Blank 2nd Column equals left-parenthesis a 1 comma a 2 comma ellipsis comma a Subscript p Baseline right-parenthesis Start 4 By 1 Matrix 1st Row x 11 2nd Row x 12 3rd Row vertical-ellipsis 4th Row x 12 EndMatrix period EndLayout

      3.6.1 Linear Combinations of Sample Means

      The sample mean of Скачать книгу