Applied Univariate, Bivariate, and Multivariate Statistics. Daniel J. Denis
section A and Mary is a student in section B. On the final exam for the course, John receives a raw score of 80 out of 100 (i.e., 80%). Mary, on the other hand, earns a score of 70 out of 100 (i.e., 70%). At first glance, it may appear that John was more successful on his final exam. However, scores, considered absolutely, do not allow us a comparison of each student's score relative to their class distributions. For instance, if the mean in John's class was equal to 85% with a standard deviation of 2, this means that John's z‐score is:
Suppose that in Mary's class, the mean was equal to 65% also with a standard deviation of 2. Mary's z‐score is thus:
As we can see, relative to their particular distributions, Mary greatly outperformed John. Assuming each distribution is approximately normal, the density under the curve for a normal distribution with mean 0 and standard deviation of 1 at a score of 2.5 is:
> dnorm(2.5, 0, 1) [1] 0.017528
where dnorm
is the density under the curve at 2.5. This is the value of f(x) at the score of 2.5. What then is the probability of scoring 2.5 or greater? To get the cumulative density up to 2.5, we compute:
> pnorm(2.5, 0, 1) [1] 0.9937903
The given area is represented in Figure 2.2. The area we are interested in is that at or above 2.5 (the area where the arrow is pointing). Since we know the area under the normal density is equal to 1, we can subtract pnorm(2.5, 0, 1)
from 1:
> 1-pnorm(2.5, 0, 1) [1] 0.006209665
Figure 2.2 Shaded area under the standard normal distribution at a z‐score of up to 2.5 standard deviations.
We can see then the percentage of students scoring higher than Mary is in the margin of approximately 0.6% (i.e., multiply the proportion by 100). What proportion of students scored better than John in his class? Recall that his z‐score was equal to −2.5. Because we know the normal distribution is symmetric, we already know the area lying below −2.5 is the same as that lying above 2.5. This means that approximately 99.38% of students scored higher than John. Hence, we see that Mary drastically outperformed her colleague when we consider their scores relative to their classes. Be careful to note that in drawing these conclusions, we had to assume each score (that of John's and Mary's) came from a normal distribution. The mere fact that we transformed their raw scores to z‐scores in no way normalizes their raw distributions. Standardization standardizes, but it does not normalize.
One can also easily verify that approximately 68% of cases in a normal distribution lie within −1 and +1 standard deviations, while approximately 95% of cases lie within −2 and +2 standard deviations.
2.1.1 Plotting Normal Distributions
We can plot normal densities in R by simply requesting the lower and upper limit on the abscissa:
> x <- seq(from = -3, to = +3, length.out = 100) > plot(x, dnorm(x))
Distributions (and densities) of a single variable typically go by the name of univariate distributions to distinguish them from distributions of two (bivariate) or more variables (multivariate).
For example, we consider some of Galton's data on parent and child heights (the height of the children were measured when they were adults, not actual toddlers). Some of Galton's data appears below, retrieved from the HistData
package (Friendly, 2014) in R:
> install.packages(“HistData”) > library(HistData) > attach(Galton) > Galton parent child 1 70.5 61.7 2 68.5 61.7 3 65.5 61.7 4 64.5 61.7 5 64.0 61.7 6 67.5 62.2 7 67.5 62.2 8 67.5 62.2 9 66.5 62.2 10 66.5 62.2
We first install the package using the install.packages
function. The library
statement loads the package HistData
into R's search path. From there, we attach
the Galton data to insert the object (dataframe) into the search list. We generate a histogram of parent height:
> hist(parent, main = "Histogram of Parent Height")
One can overlay a normal density over an empirical plot to show how closely observed data match that of a theoretical normal distribution, as was done by Fisher in 1925 displaying a distribution of the heights of 1375 women (see Figure 2.3, taken from Classics in the History of Psychology1). R.A. Fisher is usually regarded as the father of modern statistics and among his greatest contributions was the publication of Statistical Methods for Research Workers in 1925 in which he discussed such topics as tests of significance, correlation coefficients, and the analysis of variance.
We can see that the normal density serves as a close, and very convenient, approximation to empirical data. Indeed, the normal density has figured prominent in the history of statistics largely because it serves as a useful model for many phenomena, and also because it provides a very convenient starting point for much work in theoretical statistics. Oftentimes the assumption of normality will be invoked in a derivation because it makes the problem simpler and easier to solve.
2.1.2 Binomial Distributions
The binomial distribution is given by:
where,
p(r) is the probability of observing r occurrences out of n possible occurrences,2
p is the probability of a “success” on any given trial, and
1 − p is the probability of a failure on any given trial, often simply referred to by “q” (i.e., q = 1 − p).
The binomial setting provides an ideal context to demonstrate the essentials of hypothesis‐testing logic, as we will soon see. In a binomial setting, the following conditions must hold:
The variable under study must be binary in nature. That is, the outcome of the experiment can result in only one category or another. That is, the outcome categories are mutually exclusive. For instance, the flipping of a coin has this characteristic, because the coin can either come up “head” or “tail” and nothing else (yes, we are ruling out the possibility that it lands on its side, and I think it is safe to do so).
The probability of a “success” on each trial remains constant (or stationary) from trial to trial. For example, if the probability of head is equal to 0.5 on our first flip, we assume it is also equal to 0.5 on the second, third, fourth flips, and so on.
Each trial is independent of each other trial. That is, the fact that we get a head on our first