Statistics. David W. Scott
but not its brain weight.) Moreover, the relationship appears to be linear. In this re‐expressed scatter diagram, the two or three outliers identified in the first plot are no longer outliers.
1.2.2 Space Shuttle Flight 25
The 25th launch in the Space Shuttle program was scheduled for 22 January 1986, but postponed for various reasons each day until 28 January. The temperature had dropped to 28
overnight, and it was 36 when the launch was attempted at 11:38 a.m. During the first 90 s, several O‐rings on the solid rocket boosters failed, leading to a catastrophic explosion and loss of all seven crew members. Scientists knew previous shuttle flights had occasionally experienced one or two O‐ring failures, but a launch had never been attempted at freezing temperatures. Varying opinions of the safety were provided to the launch director, who eventually decided to proceed. One of the data analyses is reproduced in the first row of Figure 1.5.Figure 1.4 Scatter diagrams of the raw and
‐transformed body and brain weights of 62 land mammals.Figure 1.5 Analysis of the number of O‐ring failures for the first 24 Space Shuttle launches; see text.
In the heading of the scatter diagram in first frame, we see a list of the 7 (of the first 24) shuttle flights that experienced 1 or 2 O‐ring failures. Two failures were observed at the lowest temperature of 53
In the second frame, we have jittered the data by adding a little uniform noise. This reveals that there were two data points superimposed at
However, in a re‐analysis of these data, we have included the shuttle flights that experienced no O‐ring failures. Now the final frame suggests that two or more O‐ring failures are quite likely at 28–36
1.2.3 Pearson's Father–Son Height Data Revisited
We have explored the two variables in this dataset individually, but there is an obvious question of how accurately a son's height can be predicted knowing his father's height. In the first frame of Figure 1.6, we display a scatter diagram of the
In the top right frame, we have placed a red dot at the location of the average heights of the fathers and sons. We have also drawn a straight line fit using the intuitive equation
Galton (1886) was one of the first to observe that many scatter diagrams observed in nature have an appearance similar to that in Figure 1.6. He noted that the shape appeared elliptical, so he superimposed elliptical contours over the scatter diagram. The bottom left frame in Figure 1.6 shows three (nested) ellipses for these data. Recall that a general ellipse has five parameters: two for the center of the ellipse; two for the horizontal and vertical scales; and a fifth called the eccentricity. Galton focused on this fifth parameter, and the correlation coefficient was the result. Ironically, this parameter is often referred to today as Pearson's correlation coefficient.
Figure 1.6 Father–son height data collected by Karl Pearson.
In the final frame, we take advantage of the large sample size to try to understand if the prediction (as weak as it may be) might be linear or nonlinear. For integer values of the rounded fathers' heights, we compute a three‐point summary of the corresponding sons' heights. The red dots are the arithmetic average of the sons' heights. The vertical lines display the (conditional) interquartile range. The final two red dots on each end are based on only a few points, so that the IQR can not be computed. These four red dots are shown in a smaller font size to indicate that even the averages are not so reliable.
We see that these summary points clearly suggest a linear rather than a nonlinear fit. We also see that the two blue reference lines from the second frame, namely
1.2.4 Discussion
These rather substantial examples illustrate the search for structure in distribution and prediction problems, as well as practical problems and cures that may be encountered. A