Probability with R. Jane M. Horgan
of the importance of using graphical displays to provide insight into the data. The example is that of Anscombe (1973), who provides four data sets, given in Table 3.3 and often referred to as the Anscombe Quartet. Each data set consists of two variables on which there are 11 observations.
TABLE 3.3 The Anscombe Quartet
Data Set 1 | Data Set 2 | Data Set 3 | Data Set 4 | ||||
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
14 | 9.96 | 14 | 8.10 | 14 | 8.84 | 8 | 7.04 |
6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
4 | 4.26 | 4 | 3.10 | 4 | 5.39 | 19 | 12.50 |
12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
First, read the data into separate vectors.
x1 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y1 <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
and so on for x2
, y2
, x3
, y3
, x4
, and y4
. Then, for convenience, group the data into data frames as follows:
dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4)
When presented with data such as these, it is usual to obtain summary statistics. Let us do this using R.
To obtain the means of the variables in each data set, write
mean(dataset1) x1 y1 9.000000 7.500909 mean(dataset2) x2 y2 9.000000 7.497273 mean(dataset3) x3 y3 9.0 7.5 mean(dataset4) x4 y4 9.000000 7.500909
The means for the
Let us look at the standard deviations.
sd(dataset1) x1 y1 3.316625 2.031568 sd(dataset2) x2 y2 3.316625 2.028463 sd(dataset3) x3 y3 3.316625 2.030424 sd(dataset4) x4 y4 3.316625 2.030579
The standard deviations, as you can see, are also practically identical for the four
Calculating the mean and standard deviation is the usual way to summarize data. With these data, if this was all that we did, we would conclude naively that the four data sets are “equivalent,” since that is what the statistics say. But what do the statistics not say?
Investigating further, using graphical displays, gives a different picture. Pairwise plots would be the obvious exploratory technique to use with paired data.
par(mfrow = c(2, 2)) plot(x1,y1, xlim = c(0, 20), ylim = c(0, 13)) plot(x2,y2, xlim = c(0, 20), ylim = c(0, 13)) plot(x3,y3, xlim = c(0, 20), ylim = c(0, 13)) plot(x4,y4, xlim = c(0, 20), ylim = c(0, 13))
gives Fig. 3.20. Notice again the use of xlim
and ylim
to ensure that the scales on the axes are the same in the four plots, in order that a valid comparison can be made.
Figure 3.20 Plots of Four Data Sets with Same Means and Standard Deviations
Examining Fig. 3.20, we see that there are very great differences in the data sets:
1 Data set 1 is linear with some scatter;
2 Data set 2 is quadratic;
3 Data set 3 has an outlier. If the outlier were removed the data would be linear;
4 Data