Probability with R. Jane M. Horgan

Probability with R

of the importance of using graphical displays to provide insight into the data. The example is that of Anscombe (1973), who provides four data sets, given in Table 3.3 and often referred to as the Anscombe Quartet. Each data set consists of two variables on which there are 11 observations.

TABLE 3.3 The Anscombe Quartet

Data Set 1		Data Set 2		Data Set 3		Data Set 4
x1	y1	x2	y2	x3	y3	x4	y4
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.10	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.10	4	5.39	19	12.50
12	10.84	12	9.13	12	8.15	8	5.56
7	4.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

First, read the data into separate vectors.

x1 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y1 <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)

and so on for x2, y2, x3, y3, x4, and y4. Then, for convenience, group the data into data frames as follows:

dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4)

When presented with data such as these, it is usual to obtain summary statistics. Let us do this using R.

To obtain the means of the variables in each data set, write

mean(dataset1) x1 y1 9.000000 7.500909 mean(dataset2) x2 y2 9.000000 7.497273 mean(dataset3) x3 y3 9.0 7.5 mean(dataset4) x4 y4 9.000000 7.500909

The means for the images variables, as you can see, are practically identical as are the means for the images variables.

Let us look at the standard deviations.

sd(dataset1) x1 y1 3.316625 2.031568 sd(dataset2) x2 y2 3.316625 2.028463 sd(dataset3) x3 y3 3.316625 2.030424 sd(dataset4) x4 y4 3.316625 2.030579

The standard deviations, as you can see, are also practically identical for the four images variables, and also for the images variables.

Calculating the mean and standard deviation is the usual way to summarize data. With these data, if this was all that we did, we would conclude naively that the four data sets are “equivalent,” since that is what the statistics say. But what do the statistics not say?

Investigating further, using graphical displays, gives a different picture. Pairwise plots would be the obvious exploratory technique to use with paired data.

par(mfrow = c(2, 2)) plot(x1,y1, xlim = c(0, 20), ylim = c(0, 13)) plot(x2,y2, xlim = c(0, 20), ylim = c(0, 13)) plot(x3,y3, xlim = c(0, 20), ylim = c(0, 13)) plot(x4,y4, xlim = c(0, 20), ylim = c(0, 13))

gives Fig. 3.20. Notice again the use of xlim and ylim to ensure that the scales on the axes are the same in the four plots, in order that a valid comparison can be made.

Figure 3.20 Plots of Four Data Sets with Same Means and Standard Deviations

Examining Fig. 3.20, we see that there are very great differences in the data sets:

1 Data set 1 is linear with some scatter;

2 Data set 2 is quadratic;

3 Data set 3 has an outlier. If the outlier were removed the data would be linear;

4 Data

Скачать книгу