Probability with R. Jane M. Horgan
are.
2.2 Measures of Dispersion
Measures of dispersion, as the name suggests, estimate the spread or variation in a data set. There are many ways of measuring spread, and we consider some of the most common.
Range: The simplest measure of spread of data is the range, which is the difference between the maximum and the minimum values.
rangedown <- max(downtime) - min(downtime) rangedown [1] 51
tells us that the range in the downtime data is 51 minutes.
rangearch1 <- max(arch1, na.rm = T) - min(arch1, na.rm = T) rangearch1 [1] 97
gives the range of the marks awarded in Architecture in Semester 1.
The R function range
may also be used.
range(arch1, na.rm = TRUE) [1] 3 100
which gives the minimum (3) and the maximum (100) of the marks obtained in Architecture in Semester 1.
Note that, since arch1 contains missing values, the declaration of na.rm = T
or equivalently na.rm = TRUE
needs to be used.
To get the range for all the examination subjects in results, we use the function sapply
.
sapply(results[2:5], range, na.rm = TRUE)
gives the minimum and maximum of each subject.
arch1 prog1 arch2 prog2 [1,] 3 12 6 5 [2,] 100 98 98 97
Standard deviation: The standard deviation (sd
) measures how much the data values deviate from their average. It is the square root of the average squared deviations from the mean. A small standard deviation implies most values are near the mean. A large standard deviation indicates that values are widely spread above and below the mean.
In R
sd(downtime)
yields
[1] 14.27164.
Recall that we calculated the mean to be 25.04 minutes. We might loosely describe the downtime as being “25 minutes on average give or take 14 minutes.”
For the data in
sapply(results[2:5], sd, na.rm = TRUE)
gives the standard deviation of each examination subject in
:arch1 prog1 arch2 prog2 24.37469 23.24012 21.99061 27.08082
Quantiles: The quantiles divide the data into proportions, usually into quarters called quartiles, tenths called deciles, and percentages called percentiles. The default calculation in R is quartiles.
quantile(downtime)
gives
0% 25% 50% 75% 100% 0.0 16.0 25.0 31.5 51.0
The first quartile (16.0) is the value that breaks the data so that 25% is below this value and 75% is above.
The second quartile (25.0) is the value that breaks the data so that 50% is below and 50% is above (notice that the 2nd quartile is the median).
The third quartile (31.5) is the value that breaks the data so that 75% is below and 25% is above.
We could say that 25% of the computer systems in the laboratory experienced less than 16 minutes of downtime, another 25% of them were down for between 16 and 25 minutes, and so on.
Interquartile range: The difference between the first and third quartiles is called the interquartile range and is sometimes used as a rough estimate of the standard deviation. In downtime it is
, not too far away from 14.27, which we calculated to be the standard deviation.Deciles: Deciles divide the data into tenths. To get the deciles in R, first define the required break points
deciles <- seq(0, 1, 0.1)
The function seq
creates a vector consisting of an equidistant series of numbers. In this case, seq
assigns values in [0, 1] in intervals of 0.1 to the vector called deciles
. Writing in R
deciles
shows what the vector contains
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Adding this extra argument to the quantile function
quantile(downtime, deciles)
yields
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.0 4.0 12.8 19.8 22.6 25.0 29.2 30.0 34.8 44.8 51.0
Interpreting this output, we could say that 90% of the computer systems in the laboratory experienced less than 45 minutes of downtime.
Similarly, for the percentiles, use
percentiles <- seq(0, 1, 0.01)
as an argument in the quantile function, and write
quantile(downtime, percentiles)
2.3 Overall Summary Statistics
A quicker way of summarizing the data is to use the summary
function.
summary(downtime)
returns
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 16.00 25.00 25.04 31.50 51.00
which are the minimum the first quartile, the median, the mean, the third quartile, and the maximum, respectively.
For
, we might writesummary(arch1)
which gives
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 3.00 46.75 68.50 63.57 83.25 100.00 3.00
An entire data frame may be summarized by using the summary command. Let us do this in the data frame
. First, it is wise to make a declaration about the categorical variable gender.gender <- factor(gender)
designates the variable gender as a factor, and ensures that it is treated as such in the summary
function.
summary(results) gender arch1 prog1 arch2 prog2 f: 19 Min. : 3.00 Min. :12.00 Min. : 6.00 Min. : 5.00 m:100 1st Qu.: 46.75 1st Qu.:40.00 1st Qu.:40.00 1st Qu.:30.00 Median : 68.50 Median :64.00 Median :48.00 Median :57.00 Mean : 63.57 Mean :59.02 Mean :51.97 Mean :53.78 3rd Qu.: 83.25 3rd Qu.:78.00 3rd Qu.:61.00 3rd Qu.:76.50 Max. :100.00 Max. :98.00 Max. :98.00 Max. :97.00 NA's : 3.00 NA's : 2.00 NA's : 4.00 NA's : 8.00
Notice how the display for gender is different than that for the other variables; we are simply given the frequency for each gender.
2.4 Programming in R
One of the great benefits of R is that it is possible to write your own programs and use them as functions in your analysis. Programming is extremely simple in R because of the way it handles vectors and data frames. To illustrate, let us write a program to calculate the mean of
. The formula for the mean of a variable