Biostatistics Decoded. A. Gouveia Oliveira
to understand what this quantity really represents. However, the standard deviation is the most popular of all measures of dispersion. Why is that?
One important reason is that the standard deviation has a large number of interesting mathematical properties. The other important reason is that, actually, the standard deviation has a straightforward interpretation, very much along the lines given earlier to the value of the mean deviation. However, we will go into that a little later in the book.
A final remark about the variance. Although the variance is an average, the total sum of squares is divided not by the number of observations as an average should be, but by the number of observations minus 1, that is, by n − 1.
It does no harm if we use symbols to explain the calculations. The formula for the calculation of the variance of an attribute x is
where ∑ (capital “S” in the Greek alphabet) stands for summation and
Naturally, the formula for the standard deviation is
1.8 The n − 1 Divisor
The reason why we use the n − 1 divisor instead of the n divisor for the sum of squares when we calculate the variance and the standard deviation is because, when we present those quantities, we are implicitly trying to give an estimate of their value in the population. Now, since we use the data from our sample to calculate the variance, the resulting value will always be smaller than the value of the variance in the population. We say that our result is biased toward a smaller value. What is the explanation for that bias?
Remember that the variance is the average of the squared differences between individual values and the mean. If we calculated the variance by subtracting the individual values from the true mean (the population mean), the result would be unbiased. This is not what we do, though. We subtract the individual values from the sample mean. Since the sample mean is the quantity closest to all the values in the dataset, individual values are more likely to be closer to the sample mean than to the population mean. Therefore, the value of the sample variance tends to be smaller than the value of the population variance. The variance is a good measure of dispersion of the values observed in a sample, but is biased as a measure of dispersion of the values in the population from which the sample was taken. However, this bias is easily corrected if the sum of squares is divided by the number of observations minus 1.
This book is written at two levels of depth: the main text is strictly non‐mathematical and directed to those readers who just want to know the rationale of biostatistical concepts and methods in order to be able to understand and critically evaluate the scientific literature; the text boxes intermingled in the main text present the mathematical formulae and the description of the procedures, supported by working examples, of every statistical calculation and test presented in the main text. The former readers may skip the text boxes without loss of continuity in the presentation of the topics.
It is an easy mathematical exercise to demonstrate that dividing the sum of squares by n − 1 instead of n provides an adequate correction of that bias. However, the same thing can be shown by the small experiment illustrated in Figures 1.12 and 1.13.
Figure 1.12 The n divisor of the sum of squares.
Figure 1.13 The n − 1 divisor of the sum of squares.
Using a computer’s random number generator, we obtained random samples of a variable with variance equal to 1. This is the population variance of that variable. Starting with samples of size 2, we obtained 10 000 random samples and computed their sample variances using the n divisor. Next, we computed the average of those 10 000 sample variances and retained the result. We then repeated the procedure with samples of size 3, 4, 5, and so on up to 100.
The plot of the averaged value of sample variances against sample size is represented by the solid line in Figure 1.12. It can clearly be seen that, regardless of the sample size, the variance computed with the n divisor is on average less than the population variance, and the deviation from the true variance increases as the sample size decreases.
Now let us repeat the procedure, exactly as before, but this time using the n − 1 divisor. The plot of the average sample variance against sample size is shown in Figure 1.13. The solid line is now exactly over 1, the value of the population variance, for all sample sizes.
This experiment clearly illustrates that, contrary to the sample variance using the n divisor, the sample variance using the n − 1 divisor is an unbiased estimator of the population variance.
1.9 Degrees of Freedom
Degrees of freedom is a central notion in statistics that applies to all problems of estimation of quantities in populations from the observations made on samples. The number of degrees of freedom is the number of values in the calculation of a quantity that are free to vary. The general rule for finding the number of degrees of freedom for any statistic that estimates a quantity in the population is to count the number of independent values used in the calculation minus the number of population quantities that were replaced by sample quantities during the calculation.
In the calculation of the variance, instead of summing the squared differences of each value to the population mean, we summed the squared differences to the sample mean. Therefore, we replaced a population parameter by a sample parameter and, because of that, we lose one degree of freedom. Therefore, the number of degrees of freedom of a sample variance is n − 1.
1.10 Variance of Binary Variables
As a binary variable is a numeric variable, in addition to calculating a mean, which is called a proportion in binary variables, we can also calculate a variance. The computation is the same as for interval variables: the differences of each observation from the mean are squared, then summed up and divided by the number of observations. With binary l variables there is no need to correct the denominator and the sum of squares is divided by n.
An easier method of calculating the variance of a binary attribute is simply by multiplying the proportion with the attribute by the proportion without the attribute. If we represent the proportion with the attribute by p, then the variance will be p(1 − p). For example, if in a sample of 110 people there were 9 with diabetes, the proportion with diabetes is 9/110 = 0.082 and the variance of diabetes is 0.082 × (1 − 0.082) = 0.075. Therefore, there is a fixed relationship between a proportion and its variance, with the variance increasing for proportions between 0 and 0.5 and decreasing thereafter, as shown in