Biostatistics Decoded. A. Gouveia Oliveira
identical variable x by a constant, the sample size n. Therefore, according to the properties of variances, the variance of each identical variable xi/n will be equal to the population variance σ2 divided by the square of the sample size, that is, σ2/n2. Sample means result from adding together all the x. Consequently, the variance of the sample mean is equal to the sum of the variances of all the observations, that is, equal to n times the population variance divided by the square of the sample size:
This is equivalent to σ2/n, that is, the variance of the sample means is equal to the population variance divided by the sample size. Therefore, the standard deviation of the sample means (the standard error of the mean) equals the population standard deviation divided by the square root of the sample size.
One must not forget that these properties of means and variances only apply in the case of independent variables. Therefore, the results presented above will also only be valid if the sample consists of mutually independent observations. On the other hand, these results have nothing to do with the central limit theorem and, therefore, there are no restrictions related to the normality of the distribution or to the sample size. Actually, whatever the distribution of the attribute and the sample size might be, the mean of the sample means will always be the same as the population mean, and the standard error will always be the same as the population standard deviation divided by the square root of the sample size, provided that the observations are independent. The problem is that, in the case of small samples from an attribute with unknown distribution, we cannot assume that the sample means will have a normal distribution. Therefore, knowledge of the mean and of the standard error will not be sufficient to completely characterize the distribution of sample means.
1.20 Distribution of Sample Proportions
So far we have discussed the distribution of sample means of interval variables. What happens with sample means of binary variables when we take samples of a given size n from the same population?
We will repeat the experiment that was done for interval variables but now we will generate a random binary variable with probability 0.30 and take many samples of size 60. Of course, we will observe variation of the sample proportions as we did with sample means, as shown in Figure 1.31. As before, let us plot the sample proportions to see if there is a pattern for the distribution of their values.
The resulting graph is different from the one we obtained with sample means of interval variables. It clearly is not symmetrical, but above all the probability distribution is not continuous, it is discrete. Although resembling the normal distribution, this one is a different theoretical probability distribution, called the binomial distribution.
We shall now see how to create a binomial distribution. Imagine samples of four random observations on a binary attribute, such as sex, for example, which we know that the distribution in a population is equally divided between males and females. Each sample may have from 0 to 4 females, so the proportion of females in the samples is 0, 25, 50, 75, or 100%.
It is a simple matter to calculate the frequency with which each of these results will appear. We write down all possible combinations of males and females that can be obtained in samples of four, and count in how many cases there are 0, 1, 2, 3, and 4 females. In this example, there are 16 possible outcomes. There is only one way of having 0 females, so the theoretical relative frequency of this outcome is once out of 16 outcomes, or 6.25%. There are four possible ways out of 16 of having 25% of females, which is when the first, or the second, or the third, or the fourth sampled individual is a female. Hence, the relative frequency of this outcome, at least theoretically, is 25%. There are six possible ways of having 50% of females, so the relative frequency of this outcome is 37.5%. There are four possible ways of having 75% of females, so the frequency of this result is 25%. Finally, there is only one possible way of having 100% of females, and the relative frequency of this result is 6.25%.
Figure 1.31 Illustration of the phenomenon of sampling variation. Above, pie charts show the observed proportions in random samples of size n of a binary variable. The graph below shows the distribution of sample proportions of a large number of random samples.
These results are presented in the graph in Figure 1.32, which displays all possible proportions of females in samples of four and their relative frequency. Since, as we saw before, the proportion of females in a sample corresponds to the mean of that attribute, the graph is nothing more than the probability distribution of the sample proportions, the binomial distribution. All random binary attributes, like the proportion of patients with asthma in a sample, or the proportion of responses to a treatment, follow the binomial distribution.
Therefore, with interval attributes we know the probability distribution of sample means only when the sample sizes are large or the attribute has a normal distribution. By contrast, with binary attributes we always know which the probability distribution of sample proportions is: it is the binomial distribution.
The calculation of the frequency of all possible results by the method outlined above can be very tedious for larger sample sizes, because there are so many possible results. It is also complicated for attributes whose values, unlike the above example, do not have equal probability. Fortunately, there is a formula for the binomial distribution that allows us to calculate the frequencies for any sample size and for any probability of the attribute values. The formula is:
Figure 1.32 Probability distribution of a proportion: the binomial distribution.
We can use the formula to make the above calculations. For example, to calculate the probability of having k = 3 women in a sample of n = 4 observations, assuming that the proportion of women in the population is π = 0.5:
as before.
Since the means of binary attributes in random samples follow a probability distribution, we can calculate the mean and the variance of sample proportions in the same way as we did with interval‐scaled attributes. If we view a sample proportion as the sum of single observations from binary variables with identical distribution, then the properties of means allow us to conclude that the mean of the distribution of sample proportions is equal to the population proportion of the attribute.
By the same reasoning, we conclude that the variance of sample proportions must be the population variance of a binary attribute (the product of the probability of each value), divided by the sample size. If we call π the probability of an attribute having the value 1 (or, if we prefer, the proportion of the population having the attribute) and n the sample size, the variance of sample proportions is, therefore
The standard