Biostatistics Decoded. A. Gouveia Oliveira

Biostatistics Decoded

is called the probability density function). This means that in normally distributed attributes we can completely describe their distribution by using only the mean and the variance (or equivalently the standard deviation). This is the reason why the mean and the variance are called the parameters of the normal distribution, and what makes these two summary measures so important. It also means that if two normally distributed variables have the same variance, then the shape of their distribution will be the same; if they have the same mean, their position on the horizontal axis will be the same.

Second property. The sum or difference of normally distributed independent variables will result in a new variable with a normal distribution. According to the properties of means and variances, the mean of the new variable will be, respectively, the sum or difference of the means of the two variables, and its variance will be the sum of the variances of the two variables (Figure 1.26).

Graphs depict the properties of the normal distribution.

Figure 1.26 Properties of the normal distribution.

Third property. The sum, or difference, of a constant to a normally distributed variable will result in a new variable with a normal distribution. According to the properties of means and variances, the constant will be added to or subtracted from its mean, and its variance will not change (Figure 1.26).

Fourth property. The multiplication, or division, of the values of a normally distributed variable by a constant will result in a new variable with a normal distribution. Because of the properties of means and variances, its mean will be multiplied, or divided, by that constant and its variance will be multiplied, or divided, by the square of that constant (Figure 1.26).

Fifth property. In all normally distributed variables, irrespective of their means and variances, we can say that about two‐thirds of the observations have a value lying in the interval defined by the mean minus one standard deviation to the mean plus one standard deviation (Figure 1.27). Similarly, we can say that approximately 95% of the observations have a value lying in the interval defined by the mean minus two standard deviations to the mean plus two standard deviations. The relative frequency of the observations with values between the mean minus three standard deviations and the mean plus three standard deviations is about 99%, and so on. Therefore, one very important property of the normal distribution is that there is a fixed relationship between the standard deviation and the proportion of values within an interval on either side of the mean defined by a number of standard deviations. This means that if we know that an attribute, for example, height, has normal distribution with a population mean of 170 cm and standard deviation of 20 cm, then we also know that the height of about 66% of the population is 150–190 cm, and the height of 95% of the population is 130–210 cm.

Recall what was said earlier, when we first discussed the standard deviation: that its interpretation was easy but not evident at that time. Now we can see how to interpret this measure of dispersion. In normally distributed attributes, the standard deviation and the mean define intervals corresponding to a fixed proportion of the observations. This is why summary statistics are sometimes presented in the form of mean ± standard deviation (e.g. 170 ± 20).

An illustration of the relationship between the area under the normal curve and the standard deviation.

Figure 1.27 Relationship between the area under the normal curve and the standard deviation.

1.17 Probability Distribution of Sample Means

The reason for the pattern of variation of sample means observed in Section 1.13 can easily be understood. We know that a mean is calculated by summing a number of observations on a variable and dividing the result by the number of observations. Normally, we look at the values of an attribute as observations from a single variable. However, we could also view each single value as an observation from a distinct variable, with all variables having an identical distribution. For example, suppose we have a sample of size 100. We can think that we have 100 independent observations from a single random variable, or we can think that we have single observations on 100 variables, all of them with identical distribution. This is illustrated in Figure 1.28. What do we have there, one observation on a single variable – the value of a throw of six dice – or one observation on each of six identically distributed variables – the value of the throw of one dice? Either way we look at it we are right.

So what would be the consequences of that change of perspective? With this point of view, a sample mean would correspond to the sum of a large number of observations from variables with identical distribution, each observation being divided by a constant amount which is the sample size. Under these circumstances, the central limit theorem applies and, therefore, we must conclude that the sample means have a normal distribution, regardless of the distribution of the attribute being studied.

Because the normal distribution of sample means is a consequence of the central limit theorem, certain restrictions apply. According to the theorem, this result is valid only under two conditions. First, there must be a large number of variables. Second, the variables must be mutually independent. Transposing these restrictions to the case of sample means, this implies that a normal distribution can be expected only if there is a large number of observations, and if the observations are mutually independent.

In the case of small samples, however, the means will also have a normal distribution provided the attribute has a normal distribution. This is not because of the central limit theorem, but because of the properties of the normal distribution. If the means are sums of observations on identical normally distributed variables, then the sample means have a normal distribution whatever the number of observations, that is, the sample size.

An illustration of the total obtained from the throw of six dice may be seen as the sum of observations on six identically distributed variables.

Figure 1.28 The total obtained from the throw of six dice may be seen as the sum of observations on six identically distributed variables.

1.18 The Standard Error of the Mean

We now know that the means of large samples may be defined as observations from a random variable with normal distribution. We also know that the normal distribution is completely characterized by its mean and variance. The next step in the investigation of sampling distributions, therefore, must be to find out whether the mean and variance of the distribution of sample means can be determined.

We can conduct an experiment simulating a sampling procedure. With the help of the random number generator of a computer, we can create a random variable with normal distribution with mean 0 and variance 1. Incidentally, this is called a standard normal variable. Then, we obtain a large number of random samples of size 4 and calculate the means of those samples. Next, we calculate the mean and standard deviation of the sample means. We repeat the procedure with samples of size 9, 16, and 25. The results of the experiment are shown in Figure

Скачать книгу