Biostatistics Decoded. A. Gouveia Oliveira
the population mean has value close to the value of the sample mean. Those are the reasons why we cannot assume that the value of the population mean is the same as the sample mean. So an important conclusion is that one must never, ever draw conclusions about a population based on the value of sample means. Sample means only describe the sample, never the population.
There is something else in the results of this experiment that draws attention: the curve representing the distribution of sample means has a shape like a hat, or a bell, where the frequency of the individual values is highest near the middle and declines smoothly from there and at the same rate on either side. The interesting thing is that we have seen a curve like this before. Of course, we could never obtain a curve like this because there is no way we can take an infinite number of samples. Therefore, that curve is theoretical and is a mathematical function that describes the probability of obtaining the values that a random variable can take and, thus, is called a probability distribution.
Figure 1.23 presents several histograms showing the frequency distribution of several commonly assessed clinical laboratory variables measured in interval scales, obtained from a sample of over 400 patients with hypertension. Notice not only that all distributions are approximately symmetrical about the mean, but also that the very shape of the histograms is strikingly similar.
Actually, if we went around taking some kind of interval‐based measurements (e.g. length, weight, concentration) from samples of any type of biological materials and plotted them in a histogram, we would find this shape almost everywhere. This pattern is so repetitive that it has been compared to familiar shapes, like bells or Napoleon hats.
In other circumstances, outside the world of mathematics, people would say that we have here some kind of natural phenomenon. It seems as if some law, of physics or whatever, dictates the rules that variation must follow. This would imply that the variation we observe in everyday life is not chaotic in nature, but actually ruled by some universal law. If this were true, and if we knew what that law says, perhaps we could understand why, and especially how, variation appears.
Figure 1.23 Frequency distributions of some biological variables.
So, what would be the nature of that law and is it known already? Yes it is, and it is actually very easy to understand how it works. Let us conduct a little experiment to see if we can create something whose values have a bell‐shaped distribution.
1.14 The Normal Distribution
Consider some attribute that may take only two values, say 1 and 2, and those values occur with equal frequency. Technically speaking, we say a random variable taking values 1 and 2 with equal probability; this is the probability distribution for that variable (see Figure 1.24, upper part). Consider also, say, four variables that behave exactly like this one, that is, with the same probability distribution. Now let us create a fifth variable that is the sum of all four variables. Can we predict what will be the probability distribution of this variable?
We can, and the result is also presented in Figure 1.24. We simply write down all the possible combinations of values of the four equal variables and see in each case what the value of the fifth variable is. If all four variables have value 1, then the fifth variable will have value 4. If three variables have value 1 and one has value 2, then the fifth variable will have value 5. This may occur in four different ways – either the first variable had the value 2, or the second, or the third, or the fourth. If two variables have the value 1 and two have the value 2, then the sum will be 6, and this may occur in six different ways. If one variable has value 1 and three have value 2, then the result will be 7 and this may occur in four different ways. Finally, if all four variables have value 2, the result will be 8 and this can occur in only one way.
Figure 1.24 The origin of the normal distribution.
So, of the 16 different possible ways or combinations, in one the value of the fifth variable is 4, in four it is 5, in six it is 6, in four it is 7, and in one it is 8. If we now graph the relative frequency of each of these results, we obtain the graph shown in the lower part of Figure 1.24. This is the graph of the probability distribution of the fifth variable. Do you recognize the bell shape?
If we repeat the experiment with not two, but a much larger number of variables, the variable that results from adding all those variables will have not just five different values, but many more. Consequently, the graph will be smoother and more bell‐shaped. The same will happen if we add variables taking more than two values.
If we have a very large number of variables, then the variable resulting from adding those variables will take an infinite number of values and the graph of its probability distribution will be a perfectly smooth curve. This curve is called the normal curve. It is also called the Gaussian curve after the German mathematician Karl Gauss who described it.
1.15 The Central Limit Theorem
What was presented in the previous section is known as the central limit theorem. This theorem simply states that the sum of a large number of independent variables with identical distribution has a normal distribution. The central limit theorem plays a major role in statistical theory, and the following experiment illustrates how the theorem operates.
With a computer, we generated random numbers between 0 and 1, obtaining observations from two continuous variables with the same distribution. The variables had a uniform probability distribution, which is a probability distribution where all values occur with exactly the same probability.
Then, we created a new variable by adding the values of those two variables and plotted a histogram of the frequency distribution of the new variable. The procedure was repeated with three, four, and five identical uniform variables. The frequency distributions of the resulting variables are presented in Figure 1.25.
Figure 1.25 Frequency distribution of sums of identical variables with uniform distribution.
Notice that the more variables we add together, the more the shape of the frequency distribution approaches the normal curve. The fit is already fair for the sum of four variables. This result is a consequence of the central limit theorem.
1.16 Properties of the Normal Distribution
The normal distribution has many interesting properties, but we will present just a few of them. They are very simple to understand and, occasionally, we will have to call on them further on in this book.
First property. The normal curve is a function solely of the mean and the variance. In other words, given only a mean and a variance of a normal distribution, we can find all the values of the distribution and plot its curve using the equation of the normal curve (technically,