Biostatistics Decoded. A. Gouveia Oliveira

Biostatistics Decoded

population, the average weight and its variation, and in order to estimate the variation it is required to have at least two observations.

In summary, in order that the modern approach to sampling be valid, sampling must be at random. The representativeness of a sample is primarily determined by the sampling method used, not by the sample size. Sample size determines only the precision of the population estimates obtained with the sample.

Now, if sample size has no relationship to representativeness, does this mean that sample size has no influence at all on the validity of the estimates? No, it does not. Sample size is of importance to validity because large sample sizes offer protection against accidental errors during sample selection and data collection, which might have an impact on our estimates. Examples of such errors are selecting an individual who does not actually belong to the population under study, measurement errors, transcription errors, and missing values.

An illustration of inference with interval attributes II.

Figure 1.10 Inference with interval attributes II.

We have eliminated a lot of subjectivity by putting the notion of sample representativeness within a convenient framework. Now we must try to eliminate the remaining subjectivity in two other statements. First, we need to find a way to determine, objectively and reliably, the limits for population proportions and averages that are consistent with the samples. Second, we need to be more specific when we say that we are confident about those limits. Terms like confident, very confident, or quite confident lack objectivity, so it would be very useful if we could express quantitatively our degree of confidence in the estimates. In order to do that, as we have seen, we need a measure of the variation of the values of an attribute.

1.6 Measures of Location and Dispersion

As with the central tendency measures, there are a number of available measures of dispersion, each one having some useful properties and some shortcomings. One possible way of expressing the degree of heterogeneity of the values of an attribute could be to write down the limits, that is, the minimum and maximum values (Figure 1.11). The limits are actually measures of location. These measures indicate the values on defined positions in an ordered set of values. One good thing about this approach is that it is easy to interpret. If the two values are similar then the dispersion of the values is small, otherwise it is large. There are some problems with the limits as measures of variation, though. First, we will have to deal with two quantities, which is not practical. Second, the limits are rather unstable, in the sense that if one adds a dozen observations to a study, most likely it will be necessary to redefine the limits. This is because, as one adds more observations, the more extreme values of an attribute will have a greater chance of appearing.

The first problem can be solved by using the difference between the maximum and minimum values, a quantity commonly called the range, but this will not solve the problem of instability.

The second problem can be minimized if, instead of using the minimum and maximum to describe the dispersion of values, we use the other measures of location, the lower and upper quartiles. The lower quartile (also called the 25th percentile) is the value below which one‐quarter, or 25%, of all the values in the dataset lie. The upper quartile (or 75th percentile) is the value below which three‐quarters, or 75%, of all the values in the dataset lie (note, incidentally, that the median is the same as the 50th percentile). The advantage of the quartiles over the limits is that they are more stable because the addition of one or two extreme values to the dataset will probably not change the quartiles.

An illustration of the measures of dispersion derived from measures of location.

Figure 1.11 Measures of dispersion derived from measures of location.

However, we still have the problem of having to deal with two values, which is certainly not as practical and easy to remember, and to reason with, as if we had just one value. One way around this could be to use the difference between the upper quartile and the lower quartile to describe the dispersion. This is called the interquartile range, but the interpretation of this value is not straightforward: it is not amenable to mathematical treatment and therefore it is not a very popular measure, except perhaps in epidemiology.

So what are we looking for in a measure of dispersion? The ideal measure should have the following properties: unicity, that is, it should be a single value; stability, that is, it should not change much if more observations are added; interpretability, that is, its value should meaningful and easy to understand.

1.7 The Standard Deviation

Let us now consider other measures of dispersion. Another possible measure could be the average of the deviations of all individual values about the mean or, in other words, the average of the differences between each value and the mean of the distribution. This would be an interesting measure, being both a single value and easy to interpret, since it is an average. Unfortunately, it would not work because the differences from the mean in those values smaller than the mean are positive, and the differences in those values greater than the mean are negative. The result, if the values are symmetrically distributed about the mean, will always be close to zero regardless of the magnitude of the dispersion.

Actually, what we want is the average of the size of the differences between the individual values and the mean. We do not really care about the direction (or sign) of those differences. Therefore, we could use instead the average of the absolute value of the differences between each value and the mean. This quantity is called the absolute mean deviation. It satisfies the desired properties of a summary measure: single value, stability, and interpretability. The mean deviation is easy to interpret because it is an average, and people are used to dealing with averages. If we were told that the mean of some patient attribute is 256 mmol/l and the mean deviation is 32 mmol/l, we could immediately figure out that about half the values were in the interval 224–288 mmol/l, that is, 256 − 32 to 256 + 32.

There is a small problem, however. The mean deviation uses absolute values, and absolute values are quantities that are difficult to manipulate mathematically. Actually, they pose so many problems that it is standard mathematical practice to square a value when one wants the sign removed. Let us apply that method to the mean deviation. Instead of using the absolute value of the differences about the mean, let us square those differences and average the results. We will get a quantity that is also a measure of dispersion. This quantity is called the variance. The way to compute the variance is, therefore, first to find the mean, then subtract each value from the mean, square the result, and add all those values. The resulting quantity is called the sum of squares about the mean, or just the sum of squares. Finally, we divide the sum of squares by the number of observations to get the variance.

Because the differences are squared, the variance is also expressed as a square of the attribute’s units, something strange like mmol²/l². This is not a problem when we use the variance for calculations, but when in presentations it would be rather odd to report squared units. To put things right we have to convert these awkward units into the original units by taking the square root of the variance. This new result is also a measure of dispersion and is called the standard deviation.

As a measure of dispersion, the standard deviation is single valued and stable, but what can be said about its interpretability? Let us see: the standard deviation is the square root of the average of the squared differences between individual

Скачать книгу