Biostatistics Decoded. A. Gouveia Oliveira
of several nominal attributes in a single table.
When we look at a table, such as the ones shown in Figure 1.18, we are evaluating how the individual values are distributed in our sample. Such a display of data is called a frequency distribution.
Tabulations of data with absolute and relative frequencies are the best way of presenting binary and categorical data. Tables are a very compact means of data presentation, and tabulation does not involve any significant loss of information. In the presentation of the results of a research, the usual practice is to present all nominal variables in a single table, as illustrated in Figure 1.19.
In the table, females and self‐medicated are binary attributes. It is convenient in binary variables to present the absolute and relative frequency of just one of the attribute values, because presenting frequencies for the two values would be a redundancy. Education is a categorical variable, but may also be considered an ordinal variable. We know it is categorical because the percentages total 100.0%. Self‐referred diagnosis is a multi‐valued attribute and cardiovascular, endocrine, osteoarticular, and neurologic disease are four binary variables. We know that because the percentages do not sum to 100%, meaning that each subject may have more than one disease.
We can use tables for ordinal and interval data as well, provided the number of different values is not too large. In those tables, we present the values in ascending sort order and write down the absolute and relative frequencies, as we did with binary and categorical data. For each value we can also add the cumulative frequency, or the percentage of observations in the dataset that are equal to or smaller than that value. If the number of values is large, then it is probably better to group the values into larger intervals, as in Figure 1.20, but this will lead to some loss of information. A more convenient way of abstracting interval attributes and ordinal attributes that have many different values is by using descriptive statistics.
Figure 1.20 Tabulation of ordinal and interval data.
The following are some general rules to guide the description of study samples:
Keep in mind that the idea of using summary statistics is to display the data in an easy‐to‐grasp format while losing as little information as possible.
Begin by understanding what scale of measurement was used with each attribute.
If the scale is binary or categorical, the appropriate method is tabulation, and both the absolute and relative frequencies should always be displayed.
If the scale is ordinal, the mean and standard deviation should not be presented, which would be wrong because arithmetic operations are not allowed with ordinal scales; instead, present the median and one or more of the other measures of dispersion, either the limits, range, or interquartile range.
If the scale is interval, the mean and the standard deviation should be presented unless the distribution is very asymmetrical about the mean. In this case, the median and the limits may provide a better description of the data.
Figure 1.21 shows a typical presentation of a table describing the information obtained from a sample of patients with benign prostate hyperplasia. For some attributes, the values are presented separated with a ± sign. This is a usual way of displaying the mean and the standard deviation, but a footnote should state what those values represent. The attributes age, PV, PSA, Qmax, and PVR are interval variables, while IPSS, QoL, and IIEF are ordinal variables. We know that because the former have units and the latter do not. We know that AUR is a binary variable because it is displayed as a single value. Adverse events is a multi‐valued attribute and its values are binary variables, and we know that because the percentages do not sum to 100%.
Figure 1.21 Table with summary statistics describing the information collected from a sample.
1.13 Sampling Variation
Why data analysis does not stop after descriptive statistics are obtained from the data? After all, we have obtained population estimates of means and proportions, which is the information we were looking for. Probably that is the thinking of many people who believe that they have to present a statistical analysis otherwise they will not get their paper published.
Actually, sample means and proportions are even called point estimates of a population mean or proportion, because they are unbiased estimators of those quantities. However, that does not mean that the sample mean or sample proportion has a value close to the population mean or the population proportion. We can verify that with a simple experiment.
Let us consider for now only sample means from interval variables. With the random number generator of the computer we can create an interval variable and obtain a number of random samples of identical size n of that variable. Figure 1.22 shows some of the results of that experiment, displaying the plots of consecutive samples from an interval variable, with a horizontal line representing the sample means. It is quite clear that the sample means have a different value every time a sample is taken from a population. This phenomenon is called sampling variation.
Figure 1.22 Illustration of the phenomenon of sampling variation. Above are shown plots of the values in random samples of size n of an interval variable. Horizontal lines represent the sample means. Below is shown a histogram of the distribution of sample means of a large number of random samples. Superimposed is the theoretical curve that would be obtained if an infinite number of sample means were obtained.
Now let us continue taking random samples of size n from that variable. The means will keep coming up with a different value every time. So we plot the means in a histogram to see if there is some discernible pattern in the frequency distribution of sample means. What we will eventually find is the histogram shown in Figure 1.22. If we could take an infinite number of samples and plotted the sample means, we would end up with a graph with the shape of the curve in the same figure.
What we learn from that experiment is that the means of interval attributes of independent samples of a given size n, obtained from the same population, are subjected to random variation. Therefore, sample means are random variables, that is, they are variables because they can take many different values, and they are random because the values they take are determined by chance.
We also learn, by looking at the graph in Figure 1.22, that sample means can have very different values and we can never