Biostatistics Decoded. A. Gouveia Oliveira

Biostatistics Decoded - A. Gouveia Oliveira


Скачать книгу
that we can assume that we are measuring a continuous attribute with a somewhat faulty instrument in which the measurement error varies slightly across the range of values, as if we were measuring lengths with a metric tape in which the marks were erased in some sections so we have to take an approximate reading in those sections. In such a case, it would appear that the attribute had been measured in an ordinal scale while it has actually been measured in an interval scale. This is why we often see data obtained with some clinical questionnaires presented and analyzed as if it were interval data.

      So far, we have been considering attributes measured in interval or ordinal scales. However, we are often interested in attributes that may be characterized only by their presence or absence (e.g. family history of asthma) or that classify subjects into two groups (e.g. males and females, death and survival).

      As we saw in Section 1.2, attributes taking only two values are called binary attributes. They represent the most elementary type of measurement and, therefore, convey the smallest amount of information. It is useful to think of binary attributes as attributes that may be on or off, because then the above distinction is not necessary. For example, we may think of the “sex” attribute simply as “male sex,” and of its values as yes and no. Similarly, the outcome could be thought of as only “survival,” with values yes and no. This is the same as for the family history of asthma, which also has the values yes and no.

      We could convey the same information as yes/no by using the numerical system. Therefore, we could give the attribute the value 1 to mean that it was present, and 0 to mean that it was absent. This is much more appropriate, because now we can think of binary variables not as categories, but as numerical variables that happen to take only two possible values, 0 and 1.

      Furthermore, observations from binary variables are commonly presented as relative frequencies as in, for example, 37% of females or 14% with family history of asthma. If we adopt the 0/1 values for binary variables, those proportions are nothing more than the means of a variable with values 0 and 1. If males have value 0 and females 1, then in a sample of 200 subjects with 74 females the sum of the attribute sex totals 74 which, divided by 200 (the sample size), gives the result 0.37, or 37%.

      Sampling is such a central issue in biostatistics that an entire chapter of this book is devoted to discussing it. This is necessary for two main reasons: first, because an understanding of the statistical methods requires a clear understanding of the sampling phenomena; second, because most people do not understand at all the purpose of sampling.

      Sampling is a relatively recent addition to statistics. For almost two centuries, statistical science was concerned only with census, the study of entire populations. Nearly a century ago, however, people realized that populations could be studied easier, faster, and more economically if observations were used from only a small part of the population, a sample of the population, instead of the whole population. The basic idea was that, provided a sufficient number of observations were made, the patterns of interest in the population would be reproduced in the sample. The measurements made in the sample would then mirror the measurements in the population.

An illustration of the classical view of the purpose of sampling. An illustration of the relationship between representativeness and sample size in the classic view of sampling.

      Some people might say that the sample size should be in proportion to the total population. If so, this would mean that an investigation on the prevalence of, say, chronic heart failure in Norway would require a much smaller sample than the same investigation in Germany. This makes little sense. Now suppose we want to investigate patients with chronic heart failure. Would a sample of 100 patients with chronic heart failure be representative? What about 400 patients? Or do we need 1000 patients? In each case, the sample size is always an almost insignificant fraction of the whole population.

      If it does not make much sense to think that the ideal sample size is a certain proportion of the population (even more so because in many situations the population size is not even known), would a representative sample then be the one that contains all the patterns that exist in the population? If so, how many people will we have to sample to make sure that all possible patterns in the population also exist in the sample? For example, some findings typical of chronic heart failure, like an S3‐gallop and alveolar edema, are present in only 2 or 3% of patients, and the combination of these two findings (assuming they are independent) should exist in only 1 out of 2500 patients. Does this mean that no study of chronic heart failure with less than 2500 patients should be considered representative? And what to do when the structure of the population is unknown?

      The problem of lack of objectivity in defining sample representativeness can be circumvented if we adopt a different reasoning when dealing with samples. Let us accept that we have no means of knowing what the population structure truly is, and all we can possibly have is a sample of the population. Then, a realistic procedure would be to look at the sample and, by inspecting its structure, formulate a hypothesis about the structure of the population. The structure of the sample constrains the hypothesis to be consistent with the observations.


Скачать книгу