Business Experiments with R. B. D. McCullough
some do not. It should be apparent that our data do not constitute a random sample from the population, but a selected sample from the population. In the present case, there is no data on those who were denied credit, some of whom would have defaulted and some of whom would not have defaulted. This is a general problem called sample selection bias, and it plagues observational data; in such a case, the sample is not representative of the population. Let us be more explicit. The population of credit applicants includes four types of persons:
1 Non‐defaulters who get credit.
2 Non‐defaulters who are denied credit.
3 Defaulters who get credit.
4 Defaulters who are denied credit.
Meanwhile, since the sample consists only of persons who have been granted credit, the sample consists only of categories (1) and (3) above and does not look like the population. Therefore, inferences from the sample do not extrapolate to the population, and using such a model would likely be a mistake.
This is a very important problem that bedevils the credit industry, and this problem even has a name: “reject inference,” which is how to conduct inference when there is no data on persons whose credit applications were rejected. Very sophisticated statistical machinery, far beyond the level of this book, has been unleashed on this problem, with only a modicum of success. Indeed, some credit card companies deliberately conduct designed experiments and issue credit to persons who otherwise would have been rejected in order to collect data from categories (2) and (4) so that they may extrapolate their results to the population. However, this is a very expensive solution to the problem, so these types of experiments are rarely performed. The credit card industry largely makes do with sophisticated statistical analyses to answer causal questions.
Finally, there is a modeling problem in the data to which we must draw attention. Suppose there was no reject inference problem, and we knew all the lurking variables. We might run a regression (perhaps a logistic regression, for those of you who know that method) with “default” on the loan as the dependent variable and “age” as one of the independent variables. Suppose further that the coefficient on age is positive and statistically significant. What does this mean?
Such a regression coefficient would not agree with what creditors generally know about the relationship between age and default probability. Based on years of empirical data, they know that young people tend to default more than older people. The nonlinearity in this relationship cannot be captured by a linear regression. What does this nonlinearity mean for our linear regression model? We now have a very substantial modeling problem, and we will get different answers depending on how we model the effect of age on defaults (Is it linear or nonlinear; if nonlinear, what type of nonlinearity?). Does a modeler really want results to be dependent on the choice of how the regression model is built? This is just part of what makes drawing causal inferences from observational data so fraught with danger. When we run experiments, we don't have to worry about any of these things.
Exercises
1 1.2.1 Consider the five examples of lurking variables. We already described an experiment and a manipulated variable for the fire damage example. Come up with experiments and manipulated variables to expose the falsity of the observational conclusion for the bombing and drownings examples (we don't know the lurking variable for the other two).
2 1.2.2 For the following cases of observational data, articulate the precise nature of the sample selection.We wish to determine the effect of education on income by running a regression. (Think of the minimum wage.)Mutual funds that have been in business longer tend to have higher returns than newer mutual funds. We “know” this because we collected observational data on mutual funds, regressed return on number of years in business, and found that funds with more years had higher returns. (What happened to mutual funds with low returns?)
3 1.2.3 In the text, we asked, “If it is really the case that persons with higher credit limits are less likely to default, can we decrease the default rate simply by giving everybody a higher credit limit?” Definitely not! Why not? What is the lurking variable?
4 Using the data set credit.csv , create the variable “percentage default by age” () and plot it against age (). The relationship between and is definitely nonlinear. Describe the nonlinearity and a reason for it.
5 We know that regression results are biased when data have been selected due to a condition on the dependent variable, e.g. . What happens when the data are selected due to a condition on ? Subset the data for those values and run the regression. Is the slope estimate biased? What about the intercept? What can you conclude?
1.3 Case: Salk Polio Vaccine Trials
By the 1950s, polio had killed hundreds of thousands worldwide and infected tens of thousands per year in the United States. Many victims who did not die were condemned to spend the rest of their lives in “iron lungs” since they were unable to breathe on their own. To say nothing of the misery wrought by the disease, the effect on the community was devastating: parents kept their children indoors all summer, playgrounds were vacant, and one sick student in a class was reason for many healthy students to stay home from school. Jonas Salk developed a vaccine in 1952, and in 1954 two separate field trials were conducted, involving nearly 2 million children and 300 000 volunteers in the United States, Canada, and Finland. These constituted the largest clinical trials in history.
The first trial monitored nearly a million first, second, and third graders. The second graders were given permission slips for their parents to sign, and those children who got parental consent were vaccinated by injection. These children were compared with the non‐vaccinated first and third graders. The observant reader will have noted already that the sample is not random, and generalizing the results to the population would be problematic. Summaries are presented in Table 1.3. It appears that the vaccine worked; children who got vaccinated had a lower infection rate than children who did not get vaccinated.
Table 1.3 Results of first Salk vaccine field trial.
Group | Size | Rate per 100 000 |
---|---|---|
Grade 2 vaccinated | 225 000 | 25 |
Grades 1 and 3 | 725 000 | 54 |
Grade 2 no consent | 125 000 | 44 |
Notice, however, that the number of second graders who did not receive permission markedly was nearly half the number who did get permission. If families who gave consent were similar to families who did not give consent, then we can assume that the difference in polio rates between the grade 2 vaccinated students and the grades 1 and 3 unvaccinated students is a good estimate of the overall effect of the vaccine at reducing polio. But are the families similar? Further analysis of the trial data showed that, on average, families who gave consent had higher incomes than families who did not give consent. What else might we deduce from this income difference, with respect to educational differences? Can we infer something about how often the children visited the family doctor and other such variables that are correlated with income and might affect the children's susceptibility to polio? How might this affect the results of the trial?
A more subtle problem arises because everybody involved in the experiment, including the physicians who monitored the children for polio, knew who had been vaccinated and who had not. Specifically, the experiment was not double‐blinded. In a blinded experiment (or single‐blind experiment), only the experimenter knows whether the subject gets the treatment or the placebo; the subject does not know. In a double‐blind experiment, neither the