Business Experiments with R. B. D. McCullough

Business Experiments with R

1 Cloud cover. Planes couldn't fly in the clouds and had to fly above the clouds. If the weather was cloudy, the enemy wouldn't bother to send up fighters, and accuracy was terrible because in that era, bombing depended on sighting landmarks on the ground.

2 There is a third variable in the background – the seriousness of the fire – that is responsible for the observed relationship. More serious fires require more firefighters and also cause more damage.

3 The lurking variable that causes both ice cream sales and an increase in drowning deaths is season of the year, i.e. summer.

4 Of course, persons who eat five fruits and veggies per day are different than those who do not. How, precisely, they are different we do not know. Just because there is a lurking variable does not mean that we can identify it.

5 The women who chose HRT were different from other women in ways for which the observational study could not control. Again, just because we can deduce the existence of a lurking variable does not follow that we can say what the variable is.

The above “experiments” (the word is in quotes because they really aren't experiments) are actually just observational data masquerading as experiments, and the way to see this is to perform a hypothetical thought experiment and think about manipulating one of the variables as it would be manipulated in a true experiment. In the fire example above, imagine there was a fire and firemen had responded, and then we ordered 100 more firemen to show up to the fire. Would we expect there to be more damage simply because more firemen were present? Of course not. As will be seen, designed experiments eliminate the effect of the lurking variables.

Here we mention that many authors conflate the concepts of “lurking variable” and “confounding variable,” treating them as one and the same, but this is a mistake. Though they both make it difficult for the analyst to interpret results, they do so through different mechanisms. A lurking variable affects observational data, while a confounding variable affects experimental data. In this chapter we only encounter lurking variables. In later chapters we will encounter confounding. The “Learning More” section for this chapter describes the differences in detail.

1.2.2 Sample Selection Bias

Sample selection bias plagues nonexperimental, i.e. observational, data. Its effects are especially pernicious when selection is based on the dependent variable (the effects are not so bad when selection is based on an independent variable). To motivate this important idea, we generated some linear data with a zero intercept and a slope of unity. images takes on values from 10 to 20. If we fit a line images , we get images and images with standard errors in parentheses. The images ‐statistics to test the null hypotheses that the coefficients equal zero are the coefficients divided by the standard errors. The images ‐stat on the intercept is images , and the images ‐stat for the slope is images . Using 2 as a rough cutoff for a 5% significance level, we observe that the intercept is not significantly different from zero while the slope is significantly different from zero.

Try it!

Use the data in the file SampleSelection.csv to repeat the above analysis by running the regression for the full sample and again only for those observations for which images .

These results are consistent with the true intercept of zero and the true slope of unity. Suppose that we only get to observe observations when images . Then the results are images and images . These results are not consistent with the truth: the intercept is significantly above 0, and the slope is significantly below unity. Figure 1.3 depicts the situation. Note that for observations close to the cutoff images , for any value of images , observations with positive errors will be included in the sample and observations with negative errors will be dropped from the sample. Thus, some values of observed images are correlated with the error: values of images close to but above images are quite likely to have a positive error. This violates one of the assumptions for linear regression to be unbiased. Any time that selection into the sample depends on the value of the dependent variable, sample selection bias is a problem.

Chart depicting a sample selection bias; for observations close to the cutoff Y greater than 15, for any value of X, observations with positive errors will be included in the sample.

Figure 1.3 Sample selection bias. Dashed line for observations

and solid line for all observations and horizontal dotted line at images

If we have an umbrella problem and we only want to make a prediction of images for some value of images when images , then there is no difficulty. We can get good predictions of images for values of images larger than 15. If we have a rain dance problem and we need a good estimate of the true slope, then clearly we cannot use the regression to determine the effect that a change in images has upon images , because the only regression we can run (the dashed line) has a biased slope (see the solid line for the true slope).

Now let us return to the credit example, and suppose that we had lots of variables that included all possible lurking variables. Suppose we knew the model so that there was no garden of forking paths problem. Now, could we really get causal answers out of these data? Suppose we brought in a statistical expert on getting causal results from observational data. Could he do it? The answer is “no.”

Aside from the garden of forking paths, there is a more serious problem with these data, and it is rather subtle. Let us consider where our data came from. People apply for credit. Some are granted credit, while others are not.

Скачать книгу