Business Experiments with R. B. D. McCullough
analyze your own business experiments. I hope you enjoy learning from Bruce as much as I did.
Elea McDonnell Feit
Philadelphia, PA
About the Companion Website
This book is accompanied by a companion website:
www.wiley.com/go/mccullough/businessexperimentswithr
The website includes datasets.
1 Why Experiment?
We can learn from data, but what we can learn depends on the way the data were generated. When nature generates the data and all variables are free to operate as they will, it is very easy to learn about correlations. Very often though, we don't need to know that price decreases are correlated with increases in quantity sold, but we need to know by how much the quantity sold will increase when price is decreased by a specific amount, and correlations are not up to this task. To learn about causation, we need to restrict the way some variables can act, and this can only be done with an experiment. We don't need sophisticated statistical methods to conduct experiments; the statistics learned in a first non‐calculus‐based statistics course often is more than sufficient to conduct a wide variety of useful experiments. Before conducting any experiments, we need to be very precise about the reasons that observational data are not up to the task of yielding causal insights.
After reading this chapter, students should:
Distinguish between observational and experimental data.
Understand that observational data analysis identifies correlation, but cannot prove causality.
Know why it is difficult to establish causality with observational data.
Understand that an experiment is a systematic effort to collect exactly the data you need to inform a decision.
Explain the four key steps in any experiment.
State the “Big Three” criteria for causality.
Identify the conditions that make experiments feasible and cost effective.
Give examples of how experiments can be used to inform specific business decisions.
Understand the difference between a tactical experiment designed to inform a business decision and an experiment designed to test a scientific theory.
1.1 Case: Life Expectancy and Newspapers
Suppose we are interested in determining the reasons that some countries have long life expectancies while others do not. We might begin by examining the relationship between life expectancy and other variables for various countries. The left panel of Figure 1.1 shows a scatterplot of the average life expectancy for several countries versus the number of newspapers per 1000 persons in each country. The data are in WorldBankData.csv
. The fitted linear regression line compared to the data shows the obvious curvature, and linear regression is not appropriate. In the usual fashion, linearity is induced by applying the natural logarithm transformation to the independent variable, as shown in the right panel. The log data are still not completely linear, but are much more linear than the original data.
Figure 1.1 Life expectancy vs. newspapers per 1000 (left) and log(newspapers per 1000) (right) for several countries.
Software Details
Reproduce the above graphs using the data file WorldBankData.csv
…
Below is code for the first graph. To create the next graph, you will have to create a new variable, the natural logarithm of newspapersper1000.
df <- read.csv("WorldBankData.csv") # "df" is the data frame. plot(df$newspapersper1000,df$lifeexp,xlab="Newspapers per1000", ylab="Life Expectancy",pch=19,cex.axis=1.5,cex.lab=1.15) abline(lm(lifeexp∼newspapersper1000,data=df),lty=2) lines(lowess(df$newspapersper1000,df$lifeexp)) plot(log(df$newspapersper1000),df$lifeexp,xlab="log(Newspapers per1000)",ylab="Life Expectancy",pch=19,cex.axis=1.5, cex.lab=1.15) abline(lm(lifeexp∼log(newspapersper1000),data=df),lty=2) lines(lowess(log(df$newspapersper1000),df$lifeexp))
To analyze these data, we can run a regression of life expectancy (LE) in years against the natural logarithm of the number of newspapers per 1000 persons (LN) for a large number of countries in a given year. The results are
(1.1)
where standard errors are in parentheses, so both the coefficients have very high
Try it!
Run the above simple regression. You should get the same coefficients and standard errors.
A better analysis would add more variables to the regression to “control” for other factors. So, let's try adding other variables that we expect to drive life expectancy: LHB (natural logarithm of the number of hospital beds per 1000 in the country), LP (natural logarithm of the number of physicians per 1000 in the country), IS (an index of improvements in sanitation), and IW (an index of improvements in water supply). Since we don't believe that newspapers cause longer life expectancy, we would expect that once we include these variables in the regression, the coefficient on LN will be reduced. The results are
(1.2)
The coefficient on LN has not gone to zero; in fact, it hasn't changed much. The coefficients on all but one of the other variables that we know affect life expectancy are insignificant. What are we to make of this?
Try it!
Run the above multiple regression. You should get the same coefficients and standard errors. Be sure you understand why the variables LN and LHB are “significant” while the others are not.
In reality, life expectancy is affected by a large set of variables in a complex way, and the natural logarithm of newspapers is a good proxy for these other variables. If we have some beliefs about which variables are more likely to be the true causes of an increase in life expectancy, we might be able to build a model that we think represents the cause and effect relationships. If we really want to find the causal effect of newspapers, we might also try more sophisticated methods that involve