Business Experiments with R. B. D. McCullough

Business Experiments with R - B. D. McCullough


Скачать книгу
to give credit to people who are likely to default, and if we do give credit, we don't want to give more than the person can repay.

Female Male
0 14 349 (79%) 9 015 (76%)
1 3 763 (21%) 2 873 (24%)
Total 18 112 11 888
Married Single Other
0 10 453 (77%) 12 623 (79%) 288 (76%)
1 3 206 (23%) 3 341 (21%) 89 (24%)
Total 13 659 15 964 377

      Try it!

      We encourage you to replicate the analysis in this chapter using the data in the file credit.csv . Computing crosstabs can be done in a spreadsheet using pivot tables. Most statistical tools also have a cross‐tabulation function.

      df <- read.csv("credit.csv",header=TRUE) # Table 1.1 table1 <- table(df$default,df$sex) # to get the counts table1 # to print out the table prop.table(table1,2) # to get column proportions prop.table(table1,1) # to get row proportions

Box plots depicting that persons who do not default have higher credit limits than persons who default, while age appears to have no association with default status.

      If it is really the case that persons with higher credit limits are less likely to default, can we decrease the default rate simply by giving everybody a higher credit limit?

      Software Details

      To reproduce Figure 1.2, load the data file credit.csv

      boxplot(limit∼default, xlab="default", ylab="credit limit", data=df)

      We have thus far looked at how the four variables are associated with default, individually. How might we examine the effects of all the variables at one time in order to answer the two fundamental questions?

equation

      Marital status (married, single, or divorced/widowed) will be represented by two dummy variables, images and images:

equation

      For a married person, images and images, for a person who is divorced/widowed images and images, while for a single person images and images.

      1.2.1 Lurking Variables

      It is not uncommon for an analyst to reach mistaken conclusions based on observational data that are incorrect due to lurking variables.

      1 During WWII, an analysis of the accuracy of strategic bombing runs showed that Allied bombers were more accurate at lower altitudes than at higher altitudes (this makes sense). The analysis also showed that Allied bombers were more accurate when opposed by enemy fighters than when enemy fighters were not present. Explain.

      2 A scatterplot shows a strong relationship between the number of firefighters at a fire and the dollar amount of the damage caused by the fire. While this relationship may be predictive, it is not causal: it is not true that if fewer firefighters are sent to a fire, the dollar amount of the damage will decrease. What is the missing causal variable?

      3 On a daily basis in a coastal town, there is a positive relationship between ice cream sales and drowning deaths. What is the missing causal variable?

      4 The observational data repeatedly say that persons who eat five fruits and veggies per day have a lower cancer rate than those who don't eat fruits and veggies. The experimental results find no difference in cancer rates. Explain the discrepancy.

      5 A large, expensive observational study by the National Institutes of Health concluded that hormone replacement therapy (HRT) prevents heart disease in postmenopausal women. Consequently many women were placed on HRT. Later, an experiment showed that HRT does not prevent heart disease in postmenopausal women. Explain the discrepancy.

      The resolutions of the above


Скачать книгу