Handbook of Regression Analysis With Applications in R. Samprit Chatterjee
2.3 Methodology
2.3.1 MODEL SELECTION
We saw in Section 2.2.1 that hypothesis tests can be used to compare models. Unfortunately, there are several reasons why such tests are not adequate for the task of choosing among a set of candidate models for the appropriate model to use.
In addition to the effects of correlated predictors on
Even ignoring these issues, hypothesis tests don't necessarily address the question a data analyst is most interested in. With a large enough sample, almost any estimated slope will be significantly different from zero, but that doesn't mean that the predictor provides additional useful predictive power. Similarly, in small samples, important effects might not be statistically significant at typical levels simply because of insufficient data. That is, there is a clear distinction between statistical significance and practical importance.
In this section we discuss a strategy for determining a “best” model (or more correctly, a set of “best” models) among a larger class of candidate models, using objective measures designed to reflect a predictive point of view. As a first step, it is good to explicitly identify what should not be done. In recent years, it has become commonplace for databases to be constructed with hundreds (or thousands) of variables and hundreds of thousands (or millions) of observations. It is tempting to avoid issues related to choosing the potential set of candidate models by considering all of the variables as potential predictors in a regression model, limited only by available computing power. This would be a mistake. If too large a set of possible predictors is considered, it is very likely that variables will be identified as important just due to random chance. Since they do not reflect real relationships in the population, models based on them will predict poorly in the future, and interpretations of slope coefficients will just be mistaken explanations of what is actually random behavior. This sort of overfitting is known as “data dredging” and is among the most serious dangers when analyzing data.
The set of possible models should ideally be chosen before seeing any data based on as thorough an understanding of the underlying random process as possible. Potential predictors should be justifiable on theoretical grounds if at all possible. This is by necessity at least somewhat subjective, but good basic principles exist. Potential models to consider should be based on the scientific literature and previous relevant experiments. In particular, if a model simply doesn't “make sense,” it shouldn't be considered among the possible candidates. That does not mean that modifications and extensions of models that are suggested by the analysis should be ignored (indeed, this is the subject of the next three chapters), but an attempt to keep models grounded in what is already understood about the underlying process is always a good idea.
What do we mean by the (or a) “best” model? As was stated on page 4, there is no “true” model, since any model is only a representation of reality (or equivalently, the true model is too complex to be modeled usefully). Since the goal is not to find the “true” model, but rather to find a model or set of models that best balances fit and simplicity, any strategy used to guide model selection should be consistent with this principle. The goal is to provide a good predictive model that also provides useful descriptions of the process being studied from estimated parameters.
Once a potential set of predictors is chosen, most statistical packages include the capability to produce summary statistics for all possible regression models using those predictors. Such algorithms (often called best subsets algorithms) do not actually look at all possible models, but rather list statistics for only the models with strongest fits for each number of predictors in the model. Such a listing can then be used to determine a set of potential “best” models to consider more closely. The most common algorithm, described in Furnival and Wilson (1974), is based on branch and bound optimization, and while it is much less computationally intensive than examining all possible models, it still has a practical feasible limit of roughly
Note that model comparisons are only sensible when based on the same data set. Most statistical packages drop any observations that have missing data in any of the variables in the model. If a data set has missing values scattered over different predictors, the set of observations with complete data will change depending on which variables are in the model being examined, and model comparison measures will not be comparable. One way around this is to only use observations with complete data for all variables under consideration, but this can result in discarding a good deal of available information for any particular model.
Tools like best subsets by their very nature are likely to be more effective when there are a relatively small number of useful predictors that have relatively strong effects, as opposed to a relatively large number of predictors that have relatively weak effects. The strict present/absent choice for a predictor is consistent with true relationships with either zero or distinctly nonzero slopes, as opposed to many slopes that are each nonzero but also not far from zero.
2.3.2 EXAMPLE — ESTIMATING HOME PRICES (CONTINUED)
Consider again the home price data examined in Section 1.4. We repeat the regression output from the model based on all of the predictors below:
Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -7.149e+06 3.820e+06 -1.871 0.065043 . Bedrooms -1.229e+04 9.347e+03 -1.315 0.192361 1.262 Bathrooms 5.170e+04 1.309e+04 3.948 0.000171 1.420 *** Living.area 6.590e+01 1.598e+01 4.124 9.22e-05 1.661 *** Lot.size -8.971e-01 4.194e+00 -0.214 0.831197 1.074 Year.built 3.761e+03 1.963e+03 1.916 0.058981 1.242 . Property.tax 1.476e+00 2.832e+00 0.521 0.603734 1.300 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 47380 on 78 degrees of freedom Multiple R-squared: 0.5065, Adjusted R-squared: 0.4685 F-statistic: 13.34 on 6 and 78 DF, p-value: 2.416e-10
This is identical to the output given earlier, except that variance inflation factor (