(Hurvich and Tsai, 1989). Equation (2.3) shows that (especially for small samples) models with fewer parameters will be more strongly preferred when minimizing than when minimizing , providing stronger protection against overfitting. In large samples, the two criteria are virtually identical, but in small samples, or when considering models with a large number of parameters, is the better choice. This suggests the following model selection rule:
1 Choose the model that minimizes . In case of tied values, the simplest model (smallest ) would be chosen. In these data, this rule implies choosing , although the value for is virtually identical to that of . Note that the overall level of the values is not meaningful, and should not be compared to values or values for other data sets; it is only the value for a model for a given data set relative to the values of others for that data set that matter.
, , and have the desirable property that they are efficient model selection criteria. This means that in the (realistic) situation where the set of candidate models does not include the “true” model (that is, a good model is just viewed as a useful approximation to reality), as the sample gets larger the error obtained in making predictions using the model chosen using these criteria becomes indistinguishable from the error obtained using the best possible model among all candidate models. That is, in this large‐sample predictive sense, it is as if the best approximation was known to the data analyst. Another well‐known criterion, the Bayesian Information Criterion [which substitutes for in (2.2)], does not have this property, but is instead a consistent criterion. Such a criterion has the property that if the “true” model is in fact among the candidate models the criterion will select that model with probability approaching as the sample size increases. Thus, is a more natural criterion to use if the goal is to identify the “true” predictors with nonzero slopes (which of course presumes that there are such things as “true” predictors in a “true” model). will generally choose simpler models than because of its stronger penalty ( for ), and a version that adjusts as in (2.3) leads to even simpler models. This supports the notion that from a predictive point of view including a few unnecessary predictors (overfitting) is far less damaging than is omitting necessary predictors (underfitting).
A final way of comparing models is from a directly predictive point of view. Since a rough prediction interval is , a useful model from a predictive point of view is one with small , suggesting choosing a model that has small while still being as simple as possible. That is,
1 Increase the number of predictors until levels off. For these data ( in the output refers to ), this implies choosing or .
Taken together, all of these rules imply that the appropriate set of models to consider are those with two, three, or four predictors. Typically, the strongest model of each size (which will have highest , highest , lowest , lowest , and lowest , so there is no controversy as to which one is strongest) is examined. The output on pages 31–32 provides summaries for the top three models of each size, in case there are reasons to examine a second‐ or third‐best model (if, for example, a predictor in the best model is difficult or expensive to measure), but here we focus on the best model of each size. First, here is output for the best four‐predictor model.
The ‐statistic for number of bedrooms suggests very little evidence that it adds anything useful given the other predictors in the model, so we consider now the best three‐predictor model. This happens to be the best four‐predictor model with the one statistically insignificant predictor omitted, but this does not have to be the case.
Each of the predictors is statistically significant at a level, and this model recovers virtually all of the available fit (, while that using all six predictors is ), so this seems to be a reasonable