Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee


Скачать книгу
href="#fb3_img_img_2dabac4b-e3c8-53e1-aeca-79f78f32430d.png" alt="images"/> that helps address this problem is the corrected images,

      1 Choose the model that minimizes . In case of tied values, the simplest model (smallest ) would be chosen. In these data, this rule implies choosing , although the value for is virtually identical to that of . Note that the overall level of the values is not meaningful, and should not be compared to values or values for other data sets; it is only the value for a model for a given data set relative to the values of others for that data set that matter.

      images, images, and images have the desirable property that they are efficient model selection criteria. This means that in the (realistic) situation where the set of candidate models does not include the “true” model (that is, a good model is just viewed as a useful approximation to reality), as the sample gets larger the error obtained in making predictions using the model chosen using these criteria becomes indistinguishable from the error obtained using the best possible model among all candidate models. That is, in this large‐sample predictive sense, it is as if the best approximation was known to the data analyst. Another well‐known criterion, the Bayesian Information Criterion images [which substitutes images for images in (2.2)], does not have this property, but is instead a consistent criterion. Such a criterion has the property that if the “true” model is in fact among the candidate models the criterion will select that model with probability approaching images as the sample size increases. Thus, images is a more natural criterion to use if the goal is to identify the “true” predictors with nonzero slopes (which of course presumes that there are such things as “true” predictors in a “true” model). images will generally choose simpler models than images because of its stronger penalty (images for images), and a version images that adjusts images as in (2.3) leads to even simpler models. This supports the notion that from a predictive point of view including a few unnecessary predictors (overfitting) is far less damaging than is omitting necessary predictors (underfitting).

      A final way of comparing models is from a directly predictive point of view. Since a rough images prediction interval is images, a useful model from a predictive point of view is one with small images, suggesting choosing a model that has small images while still being as simple as possible. That is,

      1 Increase the number of predictors until levels off. For these data ( in the output refers to ), this implies choosing or .

      Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -6.852e+06 3.701e+06 -1.852 0.0678 . Bedrooms -1.207e+04 9.212e+03 -1.310 0.1940 1.252 Bathrooms 5.303e+04 1.275e+04 4.160 7.94e-05 1.374 *** Living.area 6.828e+01 1.460e+01 4.676 1.17e-05 1.417 *** Year.built 3.608e+03 1.898e+03 1.901 0.0609 1.187 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 46890 on 80 degrees of freedom Multiple R-squared: 0.5044, Adjusted R-squared: 0.4796 F-statistic: 20.35 on 4 and 80 DF, p-value: 1.356e-11

      The images‐statistic for number of bedrooms suggests very little evidence that it adds anything useful given the other predictors in the model, so we consider now the best three‐predictor model. This happens to be the best four‐predictor model with the one statistically insignificant predictor omitted, but this does not have to be the case.

      Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -7.653e+06 3.666e+06 -2.087 0.039988 * Bathrooms 5.223e+04 1.279e+04 4.084 0.000103 1.371 *** Living.area 6.097e+01 1.355e+01 4.498 2.26e-05 1.210 *** Year.built 4.001e+03 1.883e+03 2.125 0.036632 1.158 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 47090 on 81 degrees of freedom Multiple R-squared: 0.4937, Adjusted R-squared: 0.475 F-statistic: 26.33 on 3 and 81 DF, p-value: 5.489e-12


Скачать книгу