Handbook of Regression Analysis With Applications in R. Samprit Chatterjee
(average) sale price for all homes of that type in the area, so they can give a justifiable interval estimate giving the precision of the estimate of the true expected value of the house, so a confidence interval for the fitted value is desired.
Exact
The validity of all of these results depends on whether the assumptions hold. Figure 1.5 gives a scatter plot of the residuals versus the fitted values and a normal plot of the residuals for this model fit. There is no apparent pattern in the plot of residuals versus fitted values, and the ordered residuals form a roughly straight line in the normal plot, so there are no apparent violations of assumptions here. The plot of residuals versus each of the predictors (Figure 1.6) also does not show any apparent patterns, other than the houses with unusual living area and year being built, respectively. It would be reasonable to omit these observations to see if they have had an effect on the regression, but we will postpone discussion of that to Chapter 3, where diagnostics for unusual observations are discussed in greater detail.
An obvious consideration at this point is that the models discussed here appear to be overspecified; that is, they include variables that do not apparently add to the predictive power of the model. As was noted earlier, this suggests the consideration of model building, where a more appropriate (simplified) model can be chosen, which will be discussed in Chapter 2.
FIGURE 1.5: Residual plots for the home price data. (a) Plot of residuals versus fitted values. (b) Normal plot of the residuals.
FIGURE 1.6: Scatter plots of residuals versus each predictor for the home price data.
1.5 Summary
In this chapter we have laid out the basic structure of the linear regression model, including the assumptions that justify the use of least squares estimation. The three main goals of regression noted at the beginning of the chapter provide a framework for an organization of the topics covered.
1 Modeling the relationship between and :the least squares estimates summarize the expected change in for a given change in an , accounting for all of the variables in the model;the standard error of the estimate estimates the standard deviation of the errors; and estimate the proportion of variability in accounted for by ;and the confidence interval for a fitted value provides a measure of the precision in estimating the expected target for a given set of predictor values.
2 Prediction of the target variable:substituting specified values of into the fitted regression model gives an estimate of the value of the target for a new observation;the rough prediction interval provides a quick measure of the limits of the ability to predict a new observation;and the exact prediction interval provides a more precise measure of those limits.
3 Testing of hypotheses:the ‐test provides a test of the statistical significance of the overall relationship;the ‐test for each slope coefficient testing whether the true value is zero provides a test of whether the variable provides additional predictive power given the other variables;and the ‐tests can be generalized to test other hypotheses of interest about the coefficients as well.
Since all of these methods depend on the assumptions holding, a fundamental part of any regression analysis is to check those assumptions. The residual plots discussed in this chapter are a key part of that process, and other diagnostics and tests will be discussed in future chapters that provide additional support for that task.
KEY TERMS
AutocorrelationCorrelation between adjacent observations in a (time) series. In the regression context it is autocorrelation of the errors that is a violation of assumptions.Coefficient of determination(