Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee


Скачать книгу
of the errors, equaling the square root of the residual mean square.

      1  2.1 Introduction

      2  2.2 Concepts and Background Material 2.2.1 Using Hypothesis Tests to Compare Models 2.2.2 Collinearity

      3  2.3 Methodology 2.3.1 Model Selection 2.3.2 Example—Estimating Home Prices (continued)

      4  2.4 Indicator Variables and Modeling Interactions 2.4.1 Example—Electronic Voting and the 2004 Presidential Election

      5  2.5 Summary

      

      All of the discussion in Chapter 1 is based on the premise that the only model being considered is the one currently being fit. This is not a good data analysis strategy, for several reasons.

      1 Including unnecessary predictors in the model (what is sometimes called overfitting) complicates descriptions of the process. Using such models tends to lead to poorer predictions because of the additional unnecessary noise. Further, a more complex representation of the true regression relationship is less likely to remain stable enough to be useful for future prediction than is a simpler one.

      2 Omitting important effects (underfitting) reduces predictive power, biases estimates of effects for included predictors, and results in less understanding of the process being studied.

      3 Violations of assumptions should be addressed, so that least squares estimation is justified.

      The last of these reasons is the subject of later chapters, while the first two are discussed in this chapter. This operation of choosing among different candidate models so as to avoid overfitting and underfitting is called model selection.

      First, we discuss the uses of hypothesis testing for model selection. Various hypothesis tests address relevant model selection questions, but there are also reasons why they are not sufficient for these purposes. Part of these difficulties is the effect of correlations among the predictors, and the situation of high correlation among the predictors (collinearity) is a particularly challenging one.

      A useful way of thinking about the tradeoffs of overfitting versus underfitting is as a contrast between strength of fit and simplicity. The principle of parsimony states that a model should be as simple as possible while still accounting for the important relationships in the data. Thus, a sensible way of comparing models is using measures that explicitly reflect this tradeoff; such measures are discussed in Section 2.3.1.

      The chapter concludes with a discussion of techniques designed to address the existence of well‐defined subgroups in the data. In this situation, it is often the case that the effects of a predictor on the target variable is different in the two groups, and ways of building models to handle this are discussed in Section 2.4.

      2.2.1 USING HYPOTHESIS TESTS TO COMPARE MODELS

      Determining whether individual regression coefficients are statistically significant (as discussed in Section 1.3.3) is an obvious first step in deciding whether a model is overspecified. A predictor that does not add significantly to model fit should have an estimated slope coefficient that is not significantly different from images, and is thus identified by a small images‐statistic. So, for example, in the analysis of home prices in Section 1.4, the regression output on page 17 suggests removing number of bedrooms, lot size, and property taxes from the model, as all three have insignificant images‐values.

      The images‐tests and images‐test of Section 1.3.3 are special cases of a general formulation that is useful for comparing certain classes of models. It might be the case that a simpler version of a candidate model (a subset model) might be adequate to fit the data. For example, consider taking a sample of college students and determining their college grade point average (images), Scholastic Aptitude Test (SAT) evidence‐based reading and writing score (images), and SAT math score (images). The full regression model to fit to these data is

equation

      Instead of considering reading and math scores separately, we could consider whether images can be predicted by one variable: total


Скачать книгу