Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee


Скачать книгу
useful guide for data analysts, and will help contribute to effective analyses. We would like to thank our students and colleagues for their encouragement and support. We hope we have provided them with a book of which they would approve. We would like to thank Steve Quigley, Jackie Palmieri, and Amy Hendrickson for their help in bringing this manuscript to print. We would also like to thank our families for their love and support.

      SAMPRIT CHATTERJEE

      Brooksville, Maine

      JEFFREY S. SIMONOFF

      New York, New York

      August, 2012

PART ONE The Multiple Linear Regression Model

      1  1.1 Introduction

      2  1.2 Concepts and Background Material 1.2.1 The Linear Regression Model 1.2.2 Estimation Using Least Squares 1.2.3 Assumptions

      3  1.3 Methodology 1.3.1 Interpreting Regression Coefficients 1.3.2 Measuring the Strength of the Regression Relationship 1.3.3 Hypothesis Tests and Confidence Intervals for β 1.3.4 Fitted Values and Predictions 1.3.5 Checking Assumptions Using Residual Plots

      4  1.4 Example—Estimating Home Prices

      5  1.5 Summary

      This is a book about regression modeling, but when we refer to regression models, what do we mean? The regression framework can be characterized in the following way:

      1 We have one particular variable that we are interested in understanding or modeling, such as sales of a particular product, sale price of a home, or voting preference of a particular voter. This variable is called the target, response, or dependent variable, and is usually represented by .

      2 We have a set of other variables that we think might be useful in predicting or modeling the target variable (the price of the product, the competitor's price, and so on; or the lot size, number of bedrooms, number of bathrooms of the home, and so on; or the gender, age, income, party membership of the voter, and so on). These are called the predicting, or independent variables, and are usually represented by , , etc.

      Typically, a regression analysis is used for one (or more) of three purposes:

      1 modeling the relationship between and ;

      2 prediction of the target variable (forecasting);

      3 and testing of hypotheses.

      In this chapter, we introduce the basic multiple linear regression model, and discuss how this model can be used for these three purposes. Specifically, we discuss the interpretations of the estimates of different regression parameters, the assumptions underlying the model, measures of the strength of the relationship between the target and predictor variables, the construction of tests of hypotheses and intervals related to regression parameters, and the checking of assumptions using diagnostic plots.

      1.2.1 THE LINEAR REGRESSION MODEL

      The data consist of

observations, which are sets of observed values
that represent a random sample from a larger population. It is assumed that these observations satisfy a linear relationship,

      where the

coefficients are unknown parameters, and the
are random error terms. By a linear model, it is meant that the model is linear in the parameters; a quadratic model,

      paradoxically enough, is a linear model, since

and
are just versions of
and
.

      It is important to recognize that this, or any statistical model, is not viewed as a true representation of reality; rather, the goal is that the model be a useful representation of reality. A model can be used to explore the relationships between variables and make accurate forecasts based on those relationships even if it is not the “truth.” Further, any statistical model is only temporary, representing a provisional version of views about the random process being studied. Models can, and should, change, based on analysis using the current model, selection among several candidate models, the acquisition of new data, new understanding of the underlying random process, and so on. Further, it is often the case that there are several different models that are reasonable representations of reality. Having said this, we will sometimes refer to the “true” model, but this should be understood as referring to the underlying form of the currently hypothesized representation of the regression relationship.

.