Applied Regression Modeling. Iain Pardoe
method for understanding the material and completing the problems is to use statistical software rather than a statistical calculator. It may be possible to apply many of the methods discussed using spreadsheet software (such as Microsoft Excel), although some of the graphical methods may be difficult to implement and statistical software will generally be easier to use. Although a statistical calculator is not recommended for use with this book, a traditional calculator capable of basic arithmetic (including taking logarithmic and exponential transformations) will be invaluable.
What other resources are recommended?Good supplementary textbooks (some at a more advanced level) include Chatterjee and Hadi (2013), Dielman (2004), Draper and Smith (1998), Fox (2015), Gelman et al. (2020), Kutner et al. (2004), Mendenhall and Sincich (2020), Montgomery et al. (2021), Ryan (2008), and Weisberg (2013).
About the Companion Website
This book is accompanied by a companion website for Instructors and Students:
www.wiley.com/go/pardoe/AppliedRegressionModeling3e
Datasets used for examples
R code
Presentation slides
Statistical software packages
Chapter 6 – Case studies
Chapter 7 – Extensions
Appendix A – Computer Software help
Appendix B – Critical values for t-distributions
Appendix C – Notation and formulas
Appendix D – Mathematics refresher
Appendix E – Multiple Linear Regression Using Matrices
Appendix F – Answers for selected problems
Instructor's manual
Chapter 1 Foundations
This chapter provides a brief refresher of the main statistical ideas that are a useful foundation for the main focus of this book, regression analysis, covered in subsequent chapters. For more detailed discussion of this material, consult a good introductory statistics textbook such as Freedman et al. (2007) or Moore et al. (2018). To simplify matters at this stage, we consider univariate data, that is, datasets consisting of measurements of a single variable from a sample of observations. By contrast, regression analysis concerns multivariate data where there are two or more variables measured from a sample of observations. Nevertheless, the statistical ideas for univariate data carry over readily to this more complex situation, so it helps to start out as simply as possible and make things more complicated only as needed.
After reading this chapter you should be able to:
Summarize univariate data graphically and numerically.
Calculate and interpret a confidence interval for a univariate population mean.
Conduct and draw conclusions from a hypothesis test for a univariate population mean using both the rejection region and p‐value methods.
Calculate and interpret a prediction interval for an individual univariate value.
1.1 Identifying and Summarizing Data
One way to think about statistics is as a collection of methods for using data to understand a problem quantitatively—we saw many examples of this in the introduction. This book is concerned primarily with analyzing data to obtain information that can be used to help make decisions in real‐world contexts.
The process of framing a problem in such a way that it is amenable to quantitative analysis is clearly an important step in the decision‐making process, but this lies outside the scope of this book. Similarly, while data collection is also a necessary task—often the most time‐consuming part of any analysis—we assume from this point on that we have already obtained data relevant to the problem at hand. We will return to the issue of the manner in which these data have been collected—namely, whether we can consider the sample data to be representative of some larger population that we wish to make statistical inferences for—in Section 1.3.
For now, we consider identifying and summarizing the data at hand. For example, suppose that we have moved to a new city and wish to buy a home. In deciding on a suitable home, we would probably consider a variety of factors, such as size, location, amenities, and price. For the sake of illustration, we focus on price and, in particular, see if we can understand the way in which sale prices vary in a specific housing market. This example will run through the rest of the chapter, and, while no one would probably ever obsess over this problem to this degree in real life, it provides a useful, intuitive application for the statistical ideas that we use in the rest of the book in more complex problems.
For this example, identifying the data is straightforward: the units of observation are a random sample of size
The particular sample in the HOMES1 data file is random because the 30 homes have been selected randomly somehow from the population of all single‐family homes in this housing market. For example, consider a list of homes currently for sale, which are considered to be representative of this population. A random number generator—commonly available in spreadsheet or statistical software—can be used to pick out 30 of these. Alternative selection methods may or may not lead to a random sample. For example, picking the first 30 homes on the list would not lead to a random sample if the list were ordered by the size of the sale price.
We can simply list small datasets such as this. The values of
155.5 | 195.0 | 197.0 | 207.0 | 214.9 | 230.0 | 239.5 | 242.0 | 252.5 | 255.0 |
259.9 | 259.9 | 269.9 | 270.0 | 274.9 | 283.0 | 285.0 | 285.0 |