Applied Univariate, Bivariate, and Multivariate Statistics. Daniel J. Denis
as we can assume multivariate normality, we have some idea of how such linear combinations will be distributed.
2.26 MODELS IN MATRIX FORM
Throughout the book, our general approach is to first present models in their simplest possible form using only scalars. We then gently introduce the reader to the corresponding matrix counterparts and extensions. The requirement of matrices for such models is to accommodate numerous variables and dimensions. Matrix algebra is the vehicle by which multivariate analysis is communicated, though most of the concepts of statistics can be communicated using simpler scalar algebra. Knowing matrix algebra for its own sake will not necessarily equate to understanding statistical concepts. Indeed, hiding behind the mathematics of statistics are the philosophically “sticky” issues that mathematics or statistics cannot, on their own at least, claim to solve. These are often the problems confronted by researchers and scientists in their empirical pursuits and attempts to draw conclusions from data. For instance, what is the nature of a “correct” model? Do latent variables exist, or are they only a consequence of generating linear combinations? The nature of a latent variable is not necessarily contingent on the linear algebra that seeks to define it. Such questions are largely philosophical, and if such interest you, you are strongly encouraged to familiarize yourself with the philosophy of statistics and mathematics (you may not always find answers to your questions, but you will appreciate the complexity of such questions, as they are beyond our current study here). For a gentle introduction to the philosophy of statistics, see Lindley (2001).
As an example of how matrices will be used to develop more complete and general models, consider the multivariate general linear model in matrix form:
where Y is an n x m matrix of n observations on m response variables, X is the model or “design” matrix whose columns contain k regressors which includes the intercept term, B is a matrix of regression coefficients, and E is a matrix of errors. Many statistical models can be incorporated into the framework of (2.7). As a relatively easy application of this general model, consider the simple linear regression model (featured in Chapter 7) in matrix form:
where yi = 1 to yi = n are observed measurements on some dependent variable, X is the model matrix containing a constant of 1 in the first column to represent the common intercept term (i.e., “common” implying there is one intercept that represents all observations in our data), xi = 1 to xi = n are observed values on a predictor variable, α is the fixed intercept parameter, β is the slope parameter, which we also assume to be fixed, and ε is a vector of errors ε1 to εn (we use ε here instead of E).
Suppose now we want to add a second response variable. Because of the generality of (2.7), this can be easily accommodated:
where now, a second response variable is represented in Y by a second column. That is, yi = 1, 2 corresponds to individual 1 on response variable 2, yi = 2, 2 is individual 2 on response variable 2, etc. We will at times refer to matrix representations throughout the book.
2.27 GRAPHICAL APPROACHES
Performing inferential tests to help draw conclusions about population parameters is useful, but ultimately the findings of a statistical analysis should make their way into a graph or other visualization. Data visualization is a field in itself, and with the advent of modern computing power, possibilities exist today that could only be dreamt of in the past. Simple visualizations such a histograms, boxplots, scatterplots, etc., can be useful in depicting findings but also in helping to verify assumptions that underlay the statistical model one is using. For example, since many tests of normality and equality of variances (and covariances) are relatively sensitive to the types of data to which they are applied, oftentimes researchers will generate simple plots in order to detect potential gross violations of such assumptions. We feature such techniques throughout the book.
For graphical displays meant to communicate findings (rather than test assumptions), Friendly (2000) puts the field into context:
Designing good graphics is surely an art, but as surely, it is one that ought to be informed by science … In this view, an effective graphical display, like good writing, requires an understanding of its purpose – what aspects of the data are to be communicated to the viewer. In writing, we communicate most effectively when we know our audience and tailor the message appropriately. (p. 8)
In high‐dimensional space, the challenge of graphical approaches is to summarize data into lower dimensions, while still retaining most of the information in the original data. We feature some such plots in later chapters. For a thorough account of data visualization, see datavis.ca (Friendly, 2020). For sophisticated graphics using R, consult Wickham (2009).
For now, it is useful to briefly review some basic plots for which the reader is likely already familiar.
2.27.1 Box‐and‐Whisker Plots
The boxplot was a contribution of John Tukey (1977) in the spirit of what is called exploratory data analysis, or “EDA” which encouraged scientists to spend more of their energy on descriptive techniques instead of focusing exclusively on confirmatory statistical tests. Boxplots of parent heights from Galton's data appear below:
> attach(Galton) > boxplot(parent) > library(lattice) > bwplot(parent)
The boxplot provides what is generally known as a five‐number summary of a distribution, of which we can obtain most of the numbers we need by the summary
function in R:
> summary(parent) Min. 1st Qu. Median Mean 3rd Qu. Max. 64.00 67.50 68.50 68.31 69.50 73.00
Recall that the median is the point in the ordered data that divides the data set into two equal parts. The location of the median is computed by (n + 1)/2. In Galton's data, there are 928 observations, and so the location of the median is at 464.5th (i.e., (928 + 1)/2) point in the ordered data set. For parent, this value is equal to 68.50. The first and third quartiles represent the 25th and 75th percentiles and are 67.50 and 69.50 respectively. We can also compute the range as
> range(parent) [1] 64 73
We can also generate boxplots by category. Throughout the book, we use Fisher's iris data (Fisher, 1936) in which flower characteristics such as sepal and petal length are categorized by species of flower. We plot sepal length by species:
> library(lattice) > attach(iris) > bwplot(Sepal.Length ~ Species)
Data points falling beyond the whiskers of the plots