Discovering Partial Least Squares with JMP. Marie Gaudard A.
= B-1B = 1, where I is the q x q identity matrix (with 1’s down the leading diagonal and 0’s elsewhere). If the columns of an arbitrary matrix, say A, are linearly independent, then it can be shown that the inverse of the matrix A’A exists.
In MLR, when n ≥ m and when the columns of X are linearly independent so that the matrix (X'X)-1 exists, the coefficients in β can be estimated in a unique fashion as:
β^=(X'X)-1X'Y
The hat above β in the notation β^ indicates that this vector contains numerical estimates of the unknown coefficients in β. If there are fewer observations than columns in X, n < m, then there are an infinite number of solutions for β in Equation (2.5).
As an example, think of trying to fit two observations with a matrix X that has three columns. Then, geometrically, the expression Xβ in Equation (2.5) defines a hyperplane which, given that m = 3 in this case, is simply a plane. But there are infinitely many planes that pass through any two given points. There is no way to determine which of these infinitely many solutions would be best at predicting new observations well.
Underfitting and Overfitting: A Simulation
To better understand the issues behind model fitting, let’s run the script PolyRegr.jsl by clicking on the correct link in the master journal.
The script randomly generates Y values for eleven points with X values plotted horizontally and equally spaced over the interval 0 to 1 (Figure 2.2). The points exhibit some curvature. The script uses MLR to predict Y from X, using various polynomial models. (Note that your points will differ from ours because of the randomness.)
When the slider in the bottom left corner is set at the far left, the order of the polynomial model is one. In other words, we are fitting the data with a line. In this case, the design matrix X has two columns, the first containing all 1s and the second containing the horizontal coordinates of the plotted points. The linear fit ignores the seemingly obvious pattern in the data— it is underfitting the data. This is evidenced by the residuals, whose magnitudes are illustrated using vertical blue lines. The RMSE (root mean square error) is calculated by squaring each residual, averaging these (specifically, dividing their sum by the number of observations minus one, minus the number of predictors), and then taking the square root.
As we shift the slider to the right, we are adding higher-order polynomial terms to the model. This is equivalent to adding additional columns to the design matrix. The additional polynomial terms provide a more flexible model that is better able to capture the important characteristics, or the structure, of the data.
Figure 2.2 Illustration of Underfitting and Overfitting, with Order = 1, 2, 3, and 10
However, we get to a point where we go beyond modeling the structure of the data, and begin to model the noise in the data. Note that, as we increase the order of the polynomial, thereby adding more terms to the model, the RMSE progressively reduces. An order 10 polynomial, obtained by setting the slider all the way to the right, provides a perfect fit to the data and gives RMSE = 0 (bottom right plot in Figure 2.2). However, this model is not generalizable to new data, because it has modeled both the structure and the noise, and by definition the noise is random and unpredictable. Our model has overfit the data.
In fitting models, we must strike a balance between modeling the intrinsic structure of the data and modeling the noise in the data. One strategy for reaching this goal is the use of cross-validation, which we shall discuss in the section “Choosing the Number of Factors” in Chapter 4. You can close the report produced by PolyRegr.jsl at this point.
The Effect of Correlation among Predictors: A Simulation
In MLR, correlation among the predictors is called multicollinearity. We explore the effect of multicollinearity on estimates of the regression coefficients by running the script Multicollinearity.jsl. Do this by clicking on the correct link in the master journal. The script produces the launch window shown in Figure 2.3.
Figure 2.3: Multicollinearity Simulation Launch Window
The launch window enables you to set conditions to simulate data from a known model:
• You can set the values of the three regression coefficients: Beta0 (constant); Beta1 (X1 coefficient); and Beta2 (X2 coefficient). Because there are three regression parameters, you are defining a plane that models the mean of the response, Y. In symbols,
E[Y] = β0 + β1X1 + β2X2
where the notation E[Y] represents the expected value of Y.
• The noise that is applied to Y is generated from a normal distribution with mean 0 and with the standard deviation that you set as Sigma of Random Noise under Other Parameters. In symbols, this means that ε in the expression
Y = β0 + β1X1 + β2X2 + ε
has a normal distribution with mean 0 and standard deviation equal to the value you set.
• You can specify the correlation between the values of X1 and X2 using the slider for Correlation of X1 and X2 under Other Parameters. X1 and X2 values will be generated for each simulation from a multivariate normal distribution with the specified correlation.
• In the Size of Simulation panel, you can specify the Number of Points to be generated for each simulation, as well as the Number of Simulations to run.
Once you have set values for the simulation using the slider bars, generate results by clicking Simulate. Depending on your screen size, you can view multiple results simultaneously without closing the launch window.
Let’s first run a simulation with the initial settings. Then run a second simulation after moving the Correlation of X1 and X2 slider to a large, positive value. (We have selected 0.92.) Your reports will be similar to those shown in Figure 2.4.
Figure 2.4: Comparison of Design Settings, Low and High Predictor Correlation
The two graphs reflect the differences in the settings of the X variables for the two correlation scenarios. In the first, the points are evenly distributed in a circular pattern. In the second, the points are condensed into a narrow elliptical pattern. These patterns show the geometry of the design matrix for each scenario.
In the high correlation scenario, note that high values of X1 tend to be associated with high values of X2, and that low values of X1 tend to be associated with low values of X2. This is exactly what is expected for positive correlation. (For a definition of the correlation coefficient between observations, select Help > Books > Multivariate Methods and search for “Pearson Product-Moment Correlation”).
The true and estimated coefficient values are shown at the bottom of each plot. Because our model was not deterministic—the Y values were generated so that their means are linear functions of X1 and X2, but the actual values are affected by noise—the estimated coefficients are just that, estimates, and as such, they reflect uncertainty.