Discovering Partial Least Squares with JMP. Marie Gaudard A.
of the X1, X2 plane and downward in the lower right of the X1, X2 plane. Specifically, the relationship is given by Y = –X1 + .75X2.
The next two plots, Principal Components and PLS Weights, are obtained using simulated values for X1 and X2. But Y is computed directly using the relationship shown in the Contour Plot.
The Principal Components plot shows the two principal components. The direction of the first component, PC1, captures as much variation as possible in the values of X1 and X2 regardless of the value of Y. In fact, PC1 is essentially perpendicular to the direction of increase in Y, as shown in the contour plot. PC1 ignores any variation in Y. The second component, PC2, captures residual variation, again ignoring variation in Y.
The PLS Weights plot shows the directions of the two PLS factors, or latent variables. Note that PLS1 is rotated relative to PC1. PLS1 attempts to explain variation in X1 and X2 while also explaining some of the variation in Y. You can see that, while PC1 is oriented in a direction that gives no information about Y, PLS1 is rotated slightly toward the direction of increase (or decrease) for Y.
This simulation illustrates the fact that PLS tries to balance the requirements of dimensionality reduction in the X space with the need to explain variation in the response. You can close the report produced by the script PLS_PCA.jsl now.
PLS Scores and Loadings
Some Technical Background
Extracting Factors
Before considering some examples that illustrate more of the basic PLS concepts, let’s introduce some of the technical background that underpins PLS. As you know by now, a main goal of PLS is to predict one or more responses from a collection of predictors. This is done by extracting linear combinations of the predictors that are variously referred to as latent variables, components, or factors. We use the term factor exclusively from now on to be consistent with JMP usage.
We assume that all variables are at least centered. Also keep in mind that there are various versions of PLS algorithms. We mentioned earlier that JMP provides two approaches: NIPALS and SIMPLS. The following discussion describes PLS in general terms, but to be completely precise, one needs to refer to the specific algorithm in use.
With this caveat, let’s consider the calculations associated with the first PLS factor. Suppose that X is an n x m matrix whose columns are the m predictors and that Y is an n x k matrix whose columns are the k responses. The first PLS factor is an m x 1 weight vector, w1, whose elements reflect the covariance between the predictors in X and the responses in Y. The jth entry of w1 is the weight associated with the jth predictor. The vector w1 defines a linear combination of the variables in X that, subject to norm restrictions, maximizes covariance relative to all linear combinations of variables in Y. This vector defines the first PLS factor.
The weight vector w1 is used to weight the observations in X. The n weighted linear combinations of the entries in the columns of X are called X scores, denoted by the vector t1. In other words, the X scores are the entries of the vector t1 = Xw1. Note that the score vector, t1, is n x 1; each observation is given an X score on the first factor. Think of the vector w1 as defining a linear transformation mapping the m predictors to a one-dimensional subspace. With this interpretation, Xw1 represents the mapping of the data to this one-dimensional subspace.
Technically, t1 is a linear combination of the variables in X that has maximum covariance with a linear combination of the variables in Y, subject to normalizing constraints. That is, there is a vector c1 with the property that the covariance between t1 = Xw1 and u1 = Yc1 is a maximum. The vector c1 is a Y weight vector, also called a loading vector. The elements of the vector u1 are the Y scores. So, for the first factor, we would expect the X scores and the Y scores to be strongly correlated.
To obtain subsequent factors, we use all factors available to that point to predict both X and Y. In the NIPALS algorithm, the process of obtaining a new weight vector and defining new X scores is applied to the residuals from the predictive models for X and Y. (We say that X and Y are deflated and the process itself is called deflation.) This ensures that subsequent factors are independent of (orthogonal to) all previously extracted factors. In the SIMPLS algorithm, the deflation process is applied to the cross-product matrix. (For complete information, see Appendix 1.)
Models in Terms of X Scores
Suppose that a factors are extracted. Then there are:
• a weight vectors, w1, w2,...,wa
• a X-score vectors, t1, t2,...,ta
• a Y-score vectors, u1, u2,...,ua
We can now define three matrices: W is the m x a matrix whose columns consist of the weight vectors; T and U are the n x a matrices whose columns consist of the X-score and Y-score vectors, respectively. In NIPALS, the Y scores, ui, are regressed on the X scores, ti, in an inner relation regression fit.
Recall that the matrix Y contains k responses, so that Y is n x k. Let’s also assume that X and Y are both centered and scaled. For both NIPALS and SIMPLS, predictive models for both Y and X can be given in terms of a regression on the scores, T. Although we won’t go into the details at this point, we introduce notation for these predictive models:
(4.1) X^=TP'
Y^=TQ'
where P is m x a and Q is k x a. The matrix P is called the X loading matrix, and its columns are the scaled X loadings. The matrix Q is sometimes called the Y loading matrix. In NIPALS, its columns are proportional to the Y loading vectors. In SIMPLS, when Y contains more than one response, its representation in terms of loading vectors is more complex. Each matrix projects the observations onto the space defined by the factors. (See Appendix 1.) Each column is associated with a specific factor. For example, the ith column of P is associated with the ith extracted factor. The jth element of the ith column of P reflects the strength of the relationship between the jth predictor and the ith extracted factor. The columns of Q are interpreted similarly.
To facilitate the task of determining how much a predictor or response variable contributes to a factor, the loadings are usually scaled so that each loading vector has length one. This makes it easy to compare loadings across factors and across the variables in X and Y.
Model in Terms of Xs
Let’s continue to assume that the variables in the matrices X and Y are centered and scaled. We can consider the Ys to be related directly to the Xs in terms of a theoretical model as follows:
Y = Xβ + εY.