Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs

Multiblock Data Fusion in Statistics and Machine Learning

information for three of the methods covered. The upperfigure illustrates that the two input blocks share some information (C₁ and C₂), but also have substantial distinct components andnoise (see Chapter 2), here contained in the X (as the darker blueand darker yellow). The lower three figures show how differentmethods handle the common information. For MB-PLS, no initial separation is attempted since the data blocks are concatenated before analysis starts. For SO-PLS, the common predictive informationis handled as part of the X₁ block before the distinct part ofthe X₂ block is modelled. The extra predictive information in X₂ corresponds to the additional variability as will be discussed in the SO-PLS section. For PO-PLS, the common informationis explicitly separated from the distinct parts before regression.

Figure 7.2 Illustration of link between concatenated X blocks andthe response, Y, through the MB-PLS super-scores, T.

Figure 7.3 Cross-validated explained variance for various choices of number of components for single- and two-response modelling with MB-PLS.

Figure 7.4 Super-weights (w) for the first and second componentfrom MB-PLS on Raman data predicting the PUFA sampleresponse. Block-splitting indicated by vertical dotted lines.

Figure 7.5 Block-weights (w_m) for first and second componentfrom MB-PLS on Raman data predicting the PUFAsampleresponse. Block-splitting indicated by vertical dotted lines.

Figure 7.6 Block-scores (t_m, for left, middle, and right Raman block,respectively) for first and second component from MB-PLS onRaman data predicting the PUFA sample response. Colours of thesamples indicate the PUFA concentration as % in fat (PUFA_fat)and size indicates % in sample (PUFA sample). The two percentagesgiven in each axis label are cross-validated explained variancefor PUFA sample weighted by relative block contributions andcalibrated explained variance for the block (X_m), respectively.

Figure 7.7 Classification by regression. A dummy matrix (here with threeclasses, c for class) is constructed according to which groupthe different objects belong to. Then this dummy matrix isrelated to the input blocks in the standard way described above.

Figure 7.8 AUROC values of different classification tasks. Source: (Deng et al., 2020). Reproduced with permission from ACS Publications.

Figure 7.9 Super-scores (called global scores here) and block-scores for thesparse MB-PLS model of the piglet metabolomics data. Source: (Karaman et al., 2015). Reproduced with permission from Springer.

Figure 7.10 Linking structure of SO-PLS. Scores for both X₁ and the orthogonalised version of X₂ are combined in a standardLS regression model with Y as the dependent block.

Figure 7.11 The SO-PLS iterates between PLS regression and orthogonalisation, deflating the input block and responses in every cycle. This isillustrated using three input blocks X₁, X₂, and X₃. The upperfigure represents the first PLS regression of Y onto X₁. Then the residuals from this step, obtained by orthogonalisation, goes tothe next (figure in the middle) where the same PLS procedure is repeated. The same continues for the last block X₃ in the lower partof the figure. In each step, loadings, scores, and weights are available.

Figure 7.12 The CVANOVA is used for comparing cross-validated residuals Ffor different prediction methods/models or for different numbers of blocks in the models (in for instance SO-PLS). The squares or the absolute values of the cross-validated prediction residuals, D_ik, are compared using a two-way ANOVA model. The figure below the model represents the data set used. The indices i and k denote the two effects: sample and method. The I samples for each method/model (equal to three in the example) are the same, so astandard two-way ANOVA is used. Note that the error variancein the ANOVA model for the three methods is not necessarilythe same, so this must be considered a pragmatic approach.

Figure 7.13 Måge plot showing cross-validated explained variance for all combinations of components for the four input blocks (up tosix components in total) for the wine data (the digits for each combination correspond to the order A, B, C, D, as describedabove). The different combinations of components are visualisedby four numbers separated by a dot. The panel to the lower rightis a magnified view of the most important region (2, 3, and 4 components) for selecting the number of components. Coloured linesshow prediction ability (Q², see cross-validation in Section 2.7.5)for the different input blocks, A, B, C, and D, used independently.

Figure 7.14 PCP plots for wine data. The upper two plots are the score andloading plots for the predicted Y, the other three are the projectedinput X-variables from the blocks B, C, and D. Block A is not presentsince it is not needed for prediction. The sizes of the points for the Y scores follow the scale of the ‘overall quality’ (small to large) whilecolour follows the scale of ‘typical’ (blue, through green to yellow).

Figure 7.15 Måge plot showing cross-validated explained variance forall combinations of components from the three blockswith a maximum of 10 components in total. The threecoloured lines indicate pure block models, and the insetis a magnified view around maximum explained variance.

Figure 7.16 Block-wise scores (T_m) with 4+3+3 components for left, mid-dle, and right block, respectively (two first components foreach block shown). Dot sizes show the percentage PUFAin sample (small = 0%, large = 12%), while colour showsthe percentage PUFA in fat (see colour-bar on the left).

Figure 7.17 Block-wise (projected) loadings with 4+3+3 components forleft, middle, and right block, respectively (two first for eachblock shown). Dotted vertical lines indicate transition betweenblocks. Note the larger noise level for components six and nine.

Figure 7.18 Block-wise loadings from restricted SO-PLS model with 4+3+3 components for left, middle, and right block, respectively (two first for each block shown). Dotted vertical lines indicate transition between blocks.

Figure 7.19 Måge plot for restricted SO-PLS showing cross-validatedexplained variance for all combinations of components fromthe three blocks with a maximum of 10 components in total.The three coloured lines indicate pure block models, and theinset is a magnified view around maximum explained variance.

Figure 7.20 CV-ANOVA results based on the cross-validated SO-PLS modelsfitted on the Raman data. The circles represent the average absolutevalues of the difference between measured and predicted response, D_ik = |y_ik − ŷ_ik|, (from cross-validation) obtained as new blocks are incorporated. The four ticks on the x-axis represent the different models from the simplest (intercept, predict using average response value) tothe most complex containing all the three blocks (‘X left’, ‘X middle’and ‘X right’). The vertical lines indicate (random) error regions forthe models obtained. Overlap of lines means no significant difference according to Tukey’s pair-wise test (Studentised range) obtained fromthe CV-ANOVA model. This shows that the ‘X middle’ adds significantlyto predictive ability, while ‘X right’ has a negligible contribution.

Figure 7.21 Loadings from Principal Components of Predictions appliedto the 5+4+0 component solutions of SO-PLS on Raman data.

Figure 7.22 RMSEP for fish data with interactions. The standard SO-PLS procedureis used with the order of blocks described in the text. The three curves correspond to different numbers of components for the interaction part.The symbol * in the original figure (see reference) between the blocks isthe same interaction operator as described by the ∘ above. Source: (Næs et al., 2011b). Reproduced with permission from John Wiley and Sons.

Figure 7.23 Regression coefficients for the interactions for the fish data with 4+2+2 components for blocks X₁, X₂ and the interaction block X₃. Regression coefficients are obtained by back-transforming the components

Скачать книгу