Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs

Multiblock Data Fusion in Statistics and Machine Learning

a similar way as shown rightafter Algorithm 7.3. The full regression vector for the interactionblock (with 24 terms, see above) is split into four parts according tothe four levels of the two design factors (see description of codingabove). Each of the levels of the design factor has its own line inthe figure. As can be seen, there are only two lines for each designfactor, corresponding to the way the design matrix was handled (see explanation at the beginning of the example). The number onthe x-axis represent wavelengths in the NIR region. Lines close to 0 are factor combinations which do not contribute to interaction.Source: Næs et al. (2011a). Reproduced with permission from Wiley.

Figure 7.24 SO-PLS results using candy and assessor variables (dummyvariables) as X and candy attribute assessments as Y. Component numbers in parentheses indicate how many componentswere extracted in the other block before the current block.

Figure 7.25 Illustration of the idea behind PO-PLS for three input blocks, to beread from left to right. The first step is data compression of each block separately (giving scores T₁, T₂ and T₃) before a GCA is run to obtain common components. Then each block is orthogonalised (both the X_mand Y) with respect to the common components, and PLS regressionis used for each of the blocks separately to obtain block-wise distinctscores. The F in the figure is the orthogonalised Y. The common andblock wise-scores are finally combined in a joint regression model.Note that the different T blocks can have different numbers of columns.

Figure 7.26 PO-PLS calibrated/fitted and validated explained variancewhen applied to three-block Raman with PUFA responses.

Figure 7.27 PO-PLS calibrated explained variance when appliedto three-block Raman with PUFA responses.

Figure 7.28 PO-PLS common scores when applied to three-block Raman withPUFA responses. The plot to the left is for the first component from X1,2,3 versus X_1,2 and the one to the right is for first component from X1,2,3 versus X_1,3. Size and colour of the points follow the amountof PUFA % in sample and PUFA % in fat, respectively (see alsothe numbers presented in the text for the axes). The percentages reported in the axis labels are calibrated explained variance forthe two responses, corresponding to the numbers in Figure 7.26.

Figure 7.29 PO-PLS common loadings when applied tothree-block Raman with PUFA responses.

Figure 7.30 PO-PLS distinct loadings when applied tothree-block Raman with PUFA responses.

Figure 7.31 ROSA component selection searches among candidate scores (t_m) from all blocks for the one that minimises the distance tothe residual response Y. After deflation with the winning score (Y_new = Y − t_rq^′_r = Y − t_rt^t_rY) the process is repeated until a desirednumber of components has been extracted. Zeros in weights are shown in white for an arbitrary selection of blocks, here blocks 2,1,3,1. Loadings, P, and weights, W (see text), span all blocks.

Figure 7.32 Cross-validated explained variance when ROSA is applied tothree-block Raman with PUFA in sample and in fat on theleft and both PUFA responses simultaneously on the right.

Figure 7.33 ROSA weights (five first components) when appliedto three-block Raman with the PUFA sample response.

Figure 7.34 Summary of cross-validated candidate scores from blocks. Top: residual RMSECV (root mean square error of cross-validation)for each candidate component. Bottom: correlation between candidate scores and the score from the block that was selected. White dots show which block was selected for each component.

Figure 7.35 The decision paths for ‘Common and distinct components; (implicitly handled, additional contribution from block or explicitly handled)and ‘Choosing components’ (single choice, for each block ormore complex) coincide, as do ‘Invariance to block scaling’ (block scaling affects decomposition or not) and ‘# components’ (same number for all blocks or individual choice). When traversingthe tree from left or right, we therefore need to follow either agreen or a blue path through the ellipsoids, e.g., starting from ‘# components’ leads to choices ‘Different’ or ‘Same’. More indepth explanations of the concepts are found in the text above.

Figure 8.1 Figure (a)–(c) represent an L-structure/skeleton and Figure (d) a domino structure. See also notation and discussion of skeletonsin Chapter 3. The grey background in (b) and (c) indicates thatsome methods analyse the two modes sequentially. Different topologies, i.e., different ways of linking the blocks, associatedwith this skeleton will be discussed for each particular method.

Figure 8.2 Conceptual illustration of common information shared by the three blocks. The green colour represents the common column space of X₁ and X₂ and the red the common row space of X₁ and X₃. The orangein the upper corner of X₁ represents the joint commonness of the two spaces. The blue is the distinct parts of the blocks. This illustration is conceptual, there is no mathematical definition available yet about the commonness between row spaces and column spaces simultaneously.

Figure 8.3 Topologies for four different methods. The three first ((a), (b), (c)) are based on analysing the two modes in sequence. (a) PLS used for both modes (this section). (b) Correlation first approach (Section 8.5.4). (c) Using unlabelled data in calibration (Section 8.5.2). The topology in (d) will be discussed in Section 8.3. We refer to the main textfor more detailed descriptions. The dimensions of blocks are X₁ (I × N), X₂ (I × J), and X₃ (K × N). The topology in (a) corresponds to external preference mapping which will be given main attention here.

Figure 8.4 Scheme for information flow in preferencemapping with segmentation of consumers.

Figure 8.5 Preference mapping of dry fermented lamb sausages: (a) sensoryPCA scores and loadings (from X₂), and (b) consumer loadings presented for four segments determined by cluster analysis. Source: (Helgesen et al., 1997). Reproduced with permission from Elsevier.

Figure 8.6 Results from consumer liking of cheese. Estimatedeffects of the design factors in Table 8.3. Source: Almli et al. (2011). Reproduced with permission from Elsevier.

Figure 8.7 Results from consumer liking of cheese. (a) loadings from PCAof the residuals from ANOVA (using consumers as rows). Letters R/P in the loading plot refer to raw/pasteurised milk, and E/Srefer to everyday/special occasions. (b) PCA scores from the same analysis with indication of the two consumer segments. Source:Almli et al. (2011). Reproduced with permission from Elsevier.

Figure 8.8 Relations between segments and consumer characteristics. Source: (Almli et al., 2011). Reproduced with permission from Elsevier.

Figure 8.9 Topology for the extension. This is a combinationof a regression situation along the horizontal axisand a path model situation along the vertical axis.

Figure 8.10 L-block scheme with weights w’s. The w’sare used for calculating scores for deflation.

Figure 8.11 Endo-L-PLS results for fruit liking study. Source: (Martens et al., 2005). Reproduced with permission from Elsevier.

Figure 8.12 Classification CV-error as a function of the α valueand the number of L-PLS components. Source: (Sæbø et al., 2008b). Reproduced with permission from Elsevier.

Figure 8.13 (a) Data structure for labelled and unlabelled data. (b) Flow chart for how to utilise unlabelled data

Figure 8.14 Tree for selecting methods with complex data structures.

Figure 9.1 General setup for fusing heterogeneous data using representation matrices. The variables in the blocks X₁, X₂

Скачать книгу