Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Table 5.4 Properties of methods for common and distinctcomponents. The matrix D indicates a diagonalmatrix with all positive elements on its diagonal.
Table 6.1 Overview of methods. Legend: U=unsupervised,S=supervised, C=complex, HOM=homogeneous data,HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common,CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. Forabbreviations of the methods, see Section 1.11.
Table 7.1 Overview of methods. Legend: U=unsupervised,S=supervised, C=complex, HOM=homogeneous data,HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common,CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. Forabbreviations of the methods, see Section 1.11.
Table 8.1 Overview of methods. Legend: U=unsupervised, S=supervised, C=complex, HOM=homogeneous data, HET=heterogeneousdata, SEQ=sequential, SIM=simultaneous, MOD=model-based,ALG= algorithm-based, C=common, CD=common/distinct,CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correla-tions/covariances. The green colour indicates that this methodis discussed extensively in this chapter. The abbreviations forthe methods represent the different sections and follow thesame order. For abbreviations of the methods, see Section 1.11.
Table 8.2 Tabulation of consumer characteristics. A selection of two consumer attributes/characteristics, gender, and lunch habitsis given. The numbers represent percentages in each of the categories for each of the segments (subgroups). The sumsin each column for each consumer characteristic variable isequal to 100. The lunch variable reflects the frequency of usewith 1 representing the highest frequency and 5 ‘no answer’.Source: (Helgesen et al., 1997). Repro-duced with permission from Elsevier.
Table 8.3 Consumer liking of cheese. Design of the conjointexperiment based on six design factors. Source: (Almli et al., 2011). Reproduced with permission from Elsevier.
Table 9.1 Overview of methods. Legend: U=unsupervised,S=supervised, C=complex, HOM=homogeneous data,HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common, CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. For abbreviations of the methods, see Section 1.11.
Table 10.1 Overview of methods. Legend: U=unsupervised, S=supervised, C=complex, HOM=homogeneous data, HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common, CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. The abbreviations for the methods follow the same order as the sections. For abbreviations (or descriptions) of the methods, see Section 1.11.
Table 10.2 Results of the single-block regression models. PCovR isPrincipal Covariates Regression, U-PLS is unfold-PLS,MCovR is multiway covariates regression. The 3,2,3 com-ponents for MCovR refer to the components for thethree modes of Tucker3. For more explanation, see text.
Table 10.3 Results of the multiway multiblock models. MB-PLS ismultiblock PLS, MWMBCovR is multiway multiblockcovariates regression. For more explanation, see text.
Table 10.4 SO-PLS-PM results for wine data. The four columns of num-bers correspond to the explained variances for the models forthe endogenous blocks B, C, D, and E (the numbers in paren-theses represent the number of components used). Source: (Romano et al., 2019). Reproduced with permission from Wiley.
Table 11.1 R packages on CRAN having one or more multiblock methods.
Table 11.2 MATLAB toolboxes and functionshaving one or more multiblock methods.
Table 11.3 Python packages having one or more multiblock methods.
Table 11.4 Commercial software having one or more multiblock methods.
1 Introduction
1.1 Scope of the Book
In many areas of the natural and life sciences, data sets are collected consisting of multiple blocks of data measured on the same or similar systems. Examples are abundant, e.g., in genomics it is becoming increasingly common to measure gene-expression, protein abundances and metabolite levels on the same biological system (Clish et al., 2004; Heijne et al., 2005; Kleemann et al., 2007; Curtis et al., 2012; Brink-Jensen et al., 2013; Franzosa et al., 2015). In sensory science, the interest is often in relations between the chemical and sensory properties of the samples involved as well as consumer liking of the same samples (Næs et al., 2010). In chemistry, sometimes different types of instruments are utilised to characterise different properties the same set of samples (de Juan and Tauler, 2006). In cohort studies, it is increasingly popular to perform the same type of measurements in different cohorts to confirm results and perform meta-analyses. In (bio-)chemical process industry, plant-wide measurements are available collected by several sensors in the plant (Lopes et al., 2002). Clinical trials are often supported by auxiliary measurements such as gene-expression and cytokines to characterise immune responses (Coccia et al., 2018). Challenge tests to establish the health status of individuals usually contain multiple types of data collected for the same individuals as a function of time (Wopereis et al., 2009; Pellis et al., 2012; Kardinaal et al., 2015). All these examples show that simple data sets are increasingly becoming less common.
Unfortunately, there is no consensus yet about terminology regarding the structure of such data sets and the related research questions. In bioinformatics, the terms data fusion or data integration are often used where the latter distinguishes also N- or P-integration (N means the same samples and P means the same variables), horizontal and vertical integration. In psychometrics, the terms multiset and multigroup data analysis are used; in chemometrics, multiblock data analysis is in use and in the computational sciences and machine learning the term multiview or multitable data analysis is used. We will encounter all these terms in this book but we will use the noun multiblock as much as possible.1
In Elaboration 1.1 we define the terms concerning data sets we will use throughout in this book. Sometimes, we will sidestep this to some extent to make connections between fields. At those places we will clarify exactly what we mean.
ELABORATION 1.1
Glossary of terms
Data set:The total collection of all data that is under consideration for a particular problem.Data block:One block of data organised in a matrix (array) with rows and columns as a part of a data set.Multiblock data set:The organisation of the data set in blocks of data.Multiblock data analysis:The process of analysing the whole multiblock data set simultaneously using multiblock methods.Object, Subject, Sample:Entity for which measurements are obtained. They can be random drawings from a population and/or they can come from an experimental design. The general term is a sample but if these samples pertain to human beings they may be called subjects. They constitute the row entries of a matrix.Variable:A measured property of an entity collected in the columns of a matrix; this is called a feature in machine learning.Measurement scale:The scale on which a variable is measured (ratio, interval, ordinal, or nominal-scaled).Homogeneous versus heterogeneous data:If a data set contains blocks of data all measured on the same scale then this is called homogeneous data; if not, then the data are called heterogeneous. In most cases, homogeneous data will refer to blocks containing quantitative data (at least interval-scaled).
Elaboration 1.1 suggests a consistent vocabulary to be used in the book. However, the difference between variables and objects is not