Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
that use dimension reduction methods to tackle multiblock problems. The basic idea of dimension reduction methods is to extract components or latent variables from data blocks, see Figure 1.2.
Figure 1.2 Idea of dimension reduction and components. The scores T summarise the relationships between samples; the loadings P summarise the relationships between variables. Sometimes weights W are used to define the scores.
In this figure, the matrix X(I×J) consists of J variables measured on I samples. The matrix W(J×R) of weights defines the scores XW=T(I×R) where R is much smaller than J. This is the dimension reduction (or data compression) part and the idea is that T represents the samples in matrix X in a good way depending on the purpose. Likewise, the variables are represented in the loadings P(J×R) which can be connected to the scores in a least squares sense, e.g., in the model X=TPt+E. There are many alternatives to compute the weights, scores, and loadings depending on the specific situation; this will be explained in subsequent chapters.
The idea of dimension reduction by using components or latent variables is very old and has proven to be a very powerful paradigm, with many applications in the natural- life- and social sciences. When considering multiple blocks of data, each block is summarised by its components and the relationships between the blocks is then modelled by building relationships between those components. There are also many ways to build such relationships and we will discuss those in this book.
There are many reasons for and advantages of using dimension reduction methods:
The number of sources of variability in data blocks is usually (much) smaller than the number of measured variables.
Component-based methods are suitable for interpretation through the scores and loadings associated with the extracted components.
Underlying components and latent variables are appropriate for mental abstractions and interpretation.
Multivariate data analysis becomes numerically stable and statistically robust if the components are chosen in a suitable way.
Empirical validation of the models becomes manageable.
The effect of measurement noise is reduced.
Outliers can often be detected by visual inspection of the associated subspace projections provided by the extracted components.
1.3.4 Indirect Versus Direct Data
When discussing types of data, it is useful to distinguish between direct and indirect data. Direct data are always in the form of a matrix or table containing measurements of variables on a set of samples. Indirect data or derived data are always in the form of variables × variables or samples times samples matrices. Examples of such types of data are cross-products of matrices of direct data, covariances, distances and the like. The main focus in this book is on direct data, but we will discuss some indirect methods as well. First, to limit ourselves and, secondly, analyses on direct data are usually easier to understand and interpret. Thirdly, in many applications of multiblock data analysis in the natural and life sciences, direct data are available. For a more formal description of this distinction, see Section 2.2.1.
1.3.5 Heterogeneous Fusion
The final property of data we need to present is whether all blocks in the data set are measured on the same scale or not, i.e., if the data set is homogeneous or heterogeneous. These concepts are explained in more detail in Chapter 2 (Section 2.2.2). Briefly, if all blocks contain measurements on the same scale, e.g., they are all numerical or quantitative data, then the resulting problem will be called homogeneous fusion. If they are not of the same scale, e.g., a mixture of quantitative and binary measurements, then the problem is called heterogeneous fusion. We will discuss both of these in this book although most methods are made for homogeneous data.
1.4 Examples
This section contains some examples of multiblock data analysis problems in different fields of the natural and life sciences. It serves to give an idea about which types of questions are asked and which types of data sets are available. A full explanation of the methods used is given in the following chapters. These examples are only appetisers!
1.4.1 Metabolomics
Metabolomics is the part of life sciences concerned with measuring and studying the behaviour of metabolites (small biochemical compounds) in biological systems. The field has grown considerably in the last 20 years with conferences and dedicated journals. A large part of the applications concern finding biomarkers for diseases which translates into finding the metabolites that discriminate between groups of objects (e.g., control versus diseased subjects). Elaboration 1.3 shows some of the terms used in metabolomics research.
ELABORATION 1.3
Terms in metabolomics and proteomics
Biomarkers:Chemical compounds (e.g., metabolites) that mark a difference between conditions, e.g., between healthy and diseased persons.GC-MS:Gas chromatography–mass spectrometry. A separation method coupled to a mass spectrometer used a lot in advanced chemical analyses of volatile compounds.LC-MS:Liquid chromatography–mass spectrometry. A separation method coupled to a mass spectrometer used a lot in advanced chemical analyses for a large diversity of chemical compounds.Metabolome:The set of all metabolites of a biological organism responsible for its metabolism.NMR:Nuclear magnetic resonance. A fast chemical analysis method giving a fingerprint of a sample and concentrations of chemical compounds.Proteomics:The study and measurements of proteins in biological organisms. Proteins are mostly enzymes catalysing metabolic reactions.
There are several multiblock data analysis challenges in metabolomics. It is increasingly popular to measure different sets of chemically related metabolites on the same samples using different instrumental protocols (Smilde et al., 2005b; Pellis et al., 2012; Kardinaal et al., 2015). These blocks of data (each block pertaining to one instrumental protocol) then need to be combined to arrive at a global view on metabolism. Metabolites can also be measured in different compartments, such as in blood, urine, liver, muscle, kidney (Fazelzadeh et al., 2016). This also generates multiblock data analysis problems. Metabolites are converted in biochemical reactions catalysed by enzymes (proteins). Hence, it is also worthwhile in some cases to measure proteins and combine those with metabolomics measurements (Wopereis et al., 2009). Plants are complex organisms with a rich variety of metabolites. The metabolism of plants is influenced by environmental conditions, such as temperature and light. Example 1.1 illustrates this.
Example 1.1: Metabolomics example: plant science data
This metabolomics example comes from a larger study in plant sciences (Caldana et al., 2011). The goal of the study was to investigate changes in metabolism and gene-expression of Arabidopsis related to growth under different light and temperature conditions. To this end, time-resolved experiments were performed. The design of the data set is shown in Figure 1.3. It is not a fully crossed design, but for each cell in the design gene-expression and metabolomics measurements were performed at 19 time points. We will only use the metabolomics measurements which comprised around 65 identified metabolites and use the part of 210C (the third line in the table below). This results in four blocks of data (21-D, 21-LL, 21-L and 21-HL) each consisting of 19 rows (time points) and 65 columns (measured metabolites). Hence, we only study the factors light and time (the factor temperature is kept constant).