Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
that clear (for examples, see Chapter 8 on complex relations). We will try, however, to remain as consistent as possible and give extra explanations of terms at the appropriate places. In the rest of this chapter we will delineate our potential audience. We will give some examples of why multiblock methods are necessary and give an overview of the types of problems encountered. Moreover, we will give some history and discuss briefly some fundamental concepts which we need in the rest of the book. We end by giving the notation which we will use in this book and a list of abbreviations.
1.2 Potential Audience
Our ambition is to serve different types of audiences. The first set of users consists of practitioners in the natural and life sciences, such as in bioinformatics, sensometrics, chemometrics, statistics, and machine learning. They will mainly be interested in the question how to perform multiblock data analysis and what to use in which data analysis situation. They may benefit from reading the main text and studying the examples. The second set of users are method developers. They want to know what is already available and spot niches for further development; apart from the main text and the examples they may also be interested in the elaborations. The final set of users are computer scientists and software developers. They want to know which methods are worthwhile to build software for and may also study the algorithms.
We will try to serve all groups. This means that we will explain most of the methods in a rather detailed manner (especially in Parts II and III) and will also pay attention to validation and visualisation to encourage proper interpretation. At the end of the book in Chapter 11, we describe multiblock toolboxes and packages in R, MATLAB and Python and showcase the accompanying R package multiblock which includes many of the methods described in this book.
1.3 Types of Data and Analyses
1.3.1 Supervised and Unsupervised Analyses
In any multiblock data analysis, we first have to choose between the main paradigms unsupervised and supervised analysis. Unsupervised analysis refers to explorative analysis looking for structure and connections in the data either in a single data block or across data blocks, typically using dimension reduction including maximisation/minimisation of some criterion combined with orthogonalisation, or by clustering techniques. It is crucial that the roles of the blocks are exchangeable: we can change the order of the blocks without changing the solution.
Supervised analysis refers to predictive data analysis, where emphasis is on a single block of data, Y, (dependent block/response) which is connected to one or more blocks of data, Xm, (independent block(s)/predictors) through regression or classification. The role of the blocks is now important: some blocks are regarded as dependent and some are regarded as independent.
There are also more complex structures where the multiblock problem is a mixture of these two (see, e.g., the L-shape problem in Figure 1.7).
Figure 1.7 L-shape data of consumer liking studies.
1.3.2 High-, Mid- and Low-level Fusion
The data fusion literature (see, e.g., Mitchell (2012); Kedem et al. (2017); van Loon et al. (2020)) distinguishes between different ways of putting data blocks together. Here we will focus on one of these distinctions, namely between measurement level fusion, feature level fusion, and decision level fusion. Other names used for this are low-level, medium/intermediate/mid-level, and high-level fusion, respectively. For the case of supervised analyses, the three are illustrated in Figure 1.1. Low-level fusion means that the data blocks are simply concatenated and then analysed together as one single block. Mid-level fusion refers to first extracting features from each block before putting them together in a regression or classification model. High-level fusion means that predictions based on single input blocks are established before combining the results, see Elaboration 1.2. For unsupervised analyses, we can also distinguish between these different levels of fusion. In low-level fusion, the data blocks are used as such in an unsupervised multiblock data analysis method without any pre-selection of variables. In mid-level fusion, first the most important variables are selected per block and then an unsupervised fusion method is used. Such an approach is often taken in genomics where several thousands of variables are measured and filtered before entering any kind of modelling (see Section 1.4.2 for examples). High-level fusion would entail an unsupervised analysis per data block and then combining the results with visualisation tools. This approach is not taken very much. In this book we will focus on low-level and mid-level fusion.
Figure 1.1 High-level, mid-level, and low-level fusion for two input blocks. The Z’s represent the combined information from the two blocks which is used for making the predictions. The upper figure represents high-level fusion, where the results from two separate analyses are combined. The figure in the middle is an illustration of mid-level fusion, where components from the two data blocks are combined before further analysis. The lower figure illustrates low-level fusion where the data blocks are simply combined into one data block before further analysis takes place.
ELABORATION 1.2
High Level Supervised Fusion
High-level supervised fusion focuses on combining classification or prediction results for improved precision. Instead of using a method which takes the different data sets into account in building a predictor or classifier, high-level fusion combines results from individual predictions and combines them in the best possible way. In other words, high-level fusion refers to combining results from already established prediction or classification methods.
A possible drawback with this strategy as compared to low-level and feature-level fusion is that it does not provide further insight into how the different measurements relate to each other and how they can be combined in a good way in the prediction of the outcome. On the other hand, high-level fusion of prediction results for new samples does not generally require the individual predictors to be developed from the same samples. In other words, when two (or more) predictors are to be combined for a new sample, they do not need to come from the same data source. It is possible to simply plug in the new data and obtain predictions that can be combined as described below. In this sense it is more flexible (Ballabio et al. (2019)) than low- and feature-level fusion. It has been shown in Doeswijk et al. (2011) that fusing classifiers most often gives similar or improved prediction results as compared to using only one of them. An overview of the use of high-level fusion (and other methods) can be found in Borràs et al. (2015).
A simple way of combining classifiers is to use voting based on counting the number of times the classifiers agree. There are different types of voting schemes that are proposed in the literature. One of them is simple democratic majority voting which means that the group/class that gets the highest number of votes is chosen. In the case of ties, the result is inconclusive. An alternative strategy is 75% voting which means that 75% of the votes should be for the same class before a decision can be made.
Fusing quantitative predictors is most easily done using averages or weighted averages with weights depending on the prediction error of the different predictions, as determined by, for instance, cross-validation. This strategy has similarities with so-called bagging (see, e.g., Freund (1995)). In machine learning, high-level supervised fusion is found in the sub-domain ‘ensemble learning’.
1.3.3 Dimension Reduction
There are many ways to perform multiblock data analysis. We restrict the focus of this book