Data Cleaning. Ihab F. Ilyas
techniques. Multi-class model-based techniques assume that the training data points contain labeled instances belonging to multiple normal classes [De Stefano et al. 2000]. Multiple classifiers are trained, where each classifier is supposed to distinguish between one of the normal classes and the rest of the classes. A test data point is declared as an outlier if it is not classified as belonging to any of the normal classes by any of the classifiers. In contrast, one-class model-based techniques assume that all the training data points belong to one normal class. Such techniques learn a discriminative boundary to distinguish between all the normal data points and the abnormal data points using a one-class classification algorithm, such as one-class support vector machines (SVMs) [Schölkopf et al. 2001, Ratsch et al. 2002]. Any test data point that does not fall within the learned boundary is declared an outlier.
Example 2.8 Figure 2.5(a) shows an example of a multi-class anomaly detection case where there are three normal classes. Three data points that do not belong to any of the three normal classes are flagged as outliers.
Figure 2.5(b) shows an example case of one-class anomaly detection. The classifier learns a boundary for the normal data points. Data points that lie outside the boundary are flagged as outliers.
Figure 2.5 Model-based outlier detection techniques [Chandola et al. 2009].
Given that there are many different classification algorithms, such as neural networks, Bayesian networks, SVMs, and decision trees, a variety of model-based outlier detection techniques have been developed. Basic neural networks have been used as a multi-class model-based outlier detection method [Taylor and Addison 2000, De Stefano et al. 2000]. Variants of the basic neural networks have been proposed that use different types of neural networks, such as replicator neural networks [Hawkins et al. 2002]. SVMs have been used as one-class model-based outlier detection techniques. Such techniques learn a region that contains the training data points [Ratsch et al. 2002]. Different kernels, such as Gaussian kernels and radial basis function kernels, can be used to learn complex regions. A test data point is declared as an outlier if it falls outside the boundary. Different variants of the basic SVMs have been proposed for outlier detection in audio signal data [Davy and Godsill 2002] and in temporal data [Ma and Perkins 2003]. Robust support vector machines have also been proposed to detect the boundary in the presence of outliers in the training data [Song et al. 2002]. Rule-based models such as decision trees have also been used for outlier detection [Joshi et al. 2001, Fan et al. 2004]. Each learned rule has an associated confidence value that is based on the number of training normal data points correctly classified by the rule and the total number of training normal data points covered by the rule. If a test point is not captured by any of the learned rules, it is declared an outlier.
2.5 Outlier Detection in High-Dimensional Data
Real datasets can be high dimensional; some may contain hundreds or even thousands of attributes (e.g., the large number of sensor readings in an airplane). Many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the “curse of dimensionality.” For statistics-based outlier detection approaches, it becomes increasingly difficult to accurately estimate the multidimensional distribution of the data points [Scott 2008] as the number of dimensions increases. For distance-based approaches, the distances between data points approach zero and become meaningless [Beyer et al. 1999] with increasing dimensionality.
One popular category of techniques to deal with high-dimensional data is by using dimensionality reduction, which refers to the process of finding a low-dimensional representation of a high-dimensional dataset that preserves certain properties of the dataset, such as distances or similarities between data points. This topic has been well studied and surveyed in statistics, ML, and data mining [Cunningham and Ghahramani 2015, Fodor 2002, Ding et al. 2008]. Consider a set of m real-valued d-dimensional data points represented as an m × d matrix X, where each row corresponds to a data point xi ∈ ℝd. The goal of dimensionality reduction is to devise a transformation function T : ℝd → ℝk that maps each data point xi to a new data point in a k-dimensional space (k < d), such that the transformed data preserves some properties of interest. We denote by
Principal Component Analysis (PCA) [Pearson 1901, Hotelling 1933] is perhaps the most popular statistical procedure to perform dimensionality reduction. PCA uses orthogonal transformation to convert a set of data points of possibly correlated dimensions into a set of data points of linearly uncorrelated dimensions, called principal components. The transformation is done in such a way that the first component accounts for the largest possible variance in the data, and each succeeding component has the largest variance possible given that it is orthogonal to all preceding components. PCA can be done by eigenvalue decomposition of the covariance matrix or singular value decomposition (SVD) [Shlens 2014]. To reduce the complexity of computing PCA via SVD, progressive sampling strategies are used [Suri and Bailis 2017].
Dimensionality reduction techniques use all of the available dimensions of the original dataset and aim to find new datasets that preserve certain properties. In what follows, we focus on two additional categories of approaches that discover outliers using a subset of dimensions from the original dataset: subspace outlier detection techniques and contextual outlier detection techniques. Subspace outlier detection techniques select one or more subsets of attributes from all attributes and find outliers with respect to every selected subset of attributes, namely, a subspace [Aggarwal and Yu 2001, Zhang et al. 2004, Muller et al. 2008, Kriegel et al. 2009]. Contextual outlier detection techniques select from all attributes one or more pairs of subsets of attributes, where each pair of subsets consists of a subset of environmental attributes (also referred to as contextual attributes) and a subset of behavioral attributes (also referred to as indicator attributes or metric attributes). The environmental attributes are used to select subsets of tuples from all tuples, where each subset of tuples is referred to as a context. Outliers are detected within each context with respect to the behavioral attributes [Wei et al. 2003, Zhu et al. 2004, Angiulli et al. 2009, Tang et al. 2013, Liang and Parthasarathy 2016].
Example 2.9 Consider again the employee records in Table 2.1. We know from Example 2.4 that t5 is a multivariate outlier with respect to the income and tax attributes. Example 2.4 assumes that the income and tax attributes are given as input. However, in reality, we are often only given a table with many attributes, and we have no knowledge of which subsets of attributes would reveal interesting outliers.
Given Table 2.1, a subspace outlier detection algorithm would first identify the income and the tax attributes as a subspace, and then discover t5 as an outlier in that subspace.
A