Data Cleaning. Ihab F. Ilyas

Data Cleaning - Ihab F. Ilyas


Скачать книгу
of a data point. Kollios et al. [2003] defines an object O as outlier in a dataset I if at most k objects in I lie at distance at most D from O. Ramaswamy et al. [2000] defines outliers as the top few data elements whose distance to the kth nearest neighbor is greatest.

      The simplest category of algorithms [Knorr and Ng 1998, 1999, Ramaswamy et al. 2000] for finding distance-based outliers are those using nested loops to find the distance between every point and every other point, which has a quadratic complexity. Another category of algorithms uses spatial index structures such as KD-trees, R-trees, or X-trees to find the nearest neighbors of each point [Knorr and Ng 1998, 1999, Ramaswamy et al. 2000]. This category of algorithms works well for low-dimensional datasets, and has a complexity of O(n log n) if the index tree can allow a O(log n) lookup for finding a point’s nearest neighbor. However, the index structures become increasingly complex as the dimensionality increases. For example, Breunig et al. [2000] used an X-tree variant to do a nearest neighbor search; they found that the index performed well for fewer than 5 dimensions, and the performance dropped dramatically for just 10–20 dimensions. A third category of algorithms partitions the space into regions to enable a faster search of nearest neighbors. For each region, certain summary statistics such as the minimum bounding rectangle are stored. During the search procedure for finding nearest neighbors of a point, the point is compared with the summary statistics of a region to determine if it is possible to skip all the points in that region. Knorr and Ng [1998] found out that the partitioning-based approach has a complexity linear in O(n), but exponential in the number of dimensions.

      An efficient algorithm for finding distance-based outliers was devised by Bay and Schwabacher [2003]; it is an adaptation of the simple nested loop algorithm and has an expected running time of O(n). Algorithm 2.2 describes the procedure. The distance function between two points x and y is denoted as distance(x, y), such as Euclidean distance for numerical values and Hamming distance for categorical values. The function score(x, Y) denotes the nearest neighbor distances between x and its neighbors Y and is monotonically decreasing.

      Algorithm 2.2 partitions all the points I into a set of blocks I1, …, Ib, and compares each point in every block to every point in I. It keeps track of the top m outliers and the weakest outlier of the m outliers, namely, the point with the smallest score. The main idea in the nested loop is to keep track of the closest k neighbors found so far for each point x. When a point x achieves a score lower than the cutoff c, we can safely remove the point x since x can no longer be an outlier. As the loop continues, the algorithm finds more outliers, and the cutoff threshold c increases along with the pruning power. In the worst case, the algorithm still has a complexity of O(n2); however, the average case analysis reveals the algorithm has an expected complexity of O(n) [Bay and Schwabacher 2003].

      2.3.2 Local Distance-Based Outlier Detection

      Distance-based outlier detection methods such as DB(p, D) and its variants may miss certain kinds of outliers because every point is compared with every other point in the dataset.

      Example 2.7 Consider a two-dimensional dataset with data points shown in Figure 2.4. There are two clusters of points, C1 and C2, and points in C1 are sparse, whereas points in C2 are quite dense. Obviously, points o1 and o2 should be reported as outliers. However, distance-based methods would not flag o2 as an outlier before flagging any points in C1, since the distances between o2 are much closer to the majority of the data points, i.e., C2, than those points in C1.

      The local outlier factor (LOF) method [Breunig et al. 2000] scores data points based on the density of their neighboring data points, and can capture outlier points such as o2 in Figure 2.4.

      Given a positive integer k, the k-distance of an object p, denoted as k-distance(p), is defined as the distance distance(p, o) between p and another object o, such that: (1) there exist at least k objects other than p in the dataset whose distance to p is less than or equal to distance(p, o); and (2) there exist at most k − 1 objects other than p in the dataset whose distance to p is strictly less than distance(p, o). Given k-distance(p), the k-distance neighborhood of p, denoted as Nk(p), contains all objects other than p whose distance to p is less than or equal to k-distance(p). The reachability distance of an object p with respect to object o is defined as reach-distk(p, o) = max {k-distance(o), d(p, o)}. Intuitively, if an object p is too far away from o, then reach-distk(p, o) is the actual distance between p and o. On the other hand, if p and o are sufficiently close, then reach-distk(p, o) is replaced by the k-distance of o. In this way, the distance fluctuations of all objects close to o are reduced; the higher the k, the greater the smoothing effect.

Image

      Definition 2.1 The local reachability density of p is defined as:

image Image

      The local outlier factor of p is defined as:

image

      Intuitively, the local reachability density of an object p is the inverse of the average reachability distance based on the k nearest neighbors of p. The local outlier factor of p is the average of the ratio of the local reachability density of p and those points in the k-distance neighborhood of p. It has been shown [Breunig et al. 2000] that the LOF definition has many desirable properties; for example, LOFk(p) is approximately equal to 1 in a cluster (a region with homogeneous density around the point and its neighbors), LOFk(p) ≫ 1 if p is an outlier.

      There are many proposed variants of LOF. Jin et al. [2001] only mine for the top outliers, and thus do not need to compute the LOF scores for all data points by deriving bounds for the LOF scores. Tang et al. [2002] consider connectivity based outlier factors (COF) since LOF may make it hard to detect outliers in low density regions. Jin et al. [2006] take symmetric neighborhood into account by considering both the k-nearest neighborhood of p and its reverse k-nearest neighborhood. Papadimitriou et al. [2003] aim at getting rid of choosing the parameter k by considering all the possible ks.

      Model-based outlier detection techniques first learn a classifier model from a set of labeled data points, and then apply the trained classifier to a test data point to determine whether it is an outlier. Model-based approaches assume that a classifier can be trained to distinguish between the normal data points and the anomalous data points using the given feature space.

      Based on the labels available to train the classifier, model-based approaches can be further divided into two subcategories: multi-class model-based techniques


Скачать книгу