Data Cleaning. Ihab F. Ilyas
defined for the following hypothesis, where H0 is the null hypothesis and Ha is the alternative hypothesis:
H0 : there are no outliers in the dataset
Ha : there is exactly one outlier in the dataset.
Grubbs’ test statistic is defined as
Table 2.1 Employee records, including name, age, income, and tax
Example 2.1 Consider Table 2.1, including the name, age, income, and tax of employees. Domain knowledge suggests that the first value t1[age] and the last value t9[age] are outlying age values. We use Grubbs’ test with a significance level α = 0.05 to identify outliers in the age column.
The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. Grubbs’ test statistic
Removing t9[age], we are left with 8 age values. The mean of the 8 age values is 28.88, and the standard deviation of the 8 age values is 12.62. Grubbs’ test statistic is
Removing t1[age], we are left with 7 age values. The mean of the 7 age values is 32.86, and the standard deviation of the 7 age values is 6.15. Grubbs’ test statistic
Previous discussion assumes that the data follows an approximately normal distribution. To assess this, several graphical techniques can be used, including the Normal Probability Plot, the Run Sequence Plot, the Histogram, the Box Plot, and the Lag Plot.
Iglewicz and Hoaglin provide an extensive discussion of the outlier tests previously given [Iglewicz and Hoaglin 1993]. Barnett and Lewis [1994] provide a book length treatment of the subject. They provide additional tests when data is not normally distributed.
2.2.3 Fitting Distribution: Parametric Approaches
The other type of statistics-based approach first fits a statistical distribution to describe the normal behavior of the given data points, and then applies a statistical inference procedure to determine if a certain data point belongs to the learned model. Data points that have a low probability according to the learned statistical model are declared as anomalous outliers. In this section, we discuss parametric approaches for fitting a distribution to the data.
Univariate
We first consider univariate outlier detection, for example, for a set of values x1, x2, …, xn that appear in one column of a relational table. Assuming the data follows a normal distribution, fitting the values under a normal distribution essentially means computing the mean μ and the standard deviation σ from the current data points x1, x2, …, xn. Given μ and σ, a simple way to identify outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from the mean, namely z-score. Data values that have a z-score greater than a threshold, for example, of three, are declared to be outliers.
Since there might be outliers among x1, x2, … xn, the estimated μ and σ might be far off from their actual values, resulting in missing outliers in the data, as we show in Example 2.2.
Example 2.2 Consider again the age column in Table 2.1. The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. The procedure that identifies values that are more than 2 standard deviations away from the mean as outliers would mark values that are not in the range of [136.78 − 2 * 323.92, 136.78 + 2 * 323.92] = [–511.06, 784.62]. The last value t9[age] is not in the range, and thus is correctly marked as an outlier. The first value t1[age], however, is in the range and is thus missed.
This effect is called masking [Hellerstein 2008]; that is, a single data point has severely shifted the mean and standard deviation so much as to mask other outliers. To mitigate the effect of masking, robust statistics are often employed, which can correctly capture important properties of the underlying distribution even in the face of many outliers in the data values. Intuitively, the breakdown point of an estimator is the proportion of incorrect data values (e.g., arbitrarily large or small values) an estimator can tolerate before giving an incorrect estimate. The mean and standard deviation have the lowest breakdown point: a single bad value can distort the mean completely.
Robust Univariate Statistics. We now introduce two robust statistics: the median and the median absolute deviation (MAD) that can replace mean and standard deviation, respectively. The median of a set of n data points is the data point for which half of the data points are smaller, and half are larger; in the case of an even number of data points, the median is the average of the middle two data points. The median, also known as the 50th percentile, is of critical importance in robust statistics with a breakdown point of 50%; as long as no more than half the data are outliers, the median will not give an arbitrarily bad result. The median absolute deviation (MAD) is defined as the median of the absolute deviations from the data’s median, namely, MAD = mediani(|xi − medianj (xj)|). Similar to the median, MAD is a more robust statistic than the standard deviation. In the calculation of the standard deviation, the distances from xi to the mean are squared, so large deviations, which often are caused by outliers, are weighted heavily, while in the calculation of MAD, the deviations of a small number of outliers are irrelevant because MAD