Data Cleaning. Ihab F. Ilyas
contain errors. The errors surfaced by the error detection step can be in various forms, such as outliers, violations, and duplicates. Finally, given the errors detected and the metadata that generate those errors, the error repair step produces data updates that are applied to the dirty dataset. Since there are many uncertainties in the data cleaning process, external sources such as knowledge bases and human experts are consulted whenever possible and feasible to ensure the accuracy of the cleaning workflow.
Example 1.1 Consider Table 1.1 containing employee records for a U.S. company. Every tuple specifies a person in a company with her id (GID), name (FN, LN), level (LVL), zip code (ZIP), state (ST), and salary (SAL). Suppose a domain expert supplies two data quality rules for this table. The first rule states that if two employees have the same zip code, they must be in the same state. The second rule states that among employees working in the same state, a senior employee cannot earn a smaller salary than a junior employee.
Table 1.1 An example employee table
Given these two data quality rules, the error detection step detects two violations. The first violation consists of four cells {t1[ZIP], t1[ST], t3[ZIP], t3[ST]}, which together violate the first data quality rule. The second violation consists of six cells {t1[ROLE], t1[ST], t1[SAL], t2[ROLE], t2[ST], t2[SAL]}, which together violate the second data quality rule. The data repair step takes the violations and produces an update that changes t1[ST] from “NM” to “NY”, and the new data now has no violation with respect to the two rules.
1.2 Book Scope
The aforementioned data cleaning workflow describes a general purpose data cleaning process, but there are different data cleaning topics that address one or multiple steps in the workflow. We cover some of the most common and practical cleaning topics in this book: outlier detection, data deduplication, data transformation, rule-based data cleaning, ML guided cleaning, and human involved data cleaning. We briefly explain these topics in the following subsections; we also highlight the book structure in Section 1.2.7.
1.2.1 Outlier Detection
Outlier detection refers to detecting “outlying” values. While an exact definition of an outlier depends on the application, there are some commonly used definitions, such as “an outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [Hawkins 1980] and “an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett and Lewis 1994]. For example, for a company whose employees’ salaries are around $100,000, an employee with a salary of $10,000 can be considered to be an outlier.
Applications of outlier detection include network intrusion detection, financial fraud detection, and abnormal medical condition detection. As a concrete example, imagine a company that is interested in improving road safety by making drivers more aware of their driving habits. To achieve this, data is collected from many hundreds of thousands of vehicles equipped with sensors connected with smart-phones. The collected data includes phone model, trip length, battery drain, and so on. Outlier detection and explanation engines such as Macrobase [Bailis et al. 2017] can be used to analyze the collected data. For example, Macrobase can find that some trips showed abnormally short lengths, common in smartphones with Apple iOS 9.0 beta 1. This reason was reported to the engineers, who discovered that a buggy Bluetooth stack was introduced in OS 9.0 beta 1, preventing iOS devices from connecting to in-car sensors.
Outlier detection faces two main challenges. First, defining what is a normal data pattern and what is an outlier can be difficult as different data and applications differ in what is considered normal. Many different detection techniques have been proposed to define normal behavior. Second, many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the curse of dimensionality.
1.2.2 Data Deduplication
Duplicate records occur for many reasons. Data deduplication, also known as duplicate detection, record linkage, record matching, or entity resolution, refers to the process of identifying tuples in one or more relations that refer to the same real-world entity. For example, a customer might be recorded multiple times in a customer database if the customer used different names when purchasing; a single item might be represented multiple times in an online shopping site; and duplicate records might appear after a data integration project because that record had different representations in original data sources. A data deduplication process usually involves many steps and choices, including designing similarity metrics to evaluate the similarity for a pair of records, training classifiers to determine whether a pair of records are duplicates, clustering all records to obtain clusters of records that represent the same real-world entity, consolidating clusters of records to unique representations, designing blocking or distributed strategies to scale up the deduplication process, and involving humans to decide whether a record pair are duplicates when machines are uncertain.
To address this, many open-source and commercial data deduplication tools have been built, such as Magellan [Konda et al. 2016] and Data Tamer [Stonebraker et al. 2013] (later commercialized as Tamr4). Fortune 500 companies use these tools to make sense of their large procurement data. Large companies often have multiple business units, and each unit buys many parts and products from many suppliers. These units might be buying the same part and product from the same supplier without each other’s knowledge, and therefore cannot get the best pricing. Furthermore, the same part and product might be named differently across different units, which complicates the deduplication process. A great deal of money can saved by identifying multiple records from the same supplier.
Achieving good precision and recall at the same time is difficult in data deduplication—declaring all pairs are duplicates achieves perfect recall but poor precision while declaring no pairs as duplicates achieves perfect precision, but poor recall. The problem is especially challenging given the myriad of design choices in designing a data deduplication workflow. Furthermore, data deduplication is inherently a combinatorial task that has quadratic complexity. For example, when done naively, comparing all pairs of only 1000 records requires 499,500 comparisons. In addition, grouping records that refer to the same real-world entity can be even harder. For example, correlation clustering used for grouping tuples that represent the same real-world entity is an NP-hard problem [Elsner and Schudy 2009].
1.2.3 Data Transformation
Data transformation refers to the task of transforming data from one format to another, for example, transforming phone numbers to a standard format by adding “-” in between digits. Data transformations can be seen as error repair activities and are used at various stages of data analytics. For example, before running a data integration project, transformations are often used to standardize data formats, enforce standard patterns, or trim long strings. Transformations are also used at the end of the ETL process, for example, to merge clusters of duplicate records.
Transform-Data-by-Example is an Excel add-in that finds the desired transformation easily for a given transformation task.5 Only a few examples of the desired output are needed, and Transform-Data-by-Example automatically finds relevant data transformation functions from a large collection that it has already indexed. This collection is acquired from a variety of sources, including Github source codes, Stackoverflow code snippets, and .Net libraries. Users can also extend the collection to cover domain-specific transformation capabilities by providing their own transformation functions.
Often, no ground truth is available to train or