Data Cleaning. Ihab F. Ilyas

Data Cleaning

Скачать книгу

University of Waterloo

Xu Chu

Georgia Institute of Technology

ACM Books #28

Copyright © 2019 by Association for Computing Machinery

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher.

Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which the Association for Computing Machinery is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.

Data Cleaning

Ihab F. Ilyas

Xu Chu

http://books.acm.org

ISBN: 978-1-4503-7152-0 hardcover

ISBN: 978-1-4503-7153-7 paperback

ISBN: 978-1-4503-7154-4 ePub

ISBN: 978-1-4503-7155-1 eBook

Series ISSN: 2374-6769 print 2374-6777 electronic

DOIs:

10.1145/3310205 Book

10.1145/3310205.3310206 Preface

10.1145/3310205.3310207 Chapter 1

10.1145/3310205.3310208 Chapter 2

10.1145/3310205.3310209 Chapter 3

10.1145/3310205.3310210 Chapter 4

10.1145/3310205.3310211 Chapter 5

10.1145/3310205.3310212 Chapter 6

10.1145/3310205.3310213 Chapter 7

10.1145/3310205.3310214 Chapter 8

10.1145/3310205.3310215 References/Index/Bios

A publication in the ACM Books series, #28

Editor in Chief: M. Tamer Özsu, University of Waterloo

This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX.

Cover photo: Jason Dorfman MIT / CSAIL

First Edition

10 9 8 7 6 5 4 3 2 1

To my family: Francis, Aida, Mirette, Andrew and Marina

To my wife Jianmei and my daughter Hannah

Contents

Figure and Table Credits

Chapter 1 Introduction

1.1 Data Cleaning Workflow

Chapter 2 Outlier Detection

2.1 A Taxonomy of Outlier Detection Methods

2.2 Statistics-Based Outlier Detection

2.3 Distance-Based Outlier Detection

2.4 Model-Based Outlier Detection

2.5 Outlier Detection in High-Dimensional Data

Chapter 3 Data Deduplication

3.1 Similarity Metrics

3.2 Predicting Duplicate Pairs

3.3 Clustering

3.4 Blocking for Deduplication

3.5 Distributed Data Deduplication

3.6 Record Fusion and Entity Consolidation

3.7 Human-Involved Data Deduplication

3.8 Data Deduplication Tools

3.9 Conclusion

Chapter 4 Data Transformation

4.1 Syntactic Data Transformations

4.2 Semantic Data Transformations

4.3 ETL Tools

4.4 Conclusion

Chapter 5 Data Quality Rule Definition and Discovery

5.1 Functional Dependencies

5.2 Conditional Functional Dependencies

5.3 Denial Constraints

5.4 Other Types of Constraints

5.5 Conclusion

Chapter 6 Rule-Based Data Cleaning

6.1 Violation Detection

6.2 Error Repair

6.3 Conclusion

Chapter 7 Machine Learning and Probabilistic Data Cleaning

7.1 Machine

Скачать книгу