Data-Intensive Text Processing with MapReduce. Jimmy Lin

Data-Intensive Text Processing with MapReduce - Jimmy Lin


Скачать книгу
3.5 Relational Joins

       3.5.1 Reduce-Side Join

       3.5.2 Map-Side Join

       3.5.3 Memory-Backed Join

       3.6 Summary

       4 Inverted Indexing for Text Retrieval

       4.1 Web Crawling

       4.2 Inverted Indexes

       4.3 Inverted Indexing: Baseline Implementation

       4.4 Inverted Indexing: Revised Implementation

       4.5 Index Compression

       4.5.1 Byte-Aligned and Word-Aligned Codes

       4.5.2 Bit-Aligned Codes

       4.5.3 Postings Compression

       4.6 What About Retrieval?

       4.7 Summary and Additional Readings

       5 Graph Algorithms

       5.1 Graph Representations

       5.2 Parallel Breadth-First Search

       5.3 PageRank

       5.4 Issues with Graph Processing

       5.5 Summary and Additional Readings

       6 EM Algorithms for Text Processing

       6.1 Expectation Maximization

       6.1.1 Maximum Likelihood Estimation

       6.1.2 A Latent Variable Marble Game

       6.1.3 MLE with Latent Variables

       6.1.4 Expectation Maximization

       6.1.5 An EM Example

       6.2 Hidden Markov Models

       6.2.1 Three Questions for Hidden Markov Models

       6.2.2 The Forward Algorithm

       6.2.3 The Viterbi Algorithm

       6.2.4 Parameter Estimation for HMMs

       6.2.5 Forward-Backward Training: Summary

       6.3 EM in MapReduce

       6.3.1 HMM Training in MapReduce

       6.4 Case Study: Word Alignment for Statistical Machine Translation

       6.4.1 Statistical Phrase-Based Translation

       6.4.2 Brief Digression: Language Modeling with MapReduce

       6.4.3 Word Alignment

       6.4.4 Experiments

       6.5 EM-Like Algorithms

       6.5.1 Gradient-Based Optimization and Log-Linear Models

       6.6 Summary and Additional Readings

       7 Closing Remarks

       7.1 Limitations of MapReduce

       7.2 Alternative Computing Paradigms

       7.3 MapReduce and Beyond

       Bibliography

       Authors’ Biographies

       Acknowledgments

      The first author is grateful to Esther and Kiri for their loving support. He dedicates this book to Joshua and Jacob, the new joys of his life.

      The second author would like to thank Herb for putting up with his disorderly living habits and Philip for being a very indulgent linguistics advisor.

      This work was made possible by the Google and IBM Academic Cloud Computing Initiative (ACCI) and the National Science Foundation’s Cluster Exploratory (CLuE) program, under award IIS-0836560, and also award IIS-0916043. Any opinions, findings, conclusions, or recommendations expressed in this book are those of the authors and do not necessarily reflect the views of the sponsors.

      We are grateful to Jeff Dean, Miles Osborne, Tom White, as well as numerous other individuals who have commented on earlier drafts of this book.

      Jimmy Lin and Chris Dyer

      May 2010

      CHAPTER 1

       Introduction

      MapReduce [45] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with significant activity in both industry and academia.

      This book is about scalable approaches to processing large amounts of text with MapReduce. Given this focus, it makes sense to start with the most basic question: Why? There are many answers to this question, but we focus on two. First, “big data” is a fact of the world, and therefore an issue that real-world systems must grapple with. Second, across a wide range of text processing applications, more data translates into more effective algorithms, and thus it makes sense to take advantage of the plentiful amounts of data that surround us.

      Modern information societies are defined by vast repositories of data, both public and private. Therefore, any practical application must be able to scale up to datasets of interest. For many, this means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing web content must tackle large-data problems:“web-scale” processing is practically synonymous with data-intensive processing. This observation applies not only to well-established internet companies, but also countless startups and niche players as well. Just think, how many companies do you know that start their pitch with “we’re going to harvest information on the web and…”?

      Another strong area of growth is the analysis of user behavior data. Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner) be drowning in a torrent of log data. In fact, logging user behavior generates so much data that many organizations simply can’t cope with the volume, and either turn the functionality off or throw away data after some time. This represents lost opportunities, as there is a broadly held belief that great value lies in insights derived from mining such data. Knowing what users look at, what they click on, how much time they spend on a web page, etc., leads to better business decisions and competitive advantages. Broadly, this is known as business intelligence, which encompasses a wide range of technologies including data warehousing, data mining, and analytics.

      How much data are we talking about? A few examples: Google grew from processing 100 tera-bytes of data a day with MapReduce in 2004 [45] to processing 20 petabytes a day with MapReduce in 2008 [46]. In April 2009, a blog post1


Скачать книгу