Natural Language Processing for Social Media. Diana Inkpen
approaches are required for handling and processing the increasingly large amount of Twitter data (especially for real-time event detection). Other challenges are inherent to Twitter design and usage. These are mainly due to the shortness of the messages: the frequent use of (dynamically evolving) informal, irregular, and abbreviated words, the large number of spelling and grammatical errors, and the use of improper sentence structure and mixed languages. Such data sparseness, lack of context, and diversity of vocabulary make the traditional text analysis techniques less suitable for tweets [Metzler et al., 2007]. In addition, different events may enjoy different popularity among users, and can differ significantly in content, number of messages and participants, time periods, inherent structure, and causal relationships [Nallapati et al., 2004].
Across all forms of social media, subjectivity is an ever-present trait. While traditional news texts may strive to present an objective, neutral account of factual information, social media texts are much more subjective and opinion-laden. Whether or not the ultimate information need lies directly in opinion mining and sentiment analysis, subjective information plays a much greater role in semantic analysis of social texts.
Topic drift is much more prominent in social media than in other texts, both because of the conversational tone of social texts and the continuously streaming nature of social media. There are also entirely new dimensions to be explored, where new sources of information and types of features need to be assessed and exploited. While traditional texts can be seen as largely static and self-contained, the information presented in social media, such as online discussion forums, blogs, and Twitter posts, is highly dynamic and involves interaction among various participants. This can be seen as an additional source of complexity that may hamper traditional summarization approaches, but it is also an opportunity, making available additional context that can aid in summarization or making possible entirely new forms of summarization. For instance, Hu et al. [2007a] suggest summarizing a blog post by extracting representative sentences using information from user comments. Chua and Asur [2012] exploit temporal correlation in a stream of tweets to extract relevant tweets for event summarization. Lin et al. [2009] address summarization not of the content of posts or messages, but of the social network itself by extracting temporally representative users, actions, and concepts in Flickr data.
As we mentioned, standard NLP approaches applied to social media data are therefore confronted with difficulties due to non-standard spelling, noise, limited sets of features, and errors. Therefore some NLP techniques, including normalization, term expansion, improved feature selection, and noise reduction, have been proposed to improve clustering performance in Twitter news [Beverungen and Kalita, 2011]. Identifying proper names and language switch in a sentence would require rapid and accurate name entity recognition and language detection techniques. Recent research efforts focus on the analysis of language in social media for understanding social behavior and building socially aware systems. The goal is the analysis of language with implications for fields such as computational linguistics, sociolinguistics, and psycholinguistics. For example, Eisenstein [2013a] studied the phonological variation and factors when transcribed into social media text.
Several workshops organized by the Association for Computational Linguistics (ACL) and special issues in scientific journals dedicated to semantic analysis in social media show how active this research field is. We enumerate some of them here (we also mentioned them in the Preface):
• The EACL 2014 Workshop Language Analysis in Social Media (LASM 2014)4
• The NAACL/HLT 2013 Workshop on Language Analysis in Social Media (LASM 2013)5
• The EACL 2012 Workshop on Semantic Analysis in Social Media (SASM 2012)6
• The NAACL/HLT 2012 Workshop on Language in Social Media (LSM 2012)7
• The ACL/HLT 2011 Workshop on Language in Social Media (LSM 2011)8
• The WWW 2015 Workshop on Making Sense of Microposts9
• The WWW 2014 Workshop on Making Sense of Microposts10
• The WWW 2013 Workshop on Making Sense of Microposts11
• The WWW 2012 Workshop on Making Sense of Microposts12
• The ESWC 2011 Workshop on Making Sense of Microposts13
• The COLING 2014 Workshop on Natural Language Processing for Social Media (SocialNLP)14
• The IJCNLP 2013 Workshop on Natural Language Processing for Social Media (SocialNLP)15
In this book, we will cite many papers from conferences such as ACL, WWW, etc.; many workshop papers from the above-mentioned workshops and more; several books; and many journal papers from various relevant journals.
1.4 SEMANTIC ANALYSIS OF SOCIAL MEDIA
Our goal is to focus on innovative NLP applications (such as opinion mining, information extraction, summarization, and machine translation), tools, and methods that integrate appropriate linguistic information in various fields such as social media monitoring for healthcare, security and defense, business intelligence, and politics. The book contains four major chapters.
• Chapter 1: This chapter highlights the need for applications that use social media messages and meta-data. We also discuss the difficulty of processing social media data vs. traditional texts such as news articles and scientific papers.
• Chapter 2: This chapter discusses existing linguistic pre-processing tools such as tokenizers, part-of-speech taggers, parsers, and named entity recognizers, with a focus on their adaptation to social media data. We briefly discuss evaluation measures for these tools.
• Chapter 3: This chapter is the heart of the book. It presents the methods used in applications for semantic analysis of social network texts, in conjunction with social media analytics as well as methods for information extraction and text classification. We focus on tasks such as: geo-location detection, entity linking, opinion mining and sentiment analysis, emotion and mood analysis, event and topic detection, summarization, machine translation, and other tasks. They tend to pre-process the messages with some of the tools mentioned in Chapter 2 in order to extract the knowledge needed in the next processing levels. For each task, we discuss the evaluation metrics and any existing test datasets.
• Chapter 4: This chapter presents higher-level applications that use some of the methods from Chapter 3. We look at: healthcare applications, financial applications, predicting voting intentions, media monitoring, security and defense applications, NLP-based information visualization for social media, disaster response applications, NLP-based user modeling, and applications for entertainment.
• Chapter 5: This chapter discusses chapter complementary aspects such as data collection and annotation in social media, privacy issues in social media, spam detection in order to avoid spam in the collected datasets, and we describe some of the existing evaluation benchmarks that make available data collected and annotated for various tasks.
• Chapter 6: The last chapter summarizes the methods and applications described in the preceding chapters. We conclude with a discussion of the high potential for research, given the social media analysis needs of end-users.
As mentioned in the Preface, the intended audience of this book is researchers that are interested in developing tools and applications for automatic analysis of social media texts. We assume that the readers have basic knowledge in the area of natural language processing and machine learning. Nonetheless, we will try to define as many notions as we can, in order to facilitate the understanding for beginners in these two areas. We also assume basic knowledge of computer science in general.
1.5 SUMMARY
In this chapter, we reviewed the structure of social network and social media data as the collection of textual information