An Introduction to Text Mining. Gabe Ignatow
buzzword. Naturally, there are questions that arise from the use of textual data sets as a way to learn about social groups and communities. There are, of course, advantages and disadvantages to each, and there are also ways to leverage both surveys and big data.
Surveys are the traditional mechanisms for gathering information on people, and there are entire fields that have developed around these data collection instruments. Surveys can collect clear, targeted information, and as such, the information obtained from surveys is significantly “cleaner” and significantly easier to process as compared to the information extracted from unstructured data sources. Surveys also have the advantage that they can be run in controlled settings, with complete information on the survey takers. These controlled settings can however also be a disadvantage. It has been argued, for instance, that survey research is often biased because of the typical places where surveys are run—for example, large student populations from Introduction to Psychology courses. Another challenge associated with surveys is that it excludes those people who do not like to provide information, and there is an entire body of research around methodologies to remove such participation bias. Above all, the main difficulty associated with survey instruments is the fact that they are expensive to run, both in terms of time and in terms of financial costs.
The alternative to surveys that has been extensively explored in recent years is the extraction of information from unstructured sources. For instance, rather than surveying a group of people on whether they are optimistic or pessimistic, alongside with asking for their location, as a way to create maps of “optimism,” one could achieve the same goal by collecting Twitter or blog data, extracting the location of the writers from their profile, and using automatic text classification tools to infer their level of optimism (Ruan, Wilson, & Mihalcea, 2016). The main advantage of gathering people information from such data sources is their “always on” property, which allows one to collect information continuously and inexpensively. These digital resources also eliminate some of the biases that come with the survey instruments, but they nonetheless introduce other kinds of biases. For instance, most of these data-driven collections of information on people rely on social media or on crowdsourcing platforms such as Amazon Mechanical Turk, but these sources cover only a certain type of population who is open to posting on social media or participating in online crowdsourcing experiments. Even more important, another major difficulty associated with the use of unstructured data sources is the lack of exactness during the process of extracting information. This process often consists of automatic tools for text mining and classification, which even if they are generally very good, they are not perfect. This effect can, however, be counteracted with the use of large data quantities: If the data that one can get from surveys are often limited by the number of participants (which in turn is limited by time and cost reasons), that limit is much higher when it comes to the information that one can gather from digital data sources. Thus, if cleverly used, the richness of the information obtained from unstructured data can rival, if not exceed, the one obtained with surveys.
Online Data Sources
Researchers often prefer to use ready-made data rather than, or often in addition to, constructing their own data sets using crawling and scraping tools. While many sources of data are in the public domain, some require access through a university subscription. For example, sources of news data include the websites of local and regional news outlets as well as private databases such as EBSCO, Factiva, and LexisNexis, which provide access to tens of thousands of global news sources, including blogs, television and radio transcripts, and traditional print news. One example of the use of such databases is a study of academic research on international entrepreneurship by the management researchers Jones, Coviello, and Tang (2011). Jones and colleagues used EBSCO and ABI/INFORM search tools to select their final data set of 323 journal articles on international entrepreneurship published between 1989 and 2009. They then used thematic analysis (see Chapter 11) to identify themes and subthemes in their data.
In addition to being able to access digitized news sources, researchers have access to writing produced by organizations including political statements, organizational calendars, and event reports. These data include recent online writing as well as digitized historical archives. Unfortunately, many online data sources are not simple to access. Most news databases allow access to a few articles but generally do not allow access to their entire database, as the subscriptions universities pay for are based on the assumption that researchers want to read a few articles on a subject rather than use large numbers of articles as primary data. Yet despite these limitations, a large and growing number of digital text collections are available for text mining researchers to use (see Appendix A). Among the most useful of these collections is the Corpus of Contemporary American English (COCA; http://corpus.byu.edu/coca), the largest public access corpus of English. Created by Davies of Brigham Young University, the corpus contains more than 520 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990 to 2015 and is updated regularly. The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. COCA and related corpora are often used by social scientists as secondary data sources in order to compare word frequencies between their main data source and “standard” English (e.g., Baker et al., 2008).
Another major source of digital data is represented by social media platforms, many of which provide their own application programming interfaces (APIs) for programmatic access to their data. The Twitter APIs (http://dev.twitter.com), for instance, allow one to access a small set of random tweets every day, or larger keyword-based collections of tweets (e.g., all the recent tweets with the hashtag #amused). If larger collections are necessary, they can be obtained through third-party vendors such as Gnip or others, which cover several social media sites and often partly curate the data. Twitter also provides limited demographic information on their users, such as location and self-maintained free-text profiles that sometime can include gender, age, industry, interests, and others.
Blogs can also be accessed through an API—for instance, the Blogger platform offers programmatic access to the blogs and the profile of the bloggers, which includes a rich set of fields covering location, gender, age, industry, favorite books and movies, interests, and so on. Other blog sites, such as LiveJournal, also include additional information on the bloggers, for instance, their mood when writing a blog post.
Facebook is another very large platform for social media, although less available for public access. The main way for developers to access Facebook data is via their Graph API, but the access is nonetheless limited to the content of those profiles that are either publicly available, or are “friends” (in Facebook terms) of the developers. An interesting data set for social science research is the myPersonality1 data set: It was compiled using a Facebook application, and it includes the profiles and updates of a large number of Facebook users who have also completed taken a battery of psychological surveys (e.g., personality, values).
1 Available upon request from http://mypersonality.org.
In addition, there are several other social media websites, with different target audiences, such as Instagram (where users upload mainly images they take), Pinterest (with “pins” of interesting things, covering a variety of domains from DIY to fashion to design and decoration), and many review platforms such as Amazon, Yelp, and others.
If you are interested in assembling your own data set, Chapter 6 provides an overview of software tools for scraping and crawling websites to collect your own data, and Chapter 5 provides instruction related to data selection and sampling.