An Introduction to Text Mining. Gabe Ignatow
through interpretive methods (e.g., archival research and ethnography).
Making Inferences
Social science research involves making inferences (drawing conclusions) about theories; about patterns in data and about the individuals and communities that are, ultimately, the source of the data used in research. Inferential logic involves thinking about how and why it is warranted to make inferences from data. Based on their analysis of collected data, researchers use specific forms of logic to make inferences about relationships among social phenomena and between social phenomena and theoretical propositions and generalizations. In the early stages of a project, a researcher may not know the sorts of inferences they will make or conclusions they will draw. But in the end, they inevitably use one or more of the following forms of inferential logic, and it is beneficial for you as a researcher to be aware of these well-established forms of inferential logic as early as possible.
Inductive Logic
Inductive logic involves making inferences that take data as their starting point and then working upward to theoretical generalizations and propositions. Researchers begin by analyzing empirical data with their preferred tools and then allow general conclusions to emerge organically from their analysis (see Figure 4.1). The first ethical scenario in Chapter 3 is an example of a researcher relying exclusively on induction.
When qualitatively oriented researchers use inductive logic, they often position their research as grounded theory (Glaser & Strauss, 1967), while more quantitatively oriented researchers refer to data mining. Both grounded theory and data mining are used extensively in text mining research.
Figure 4.1 ∎ Inductive Logic
The use of inductive inference is attractive to social scientists for several reasons. First, it allows them to work with data sets and specialized tools quickly without having to invest time mastering abstruse philosophical debates and theories or setting up complex research designs. It also allows for great flexibility and adaptability, as analysts can allow their data to speak to them and adjust their conclusions accordingly rather than imposing a priori categories and concepts onto data in an artificial manner. And inductive research designs allow quantitatively oriented researchers, in particular, to immediately make use of the very large data sets and powerful software and programming languages that are at their disposal.
In its purer forms, induction has some serious drawbacks. First, it encourages analysts to begin research projects without first formulating a research question. Researchers simply assume that the project’s purpose will become evident during its analysis phase. But there is a very real risk that this simply will not happen, and the researcher will have invested significant time and resources in a directionless and perhaps purposeless project.
Another drawback of purely inductive research is that it can encourage researcher passivity with regard to mastering the research literatures in their areas of interest. Rather than mastering the work that has been done by others so as to identify gaps in knowledge, unsolved puzzles, or critical disagreements and then designing a study to address one or several of these, induction encourages researchers to skip straight to data collection and analysis and then work backward from their findings to the pertinent gaps in the literature, puzzles, and disagreements. In practice, this is often a high-risk strategy.
Although relying exclusively on inductive inferential logic is a risky and sometimes dangerous strategy, induction does end up playing a role in most text mining research projects. The complexity of natural language data demands that researchers allow their data to alter their theoretical models and frameworks rather than forcing data to conform to their preferred theories.
An example of a text mining study with a research design based on inductive logic is Frith and Gleeson’s 2004 thematic analysis of male undergraduate students’ responses to open-ended survey items related to clothing and body image. The undergraduates in the study were recruited through snowball sampling (see Chapter 5). In order to better understand how men’s feelings about their bodies influence their clothing practices, Frith and Gleeson analyzed written answers to four questions about clothing practices and body image and discovered four main themes relevant to their research question, including men value practicality, men should not care how they look, clothes are used to conceal or reveal, and clothes are used to fit a cultural ideal.
A second example of an inductive text mining study is Jones, Coviello, and Tang’s (2011) study of academic research on the academic field of international entrepreneurship research. Jones, Coviello, and Tang constructed a corpus from 323 journal articles on international entrepreneurship published between 1989 and 2009 and then inductively synthesized and categorized themes and subthemes in their data.
Spotlight on the Research
An Inductive Approach to Media Framing
Bastin, G., & Bouchet-Valat, M. (2014). Media corpora, text mining, and the sociological imagination: A free software text mining approach to the framing of Julian Assange by three news agencies. Bulletin de Méthodologie Sociologique, 122, 5–25.
In this paper, the sociologists Bastin and Bouchet-Valat introduced R.TeMiS (http://rtemis.hypotheses.org), a free text mining software package designed for media framing analysis. Unique among R text mining tools, R.TeMiS features a graphical user interface (GUI) to help in the automation of corpus construction and management procedures based on the use of large media content databases and to facilitate the use of a range of statistical tools such as one- and two-way tables, time series, hierarchical clustering, correspondence analysis, and geographical mapping. Bastin and Bouchet-Valat presented a case study on the media framing of Julian Assange from January 2010 to December 2011 based on an analysis of a corpus of 667 news dispatches published in English by the international news agencies Agence France-Presse, Reuters, and the Associated Press. Bastin and Bouchet-Valat’s inductive approach to their data incorporates correspondence analysis (see Appendix G) as well as geographic tagging and mapping based on country names in the texts.
Specialized software used:
R.TeMiS
Deductive Logic
Deductive logic is the form of inferential logic most closely associated with the scientific method. Deductive research designs start with theoretical abstractions (see Figure 4.2), derive hypotheses from those theories, and then set up research projects that test the hypotheses on empirical data. The purest form of a deductive research design is the laboratory experiment, which in principle allows the researcher to control all variables except for those of theoretical interest and then to determine unequivocally whether hypotheses derived from a theory are supported or not.
Deductive inferential logic has been applied in many text mining studies. An early example is Hirschman’s (1987) study “People as Products,” which tested an established theory of resource exchange on male- and female-placed personal advertisements. In total, Hirschman derived 16 hypotheses from this theory and tested these hypotheses on a year’s worth of personal dating advertisements collected from New York and Washingtonian magazines. Hirschman selected at random 100 male-placed and 100 female-placed advertisements, as well as 20 additional advertisements that she used to establish content categories for the analysis. One male and one female coder coded the advertisements in terms of the categories derived from the 20 additional advertisements. The data were transformed to represent the proportionate weight