An Introduction to Text Mining. Gabe Ignatow

An Introduction to Text Mining - Gabe Ignatow


Скачать книгу
status, occupational status) for each sample, and the data were analyzed with a 2 × 2 analysis of variance (ANOVA) procedure. As is discussed in Appendix I, ANOVA is a collection of statistical models used to analyze variation between groups. In Hirschman’s 1987 study, gender of advertiser (male or female) and city (New York or Washington) were the factors analyzed in the ANOVA procedure, while Cunningham, Sagas, Sartore, Amsden, and Schellhase (2004) used ANOVAs to compare news coverage of women’s and men’s athletic teams.

Figure 2

      Figure 4.2 ∎ Deductive Logic

      Management researchers Gibson and Zellmer-Bruhn’s 2001 study of concepts of teamwork across national organizational cultures is another example of the use of deductive inferential logic in a text mining project. This study’s goal was to test an established theory of the influence of national culture on employees’ attitudes. Gibson and Zellmer-Bruhn tested this theory on data from four organizations in four different countries (France, the Philippines, Puerto Rico, and the United States), conducting interviews that they transcribed to form their corpora. They used QSR NUD*IST (which subsequently evolved into NVivo; see Appendix D) and TACT (Popping, 1997) to organize qualitative coding of five frequently used teamwork metaphors (see Chapter 12), which were then used to create dependent variables for hypothesis testing using multinomial logit and logistic regression (multiple regression).

      Cunningham and colleagues’ (2004) analysis of coverage of women’s and men’s sports in the newsletter NCAA (National Collegiate Athletic Association) News is another example of a deductive research design. Cunningham and his colleagues tested theories of organizational resource dependence on data from 24 randomly selected issues of the NCAA News. One issue of the magazine was selected from each month of the year from the years 1999 and 2001 (see Chapter 5 on systematic sampling). From these issues, the authors chose to analyze only articles specifically focused on athletics, coaches, or their teams, excluding articles focused on committees, facilities and other topics (see Chapter 5 on relevance sampling). Two researchers independently coded each of 5,745 paragraphs in the sample for gender and for the paragraph’s location within the magazine and content. Reliability coefficients including Cohen’s kappa and the Pearson product-moment coefficient were calculated. As is discussed in Appendix I, reliability coefficients are used to measure the degree of agreement among raters. Interrater reliability is useful for determining if a particular scale is appropriate for measuring a particular variable. If the raters do not agree, either the scale is defective or the raters need retraining.

      Cunningham and his colleagues also calculated word use frequencies, ANOVA, and chi-square statistics. The chi-square statistic, also discussed in Appendix I, is a very useful statistic in text mining research. It allows for comparisons of observed versus expected word frequencies across documents or groups of documents that may differ in size.

      The extreme complexity of user-generated textual data poses challenges for the use of deductive logic in social science research. One cannot perform laboratory experiments on the texts that result from interactions among members of large online communities, and it is difficult, and often unethical, to use manipulation to perform field experiments in online communities (see Chapter 3). And even researchers who are immersed in the relevant literatures in their field may not know precisely what they want to look for when they begin their analysis. For this reason, many researchers who work with text mining tools advocate for abductive inferential logic, a more forensic logic that is commonly use in social science research but also in natural science fields such as geology and astronomy where experiments are rarely performed.

      Abductive Logic

      A weakness of both induction and deduction are that they do not provide guidance about how theories, whether grand theories, middle-range theories, or theoretical models, are discovered in the first place (Hoffman, 1999). The inferential logic that best accounts for theoretical innovation is abduction, also known, approximately, as “inference to the best explanation” (Lipton, 2003). Abduction differs from induction and deduction in that abduction involves an inference in which the conclusion is a hypothesis that can then be tested with a new or modified research design. The term was originally defined by the philosopher Peirce (1901), who claimed that for science to progress it was necessary for scientists to adopt hypotheses “as being suggested by the facts”:

      Accepting the conclusion that an explanation is needed when facts contrary to what we should expect emerge, it follows that the explanation must be such a proposition as would lead to the prediction of the observed facts, either as necessary consequences or at least as very probable under the circumstances. A hypothesis, then, has to be adopted, which is likely in itself, and renders the facts likely. This step of adopting a hypothesis as being suggested by the facts, is what I call abduction. (pp. 202–203)

      Abduction “seeks no algorithm, but is a heuristic for luckily finding new things and creating insights” (Bauer, Bicquelet, & Suerdem, 2014). Abductive logic does not replace deduction and induction but bridges them iteratively (through a process that repeats itself many times). It is a “forensic” form of reasoning in that it resembles the reasoning of detectives who interpret clues that permit a course of events to be reconstructed or of doctors who make inferences about the presence of illness based on patients’ symptoms.

Figure 3

      Figure 4.3 ∎ Abductive Logic

      A number of researchers who work with text mining and text analysis tools advocate for abduction as the optimal inferential logic for their research, including Bauer and colleagues (2014) and Ruiz Ruiz (2009). Bauer and colleagues (2014) argued that with abduction, text mining researchers need not do the following:

      . . . face a dilemma between the Scylla of deduction on the one hand, and Charybdis of induction on the other. We suggest abductive logic as the middle way out of this forced choice: the logic of inference to the most plausible explanation of the given evidence, considering less plausible alternatives. As it entails both machine inference and human intuition, it can maintain the human-machine-text trialogue.

      One of the main problems of abductive inference is how to formulate an abduction. Peirce was not especially clear on this point when he referred to a “flash of understanding” or when attributing abductive capacity to an adaptive human need to explain surprising or unexpected facts. Although Peirce did not establish formal procedures to generate abductive inferences, he did propose criteria to distinguish between good and bad abduction. These include the need for abduction to propose truly new ideas or explanations, the need to derive empirically contrastable predictions from the hypotheses, and the need for the hypotheses to fit in with or give an adequate account of the social and historical context in which they emerge (Peirce, 1901).

      The sheer complexity of language as compared with the kinds of phenomena studied in the natural and physical sciences makes it difficult to implement text mining research designs that are entirely inductive or deductive. Even the most carefully planned text mining projects that are presented as though they were products of pure deductive reasoning generally result from long periods in which research teams use abductive logic to refine and reformulate their hypotheses and, at times, even their research questions.

      Abductive inferential logic is compatible with the use of any number of sophisticated research tools and is used in the early stages of many deductive research designs. One example is Ruiz


Скачать книгу