An Introduction to Text Mining. Gabe Ignatow
IRBs. For example, the European Union has stringent privacy laws that may have been violated by the Facebook study. Adding to the difficulties for researchers attempting to follow privacy laws, it is unclear whether laws governing data protection are the laws in the jurisdiction where research participants reside, the jurisdiction where the researchers reside, the jurisdiction of the IRB, the location of the server, the location where the data are analyzed, or some combination of these.
Because acquiring users’ textual data from online sources is a passive method of information gathering that generally involves no interaction with the individual about whom data are being collected, for the most part text mining research is not as ethically challenging as experiments and other methods that involve recruiting participants and that may involve deception. Nevertheless, universities’ IRBs are increasingly requiring participant consent (see the next section) in cases where users can reasonably expect that their online discussions will remain private. At the very least, in almost all cases social scientists are required to anonymize (use pseudonyms for) users’ user names and full names.
It has also been suggested that although publicly available online interactions exist within the public domain, site members may view their online interactions as private. Hair and Clark (2007) have warned researchers that members of online communities often have no expectation that their discussions are being observed and may not be accepting of being observed.
In order for text mining research using user-generated data to progress, researchers must make several determinations. First, they must use all available evidence to determine whether the data should be considered to be in the public or private domain. Second, if data are in the public domain, the researcher must determine whether users have a reasonable expectation of privacy. In order to make these determinations, researchers should note whether websites, apps, and other platforms require member registration and whether they include privacy policies that specify users’ privacy expectations.
Informed Consent
Informed consent refers to the process by which individuals explicitly agree to be participants in a research project based on a comprehensive understanding of what will be required of them. The Belmont Report (discussed previously) identified three elements of informed consent: information, comprehension, and voluntariness. The principle of respect for persons implies that participants should be presented with relevant information in a comprehensible format and then should voluntarily agree to participate. Participants in research projects who have given their informed consent are not expected to be informed of a study’s theories or hypotheses, but they are expected to be informed of what data the researcher will be collecting and what will happen to that data as well as of their rights to withdraw from the research.
Informed consent is a core principle of human research ethics established in the aftermath of the Second World War. In important cases where the question is deemed vital and consent isn’t possible (or would prevent a fair test), it can be legally bypassed. But this is rare, and it is clear that the researchers in the Cornell–Facebook study failed to obtain consent from the thousands of Facebook users who were subjected to the manipulation of their news feeds. Instead, the researchers took advantage of the fine print in Facebook’s data use policy to conduct an experiment without users’ informed consent. Even though the academic researchers collaborated with Facebook in designing the study, it appears that they obtained ethical approval only after the data collection was finished (Chambers, 2014).
Researchers have argued that informed consent is not required for research in online contexts in which the data can be considered to be in the public domain (Eysenbach & Till, 2001; Sudweeks & Rafaeli, 1996). And professional research associations occasionally deem informed consent unnecessary in cases where the scientific value of a research project can justify undisclosed observation. However, in cases where it cannot be legitimately argued that data are in the public domain or where data are in the public domain but are protected by copyright laws, participants’ informed consent to use such data must be sought.
Because the process of seeking informed consent is onerous and requires the creation and administration of an IRB-approved informed consent form, text mining researchers typically prefer to use data that are clearly in the public domain.
Manipulation
So far, we have assumed that the researcher is collecting unprompted user conversations (rather than prompted data, such as from interviews or questionnaires), but social scientists are beginning to collect users’ textual data after actively manipulating the online environment as a stimulus intended to assess reactions or responses. The Cornell–Facebook emotion study is an example of such research. Researchers could also prime users by, for example, introducing sexist, racist, or homophobic language into the online environment and then recording the responses of members of different communities. From an ethical standpoint, for this kind of experimental online research it is not sufficient for the researcher to anonymize participants’ names after the experiment has been conducted.
As always, in the case of manipulation, the best practice is for university researchers who plan to manipulate the social media environment in any way to consult with their IRB and for researchers in the industry to follow the regulations and guidelines of their respective professional associations.
Publishing Ethics
If you are thinking about graduate school or a career in research and teaching, you have many outlets available for publishing your research papers. You can publish your own work in specialized undergraduate research journals, present your work in undergraduate poster sessions at national and international academic conferences, and possibly upload your undergraduate honors thesis to your university’s digital research archives. You may also publish collaboratively, as research assistants or perhaps occasionally coauthors, with faculty members in research journals and conference proceedings. Whatever your specific goals, it is important to be aware of the many ethical pitfalls involved in scholarly publishing. In this section, we borrow liberally from research ethics scenarios (http://ethicist.aom.org/2013/02/ethics-in-research-scenarios-what-would-you-do) that were developed by management researchers Davis and Madsen in 2007. The following scenarios presented all represent ethics violations related to authorship and publishing, and they all represent patterns of behavior that occur quite often.
Scenario 1
In the first scenario, you have recently begun to work with a professor who is a productive scholar who has published in major journals for many years. But you have discovered that he has an unusual approach to research. He begins by gathering and analyzing data, which may include using a student’s data set, to see if the data have anything interesting to say. You have found that the professor often manipulates the data and changes the dependent variable to ensure a statistically significant result and increase the probability of a major publication.
Is this professor’s approach to research ethical? Why or why not?
Is there anything you could or should do as a student in this situation?
Scenario 2
In the second scenario, a professor has a long and impressive resume, but upon closer examination, you realize that many of her publications seem to be quite similar. One day you met with this professor and commented on her impressive body of work. She said that she never writes anything that doesn’t get as much ink and attention as possible. Among other things, she said that she may change the name of some of her papers to get them into conferences. She also claimed that she spends so much time gathering data that to be as productive as possible she must use the same data and theory in multiple published studies.
Can one plagiarize oneself?
How often can data be used ethically?
Can the same paper be submitted to a conference