Introduction to Corpus Linguistics. Sandrine Zufferey
interview with a researcher. We will also need to make sure that the corpus collected in this way includes approximately the same speaking time or the same number of words pronounced by men and women. This control over the linguistic context and the duration of interactions helps us to ensure that men and women have had fairly equal motives to pronounce words related to emotions/feelings, and as many chances of doing so. Second, we would have to choose a list of words to search within the corpus, representative of the vocabulary related to emotions, for example verbs such as to annoy, adjectives like furious or nouns like anger. Then, by comparing the number of times these words have been produced by the two groups and by validating the significance of the differences observed between the groups through statistical tests, we would be able to provide an answer to the research question. In this study, we have sought to reduce the number of confounding variables by controlling the context of production of the statements, as well as by limiting the word choice in the examined vocabulary. It is precisely this limited and reductionist aspect that the opponents to quantitative methods criticize, thinking that the constructed and unnatural context in which structured interviews take place does not reflect the richness of natural and spontaneous exchanges between speakers.
The other major methodological paradigm includes so-called qualitative studies. The main objective of these studies is holistic: they aim to study a phenomenon understanding it as a whole, as detailed and as thoroughly as possible, but in a small number of people. Due to their nature, qualitative studies are interpretative. In linguistics, research paradigms involving a qualitative methodology typically resort to the administration of questionnaires with open questions, interviews, observations or introspective techniques, such as think-aloud protocols. For example, in order to study the differences in the way of expressing emotions between men and women, a qualitative methodology could involve asking a reduced number of speakers, for example three men and three women, to describe the way in which they express their emotions, either by talking freely with the experimenter or by talking to each other. The analysis would then require an in-depth study of some of the examples found interesting during the discussion.
One of the main criticisms aimed at qualitative methods is that they are very subjective in nature, insofar as they are largely based on the interpretations made by linguists and the subjective impressions of a few speakers. Thus, the specific cases they describe cannot often be generalized to a population, which, by the way, is not the aim pursued by such studies. Rather than the generalization of results, these studies are based on the possibility of making a transfer from a particular situation so as to understand another one with which it shares common traits. For example, an in-depth case study on the difficulties of expressing emotions in an aphasic patient may help to highlight similar difficulties existing in other patients with the same disorder.
To summarize, each of the two methodological paradigms introduced in this section has both advantages and disadvantages. Quantitative methods enable the generalization of results to the whole of a population, whereas qualitative methods offer a more detailed and nuanced panorama of a real case. Recently, the complementarity between these approaches has started to be broadly accepted in research and many studies are crossing the two types of methodologies, in order to benefit from their advantages and limit their disadvantages.
For example, if we want to know whether learners of French as a foreign language at an advanced level are able to use collocations as native speakers do (collocations such as “prendre une décision” – to make a decision – or “pleuvoir à verse” – to pour with rain), we can search for occurrences of these expressions in text corpora produced by learners and compare the number of times these expressions appear – and their frequency – in a corpus of similar textual productions made by native speakers. By comparing these frequencies through statistical tests, we will know whether learners actually use these expressions as often as native speakers do, or not. Even if we find a difference between the two groups, something which this study will not tell us is why learners do not use these expressions as often as native speakers do or which expressions they use instead. To find out, we can complete this study with a qualitative analysis, by observing, for example, which words often accompany the occurrences of the noun décision in French, which are not the verb prendre. If we observe that several times the verb used is faire (make), rather than prendre (take), a decision in English-speaking learners, but not in German-speaking learners, we will conclude that these errors could come from a problem of transfer from their mother tongue and, more specifically, from the expression to make a decision in English.
In summary, a corpus can be analyzed using a quantitative or qualitative methodology. While we acknowledge the use and importance of combining these two approaches, in the rest of the book we will focus on the quantitative approach to corpus linguistics, which poses its own theoretical and methodological challenges.
1.6. Differences between corpus linguistics and experimental linguistics
Corpus linguistics and experimental linguistics share very important methodological properties, since both are empirical in nature and both generally involve a quantitative rather than a qualitative approach. However, these two types of approaches differ in one very important point. On the one hand, corpus linguistics focuses on data observation as found in collections of texts, recordings, etc. On the other hand, experimental linguistics points to the manipulation of one or more variables in order to study their effect on other variables.
Let us imagine once again that we are interested in the types of language errors produced by learners of French. By means of a corpus study, we will be able to identify all the types of errors produced and then quantify each of them: for example, 30 spelling mistakes, 12 lexicon errors, 20 syntax mistakes, etc., made every 100 words. Then, by applying statistical tests, we will be able to determine whether one of the error categories is significantly more frequent than the others. We will also be able to compare the number of errors produced in each category by students of different levels and, thanks to statistical tests, determine whether students make significant progress faster in certain categories than in others. In contrast, what a corpus study will not help you to do is establish with certainty the factors influencing the number of errors. The corpus only shows you the result of the speakers’ production, but not what led to these results. In order to determine the factors that lead learners to make mistakes or not, we will need to resort to experimental methodology.
When we conduct an experiment, the goal is to manipulate the possible causes and then to observe their effects. Going back to our example research question, we may wonder what makes some students produce more errors than others, and in certain contexts, what makes the same student produce more errors than in other contexts. As regards the difference between students, we may think that one possible cause is the level of general intelligence of each student, the assumption being that overall smarter students should produce fewer errors than less intelligent students. The level of intelligence thus constitutes the cause that we will manipulate in order to observe its effect on the number of errors produced. In order to measure the effect of the intelligence variable, we will first need to measure the students’ intelligence, for example by means of an IQ test. We will then use the result of this test to determine whether the students who have a higher IQ are also the ones who make the fewest language errors.
In the case of the second research question, which seeks to determine why the same student makes more mistakes in certain contexts, we may assume that stress promotes the production of errors. In order to test this hypothesis, we will have to conduct an experiment in which half of the students are placed in a stressful situation such as an examination context or, for instance, a test with a limited amount of time to complete the task, whereas the other half of the students are placed in a low-stress situation, for example, without any time constraint, performing a task which does not involve marked assessment, etc. Then, we will compare the number of errors in the two groups so as to determine, by means of a statistical test, whether the students under a stressful situation make significantly more errors than the other students, or not. In the two examples of studies that we have just discussed, the approach is the same: to identify a possible cause and to assess its effect through