Introduction to Corpus Linguistics. Sandrine Zufferey
manipulation. Conversely, a corpus study focuses on linguistic productions without manipulating the data before collecting them.
The study of linguistic productions in a corpus and the manipulation of experimental variables both have their advantages and disadvantages. On the one hand, corpus linguistics has the advantage of favoring the observation of natural data, that is, those which are not influenced by an experimental context. A corpus of journalistic texts includes real productions by journalists, which are not produced for the purpose of being observed. Likewise, a text produced by a learner is also natural, insofar as it is produced in its usual conditions, without there having been any particular manipulation. In addition, the use of corpora favors the observation of a very large amount of linguistic data, whereas experiments are based on a limited number of linguistic items for the task to remain feasible for participants, who would not be able to read thousands of sentences at a laboratory, for example. Finally, once a corpus has been created, it can be used for numerous research questions without requiring any additional time or financial costs. On the other hand, experiments require significant time resources as well as the usual obligation of having to financially compensate participants for their cooperation.
Experimental studies also have definite advantages over corpus studies. The first advantage, mentioned above, is that experiments allow us to test the existence of a causal relationship between two variables, such as the fact of being stressed and producing more errors. Corpus studies do not make it possible to draw this type of conclusion. Second, while an experimental paradigm can be developed to test almost any kind of phenomenon, there are some rare linguistic phenomena which may be absent or too little represented in a corpus to be examined in this way. For example, if we want to decide whether learners are fluent in French idioms such as “mettre le feu aux poudres” (to stir up a hornet’s nest) or “avoir un poil dans la main” (to be extremely lazy) through a corpus study, we will have to look for them in a corpus of learners’ productions. Now, it is quite possible that these expressions are never found there, but this does not necessarily mean that the learners do not know how to use them. It only means that they did not have an opportunity to produce them in the corpus. Using experimental methodology, we will be able to test whether learners have mastered these expressions. For instance, we can encourage them to read the expressions and then ask them to choose, from among several definitions, the one corresponding to their meaning. Finally, experimental linguistics makes it possible to study the linguistic competence of speakers, through different language comprehension tasks which can be more or less explicit or implicit, such as the conscious evaluation of sentences, their intuitive reading, etc. Corpora can only reflect the linguistic productions of speakers.
To conclude, corpus studies and experimental studies can often be used in a complementary way, and, when put together, they represent powerful tools for answering a good number of research questions.
1.7. Different types of corpora
As we will see in the following chapters, corpora represent linguistic samples of a very varied nature, and it is precisely this variety that makes it possible to answer diverse research questions in all fields of linguistics. In this last section, we will introduce a first classification of the types of existing corpora, in order to be able to refer back to it in the following chapters.
The first distinction we can make among all the existing corpora is the one that classifies them into a sample corpus and a monitor corpus. Sample corpora are those in which data have been collected once and for all, and which no longer evolve thereafter. For this reason, they are also known as closed corpora in the specialized literature. The advantage of these corpora is that they have been designed to contain a set of texts representative of the language, or a part of the language to be studied, with a balanced representation of the different text genres, for example. Thus, these corpora make it possible to draw conclusions which can be generalized. On the other hand, their main defect is that they age quickly and do not follow changes in the language. Therefore, sample corpora need to be recollected at regular intervals.
On the other hand, monitor corpora are never finished and constantly continue to integrate new elements, which is why they are described as open corpora in the literature. A typical example of this type of data is the corpus that contains newspaper archives or parliamentary debates. Every year, the number of available data increases. It is for this reason that it is difficult to maintain a perfect balance between the different parts of these corpora, whose representativeness cannot be fully guaranteed. We will return to the problem of representativeness in Chapter 6. On the other hand, these corpora remain up to date. In cases where they comprise a period of a few decades, they make it possible to observe the appearance of certain changes in language.
The second major distinction to be made among existing corpora differentiates general language corpora from specialized language corpora. General language corpora aim to offer a panorama of the whole of a language at a given time. It is evidently impossible to collect a sample of the whole language, but in the same way that a general language dictionary aims to describe the common lexicon of a language, the general corpus seeks to offer a global image, including the main textual genres found in language. These corpora are really valuable when it comes to studying a language as a whole, but they cannot offer precise answers on linguistic phenomena present in certain specific communication means, such as mobile texting, social media, medical reports, etc.
In order to study one of these areas specifically, it is preferable to resort to a specialized corpus. In fact, there are corpora especially devoted to texting, social media, etc. In addition, general corpora include productions by adults who are native speakers of the language represented. Other corpora specialize in representing other population categories, regardless of whether they are monolingual children in the process of acquiring their mother tongue, bilingual children, foreign-language learners, or even children with neuro-developmental disorders influencing language acquisition, such as autism and specific language impairment. Finally, by default, a general corpus includes examples of the variety considered as a language standard, or one of its main varieties. In French, it generally refers to the French language from France and, more precisely, from the Parisian region. In English, general corpora can refer to the English language from the UK or to American English. Conversely, some corpora specialize in the productions of speakers of a certain language variety, such as French from French-speaking Switzerland, Belgium, Canada, etc.
General or specialized language corpora can contain either written language or spoken language samples. For a long time, written language corpora were the norm, but analysis of the spoken language has developed broadly since the 2000s. Corpora of spoken language are typically of smaller size than written language ones, since they require manual transcription. As a matter of fact, it is easy to record voices, but what is difficult is to carry out searches directly on an audio file. At the same time, speech recognition software does not always fully allow reliable automatic transcriptions. It is for this reason that the oral data must be transcribed manually, which often limits the size of the spoken corpora. More recently, audio-visual recording corpora (also called “multimodal” corpora) have been created, in order to facilitate, for instance, the study of gestures and facial expressions as well as their role in communication. These corpora still pose many codification and interpretation challenges. Finally, let us point out that video corpora are also used for the study of sign language.
Another distinction that can be made regarding the types of existing corpora relates to the type of processing carried out on the linguistic data of the corpus. On the one hand, raw corpora contain nothing but language samples. This scenario represents the majority of the French corpora. On the other hand, some annotated corpora contain specific linguistic information, apart from the language samples. The most common type of annotation is the assignment of a grammatical category to each word in the corpus, as we have already mentioned. More rarely, certain corpora contain a syntactic analysis of all of their sentences, as well as other types of information, such as an annotation of the discourse relations (cause, condition, etc.) which interconnect the sentences within the text corpora.