Introduction to Corpus Linguistics. Sandrine Zufferey

Introduction to Corpus Linguistics

The term corpus has a Latin origin and means “body”. A text corpus literally embodies a set of texts, a collection of a certain number of texts for study. For example, it is possible to collect a series of newspaper articles and make a corpus of them in order to study the specificities of the journalistic genre. In the field of language teaching, it is also possible to collect texts written by students having different levels, and to build a corpus of these writings in order to study the typical errors that students produce at different learning stages. A methodology using data from the outside world rather than using one’s own knowledge of the language is called an empirical methodology. Corpus linguistics can be defined as an empirical discipline par excellence, since it aims to draw conclusions based on the analysis of external data, rather than on the linguistic knowledge pertaining to researchers.

Working with corpus linguistics therefore implies being in contact with linguistic data in the form of texts, and also in the form of recordings, videos or any other sample containing language. Most of the time, these samples are collected in a computerized format, which makes it possible to study them more effectively than if they were on paper. Let us imagine, for example, we wish to know how many times and in what passages Flaubert evokes the feeling of love in his novel Madame Bovary. If we have a paper version of that book, finding these passages will be a long and tedious task, which will require going through the entire text. However, having a computerized version would make the task much easier. We simply need to look up for the terms love, in love or the verb to love in its different forms with the search function of the word processor so as to locate the appearances and easily count them. For most of the questions addressed by corpus linguistics, it would be impossible to search through a paper database, and that is why having computerized corpora becomes essential.

The problem of manual tracking and counting of occurrences is all the more acute since corpus linguistics is often based on large amounts of data which have not been drawn from a single book, in view of observing the multiple occurrences of a certain linguistic phenomenon and thus apprehending its specificities. For example, let us suppose that we wish to know whether Flaubert talks about love in his work. In this case, focusing solely on Madame Bovary would induce a bias, because this novel is not representative of the whole of his work. So, in order to be able to answer this question, it is necessary to go through the entirety of his novels, making the task even more complex to perform manually. Let us now imagine that this time we want to know whether the French authors of the 19th Century all deal with the question of love as much as Flaubert does. In this case, it would be impossible for us to look up the occurrence of terms related to love in all of the novels written by French authors in the 19th Century. In order to avoid this problem, it would be necessary to collect a sample of texts, representative of the works of this period. We will discuss this topic in Chapter 6, which is devoted to the methodological principles underlying the construction of a corpus. For the moment, the important point to bear in mind is that corpus linguistics often resorts to a quantitative methodology (see section 1.5) so as to be able to generalize the conclusions observed on the basis of a linguistic sample to the whole of the language, or belonging to a particular language register.

As we will see in the following chapters, corpus linguistics may be of use in all areas of linguistics, for instance in fundamental (see Chapter 2) or applied (see Chapter 3) linguistics. For example, it is crucial in lexicography, since it makes it possible to make an exhaustive inventory of a language’s lexicon. It also makes it easy to find examples of uses in different types of sources (literary, journalistic and others), while bringing to light the expressions in which a word is frequently used. In other words, it makes it possible to establish very useful phraseology elements for dictionaries. For example, it is useful to know what the word “knowledge” means, but it is just as important to know that this word is frequently used in phrases such as “acquire knowledge” or “having good knowledge of”, etc. Corpus linguistics is a particularly effective method for establishing the frequent contexts in which a word or an expression is used. But corpus linguistics is also used for conducting research in fundamental areas of linguistics such as the study of syntax, since it makes it possible to identify the types of syntactic structures used in different languages. For example, by making a corpus study, it is possible to determine in which textual genres the passive voice is most commonly used. Finally, thanks to the existence of a corpus of oral data, corpus linguistics also makes it possible to answer questions related to phonology and sociolinguistics. For instance, it makes it possible to establish the area of geographical distribution of certain pronunciation traits, such as differentiating the short /a/ form in the French word “patte” (paw), from the long /ɑ/ form in the word “pâte” (pastry). Answering these different questions requires the use of different types of corpora, as well as having available data regarding their contents. For example, in order to determine the geographical area of diffusion of a certain pronunciation trait, it is necessary to know where each speaker having contributed to the corpus came from. This type of information is called corpus metadata. We will review the main types of existing corpora at the end of this chapter, and discuss the issue of metadata in Chapter 6.

To sum up, in this section, we have defined corpus linguistics as an empirical discipline, which observes and analyzes quantitative language samples gathered in a computerized format. In the following sections, we will discuss in depth the different central points of the definition, indicated in bold, in order to better understand the theoretical and methodological anchoring of corpus linguistics.

1.2. Empiricism versus rationalism in linguistics

Corpus linguistics is an empirical discipline, which means that it uses data produced by speakers in order to study language. This methodology is opposed to the rationalist method, which functions by looking for answers by relying on one’s own linguistic knowledge, rather than looking for it in external data. Let us take an example. In order to determine whether the phrase “When do you think he will prepare which cake?” is grammatically correct or not, the use of empirical methodology would go through large corpora to find whether this syntactic structure is used by English speakers or not.

If sentences following such a syntactic structure never or almost never appear in the corpus, linguists might conclude that this sentence is only rarely used in English. Rationalist methodology, on the contrary, might respond to the same issue by relying on the intuitions of linguists. In this particular case, they might wonder whether they could produce such a sentence or not, whether it seems correct or incorrect depending on their knowledge of the language and might infer a grammaticality judgment from it. Grammaticality judgments are often classified into three types: correct, incorrect or marked, in the event that a sentence may seem possible, but sounds unnatural.

This example illustrates a fundamental difference between empirical and rationalist methodology. While the rationalist methodology leads to the formulation of categorical judgments, the empirical methodology provides a more refined answer to this question, since the observation of corpus data offers a precise indication of frequency, rather than a result in terms of absence or presence. This is one of the reasons why many linguists currently consider that the empirical methodology better matches a scientific approach (in the sense of confrontation against the facts) than a purely rationalist method for studying language.

Nonetheless, the choice between the use of empirical or rationalist methods is not limited to the field of linguistics. Certain scientific branches such as physics, chemistry, as well as sociology and history are essentially empirical disciplines. In fact, both physicists and historians base their insights on external data, which they collect in the

Скачать книгу