The Routledge Handbook of Corpus Linguistics

2
Theoretical overview of the evolution of corpus linguistics

Elena Tognini Bonelli

1. The origins of corpora

The starting point of corpus linguistics can be traced by considering the issue of observable data and how this has been handled in different periods and across different theoretical schools. Of necessity, historical linguistics has always been corpus-based since by far the principal evidence of language change and evolution is found in collections of texts of different periods and locations (Johansson 1995: 22). Indeed modern linguistics owes its impetus to the lively work of the historical linguists of the nineteenth century. It may come as a surprise therefore that in a relatively short space of time it should shift its focus to an approach based on intuition and introspection from that data-based approach. But it is a fact that, in spite of its data-based origins, modern linguistics, after the historicalevolutionist period, shifted away from the observation of data and, starting with Saussure, the object of linguistics was defined as the system, abstract par excellence, and non-identifiable in single tokens.

Under the influence of the positivist and behaviourist trend, post-Bloomfieldian linguistics in the USA became concerned to account for the observable data, and there was little room for abstract speculation. With Chomsky, though, the pendulum swung back towards a refusal of observable data as the basis for linguistic statements. Chomsky’s position as to observable data in general and corpus linguistics in particular is made clear in the following quotes:

Like most facts of interest and importance … information about the speaker-hearer’s competence … is neither presented for direct observation nor extractable from data by inductive procedures of any known sort.

(Chomsky 1965: 18)

Corpus Linguistics does not exist.

(Chomsky, in an interview with Bas Aarts, 2000)

The contrast between this position and the theoretical assumptions of corpus linguistics is obvious: corpus linguistics represents a definite shift towards a linguistics of parole; the focus is on ‘performance’ rather than ‘competence’. The linguist aims to describe language use rather than identify linguistic universals. The quantitative element (frequency of occurrence) is considered very significant and, depending on the specific approach, is taken to determine the categories of description.

The idea of a corpus grew in the 1960s, deriving mainly from the tradition of lexicography (Francis 1992). While Dr Johnson, the reference point for English lexicography, used for his examples sentences quoted from great scholars like Hume, Johnson’s focus was on the meaning of the words in use, and not on the ideas expressed in the sentences. Well-known writers were cited because they were authority figures in a prescriptive tradition. But as well as gathering the words of the great and famous, another tradition of scholarship that grew with modern linguistics was that of the field linguists, who spread all over the world reaching ever more remote communities and building up records of the languages – usually spoken – that they found. Their informants were in the main quite ordinary people, their conversations also ordinary.

The modern corpus was mainly based on these prior methods of acquiring data for language study. Nevertheless, the thought of compiling a collection of texts which would provide sound evidence of the state of a language was new. Instead of capturing the great signs of culture, the early corpora had modest aims: to collect a good variety of language in use by fairly ordinary people in order to study better the grammar and vocabulary currently in use. As noted by Biber (this volume), an early example of corpus-based work is found in C. C. Fries’ grammars of written and spoken American English (1940 and 1952, respectively). By the end of the 1960s there existed a few small corpora, constructed on diverse principles.

The Survey of English Usage, led by Randolph Quirk from its inauguration in 1959 (see website) was an exception to the trend of the time. The Survey focused on the everyday linguistic interactions, spoken and written, of non-celebrities and accumulated a large database on file cards at University College London (see website). There were, however, no plans to computerise it until many years later.

2. The influence of technology in the development of corpora

It was not the linguistic climate but the technological one that stimulated the development of corpora. The electronic computer was on the horizon, and although the first computers were extremely difficult to work with, their great potential was correctly assessed from an early date. Computational work on texts began with Father Busa’s Index Thomisticus before 1950 (completed in 1978, see Busa 2000), continuing the scholarly tradition of making concordances to works of high status, but using the clerical potential of the computer. There is now a large library of electronic versions of literary, philosophical and religious texts, and the word corpus often stretches to cover some of these collections (see special corpora below).

The digitisation of a vast range of documents is of more recent origin. Starting with databases of legal and journalistic documents, the movement has grown in parallel with the access provided by the internet, and the world-wide web in particular.

The first electronic corpus of written language was the Brown Corpus, compiled in the 1960s at Brown University by Nelson Francis and Henry Kucera (Francis and Kucera 1964) and still very much in use (see website). This corpus contains a million words of American English from documents which had been published in the year 1961. Its design (see sample corpora below) became the standard for some years, and some thirty years later was repeated in the Frown Corpus (see diachronic corpora below).

Advances in technology also enabled the collection of spoken data through the invention of the tape recorder. Portable tape recorders were just coming on the market in the late 1950s, and speech could be played back again and again, and studied as a sound wave. A side-benefit of this activity was that speech events could be transcribed without using shorthand, and the first electronic corpus of spoken language was assembled at the University of Edinburgh in the years 1963–5 on Sinclair’s initiative (see Krishnamurthy 2004). It contained 166,000 words of informal conversations in English, recorded and transcribed. So the written and spoken corpora of the early 1960s were prepared over the same period (by Francis and Kucera on the one hand for the written language and by Sinclair on the other for the spoken), but the researchers were not initially aware of each other’s work.

The 1970s was a period of consolidation and modest spread to a number of languages and different types of corpus. Development was slow, and this was also mainly because of the state of the available technology. Computers were still calculating machines with small memories, and programming languages were not devised with the manipulation of character strings in mind. Nevertheless, this is the time when corpora in excess of one million words were assembled, and annotated corpora were first considered, and also a spoken corpus in a detailed phonological transcription, the spoken section of the Survey of English Usage (see London–Lund website). All of these advances came from Sweden, and Scandinavian scholars such as Sture Allén, Knut Hofland, Stig Johansson and Jan Svartvik set the shape of mainstream corpus linguistics for a generation; they were not alone, however, and important corpus work was in progress with French, Hebrew and Frisian, among other projects. The first corpus of a special variety of a language was the Jiao Da English for Science and Technology (JDEST) corpus, compiled by Yang Huizhong in Shanghai around the same time (see JDEST corpus website).

Here again the timing was the result of technological advances. The invention of scanners improved access to the printed word enormously, and the growth of computer typesetting pushed the horizon out of sight. As Sinclair was fond of pointing out, by about 1990 linguistics had changed from a subject that was constrained by a scarcity of data to one that was confused by more data than the methodologies could cope with. Some may even claim that it has not yet come to terms with this abundance.

Certain classes of data are still scarce, and this is not likely to change in the very near future. Anything that is not available in electronic and alphanumeric form has still to undergo skilled and expensive processing at the input stage. The sound wave is still not amenable to automatic linguistic interpretation despite some successes in the field of speech recognition. Handwritten material has to be transcribed, and a lot of older printed material resists the best scanners. Meanwhile advances in graphics and the emergence of animated text and mixed media communication have set new descriptive goals for which linguistics was ill-prepared.

The large corpora of today often privilege material from an essentially unlimited source – journalism. This feature maintains the controversies about ‘balance’ and ‘representativeness’ which have been important issues since computer typesetting became almost universal. There is a clear risk that some features presented as characteristic of a language are actually characteristic mainly of its journalism. More recently the growth of electronic communication has given rise to several new and equally abundant sources, notably web pages, e-mail and blogging. All of these are uncharted territories whose communicative properties are, at the time of writing, largely unknown.

To conclude this section we could say that, in a rough-and-ready way, the relatively brief progress of electronic corpus building and availability can be seen as falling into three stages, or ‘generations’ (Tognini Bonelli and Sinclair 2006: 208):

(a) The first twenty years, c. 1960–80; learning how to build and maintain corpora of up to a million words; no material available in electronic form, so everything has to be transliterated on a keyboard.

(b) The second twenty years, 1980–2000; divisible into two decades:

(i) The eighties, the decade of the scanner, where with even the early scanners a target of twenty million words becomes realistic.

(ii) The nineties, the First Serendipidity, when text becomes available as the byproduct of computer typesetting, allowing another order of magnitude to the target size of corpora.

(c) The new millennium, and the Second Serendipidity, when text that never had existence as hard copy becomes available in unlimited quantities from the internet.

3. A quantitative and a qualitative revolution

The technological advances outlined above are strictly interwoven with the emergence of corpus linguistics as a discipline and the progressive penetration of the computer in corpus linguistics work needs to be considered further. In the first stage the computer was seen simply as a tool: it was used to process, in real time, a quantity of information that could hardly be envisaged a few years ago. This is still the most impressive contribution of corpora to language research. But in changing the dimension of evidence, the computer reached a second stage of penetration into linguistic work: not only was it providing an abundance of new evidence, it was by its nature affecting the methodological frame of enquiry by speeding it up, systematising it, and making it applicable in real time to ever larger amounts of data. So Leech (1992) saw that there was now a distinctive methodology associated with corpus work; while a corpus was little more than a big collection of evidence, one approached it in a different way from the perusal of separate texts. The means of retrieving information were getting more and more sophisticated, and the results required more and more skilled interpretation.

Leech (ibid.) drew a clear distinction between corpus linguistics and such varieties as sociolinguistics and psycholinguistics, which, although manifestly hybrid, were regarded as disciplines in their own right; the corpus advanced the methodology but did not change the categorial map drawn by linguistic theory. For many linguists, that is the limit of the changes brought about by the addition of corpora to the computational toolkit. The computer as a tool was not expected to upturn the theoretical assumptions behind the original enquiries themselves, and so no such effect was expected. It was not even felt necessary to look out for such fundamental changes.

However, we can now show that a further development has taken place in the 1990s, a third stage of penetration. What started as a methodological enhancement but included a quantitative explosion (I am referring here to the quantity of data processed thanks to the aid of the computer) has turned out to be a theoretical and qualitative revolution in that it has offered insights into the language that have shaken the underlying assumptions behind many well-established theoretical positions in the field.

Writing a little after Leech, Halliday foresaw the signs of a qualitative change in the results of the quantitative studies opened up by corpus research. He warned that not only language but semiotic systems in general would be affected by this new proximity of theory and data (Halliday and James 1993: 1–25). This is clearly a stage beyond methodology.

Others expressed similar points of view around this time; one theme concerned the effect of increasing the arithmetic power under the command of the linguist. Clear (1993) pointed out the connection between the use of computational, and consequently algorithmic and statistical, methods on the one hand and the qualitative change in the observations. Not only could language researchers speed up the process of analysis, they could carry out procedures which were just not feasible before computers became available. The difference of scale led to a qualitative difference in the observations. It is strange to imagine that just more data and better counting could trigger philosophical repositionings, but that indeed is what has happened.

Saussure’s famous words, ‘c’est le point de vue qui crée l’objet’ (it is the viewpoint which creates the object), can be reinterpreted in this turn of events: if the dimensions of the viewpoint change as they did, or the granularity of the research results, the object created is substantially different from before. What we have witnessed in the development of corpus linguistics as a discipline is that our chosen methodological standpoint has progressively determined both the object and the aim of the enquiry. In other words, in this instance, the methodology has ended up defining the domain of the discipline.

Given these premises, we should note a few points. Linguistic data are now available in such large quantities that patterns emerge that could not be seen before. In the debate centred around the issues arising in corpus linguistics there is a lot of talk of ‘the web as a corpus’ and the explosion of information that affects corpus building. The change in the quality of evidence is now obvious to most scholars and observations about instances of language use affect systematically the statements about the language system in general. The problem for the linguist has shifted from accessing large enough quantities of data to elaborating a reliable methodology to describe and take into account this type of unprecedented evidence. This is what a scholar like Sinclair observed (Sinclair 1991: 1 and ff.) and much of his theoretical work on the definition of units of meaning for language description amply proves this point.

Halliday’s point about the converging paths of theory and data raises question marks around some of the most familiar dichotomies of modern linguistics, in particular ‘competence’ and ‘performance’. Although this separation between linguistic theory and linguistic evidence could be seen as a methodological convenience, it certainly cuts right across the descriptive framework which is necessary for deriving linguistic information from corpora. While before corpora came along such buffers were needed because no way could be envisaged of accounting directly for all the evidence, corpus work offers no reason or motivation for selecting some evidence and ignoring the rest. The theoretical statement derived from corpus evidence, especially nowadays when large corpora are at everybody’s disposal, has to start from new presuppositions. This point has been discussed in detail in Tognini Bonelli (2001) and was at the basis of the distinction between corpus-based and corpus-driven linguistics.

4. The theoretical shift from text-linguistics to corpus linguistics

There is another point that is worth noting. Given that a corpus is a collection of texts, the aim of corpus linguistics has rightly been seen as the analysis and description of language use, as realised in text(s). Corpus linguistics started from the same premises as textlinguistics in that texts were assumed to be the main vehicle for the creation of meaning. The question that arose, however was: could corpus evidence be evaluated in the same way as a text was evaluated? This issue is in no way resolved. Different scholars continue to approach it in different ways and there are still those who advocate that, in order to understand and evaluate corpus data better, the analyst has to have direct and full access to the individual texts at any point in time. Most scholars, however, now accept that, in spite of the initial starting point which corpus and text share, the two approaches are fundamentally and qualitatively different from several points of view (summarised in Table 2.1).

Working within the Firthian framework of a contextual theory of meaning, the text has been seen in a unique communicative context as a single, unified language event mediated between two (sets of) participants. The switch in focus from a text-linguistic perspective to a corpus-linguistics one has brought about a different approach. The corpus is not ‘just like a text, only more of it’. It brings together many different texts and therefore cannot be identified with a unique and coherent communicative event; the citations in a corpus – expandable from the Key Word in Context (KWIC) format to include n number of words – remain fragments of texts and lose out on the integrity of the text (for a detailed treatment of KWIC searches, see Tribble, this volume). The significant elements in a corpus become the patterns of repetition and patterns of co-selection. In other words, in corpus linguistics it is the frequency of occurrence that takes pride of place.

This difference entails a different ‘reading’ of a corpus compared to one of a text (Tognini Bonelli 2001): the text is to be read horizontally, from left to right in the case of English and other western languages, paying attention to the boundaries between larger units such as clauses, sentences and paragraphs, possible markers of the macrostructure. A corpus, on the other hand, examined at first in KWIC format with the node word aligned in the centre, is read vertically, scanning for the repeated patterns present in the context of the node.

Furthermore, the text has a function which is realised in a verbal context, but also extends to a situational and a wider cultural context. It is interpreted by looking at the functions it has as a communicative event. The corpus, on the other hand, does not have a unique function, apart from the one of being a sample of the language gathered for linguistic analysis; the parameters for corpus analysis are above all formal.

The type of information one draws from a text is interpreted as meaningful in relation to both verbal and non-verbal actions in the context in which they occur – and the consequences of such actions. The type of information gathered from a corpus is

Table 2.1 A qualitative comparison of a text versus a corpus

evaluated as meaningful in so far as it can be generalised to the language as a whole, but with no direct connection with a specific instance.

Using a Saussurian terminology one can conclude by saying that text is an instance of parole while the patterns shown up by corpus evidence yield insights into langue.

The series of contrasts between corpus and text outlined above have the purpose of differentiating two sources of evidence that may appear similar but entail very different analytical steps. A corpus contains text evidence and therefore, given a different methodological framework of analysis, it yields insights into the specific text as well. The corpus, in fact, is in a position to offer the analyst a privileged viewpoint on the evidence, made possible by the new possibility of accessing simultaneously the individual instance, which can be read and expanded on the horizontal axis of the concordance, and the social practice retrievable in the repeated patterns of co-selection on the vertical axis of the concordance. Here, frequency of occurrence is indicative of frequency of use, and this gives a good basis for evaluating the profile of a specific word, structure or expression in relation to a norm. The horizontal axis also portrays, at the local level, the syntagmatic patterning, while the vertical axis yields the paradigmatic availability; this is the choice available to a speaker or a writer at a given point and within a certain language system.

5. Corpus typology

Here we briefly consider some of the most common types of corpora available at the time of writing and their different characteristics. There are now so many corpora for so many purposes that it is impossible to list them, and only a sketchy classification can be attempted (see Lee, this volume, who provides a detailed coverage of corpora available and an alternative typology into which they are organised). The typology offered here was originally proposed in the course of an EU project (see EAGLES website, also Tognini Bonelli and Sinclair 2006).

Sample corpora

Most corpora are ‘snapshots’ in time, and as such they are samples of a given language at a given moment. They are often referred to as sample corpora. The primary aim of a sample corpus is to present the normal linguistic features of a language or variety in approximately the proportions found in general use. The Brown Corpus, the original sample corpus, is still a very clear example of the way in which such a corpus is constructed. It is divided into Informative and Imaginative prose, then into sixteen subcategories of these, and finally into 500 samples each containing approximately 2,000 words. Although sample corpora may have been amassed over a period and deal with texts that cover twenty years or more, their aim is to present a state-of-a-language, with the time dimension frozen. Their main drawback is that they fall out of date rather quickly. Monitor corpora and diachronic corpora (see below) introduce the time dimension as a design factor in a corpus.

When a sample corpus claims to be a reasonably reliable repository of all the features of a language, it can be called a reference corpus. Nowadays it will have to be quite large – 100 million words is the typical size – and it will contain substantial amounts of all the main kinds of language that are found in a society. Language, spoken and written, public and private, informative and fictional, etc., will all be there.

Corpora for comparison

Two or more corpora can be designated comparable when they are built on the same design criteria and are of similar size (see chapter by Kenning, this volume, for extensive coverage of comparable corpora). Although the term was first used to designate a variety of multilingual corpora (see below), corpora which were designed to be compared with each other had already been compiled in the monolingual area.

Geographical

The first corpus designed to be comparable was the Lancaster–Oslo–Bergen (LOB) corpus. It was given the same design as the Brown Corpus, but selected from British English, where the Brown Corpus selected from American English. All the texts in both corpora were printed during the year 1961, and LOB was completed in 1978. The availability of this comparable resource led to a landmark publication in corpus linguistics – Hofland and Johansson (1982).

The Survey of English Usage (SEU) developed into a project to build a group of comparable corpora which would also facilitate research into language variety, this time on a global basis – the International Corpus of English (ICE). Launched in 1990, the plan was to explore the world-wide varieties of English by building corpora of local varieties in many regions, using a design based on the Survey’s holdings (see Lee, this volume). Each corpus contains only around a million words, but twenty varieties are under investigation, and six of the ICE corpora are now available (for details see ICE website).

Historical

These corpora are designed to be compared along a time dimension – see below.

Topic

Some students of discourse like to collect texts organised by topic – e.g. a number of newspaper reports of the same event. These tend, however, to be very small.

Contrastive

In principle any pair of corpora can be compared and contrasted. But sometimes a corpus is built with the specific purpose of making contrasts within itself and declares at the outset that certain of its major components are expected to contrast sharply. A contrastive corpus is thus a single corpus whose principal components have been chosen to facilitate the study of variety. Whereas sample corpora are mostly designed so that the various differences among varieties blend into a general picture of the patterns of usage, a contrastive corpus is designed to bring out the characteristic patterning of distinguishable varieties. Clearly this is a matter of degree. Just as any two corpora can be contrasted, any corpus with distinguishable components can be examined contrastively. But there are internal differences. Each principal component of a contrastive corpus is designed without reference to the others, so the comparability that is maintained in, say, Brown and LOB, is not found. An example of a contrastive corpus is that compiled by Biber, contrasting four varieties of modern English, which provides the evidence for a substantial grammar (Biber et al. 1999).

Special corpora

The term corpus has been usually associated with the study of language in and for itself and, as we mentioned above, this was later narrowed to feature the ordinary, everyday language of normal people going about their daily business. Such corpora, it is assumed, will be a repository of evidence about a language or variety in its natural state; repeated patterns of occurrence in such corpora will provide evidence which is independent of the idiolect of one particular informant. If several apparently unconnected people are observed constructing patterns with strong similarities to each other, such patterns are likely to be patterns of langue, the shared area of meaning-creation in a speech community.

There are, however, collections of texts which do not provide this kind of evidence, but which are still referred to as types of corpora. The selection method, or the pool of texts from which the selection is made, is not designed to be representative of a language or variety. Many of these are important collections, and to mark both their importance and their difference from ‘ordinary language’ corpora they are called special corpora.

The works of Shakespeare, Goethe or Proust are not reliable exemplars of contemporary English, German or French; they are chosen not for their ordinariness but for their extraordinariness. It is the uniqueness of their language choices, and not the universality of them, that causes them to be collected.

Special corpora have been in use for many years, and the advent of computers offered a powerful platform for study; as noted above, the first project to transfer language text into electronic form concerned the works of St Thomas Aquinas. One of the earliest corpora available to researchers was the Leuven Drama Corpus and Frequency List, made by L. K. Engels and colleagues at the University of Leuven in 1975 (see Goethals et al. 1990). This pioneering corpus consisted of sixty-one plays written between 1965 and 1972, written in standard English and in ordinary spelling. It was the same size as Brown, one million words, and part of its rationale was to offer an imitation of spoken dialogue to complement the exclusively written genres of the Brown Corpus.

Corpora along the time dimension

Corpora which include the time dimension as a design feature are not very common, and are of two kinds. Diachronic corpora present ‘snapshots’ at intervals of time, usually spanning at least a generation, while monitor corpora are devised so that language change can be plotted as it occurs.

The first diachronic corpus was the Helsinki corpus, which offers exemplars of English texts from c.750 to c.1700 (see website). Since the first sample corpora come from the 1960s, it is now feasible to replicate their design with current texts and make detailed comparisons. There is now the Frown corpus, of similar architecture to its parent Brown but with a time interval of thirty years, so that it records the American printed English of a generation later than the Brown. The design of Frown is thus to make it comparable; it and its sister corpus Flob were assembled by Christian Mair at the University of Freiburg (see website), aiming to be updated versions of the Brown and LOB corpora, and encouraging four-way comparisons along the geographical and time dimensions.

These corpora are restricted to printed texts. For the spoken language the diachronic dimension is restricted to the period since the invention of sound recording techniques. At the time of writing, a recently completed venture is the Diachronic Corpus of Present-Day Spoken English (DCPSE), a corpus which presents two spoken corpora dated a generation apart in a format which facilitates comparison. The corpora are the London– Lund Corpus, which is the spoken element of the Survey of English Usage, recorded in the 1960s, and the ICE-GB Corpus, recorded in the 1990s. For details see DCPSE website.

Monitor corpora were originally put forward by Sinclair (1982) (see also Clear 1987) as attempts to keep synchronic corpora up to date. Instead of discarding material that was to be replaced with more recent exemplars, it was more prudent to retain it with a time tag, and thus add a diachronic dimension to the corpus. Because of the rapidly growing size of corpora, it required the uniformity and the substantial dimensions of newspaper publishing to make monitor corpora feasible. After several years of development, they are still in a provisional state because their intrinsic importance is not as yet fully recognised. The first attempt was the Aviator project, which layered an annual ten million words of The Times newspaper and devised software that would detect innovations of various kinds (see AVIATOR website).

Bilingual and multilingual corpora

The development of bilingual and multilingual corpora was initially tied to the perceived needs of mechanical translation, the original goal of computational linguistics (Kay 2003, see also chapters by Kübler and Aston, and Kenning, this volume). The Canadian Parliament was both bilingual and technologically aware, and from about 1980 it was possible to obtain the Proceedings in both English and French, in electronic form. The first parallel corpus (see website, Canadian Hansard) began to take shape.

This alignment procedure worked reasonably well for a wide range of bureaucratic, administrative and legal documents, and encouraged the building of suitable corpora, for example in the European Union. However, the procedure relied on one particular kind of close translation, where translation equivalence is virtually guaranteed for units as small as the sentence. Other kinds of translation work to less rigid criteria, and offer a ‘freer’ translation which is claimed to be less dependent on the source text and more faithful to the phraseological conventions of the target language (see Tognini Bonelli and Manca 2002, and chapters by Kübler and Aston, and Kenning, this volume).

Multilingual corpora based on a relationship of translation among the constituent texts are called parallel corpora (see Kenning, this volume). One concern for researchers in the 1990s was that translated text might not be a good specimen of the target language in its normal usage, and this led to two developments.

The multilingual comparable corpus

Here the correspondence among participating languages is at the level of design rather than at the level of the choice of actual texts. As far as possible, given cultural differences, the same quantities of the same types of texts would be assembled. Usually all the texts are originals in the language they represent, to avoid any influence of the translation. Translation equivalence at the level of the sentence is thus not possible, but a more complex network can be built up to aid translation (see Sinclair 1996). The comparable corpus has a wide range of applications, and the PAROLE Corpus (see ELDA website) is an instance of its application to all the official languages of the European Union at the time (see Kenning, this volume).

The contrastive corpus

This refers to a corpus consisting of two sub-corpora, one of translated texts and the other of untranslated texts, in the same language. There is no translation relationship between individual texts because this corpus is designed to pursue Baker’s hypothesis (Baker 1993: 245) that some linguistic features are characteristic of translated text, regardless of the source language. Such a corpus is therefore classified as contrastive. Formerly the general opinion was that where features of translated text differed from text that was independently composed in the same language, they would be explicable with reference to features of the source language (see also e.g. Kenny 1998: 52) for a classification within Translation Studies and chapters by Kübler and Aston, and Kenning, this volume).

Other variations on parallel and comparable corpora include the English–Norwegian Parallel Corpus (a standard reference point around the turn of the millennium which has broadened out into the Oslo Multilingual Corpus) and the Corpus of Translated Finnish (CTF) (a comparable corpus which consists of ten million words altogether, consisting both of translations into Finnish in several genres, and comparable texts originally written in Finnish).

Normativeness

The first corpora were deliberate attempts to record the normal usage of members of a language-using community. There was an implication that most if not all of the authors and speakers would be native speakers, and probably fairly accomplished native speakers, of the language concerned. The Brown Corpus was called a ‘standard’ corpus, and by being restricted to printed documents it acquired all the features of standardisation that are provided in the printing and publishing process. As corpora became more easily available, their use in providing models for language learners was appreciated; this has been a controversial area but corpora are now widely accepted despite some doubts concerning their reliability as repositories of ‘correct’ sentences.

One of the first multi-million-word corpora, the Birmingham Collection of English Texts (Sinclair 1987), the precursor of the Bank of English (see website) specifically set out to be normative, aiming at adult native speakers for the originators of the texts and advanced foreign learners as the recipients of the dictionaries, grammars and other publications that were produced from corpus evidence.

Normativeness is not an easy feature to realise in spoken text (see below), but is built into the normal practices of legal and administrative transcription – court reporting, Hansard (the official transcript of the UK parliament’s proceedings) and the like. These, on the other hand, were among the first electronic representations of spoken language that became available. Normalisation is of course absent from corpora of dialectal material, spoken or written.

The possibility of corpus building has led to reappraisals of standards, targets and models for non-native users of languages and ex-colonial varieties, which had generally low status. Following political independence, there was no reason for the native speaker to reign supreme when corpora could be gathered of high-quality users of established local varieties. See, for example, the recent interest in English as a Lingua Franca (see ELFA website). Reference has already been made to the ICE project (see website), which may play a normative role in the future, though that is not seen as its principal function.

Non-native speaker corpora

Corpora in the context of language learning have found other uses, and one popular one is corpora of learner language. By gathering instances of the usage of learners and comparing these with normative model corpora, the language of learners can be explored in a much more profound way than the previous work on error analysis was able to do. The main work in Europe on learner corpora is at the University of Louvain in Belgium (see CECL website), but there are many other projects in learner corpora around the world (see Nesselhauf 2004; see also Gilquin and Granger, this volume).

Spoken corpora

There are now many different kinds of collections of spoken language, whose electronic representations follow a host of different conventions (see Adolphs and Knight, this volume). One group, called speech corpora, are built for the detailed study of individual sounds and phonetic features, and do not need to be continuous texts, nor do they need to be collected in natural situations (see Lee, this volume). Many are recorded in hightech laboratories in order to maximise the detail on the recorded version. Speech corpora are thus part of the class of special corpora (see above), those which for one reason or another are not intended to be representative of the ordinary language in its characteristic use as communication.

Large modern reference corpora usually incorporate some spoken material – typically around 10 per cent – and this is in a simple orthographic transcription so that the query software will retrieve instances of a word or phrase in both spoken and written texts. But there is a growing number of spoken corpora which are adapting to the particular character of spoken language and offering new routes to study (see Lee, this volume).

Throughout the period of access to corpora, and despite the prominence of the written corpus from Brown onwards, there has been a high value placed on the recording of impromptu language, typically conversations involving a few people known to each other. While the more formal kinds of spoken language – lectures, speeches, etc. – seemed reasonably familiar as minor variations of the written norms, there was a novel quality about capturing informal interaction that opened up the study of discourse and made it absolutely clear that the received grammars were not constructed to cope with the meaningful patterns that emerged from the conversations; they were heavily biased in favour of written language of medium formality.

Renewed interest in the structure of the spoken language (Carter and McCarthy 2006; Stenström et al. 2002) has given us new and valuable corpora. The MICASE corpus of academic American English is freely available (see website), and the C-ORAL-ROM project has published a set of comparable corpora of spontaneous speech for the principal Romance languages, French, Italian, Portuguese and Spanish, about 1,200,000 words in all (Cresti and Moneglia 2005; see also website). A particularly valuable feature of this product is text-to-speech synchronisation (see Adolphs and Knight, this volume).

The internet accepts any type of any language, and some of the new varieties, such as e-mail, chatrooms and blogging, are mostly informal and subject to no conventions of usage. To some observers they seem to be close in style to ordinary spoken conversation, being a kind of written conversation and conducted often in real time. This has led to optimistic claims that the internet could overcome the technical hitches that prevent researchers accessing large quantities of speech; however, the claim has not yet been validated, and speaking remains probably the most intricate of human activities.

Websites

David Lee’s website: http://www.devoted.to/corpora

The Corpora list website: http://gandalf.aksis.uib.no/corpora/

References

Aarts, B. (2001) ‘Corpus Linguistics, Chomsky and Fuzzy Tree Fragments’, in C. Mair and M. Hundt (eds) Corpus Linguistics and Linguistic Theory. Amsterdam: Rodopi, pp. 5–13. (Reprinted in W. Teubert and R. Krishnamurthy (eds) (2007) Corpus Linguistics: Critical Concepts in Linguistics, volume 1, London: Routledge, pp. 173–81.)

Baker, M. (1993) ‘Corpus Linguistics and Translation Studies’, in M. Baker, G. Francis and E. Tognini Bonelli (eds) Text and Technology. In Honour of John Sinclair. Amsterdam: John Benjamins, pp. 233–50.

——(ed.) (1998) Routledge Encyclopedia of Translation Studies. London and New York, Routledge.

Baker, M., Francis, G. and Tognini Bonelli, E. (1993) Text and Technology. In Honour of John Sinclair. Amsterdam: John Benjamins.

Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) The Longman Grammar of Spoken and Written English. Harlow: Pearson Education.

Busa, R. (2000) ‘Half a Century of Literary Computing: Towards a “New” Philology’, available at www.uni-tuebingen.de/zdv/tustep/kolloq.html#50

Carter, R. A. and McCarthy, M. J. (2006) Cambridge Grammar of English: A Comprehensive Guide to Spoken and Written Grammar and Usage. Cambridge: Cambridge University Press.

Chomsky, N. (1965) Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.

Clear, J. (1987) ‘Trawling the Language: Monitor Corpora’, in M. Snell-Hornby (ed.) ZURILEX Proceedings. Tübingen: Francke.

——(1993) ‘From Firth Principles – Computational Tools for the Study of Collocation’, in M. Baker, G. Francis and E. Tognini Bonelli (eds) Text and Technology. In Honour of John Sinclair. Amsterdam: John Benjamins, pp. 271–92.

Cresti, E. and Moneglia, M. (2005) C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: John Benjamins.

Francis, W. (1992) ‘Language Corpora B.C.’, in J. Svartvik (ed.) Trends in Linguistics. Studies and Monographs 65. Berlin and New York: Mouton de Gruyter, pp. 17–32.

Francis, W. and Kucera, H. (1964) Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence, RI: Brown University, Department of Linguistics (revised 1971; revised and amplified 1979).

Fries, C. (1952) The Structure of English: An Introduction to the Construction of Sentences. New York: Harcourt-Brace.

Fries, C. and Traver, A. (1940) English Word Lists. A Study of their Adaptability and Instruction. Washington DC: American Council of Education.

Goethals, M., Engels, L. K. and Leenders, T. (1990) ‘Automated Analysis of the Vocabulary of English Texts and Generation of Practice Materials. From Main Frame to PC, the Way to the Teacher’s Electronic Desk’, in M. A. K. Halliday, J. Gibbons and H. Nicholas (eds) Learning, Keeping and Using Language. Selected Papers from the 8th World Congress of Applied Linguistics, Sydney, 16–21 August 1987, Vol. 2. Amsterdam: Benjamins, pp. 231–68.

Halliday, M. A. K. (1992) ‘Language as System and Language as Instance: The Corpus as a Theoretical Construct’, in J. Svartvik (ed.) Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82. Berlin and New York: Mouton de Gruyter, pp. 1–25.

Halliday, M. A. K. and James, Z. (1993) ‘A Quantitative Study of Polarity and Primary Tense in the English Finite Clause’, in J. M. Sinclair, M. Hoey and J. Fox (eds) Techniques of Description: Spoken and Written Discourse. London: Routledge.

Hofland, K. and Johansson, S. (1982) Word Frequencies in British and American English. London: Longman.

Johansson, S. (1995) ‘Mens Sana in Corpore Sano: On the Role of Corpora in Linguistic Research’, The European English Messenger IV(2): 19–25.

Kay, M. (2003) ‘Introduction’, in R. Mitlov (ed.) The Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press, pp. xvii–xx.

Kenny, D. (1998) ‘Corpora in Translation Studies’, in M. Baker (ed.) Routledge Encyclopedia of Translation Studies. London and New York, Routledge, pp. 50–3.

Krishnamurthy, R. (ed.) (2004) English Collocation Studies. London: Continuum (new edition of J. Sinclair, S. Jones and R. Daley (1970) English Lexical Studies).

Leech, G. (1992) ‘Corpora and Theories of Linguistic Performance’, in J. Svartvik (ed.) Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82. Berlin and New York: Mouton de Gruyter, pp. 125–48.

Nesselhauf, N. (2004) Collocations in a Learner Corpus. Amsterdam: John Benjamins.

Saussure, F. de (1922) Cours de Linguistique Generale. Paris: Payot.

Sinclair, J. (1982) ‘Reflections on Computer Corpora in Linguistic Research’,inS.Johansson(ed.)Computer Corpora in English Language Research. Bergen: Norwegian Computing Centre for the Humanities, pp. 1–6.

——(ed.) (1987) Looking Up. London: HarperCollins.

——(1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.

——(1996) ‘Corpus to Corpus: A Study of Translation Equivalence’, special issue of International Journal of Lexicography 9(3): 179–96.

——(ed.) (2004) How to Use Corpora for Language Teaching. Amsterdam: John Benjamins.

Sinclair, J., Payne, J. and Pérez Hernandez, C. (eds) (1996) ‘ Corpus to Corpus: A Study of Translation Equivalence,’ special issue of International Journal of Lexicography 9(3): 179–276.

Stenström,A-B.,Andersen,G.andHasund,K.(2002)Trends in Teenage Talk. Amsterdam: John Benjamins.

Svartvik, J. (ed.) (1992) Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82. Berlin and New York: Mouton de Gruyter.

Tognini Bonelli, E. (2001) Corpus Linguistics at Work. Amsterdam and Philadelphia: John Benjamins.

Tognini Bonelli, E. and Manca, E. (2002) ‘Welcoming Children, Pets and Guests: Towards Functional Equivalence in the Languages of “Agriturismo” and “Farmhouse Holidays”’, Textus XV(2): 317–34.

Tognini Bonelli, E. and Sinclair, J. (2006) ‘Corpora’, in K. Brown (ed.) Encyclopedia of Language and Linguistics, second edition. Amsterdam: Elsevier, pp. 206–19.

2Theoretical overview of the evolution of corpus linguistics