Elizabeth Walter
Lexicography is a discipline characterised by an unusual degree of overlap between the commercial and the academic. Collaboration between linguists and dictionary publishing houses has made possible the creation of huge, state-of-the-art corpora and highly sophisticated tools of analysis which have transformed lexicography and had a huge influence on linguistics as a whole.
Any lexicographer working with the English language today must owe a huge debt to that evangelist of corpus lexicography, John Sinclair. While the American Heritage Dictionary of 1969 was the first to make use of corpus information, Sinclair and his team at Birmingham University, UK (in collaboration with Collins publishers), went far further in marrying the theory of large-scale corpus analysis with almost every aspect of the practice of lexicography in their groundbreaking COBUILD (Collins Birmingham University International Language Database) dictionary of 1987.
At the time, their work was highly controversial. Not surprisingly, publishers balked at the idea that building costly large corpora was necessary to the business of writing dictionaries, and some traditional lexicographers were unwilling to concede that their own intuition might not be adequate to describe the language in general. Twenty years later, however, the debate has been decisively won: lexicographers in all the major publishing houses have access to large corpora of both written and spoken (transcribed) texts, and their methods have spread into other areas of publishing, most notably that of grammars and language teaching materials (see chapters by Hughes and McCarten, this volume).
Before the advent of computerised corpora, most lexicographers used a system of citations, or ‘reading and marking’, where evidence of usages was found in texts and recorded in some way, often stored on index cards. This kind of evidence is still used today, because for some types of information it can usefully complement corpus evidence (see Section 3). However, as Sinclair points out, ‘Especially in lexicography, there is a marked contrast between the data collected by computer and that collected by human readers exercising their judgment on what should or should not be selected for inclusion in a dictionary’ (Sinclair 1991: 4). When Boris Becker crashed out of the Wimbledon tennis championship in 1987, the UK tabloid newspaper The Sun said he had been spending too much time ‘bonking’ (a mild, informal term for having sexual intercourse). Challenged on this, Becker replied that he did not know what was meant, as the word was not in his dictionary. Since it had been in wide use for some time by then, one can only assume that lexicographers had made some sort of judgement that this was not a ‘real’ word and did not merit dictionary status.
Corpus lexicography does not allow of this sort of personal prejudice, nor does it allow the lexicographer simply to ignore tricky usages. Each corpus line must either be accounted for or discarded for a valid reason. It is much more difficult to ignore a meaning that is in front of you in black and white than one which comes into your head, but which you can easily persuade yourself is actually an aberration.
Using corpora forces the lexicographer to become much more objective. Without a corpus, it is easy and tempting (often through no wish to deceive but rather a conviction that one is simply presenting the truth) to come up with evidence that supports a preconceived notion. Example sentences, for instance, will be invented to support the definition that is written (see Section 3). With a corpus, the process is entirely reversed: we must start with the evidence and match the description to it.
Corpora, therefore, help us to analyse language with objectivity, help us to identify senses of words and their relative frequency, and give us the evidence we need to support what we say about language in our dictionaries.
When it comes to corpus size, the general consensus seems to be that big is beautiful, and the increasing availability of text in electronic form has made the collection of huge amounts of data much more viable. Oxford and Cambridge University Presses, for instance, now both boast corpora of over a billion words. To see what this means in practical terms, it is useful to look at some examples. Table 31.1 shows frequency per ten million words (rounded to the nearest whole number) according to the Cambridge International Corpus.
Table 31.1 shows that even in a corpus of one million words, ‘effort’ is likely to occur around 148 times. However, ‘effortless’–a word which would strike most native speakers as not at all obscure – would occur only once or twice. If this sample is representative, even in a twenty-million-word corpus, it may only occur around twenty-six times, which is probably not enough to give reliable information on features such as its typical collocation patterns.
Of course, quantity of data is not the only important characteristic. With the internet, it is possible now to gather almost unlimited amounts of text, but without some sort of monitoring and selection of content there is the danger that a very distorted picture of the language could emerge (see Biber 1993; Crowdy 1993; McEnery and Wilson 1996; Biber et al. 1998; McCarthy 1998; and chapters by Reppen, Nelson, Koester and Adolphs and Knight, this volume).
It is usual, therefore, when building a corpus, to aim at some notion of a ‘representative’ sample of the language. Debate rages as to what is meant by ‘representative’ and, indeed, whether such a notion has any academic or practical validity at all (see McEnery et al. 2006). In practice, corpus builders in dictionary publishing houses try to aim for a wide range of genres. It is likely that a template of genres would be drawn up and target numbers for each genre collected.
There are four fundamental divisions of genre for lexicographic corpora:
1. Written or spoken data? Written data most usually comprises texts that are available electronically, such as newspapers and novels, though other text may be scanned or re-keyed if particular required genres are not available. Spoken data consists of transcripts of speech ranging from general conversation to TV shows. The third edition of the Longman Dictionary of Contemporary English was the first to highlight the use of spoken corpora, giving small graphs comparing the frequencies of some words and phrases in written and spoken data.
2. Regional language variety? For English, the corpus builder must decide which varieties (e.g. British, US, Australian, Indian) to include, and in what proportions. Many other languages would have similar considerations where they are spoken in various countries or regions of the world.
3. Synchronic versus diachronic data? Synchronic data is contemporary or recent material, used for general, contemporary dictionaries. Diachronic data covers historical data, and may include words and usages that are now considered archaic or have fallen out of use altogether. There is a third form of corpus used by some publishers and researchers, which is known as a ‘monitor corpus’. A monitor corpus aims to track language change by constantly adding new material and comparing it to a stable ‘reference corpus’.
4. Native speaker or non-native speaker language? Several publishers, especially those which produce learner dictionaries, have corpora of texts produced by nonnative speakers. This data comes most typically from exam scripts or student course work, and is transcribed electronically (see Section 4).
Within those major divisions, an almost infinite number of sub-genres is possible, and their choice will be determined partly by the type of dictionaries being produced and partly by practical considerations of the time and cost involved in their collection. Some examples of these sub-genres might be: broadsheet newspapers, tabloid newspapers, classic novels, modern novels, emails, personal letters, transcripts of service encounters (business-to-customer contact, for instance interactions in shops), transcripts of business meetings, text on product packaging, text messages, transcripts of radio or TV shows.
For lexicographers, the content of the corpus they use will depend on the type of dictionary they are writing. For a general dictionary such as one for learners of English, they will use general sources such as newspapers, novels and general conversations. More specialised lexicography requires more specialised sources. Most corpora will be structured in such a way that it is possible to pull out sub-corpora chosen from a range of criteria, such as subject type, language variety, source genre, age of speaker, etc. Alternatively, special subject corpora may be built, and used either independently from the main, general corpus, or be added to it. For example, when working on a business dictionary, lexicographers might choose to use a corpus built entirely from business-related texts.
One type of corpus that has been used in bilingual lexicography is that of the ‘multilingual aligned corpus’, otherwise known as the ‘parallel corpus’. This may either be a corpus of translated texts, such as the proceedings of the Canadian parliament in French and English, or collections of texts in different languages that attempt to replicate the same distribution of text types. The advantage of the former type is the exact correspondence of texts, but in the latter, the language is more natural because it does not contain translation (see chapters by Kenning, and Kübler and Aston, this volume). The use of such corpora for lexicography has been primarily for technical lexicons, while work on translation studies and contrastive linguistics feeds into general bilingual dictionaries more indirectly (see Hallebeek 2000).
In order to make use of the corpus data available to lexicographers, sophisticated analysis tools have been developed. This sort of software is often referred to as a ‘corpus query system’ (CQS).
The simplest tools are concordancing packages (see the following sub-section for a discussion of concordancing), and there are several of these commercially available (see Tribble, this volume). However, most dictionary publishers will either use tools developed in-house, or buy in a package from an outside supplier. This is because lexicographers want not simply concordances, but much more detailed ways of analysing the enormous amounts of corpus data at their disposal. A package which is currently used by several dictionary publishers is the Sketch Engine, developed by Adam Kilgarriff and his company Lexical Computing Ltd (see Kilgarriff et al. 2004). This is a web-based system which can use corpora in any language provided it has an appropriate level of linguistic mark-up, and some of its features are discussed in Section 3.
Any corpus query system relies on extensive processing of the data. The corpus will be ‘part-of-speech tagged’, meaning that every word in the corpus is assigned a part of speech. This is done using computerised systems, and with varying degrees of accuracy (see Section 5). It is also of great benefit if some sort of parsing can be accomplished, to show the relationship of words in context to one another.
In the early days of corpus lexicography, it was common for lexicographers to be presented with a ream of computer printout containing KWIC (keyword in context) concordance lines for the stretch of words they were working on. Although this was a great advance in the provision of evidence for lexicographers, and did enable them to spot senses and patterns that had previously been unrecorded, the sheer mass of unanalysed data was overwhelming for any reasonably common word. If there are several thousand lines for a single search word, a pattern that crops up only once every couple of hundred lines may be statistically significant, but beyond the capabilities of human brain power to calculate.
Once lexicographers were able to have concordance lines on their own computer screens, tools were developed to allow some manipulation of the data. The most simple, yet for lexicographers extremely useful, example is left and right sorting, where the concordance lines are presented alphabetically according to the word to the right or the left of the search word. This sorting serves to highlight common collocations, such as ‘strenuous efforts’ and ‘make … efforts’ in Table 31.2.
Common syntactic and grammatical patterns are also shown, such as the use of the infinitive in the right-sorted lines in Table 31.3.
There are many other requirements of corpus analysis tools, of which some of the most important are:
• Lemmatisation. This means matching any form of a word, with all its inflections, to a base form, usually the form that would be found as a dictionary headword. Thus, a corpus search on ‘find’ can return data for the forms ‘find’, ‘finds’, ‘finding’ and ‘found’. Some tools also allow the user to exclude a particular inflection if it is not wanted. As well as going straight to an amalgamated search of all the inflections of a word, it can often be useful to a lexicographer to note the distribution of the inflections. For instance, when looking at the verb ‘flood’,we note that the form ‘flooded’ is by far the most common. (Figures from the Cambridge International Corpus show the following frequencies per ten million: ‘flood’, twenty-six; ‘flooded’, seventy; ‘flooding’, twenty-six; ‘floods’, seven.) This is because the verb is so often used in passive forms, a fact which is useful to record, particularly in a dictionary for learners of English. A related issue, and one on which it is extremely useful for the lexicographer to have corpus guidance, is that of when an inflection takes on a life of its own, for instance when a form with -ing or -ed (e.g. ‘worrying’, ‘boiled’) achieves adjective status. Sinclair (1991: 8) goes so far as to say: ‘There is a good case for arguing that each distinct form is potentially
a unique lexical unit, and that forms should only be conflated into lemmas when their environments show a certain amount and type of similarity.’
• Part-of-speech tagging. Where a form has more than one part of speech (e.g. ‘hand’ verb or ‘hand’ noun), lexicographers will usually want to select one of them. The lemmatisation must work for different parts of speech too: for instance, the lexicographer will want ‘houses’ the plural noun separately from ‘houses’ the present tense, third person singular verb.
• Case sensitivity. In some cases, lexicographers may want to exclude capitalised forms of a word, or indeed search only for capitalised forms, so as to distinguish between ‘Labour’ (the UK political party) and ‘labour’ (work), or between ‘Polish’ (from Poland) and ‘polish’ (substance for cleaning furniture).
• Random selection of lines. The smaller the corpus, the more it can be distorted by a single unusual text, and similarly, the smaller the section of corpus looked at, the more likely the same thing is to occur. Some corpus systems get round this by making a randomised search for a number of lines (typically something like 1,000) the default search. Others have this facility as an option. It is always important to be aware of the source of a citation, especially when it appears to show something unusual, and to check that multiple instances of the same thing are not simply repeated uses in the same text.
• Multi-word unit searches. A good corpus query system will allow searches on multi-word units, though the limitations of automatic parsing may make searching for items containing very common de-lexicalised words difficult in some systems, for instance the particles in phrasal verbs (see Greaves and Warren, this volume).
• Expanding context. The default context for viewing a citation line is usually the width of a computer screen. Sometimes this is not enough to understand the context of the citation, so it is useful to be able to expand the citation as much as necessary.
• Excluding occurrences of fewer than a certain number of instances. This is a useful, practical refinement for lexicographers. Especially when working on a new dictionary, where decisions on inclusion are to be made, a lexicographer may want to ask the corpus to list all words in an alphabetical stretch which have a minimum number of occurrences in a certain number of words (ten million is a useful benchmark).
• Filtering. Within the corpus, texts may be coded by a number of features, such as age/sex of speaker/writer, date of text, genre of text, name of source. The lexicographer may wish to filter text on any of these features. For instance, when using a learner corpus (see Section 4), it would be common to filter on the level of the student or on their native language.
The early period of corpus use was a salutary one for lexicographers when it became clear how many senses of words in existing dictionaries were not attested.
Of course, the number of senses of any particular word which are covered in a dictionary, and the degree to which senses are split, depends on the target audience of the dictionary. A comparison of any two dictionaries of similar length and target audience quickly makes it apparent that sense division is not an exact science, and the lexicographer’s own views come into the decisions. It is sometimes said that there are two types of semanticist, the ‘splitters’ and the ‘lumpers’, the former feeling that even fine shades of meaning denote separate senses, with the latter more inclined to identify fewer core senses and to see minor deviations from them as resonating from that core meaning rather than constituting a separate sense.
To give a practical example, Table 31.4 shows KWIC concordance lines for the word ‘production’, taken from the Cambridge International Corpus. These lines are in the order that they appeared from the random selection of lines by the CQS.
If we look at the entry for ‘production’ in the Oxford Advanced Learner’s Dictionary, seventh edition (OALD), we can see that lines 1, 2 and 7 correspond to their sense 1: ‘the process of growing or making food, goods, or materials’. Line 8 is covered by sense 2: ‘the quantity of goods that is produced’, and line 4 by sense 4: ‘a film, a play or a broadcast that is prepared for the public; the act of preparing a film, play, etc.’
This leaves lines 3 and 5. The only remaining sense in OALD is sense 3: ‘the act or process of making sth naturally’.A‘lumper’ might argue that lines 3 and 5 belong here, since the sound of a voice is produced ‘naturally’.A‘splitter’ on the other hand would probably argue that there is will and effort involved in the production of words and the production of a particular tone. The OALD sense gives as its example: ‘drugs to stimulate the production of hormones’, implying that the sense covers processes which happen without volition, and thus does not adequately describe lines 3 and 5, both of which sound entirely plausible to a native speaker ear.
The availability of plentiful corpus lines is likely to encourage splitters, as it is easy to find uses which are not unambiguously covered in existing dictionaries. The lexicographer then has to make a judgement about the validity of those lines, considering for instance whether the style and quality of the surrounding text make it a reliable source of evidence, and also whether this usage is attested elsewhere in the corpus.
Use of corpora has particularly changed the treatment of very polysemous words (words with many different meanings, such as ‘get’, ‘far’ or ‘set’). These words often have the added complication that their meaning may be rather delexical. Indeed, we now see in many dictionaries a different style of definition evolving to deal with some of these senses, often following formulae such as ‘used to express …’,or‘used when you …’. These meanings are often not the ones which first come to mind. However, even with corpus evidence, it is still an enormous challenge to distil the senses of a very polysemous word into a coherent set of senses for a dictionary. The lines in Table 31.5, taken from the BNC, illustrate the complexity of such a word. Even in these few lines, it is extremely difficult to say where separate senses are being illustrated.
Collocation is the co-occurrence of words with a frequency that is much higher than it would be by chance. For example, we talk about ‘heavy traffic’ or say that we ‘commit a crime’. All good dictionaries for learners of English (and many dictionaries for learners of other languages) contain information on collocation, because it is a key aspect of fluent English, and often quite idiomatic and difficult to predict. The tools used today for identifying collocation build on the work of Church and Hanks, who began looking at the possibilities of statistical analysis of the co-occurrences of words and patterns in large corpora in the 1980s (see Church and Hanks 1990). They recognised that computer power could be used to spot significant patterns that the human brain could not.
In Section 2, we saw how right- and left-sorting can help to show up common collocations. However, in cases where the collocating word is not adjacent to the headword, they would not be highlighted in this way. Each of the following lines, for instance, illustrates the collocation ‘reach’ with ‘compromise’:
I thought we would never be able to reach a compromise.
Eventually a compromise was reached.
We are hoping to be able to reach some sort of compromise.
The CQS needs to be able to spot these combinations wherever they occur. Church and Hanks (1990) devised systems that looked not only at words that occur adjacently, but at the co-occurrence of words in the same chunk of text – in practice usually a sentence – irrespective of their position in that text. They referred to this pattern of co-occurrence as ‘Mutual Information’, and began to produce statistical information on collocation that had not been previously available.
As well as simple frequency of co-occurrence, a system based on Mutual Information will look at the frequency of both the words involved. In basic terms, the less frequent the words, the less likely it would be that they would co-occur by chance, and therefore the more significant their co-occurrence is deemed to be. It will also probably look at the stability of the position of the collocating word in relation to the headword. In general, the more fixed the position (for instance if the collocate occurs two places before the headword in 90 per cent of occurrences), the stronger the collocation. All these factors are used to provide a list of the most significant collocations.
A more recent development of this sort of information is the Word Sketch, the product of Kilgarriff’s Sketch Engine (see Kilgarriff et al. 2004). Where this system differs from previous methods is in combining an analysis of both grammatical and collocational behaviour. A traditional corpus frequency list would make no distinction between the grammatical relationships between collocate and search word (or ‘node word’). So, for instance, a list for the search word ‘compromise’ might include the items ‘accept’, ‘reach’, ‘between’, ‘deal’, ‘solution’. There would be nothing to explain the grammatical or syntactic relationship between them and the node word.
A Word Sketch groups collocates according to their grammatical relation to the search word. Figure 31.1 shows a Word Sketch for the word ‘compromise’. The lists are divided into groups which show features such as verb use when ‘compromise’ is the object of the sentence (the long list on the left), adjectives that modify ‘compromise’, words that ‘compromise’ itself modifies, and groups of words that are used with particular prepositions (‘for’, ‘of’, ‘to’, etc.) before ‘compromise’.
Word Sketches can be used by lexicographers for comparing near synonyms, by looking at which patterns and combinations the items have in common and which are more typical of, or even unique to, one word rather than the other. Word Sketches were first used for the Macmillan English Dictionary (Rundell 2002), but are now being used more widely, and for languages other than English.
Section 1 mentions the use of monitor corpora, which compare new material with a stable corpus. Without such a corpus, the basic method of searching for neologisms (new words) would be simply to look at all words occurring a certain number of times per ten million within a particular alphabetical stretch and comparing them to an existing wordlist. New senses of existing words (for example the computing uses of ‘wallpaper’, ‘attachment’ or ‘thread’) should be identified by the analysis of corpus lines.
However, a corpus is not always the best way to spot neologisms for the simple mathematical reason that an item which has only recently entered the language will not occur as often as one that has been in the language for many years. For this reason, most dictionary publishers still use a manual system of trawling for neologisms in a range of sources, either using their own staff or subscribing to a new words collection service such as the Camlex New Words Service. Some publishers, most notably Oxford University Press, also draw on contributions from the public.
Information about the source of corpus material will be recorded, so that it is always possible to know information such as title, genre and date of each citation. Thus we can look, for example, at whether a sense of a word is more common in British or US English, and if we search by date, we can see if that distribution has changed, for instance if a US word has become more widely used in Britain. Similarly, we can see if a particular word is more commonly used in written or spoken language.
The Cambridge International Corpus tools provide a summary of this information before getting to the corpus lines themselves, so it is possible from that point either to get to all the lines for the word in question, or refine the search, for example to British, spoken sources. For the word ‘sidewalk’, for instance, we find the information in Table 31.6.
We can see that it is, unsurprisingly, more common in US English, and that although it is used relatively often in British written sources (perhaps because of having US authors or writing about the US), it is never used in spoken British English.
The Sketch Engine enables the lexicographer to compare uses between different genres. For instance, a comparison of the word ‘wicked’ shows that in written texts it modifies words such as ‘stepmother’, ‘fairy’ and ‘witch’, and it is only in spoken contexts that we see the much more modern, positive sense of ‘very good’.
Identifying common grammatical and syntactic patterns is a key part of the lexicographer’s job, and an aspect which has become much more refined as the result of corpus evidence, particularly for those working on dictionaries for learners, where such information is an essential part of what the dictionary entry must offer. Using corpora for studying grammar is covered in depth in chapters by Conrad and Hughes (this volume), but we have seen in Section 3 how the Sketch Engine shows the main grammatical and syntactical patterns in which a word occurs, and also in Section 2 how concordances can highlight common patterns, such as the idiom ‘barking up the wrong tree’ occurring almost exclusively in continuous forms.
Using the corpus as a source of dictionary examples is still a rather controversial topic. John Sinclair was unequivocal in his position that only genuine corpus examples should be used, arguing that segments of language depend on the rest of the text around them, so it is artificial to construct a sentence in isolation.
It is easy to see the logic in his exasperated insistence that: ‘It is an absurd notion that invented examples can actually represent the language better than real ones’ (Sinclair 1991: 5). His answer to the complaints of lexicographers about the difficulty of finding suitable examples to take verbatim is that larger corpora are needed. As he says, ‘One does not study all of botany by making artificial flowers’ (Sinclair 1991: 6).
However, it cannot be denied that even with the large corpora available to us today, finding suitable examples can be quite a challenge, particularly when writing dictionaries for learners, where clarity is essential. Common problems of corpus lines include excessive length, culturally specific items and difficult surrounding vocabulary that distracts from understanding the word in question. Given that the lexicographer is likely to be looking not merely for plausible examples, but for ones which illustrate particular features of collocation and grammar, the temptation to creativity can be great indeed.
In her paper on the subject, Liz Potter, in general an advocate of the corpus-only approach, concedes: ‘Invented examples demonstrate the linguistic points the lexicographer wishes to convey, without any distraction or added difficulty such as may be introduced by using examples taken directly from real texts’ (Potter 1998: 358). She suggests that for dictionaries aimed at lower-level (pre-intermediate) students, a case can be made for simple, invented examples that reinforce the most important features of a word.
Most modern dictionaries tend to order senses by frequency, or at least claim to do so, although in fact, as John Sinclair has pointed out (Sinclair 1991: 36), the first meaning that comes to mind, and often the one that appears first in a dictionary, is frequently not the most common usage of a word. The corpus can help us to take a more objective approach to frequency, though at the time of writing there is no system available that will make automatic semantic distinctions, so the lexicographer has to do the necessary analysis of corpus lines.
It may be relatively simple to look at a single word and determine the order of frequency of its senses, but many modern dictionaries, particularly those aimed at learners of a language, attempt to provide relative frequency between headwords. For instance, many dictionaries have some sort of system for marking what they say are the x thousand most common words. Some show more than one level of frequency. It is interesting to note that most dictionaries that show frequency in this way do so only at headword level. To provide relative frequency for senses requires extensive coding of large numbers of corpus lines to give a frequency figure that can be measured per million or ten million, and thus compared with others. To date, only the Cambridge Advanced Learner’s Dictionary has attempted to do this. A major problem of this endeavour is that the frequency of a meaning will depend on what the lexicographer deems that meaning to be. If you are a ‘lumper’ and identify two meanings of a word, the frequency of both those meanings will be higher than if you are a ‘splitter’ and have identified three senses for that word.
Another general problem of identifying frequency is that although the corpus gives us more reliable frequency information than introspection, it is clearly dependent on the make-up of its content. It is very easy for a corpus to become unbalanced through a preponderance of newspaper texts which are cheap and easy to collect and store, so that alone may exaggerate the importance of particular words such as sports terminology or words used in the reporting of court cases.
Learner corpora consist of language produced by learners of a language (see Gilquin and Granger, this volume). Typically, they would include transcriptions of written work or examinations, and will often be coded according to the type of error the learners make. Learner corpora are used by lexicographers working on learners’ dictionaries or bilingual dictionaries.
Concordancing learner corpus lines can enable lexicographers to spot errors that learners make with particular words. Uncoded learner data is of value in this respect, but errorcoded data offers a much more focused and statistically valid analysis. Learner texts may be coded for many different sorts of errors. Such coding will be inserted by experienced teachers or examiners working to a highly prescribed system.
Using such codes, it is possible to look at the words which most frequently attract a particular kind of error. For instance, you could see which words were most commonly misspelled, or which uncountable nouns were most commonly used in an incorrect plural form.
It is also possible to group individual lexical items according to the error that learners make with them. Having this information enables the lexicographer to include information in a dictionary entry to help a learner avoid that mistake, for instance by highlighting a grammar code, adding an example sentence, or even adding a specific note or panel to deal with a particular issue. Learner error notes are a common feature of learners’ dictionaries.
The other great advantage of error coding is that it usually includes the addition of a ‘corrected’ version. This is a great help in finding items, since it is much easier, for instance, to search on a correct spelling than to identify several incorrect spellings of a word.
Work on the use of learner corpora in dictionaries is sometimes carried out by publishers’ own teams and sometimes in collaboration with academic institutions, such as the work of Macmillan Dictionaries and the Université catholique de Louvain, Belgium, who developed the International Corpus of Learner English (see Granger 1993, 1994, 1996, 1998; Granger et al. 2002; Rundell and Granger 2007).
A learner corpus can also enable the lexicographer to look at specific types of learner. For instance, we can filter the corpus by level of proficiency, which enables us to tailor information so that it is appropriate for the target user of the dictionary.
As well as looking at how learners are using words, we can look at which words they use. That enables us, if appropriate for a particular dictionary, to give more usage information (e.g. grammar, collocation) in entries for words which learners are likely to try to use productively, as opposed to those they will consult for comprehension. Currently, the English Profile Project (see project website) is using learner corpus data to try to link lexical and grammatical use to Common European Framework levels, a project which will surely feed into dictionaries in future.
Another very useful feature is the ability to sort according to the native language of the learner. Obviously, this is of particular use when working on a dictionary aimed at a particular native speaker group, but it can also help with decisions on when to add a usage note in a general learners’ dictionary – the greater the number of native languages on a particular error, the more reason to include a note.
It can be interesting to compare native speaker and learner language. We can see, for instance, that words which are extremely common in native speaker texts are likely to be proportionally even more common in learner texts, because learners will know fewer more interesting synonyms than native speakers. This can give us guidance on where to include thesaurus-type information in our dictionaries.
Some types of corpus material are much easier to gather than others and it is a common weakness of lexicographic corpora to have an over-reliance on newspaper text. Particularly for spoken corpora and learner corpora, where texts need to be transcribed, the costs of building corpora large enough to be of value are daunting, and there are challenging issues of copyright and ‘anonymising’ spoken data to be faced.
One new area of corpus creation which is being worked on at present is that of using the internet in a more sophisticated way to gather data. For example, the BootCaT method (see Baroni and Bernadini 2004) uses a process of selecting ‘seed terms’, sending queries with the seed terms to Google and collecting the pages that the Google hits point to. This is then a first-pass specialist corpus. The vocabulary in this corpus can be compared with a reference corpus and terms can be automatically extracted. The process can also be iterated with the new terms as seeds to give a ‘purer’ specialist corpus. The corpus size can be increased simply by sending more queries to the search engine. A service of this kind is available online from WebBootCaT (see Baroni et al. 2006).
Although spoken corpora have led to many insights into the grammar of discourse and the vocabulary of everyday spoken language (see McCarthy and Carter 1997), their value to lexicographers working primarily at the level of a lexical unit is more doubtful. Most people are shocked when they first see transcription of speech. We can hardly believe that we speak in such an incoherent way, stammering, repeating words, changing the subject in the middle of a sentence, interrupting one another or simply petering out. This sort of language is very difficult to analyse, and almost never suitable to be used directly as a dictionary example.
It could be argued that the transcription process itself robs speech of an important layer of meaning. Features such as intonation, speed and tone of voice can be crucial in distinguishing meaning, as can the context of an utterance, which, unlike in most written text, may not be explicitly stated at all. As Rosamund Moon has pointed out: ‘The range of conversational implicatures in such formulae as if you like or XX would like can scarcely be described in a linguistic monograph, let alone within the constrained space and vocabulary of explication in a dictionary’ (Moon 1998: 353).
It would of course be very useful to dictionary editors working on pronunciation to have spoken corpus examples in an audio form, but such systems are not yet available.
Most lexicographers would probably say that the biggest limitation is still inaccuracy of part-of-speech tagging and the difficulty of parsing. It can be frustrating to search for noun senses of ‘side’ and discover that a large percentage are actually verb senses, and such problems can make it difficult to make objective statements about frequency. These problems are exacerbated in spoken data, where language conforms less to ‘normal’ rules of grammar and is therefore even more difficult to analyse automatically.
A related problem is that quite apart from the difficulty in distinguishing different parts of speech, there is the issue of agreeing what those parts of speech should be. If the lexicographer decides that ‘divorced’ is an adjective, but the CQS calls it a past participle of the verb ‘divorce’, the analysis will not be the one the lexicographer wants.
The tool that would revolutionise the lexicographer’s life is one which could distinguish senses, and of course this is the goal of computational linguists working in all manner of academic and commercial areas. Although many avenues such as looking at context, collocation and syntax are being pursued, we are still a long way from having a tool that would be useful for dictionary writers.
In Section 3, the difficulty of finding suitable corpus chunks to be used verbatim in dictionary examples was discussed. Lexical Computing Ltd is currently working on developing a tool that would automatically select corpus lines which conform to certain criteria likely to make them potential candidates as examples.
One of the biggest problems for a lexicographer can be to decide how to use the resources at their disposal most effectively within time constraints which are probably unimaginable to the academic linguist. Many of the CQS features discussed in this chapter have helped a great deal, but good lexicographers still need to have an excellent feel for the language in order to be able to use them accurately.
Very often, when presented with summarised information, it is necessary to drill down deeper into the corpus to understand fully what is happening in the language. For instance, a collocation search on the word ‘run’ will highlight the word ‘short’. The lexicographer needs the instinct either to guess or to know to check that this is because of the idiom ‘in the short run’ or the collocation ‘run short (of something)’, rather than just a literal combination, as in ‘I went for a short run.’ Similarly, they should be able to spot the likelihood of query results being influenced by a particular text in the corpus, leading to information that would be atypical outside a very limited context. For instance, a word may have a common technical collocation in a financial context, which is over-represented in the corpus because of a newspaper bias, but which would not be appropriate for a general dictionary, and the lexicographer must be alert to these possible distortions.
Thankfully, even in this age of computational sophistication, the role of the lexicographer is far from redundant. Many dictionary publishers sell data to computational linguists, and there is a nice circularity in the fact that it is at least in part the nuggets of linguistic knowledge that lexicographers mine from the corpus which will lead to the development of even better corpus tools in the future.
Atkins, B. T. S. and Rundell, M. (2008) The Oxford Guide to Practical Lexicography. Oxford: Oxford University Press. (This is a very hands-on guide to dictionary writing and includes detailed discussion of the use of corpus evidence.)
Rundell, M. (2008) ‘The Corpus Revolution Revisited’, English Today 24(1): 23–7. (This paper is an update of the 1992 paper, ‘The Corpus Revolution’, by Rundell and Stock. It provides an interesting diachronic reflection on how the role of corpora has developed and evolved in that period.)
Sinclair, J. (1987). Looking Up: An Account of the COBUILD Project in Lexical Computing. London: Collins. (This details the groundbreaking COBUILD project.)
Baroni, M. and Bernardini, S. (2004) BootCaT: Bootstrapping Corpora and Terms from the Web’,in Proceedings of LREC 2004, Lisbon: ELDA, pp. 1313–16.
Baroni, M., Kilgarriff, A., Pomikálek, J. and Rychl, P.(2006) ‘WebBootCaT: Instant Domain-specific Corpora to Support Human Translators’, in Proceedings of EAMT 2006, Oslo, pp. 247–52.
Biber, D. (1993) ‘Representativeness in Corpus Design’, Literary and Linguistic Computing 8(4): 243–57.
Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Camlex New Words Service, www.camlex.co.uk
Church, K. W. and Hanks, P. (1990) ‘Word Association Norms, Mutual Information, and Lexicography’, Computational Linguistics 16(1):22–9.
Crowdy, S. (1993) ‘Spoken Corpus Design’, Literary and Linguistic Computing 8: 259–65. English Profile Project, www.englishprofile.org
Granger, S. (1993) ‘The International Corpus of Learner English’, in J. Aarts, P. de Haan and N. Oostdijk (eds) English Language Corpora: Design, Analysis and Exploitation. Amsterdam: Rodopi, pp. 57–69.
——(1994) ‘The Learner Corpus: A Revolution in Applied Linguistics’, English Today 39: 25–9.
——(1996) ‘Learner English Around the World’, in S. Greenbaum (ed.) Comparing English World-wide. Oxford: Clarendon Press, pp. 13–24.
——(1998) ‘The Computerized Learner Corpus: A Versatile New Source of Data for SLA Research’, in S. Granger (ed.) Learner English on Computer. London: Longman, pp. 3–18.
Granger, S., Hung, J. and Petch-Tyson, S. (eds) (2002) Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins.
Hallebeek, J. (2000) ‘English Parallel Corpora and Applications’, Cuadernos de Filologia Inglesa, 9(1): 111–23.
Kilgarriff, A., Rychly, P., Smrz, P. and Tugwell, D. (2004) ‘The Sketch Engine’, Proc. Euralex, Lorient, France, July, pp. 105–16.
McCarthy, M. J. (1998) Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press.
McCarthy, M. J. and Carter, R. A. (1997) ‘Written and Spoken Vocabulary’, in N. Schmitt and M. J. McCarthy (eds) Vocabulary: Description, Acquisition, Pedagogy. Cambridge: Cambridge University Press, pp. 20–39.
McEnery, T. and Wilson, A. (1996) Corpus Linguistics. Edinburgh: Edinburgh University Press.
McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-based Language Studies. London: Routledge.
Moon, R. (1998) ‘On Using Spoken Data in Corpus Lexicography’, in T. Fontenelle, P. Hiligsmann, A. Michiel, A. Moulin and S. Theissen (eds) Euralex ‘98 Proceedings. Liège: University of Liège, pp. 347–55.
Potter, E. (1998) ‘Setting a Good Example: What Kind of Examples Best Serve the Users of Learners Dictionaries?’, in T. Fontenelle, P. Hiligsmann, A. Michiel, A. Moulin and S. Theissen (eds) Euralex ‘98 Proceedings. Liège: University of Liège, pp. 357–62.
Rundell, M. (ed.) (2002) Macmillan English Dictionary. London: Macmillan.
Rundell, M. and Granger, S (2007) ‘From Corpora to Confidence’, English Teaching Professional 50: 15–18.
Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.