Rosamund Moon
For corpus linguists, it is difficult to see how anyone can learn much about lexis without using a corpus, or could fail to learn something from each new corpus search. Lexis can be researched simply through exploring individual lexical items and their behaviour, or by using corpus data to examine the lexicon as a whole or to test lexical theory. This chapter gives an introductory overview of some key aspects of English lexis, all of which can be straightforwardly investigated with corpora. Data is drawn from the 450-millionword Bank of English corpus (BoE), created by COBUILD at the University of Birmingham (71 per cent British English, 21 per cent North American, 8 per cent Australian; 86 per cent written, 14 per cent transcribed spoken data).
What do corpora tell us about the English lexicon? This kind of information is best obtained from a large reference corpus (ideally, at least fifty million words), since smaller and specialist corpora are likely to show skewings, with too few examples of rarer words. However, the obvious question – how many words are there in English? – is unanswerable, even with very large corpora. All corpora do is reveal which words are used in their constituent texts and how frequent they are.
A more useful question is how many words comprise the main vocabulary of a language, its central lexicon. Here word frequency lists provide a starting point, demonstrating that a comparatively small set of words accounts for a large proportion of text. Table 15.1 shows the cumulative proportions of BoE comprised by the N most frequent lemmas (base forms and inflections), and their approximate frequencies per million words of corpus text. Beyond this is a long tail of infrequent items, including hapaxes (words occurring once), of which many in BoE are names, numbers or errors.
Unsurprisingly, most very high frequency items are grammatical words: BoE’s top ten lemmas are the, be, of, and, a, in, to (infinitive particle), have, to (preposition) and it.Itis usually the lexical words, though, which seem more interesting: BoE’s top 100 includes the noun lemmas year, time, person/people, day, man, way, verbs say, go, make, get, take,
know, see, come, think, give and adjectives new, good. These are all fairly general words, many associated with semi-grammatical functions – for example, deictic uses of year(s), causative and phrasal uses of make:
This kind of data is replicated in other reference corpora, though exact rank orderings of words and rates of occurrence vary according to corpus composition and lemmatisation policies. The effect of composition is evident when we consider distribution: even high frequency words may have quite different frequencies in different types of text. See Hunston (2002: 3ff) for discussion of word frequency; also O’Keeffe et al. (2007: 31ff) with respect to applied linguistics. Leech et al. (2001) list frequencies in the British National Corpus; Biber et al. (1999) distributions of words within grammatical categories across discourse types.
Lemmatised frequencies are useful, but equally important are the relative frequencies of individual inflections and what these suggest about usage. While it is unsurprising that in BoE peas is twice as frequent as pea, it seems more surprising that fact is eight times as frequent as facts, though corpus data quickly shows why fact is so common:
See Sinclair (1991: 37–79); Stubbs (2001: 27ff); Hunston (2002: 60–62) for discussion of forms, frequency and usage.
General corpora also provide information about derivation and compounding, helping establish which potential words are actually institutionalised. In BoE, for example, principal derivatives from the high-frequency root colour/color include colourful, colourless, discolour, colourant, colourist, colourise, colouration, uncoloured…; also colourable, colourism, colouristic, recolour, etc. Formations such as colo(u)rous, miscolour are morphologically possible, but not found. Similarly with words formed with specific morphemes, for example the prefix hyper-: among the most frequent in BoE are hypertension, hyperactive, hyperinflation, hypermarket, hypertext, hyperventilate, hypersensitive. However, rare or hapax items often reveal more about creative processes of word formation, with hyphenated forms especially interesting: thus marginal items such as hyper-accurate, hyper-addictive, hyper-animated, hyper-assertive, hyper-babbling, hyper-blues…may suggest patterns or motivations for coinage. As for compounding, corpora show which formations recur and have specific meanings: for example, BoE has watercolour, colourway, colour-blind, colour-fast, colour-washed, colour-coded, hair colour (dye), adjectivals full-colour, two/four-colour (of printing), and so on.
It is much more interesting, of course, to look at words in their corpus contexts than in isolation. Here the interdependence of words becomes most obvious, showing how the phraseological patternings of words are critically important in relation to meaning as well as usage. A consistent finding in corpus studies has been the extent to which words occur as parts of phraseologies, whether collocational, structural, or both: for discussion, see Sinclair (1987; 2004: 24–48); Stubbs (2001: 80ff); O’Keeffe et al. (2007: 14–15, 59ff, 100ff). Phraseology is also covered in other chapters in this handbook (see chapters by Greaves and Warren and by Hunston), but it is too important with respect to lexis not to discuss here too.
What can we learn about a word from looking at its collocates, the words with which it co-occurs? We might expect these to be least interesting in the case of items with specific meanings, for example a natural-kind entity like aphid:
(Here and elsewhere, concordances are randomised BoE samples.) Co-texts show something about the semantics of aphid: even without knowing what aphids are, we deduce that they are insects or insect-like, small, parasitic on plants, and preyed on by other insects. This is reinforced in listings of aphid’s significant collocates: control, spider, insects, species, mealybugs, plants, thrips, pests, caterpillars, larvae, attack …
For many words, collocational and syntagmatic patterns are more noticeable, as even a few lines for the less concrete noun refuge show:
Phraseological patterns include seek/take refuge, refuge in/for/from, last refuge, place of refuge; the first line contains a recurrent collocation with world, with world representing a difficult, frightening or uncongenial situation.
Corpus data for verbs demonstrates recurrent or mandatory grammatical structures, typical subject and object realisations, and so on. Verbs of motion are typically followed by adverbials or prepositional phrases of direction or manner, even when metaphorical:
With other verbs, grammatical structure and kinds of collocate seem more restricted. For example, comply is typically preceded by something indicating coercion, necessity, willingness, etc. (incentive, must, force, fail(ure), hesitate), and/or is followed by with, itself typically followed by a noun phrase indicating a constraint (agreement, decision, law, obligation … ). The specific meaning of comply is inseparable from such a pattern:
See Sinclair (1987), Partington (1996: 15–28 and passim) and Hunston (2002: 39ff,68ff and passim) for discussion of collocation and patterning: see also chapters by Evison, Hunston, Tribble, Greaves and Warren, this volume.
In many cases, recurrent collocates, especially when syntagmatically fixed, represent some kind of multi-word lexical item. In the following, of course, on course (to/for) and stay the course have holistic meanings:
In fact, BoE evidence for course is dominated by phrasal uses, particularly of course. Similarly, much of the corpus evidence for high-frequency verbs such as take and get consists of occurrences in phrasal verbs and other fixed expressions, while an important feature of data for nouns such as hand, head and heart is the way in which they are used metonymically and metaphorically, including in idioms.
It is difficult to generalise about multi-word items since the range is wide and behaviour varies greatly. Some individual phrasal verbs and compound nouns are themselves very frequent, as are fixed expressions like of course, as well (as), for example, in particular, take place, take part, sort of … , all with important grammatical or discourse functions. However, corpus studies of idioms, proverbs and other such items suggest that most are infrequent, tending to occur mainly in journalism and fiction. A number of claims have been made in the literature about the essential ambiguity of idioms, or their frozenness and syntactic defectiveness; these claims can be correlated with corpus data. Do literal counterparts to idioms actually occur? (only rarely). Are passivisable idioms actually used in the passive? (in BoE bury the hatchet is passive in about 5 per cent of instances, spill the beans in fewer than 1 per cent).While many idioms are indeed lexically frozen, many others are unstable. Thus corpus searches need to be flexible and recursive to find all idiom occurrences, including creative exploitations and interruptions: for example, keep something under one’s hat and … under one’s cap/turban/bonnet; or upset the proverbial/political/establishment applecart. Proverbs too are often shortened or varied: BoE has evidence for canonical what’s sauce for the goose is sauce for the gander, truncated what’s sauce for the goose, variants what is good for the goose (is good for the gander), and manipulations (liberal goose/conservative gander, etc.). The discourse contributions of such expressions, their evaluations and cohesiveness, become apparent from their corpus contexts:
I seem to recall that it was Ashdown who … almost upset the whole Alliance apple-cart. He did so by overturning the leadership on a major question of defence policy. Sauce for the Czech goose should not mean arsenic for the Slovak gander.
See Moon (1998) for a corpus-based overview of English fixed expressions and idioms; McCarthy 1998: 129–49; Moon 1998: 260–64, 300–5; and O’Keeffe et al. 2007: 80–99) for discussion of idioms in interaction.
By forcing us to consider words in context, corpus concordances make us aware of how far the meanings of words are derived from context – even raising the question of whether words have independent meanings at all. This interdependence of meaning and context is clearest in the case of semantically depleted words, such as muchdiscussed take:
Trying to abstract an essential meaning seems pointless: each use needs a different explanation of meaning or paraphrase. There are some fixed expressions (take part, take place, take something seriously); others are simply recurrent usages, and a proper sample would provide many more. What is the commonality of meaning, the meaning of ‘take-ness’? Another case is the noun heed:
We might feel that heed means ‘attention, notice’, but corpus evidence shows that it normally occurs in a restricted range of structures, including the curiously synonymous pay/give/take heed. If its usage is so restricted, how versatile can its meaning really be? Can meaning be separated from phraseology? We can pursue this point with the adjective heinous, which a dictionary might gloss as ‘odious, wicked’:
Heinous indicates unacceptability, but part of its meaning is supplied through collocation with crime and related words. While take and heinous represent very different kinds of word in terms of meaning, a corpus perspective suggests that both reflect the way in which meaning is a construction of context, not intrinsic to the word. It might be objected that there are plenty of other words in the lexicon which are semantically full, have specific denotations and are not collocationally restricted: words like new, walk, year or aphid, peas, hypertension. Any number of words could be substituted in the following concordances, changing meanings, but still maintaining syntactic and collocational wellformedness:
What can corpora tell us about the meanings of such words? One response is that corpora show us typical contexts in which we encounter a word – even if we first acquire them through ostension (walking, peas, aphids) – and that part of our knowledge of a word relates to its associations with other words: cf. Firth’s comment ‘One of the meanings of night is its collocability with dark, and of dark, of course, collocation with night’ (1957: 196).
Another aspect of word meaning to explore is polysemy: how many different senses or uses words have, and how these are distinguished in context. Collocates discriminate by indicating topic and semantic field, with different senses often associated with different phraseological patterns and structures. The noun race is relatively straightforward:
The primary sense distinction between ‘competitive activity’ and ‘ethnic grouping’ is clear, reinforced by words such as champ, won, title, victory on the one hand, relations, religion, gender, human on the other. Since race derives from two distinct roots, it is homonymic rather than polysemous; a more synchronic view, consistent with corpus approaches, is to see this in terms of a major semantic discontinuity and lack of collocational overlap.
See also has multiple senses, with a primary distinction between vision and comprehension. In context, these are distinguished through the nature of what is seen – physical objects (bodies, trees, a horse, text) or entertainment, and ideas (contrast, one’s status, situations):
Adjective senses can be distinguished through types of noun collocate, the entity being described. For colourful, we might group together physical objects and places (cloth, gardens); people and their lives, sometimes with an implication of notoriety; and then more abstract things (lively descriptions, mixes of culture):
One analyst might say that these senses are substantially the same, a core idea with contextualised metaphorical applications. Another might find further nuances of meaning. There are arguments for and against fine-grainedness of sense distinction: what corpus data contributes is a way of having the argument in the first place. With take, an extreme example of a polysemous word, we have already seen that its meanings are tied up with phraseology. It would be impossible to investigate this robustly without a corpus. See Sinclair (2004: 24–48, 131–48) for discussion of meaning, ambiguity and phraseology; see Hunston (2002: 38–66) for further examples and analysis.
Figurative senses such as see ‘understand’ and colourful ‘lively’ are institutionalised: corpora also contain large quantities of more creatively figurative language, but it is not always easy to locate. Systematic studies of metaphor or metonymy may start by searching for a specific item, such as a set of verbs of vision/cognition, or a metaphor-rich word like heart. The following sample omits literal uses, and includes fixed expressions (break one’s heart, change of heart, take heart) and more generalised allusion to the heart as source of emotions:
The last line is arguably literal, but such uses typically also imply psychological states (one’s hands shake, one’s stomach churns … ).
Connotational meaning seems to depend on intuition and suggestion, but are there traces in corpus data? If we follow Firthian principles of meaning by collocation, then collocates may well provide answers. For example, a word such as claret denotes a kind of red wine while connoting a certain lifestyle:
Its collocates include bottle, glass, drink; general adjectives such as good, old, classic; winerelated items burgundy, bordeaux, champagne, port, cru, decanter; vintage, fruity, fine, full-bodied; foods duck, stilton. As a set, they seem to point to that intuited claret lifestyle. See Stubbs (1996: 85–6, 106ff) for discussion of connotation and further examples.
The recurrent lexicogrammatical patterning of words such as refuge, comply, storm also helps us explore connotation in relation to semantic prosodies (Louw 1993; Stubbs 2001: 198ff and passim; Hunston 2002: 141ff, whose term is ‘discourse prosody’). We can start by identifying canonical semantic structures:
(take) refuge from + UNDESIRABLE SITUATION + in + SAFE PLACE someone (+ MODAL) + complies with + OFFICIAL INSTRUCTION |
|
The interesting cases are those which deviate from these patterns, belittling, criticising or creating irony by subverting the canonical evaluation or downtoning a semantic feature:
Following on from connotation, prosodies and collocation is the use of corpus data to explore ideologically significant items. We might look, for example, at gender issues through he/she (cf. discussion of gendered pairings in Section 4) or ethnicity through race; or else look at ‘keywords’ in Raymond Williams’ terms (1983). One of Williams’ words is bourgeois; corpus examples point to something of what it now suggests:
This is reinforced in its collocates (society, culture, family/families, life, values; revolution, ideology, democracy, liberalism, hegemony, class), which in turn suggest discursive associations to be examined. Stubbs (1996: 157–95; 2001: 145–93) has extended discussion of culturally loaded items and keywords; Krishnamurthy (1996) looks at racial, tribal and ethnic; Baker (2006) provides many further case studies: see also O’Halloran (this volume).
Words fall into lexical sets, fitting into semantic fields, though these are not always easily identified from corpus data. However, where a corpus/subcorpus is limited to a particular field, especially if technical, we can learn something about the lexis with which that field is discussed. For example, in the BoE subcorpus of business-focused texts, almost 40 per cent of its top 100 words are lexical and topic-specific(fund(s), shares, securities, investment, company, market, business, income, value, sales … ). Similarly, collocate listings for words may reveal co-hyponyms: those for aphid include a range of insects and other small creatures, often sharing a semantic feature ‘parasitic’ (ants, blackfly, greenfly, caterpillars, lacewings, mealybugs, moths, slugs, snails, spiders, thrips, weevils … ). Another approach to exploring sets is through phraseological frames, looking at what kinds of word occur within a particular slot. For example, the noun slot in the phraseology (QUANTIFIER) + NOUN + ago is realised by not just the obvious set of time words (years, months, weeks, days, seasons … ) but also items considered as periods of time (generations, games, moons, vintages, albums … ). Cf. the way that comply with (see above) is typically followed by collocates which indicate a constraint: another kind of semantic set.
It is often said that English has no perfect synonyms: that is, words which can be used interchangeably in any context. Corpora make it possible to test this by examining collocation, phraseological structure, genre, variety and frequency. Asylum seems to occupy a similar semantic space, ‘place of safety’,to refuge (roughly twice as frequent: see Section 2, ‘Collocation and patterning’, for concordances):
But even a few lines show differences: seek* seems to collocate with both, but take* only with refuge; asylum is more politicised. Another pairing is colourless and drab:
Drab, roughly twice as common as colourless, here refers to appearances, experiences and situations; colourless means ‘lacking pigment, colour’ or is applied metaphorically to uninteresting performances and people. While drab/colourless (and dull/dreary/lacklustre) are only partial synonyms, there are many more problematic pairs like wide/broad or begin/ start where corpora can similarly help disentangle their different usages. See Partington (1996: 29–47) and Stubbs (2001: 35ff, 102ff) for discussion of examples.
Conventional antonyms can also be explored with corpus data, in particular to establish whether they share ranges of reference and phraseological patterns. Are, for example, morphological counterparts such as acceptable/unacceptable, important/unimportant and colourful/colourless also counterparts in usage? What about happy with its two antonyms, unhappy and sad? A third choice, not happy, may need to be considered too:
Of other types of oppositeness, one of the most obvious to investigate is gendered pairings, such as man/woman, boy/girl, husband/wife, where collocation often shows up gender stereotyping. For example, looking in the BoE books subcorpora at the pattern ADJECTIVE + husband/wife, it appears that collocationally a husband is abusive, unfaithful, wayward, hardworking, drunken, a wife good, perfect, battered, pregnant, beautiful…: there are interesting implications ideologically. For studies, see those by Pearce of man/woman (2008), Sigley and Holmes of boy/girl (2002), Baker of bachelor/spinster (2006: 95ff).
This final section looks briefly at lexis in the most distinctive of all discourse subtypes, spoken interaction, here drawing on BoE’s twenty-million-word subcorpus of British conversation and local radio broadcasts. One significant feature is its lexicon: smaller and more homogenised, with fewer types, just under 47,000 non-hapaxes. Correspondingly, its top 100 and 1,000 types comprise a larger proportion of the corpus, or to put it another way, are used proportionately more. Lexical items in its top 100 include mean, sort, thing, want, all more common than in written English; also discourse markers and phatics yeah, yes, right, well, okay, er, erm, mm, oh. Many are strongly patterned phraseologically, all have important pragmatic functions. See McCarthy and Carter (1997) for an overview of spoken lexis; O’Keeffe et al. (2007: passim) for extended discussion.
The following, extracted from BoE, shows some of the features of spoken lexis:
AA … and good shoes … and … and buy the things to me that are erm practical they need … th that … that something that might need to be in use constantly every day and therefore has to be you know a good thing. Erm and not to buy it on silly frivol you know … d+ … don’t spend your money on silly frivolities and BB Mm
AA the things that you don’t need. I mean it’s okay to do that providing that you’ve already taken care of all the other needs.
Features include hesitations, repetitions and relexicalisations; the chunked nature of the language is also clear. Among more lexical words, know and mean occur in phraseologies with discourse functions; others have fairly general meanings (good, thing/things, need/needs)or occur in common collocations or phrases (in use, every day, spend money, take care). Only shoes, buy, practical, constantly, silly, frivolities have fuller meanings or seem more independent.
Predictably, if we compare data drawn from written and spoken subcorpora for a word like know,we find phraseological differences. From British broadsheet journalism:
and from the spoken subcorpus, where 94 per cent of occurrences are as the base form know:
Even in terms of collocates, there can be marked distinctions between speech and writing. If we take a structure such as ADVERB + happy to explore the ranges of submodifiers used, and compare realisations in BoE generally with those in the spoken subcorpus, there are interesting contrasts. Items such as very, quite, perfectly, really, reasonably are found in both subcorpora; in BoE overall, further significant collocates include blissfully, deliriously, ecstatically, fantastically, gloriously, idyllically, infectiously, insanely, irrationally, radiantly, serenely, supremely, unspeakably, wonderfully, etc. But the only such item to occur significantly in the spoken subcorpus is deliriously, suggesting a simpler vocabulary, characteristic of the spoken lexicon.
Perhaps the most important and complex aspect of lexis in spoken language is the way in which words and phraseologies fulfil pragmatic functions: meaning effectively has to be explored in terms of function, as in the case of know. While I know, I don’t know, you know are not semantically opaque, they contribute interactionally, indicating hesitation and uncertainty, appealing to shared knowledge or understanding, pre-empting contradiction. With thing, corpus data shows its functions as a proform or vagueness marker, or, in the formula the NOUN is, prefacing a reason, point or new information:
For words and phrases like these, corpora provide a means for exploring usage, co-textual patterning and positioning, and so on. Less directly, concordances of such words may suggest other items or phenomena to explore: patterns of repetition and relexicalisation, particular phraseologies, and other narrative devices, as in:
Discourse pragmatics can only be studied properly with full transcripts and intonational data, yet there is still much to learn about the spoken lexicon from corpora. It is also salutary to remember that assertions made about lexis in homogenised general corpora provide only part of the truth: there may be extraordinary dissimilarities within corpora, alongside the extraordinary similarities of lexical and phraseological patterning, and patterns of conventionalised usage.
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) Longman Grammar of Spoken and Written English. Harlow: Longman. (This provides useful data about the different kinds of word and phraseological structure associated with four discourse types – conversation, fiction, journalism and academic writing.)
Sinclair, J. (1966) ‘Beginning the Study of Lexis’, in C. E. Bazell, J. C. Catford, M. A. K. Halliday and R. H. Robins (eds) In Memory of J. R. Firth, London: Longman, pp. 410–30. (An early corpus linguistics text, this sets an agenda for corpus studies of lexis, and, significantly, predicts the extensiveness of phraseological patterning and the inseparability of phraseology and meaning.)
——(1987) ‘Collocation: A Progress Report’, in R. Steele and T. Threadgold (eds) Language Topics: Essays in Honour of Michael Halliday, II. Amsterdam: John Benjamins, pp. 319–31. (Reprinted in Sinclair 1991, pp. 109–21) (An important counterpart to Sinclair 1966, this draws on extensive corpus research into lexis, introducing the Idiom Principle, alongside the Open Choice Principle, to explain the role of collocation in determining lexical choice.)
Stubbs, M. (2001) Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell. (This includes a stimulating array of corpus studies of words and phrases, along with discussions of the relevant linguistic issues.)
Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) Longman Grammar of Spoken and Written English. Harlow: Longman.
Firth, J. R. (1957) Papers in Linguistics 1934–1951. London: Oxford University Press.
Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Krishnamurthy, R. (1996) ‘Ethnic, Racial and Tribal: The Language of Racism?’ in C. Caldas-Coulthard and M. Coulthard (eds) Texts and Practices: Readings in Critical Discourse Analysis. London: Routledge, pp. 129–49.
Leech, G., Rayson, P. and Wilson, A. (2001) Word Frequencies in Written and Spoken English. Harlow: Longman.
Louw, W. (1993) ‘Irony in the Text or Insincerity in the Writer? – the Diagnostic Potential of Semantic Prosodies’, in M. Baker, G. Francis and E. Tognini Bonelli (eds) Text and Technology: in Honour of John Sinclair. Amsterdam: John Benjamins, pp. 157–76.
McCarthy, M. J. (1998) Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press.
McCarthy, M. J., and Carter, R. A. (1997) ‘Written and Spoken Vocabulary’, in N. Schmitt and M. J. McCarthy (eds) Vocabulary: Description, Acquisition and Pedagogy. Cambridge: Cambridge University Press, pp. 20–39.
Moon, R. (1998) Fixed Expressions and Idioms in English: A Corpus-based Approach. Oxford: Oxford University Press.
O’Keeffe, A., McCarthy, M. J. and Carter, R. A. (2007) From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge University Press.
Partington, A. (1996) Patterns and Meanings. Amsterdam: John Benjamins.
Pearce, M. (2008) ‘Investigating the Collocational Behaviour of MAN and WOMAN in the BNC using Sketch Engine’, Corpora 3(1): 1–29.
Sigley, R. and Holmes, J. (2002) ‘Looking at Girls in Corpora of English’, Journal of English Linguistics 30 (2): 138–57.
Sinclair, J. (1966) ‘Beginning the Study of Lexis’, in C. E. Bazell, J. C. Catford, M. A. K. Halliday, and R. H. Robins (eds) In Memory of J. R. Firth, London: Longman, pp. 410–30.
——(1987) ‘Collocation: a Progress Report’, in R. Steele and T. Threadgold (eds) Language Topics: EssaysinHonourofMichaelHalliday,II.Amsterdam:JohnBenjamins,pp.319–31.ReprintedinJ.Sinclair, Corpus, Concordance, Collocation. Oxford: Oxford University Press, pp. 109–21.
——(1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.
——(2004) Trust the Text. London: Routledge.
Stubbs, M. (1996) Text and Corpus Analysis. Oxford: Blackwell.
——(2001) Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell.
Williams, R. (1983) Keywords: A Vocabulary of Language and Society (revised edition). London: Flamingo.