GERALD NELSON
Kennedy (1998: 1) provides succinct definitions of both the terms “corpus” and “corpus linguistics”:
In the language sciences a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description. Over the last three decades the compilation and analysis of corpora stored in computerised databases has led to a new scholarly enterprise known as corpus linguistics.
As Kennedy’s definition shows, the corpus‐based method of linguistic research is a very recent development, and the use of corpora in the study of world Englishes is more recent still. McEnery and Wilson (2001: 1) provide a useful account of what they call “early” corpus linguistics, by which they refer to a range of research projects undertaken from the 1950s to the 1970s, using entirely manual methods for compiling and analyzing large collections of text. Notable among these was the work of Randolph Quirk, who compiled the Survey of English Usage (SEU) corpus, beginning in 1959. SEU is a one‐million‐word corpus of British English, dating for the most part from the 1960s. From our perspective in the technologically sophisticated twenty‐first century, it is astonishing to recall that the SEU corpus was an entirely paper‐based corpus, with each instance of every word annotated on its own paper slip, and the slips stored in a vast array of metal filing cabinets (Peppé 1995). In contrast with this, corpus linguistics today exploits the ever‐increasing power of computer hardware and software, and with the aid of computer technology, linguists are compiling ever‐larger collections of text. Since the 1980s, the corpus‐based approach has become firmly established as a methodology for linguistic research.
The first electronic corpus of English is generally agreed to be the Brown corpus, which was compiled by Francis and Kučera at Brown University, Rhode Island, in 1963‐4. The compilers refer to the corpus as A Standard Corpus of Present‐Day Edited American English (Francis & Kučera 1979). It consists of just over one million words of printed English produced in the United States during the calendar year 1961. It includes 500 individual samples of 2,000 words each, selected from the range of text types shown in Table 29.1.
The Brown corpus continues to be very influential, especially in terms of the methodology of corpus design and compilation. For that reason, it is worth quoting the compilers at some length here:
Samples were chosen for their representative quality rather than for any subjectively determined excellence. The use of the word standard in the title of the Corpus does not in any way mean that it is put forward as “standard English”; it merely expresses the hope that this corpus will be used for comparative studies where it is important to use the same body of data. Since the preparation and input of data is a major bottleneck in computer work, the intent was to make available a carefully chosen and prepared body of material of considerable size in standardized format. The corpus may further prove to be standard in setting the pattern for the preparation and presentation of further bodies of data in English or in other languages.
(Francis & Kučera 1979)
The Brown corpus did, indeed, become a “standard” in the sense that the compilers express here. It set in motion a series of corpus‐based projects around the world, in which the researchers invariably looked to Brown as their model. The Lancaster‐Oslo/Bergen (LOB) corpus was begun in 1976 in order to provide a British English equivalent of the Brown corpus (Johansson, Leech, & Goodluck 1978). To this end, the compilers followed the design of Brown closely, selecting texts printed in Great Britain in 1961 and choosing the same number and size of samples from the same text categories. The objective was, of course, to ensure that the two corpora would be directly comparable with each other, so that they could be used as the basis for comparative studies across the two dominant varieties, American and British English (AmE and BrE).
Table 29.1 Composition of the Brown Corpus.
Informative prose: 374 samples | Imaginative prose: 126 samples |
Press: reportage Press: editorial Press: reviews Religion Skills and hobbies Popular lore Belles lettres, biography, memoir Miscellaneous Learned |
General fiction Mystery and detective fiction Science fiction Adventure and western fiction Romance and love story Humor |
In 1978, S. V. Shastri noted that previous studies of Indian English had been largely confined to aspects of the spoken variety (Bansal 1969), or to isolated topics in the language (Kachru 1965). Having worked at Lancaster University with Geoffrey Leech, one of the prime movers behind the LOB corpus, Shastri recognized that “a comprehensive description [of Indian English] will have to be based on a standard corpus” (Shastri 1986). To this end, Shastri compiled the Kolhapur corpus of written Indian English, using both Brown and LOB as his models. He declared his objectives in the following terms:
The present corpus of Indian Written English is comparable to the Brown and the LOB corpora. It is intended to serve as source material for comparative studies of American, British and Indian English which in its turn is expected to lead to a comprehensive description of Indian English.
(Shastri 1986)
However, unlike Brown and LOB, which sampled texts from 1961, the Kolhapur corpus takes 1978 as its sampling date. Part of the rationale behind this had to do with the perceived “Indianness” of postindependence Indian English. As Shastri explained:
it is felt that the value of the Indian corpus is immensely enhanced in general and in particular as a source for the description of Indian English…as the Indianness of Indian English is a post‐independence phenomenon and may have reached a discernible stage in the thirty years after Independence. It is argued in theory that in the same thirty years the American and British English may not have undergone such changes.
(Shastri 1986)
This is an interesting observation, and one which, consciously or unconsciously, informs descriptions of other postcolonial Englishes as well. What Shastri was consciously attempting to construct was a corpus of distinctively Indian English, as opposed to the variety used at the time of Independence. Whether the 30‐year gap to 1978 would be sufficient to allow the “Indianness” of Indian English to manifest itself is perhaps a moot point. The key issue here is that Shastri, following Kachru (1965), recognized Indian English as a distinct variety, and set about capturing it in the Kolhapur corpus.
The Australian Corpus of English (ACE) was compiled at Macquarie University, beginning in 1986. As with the Kolhapur corpus, the compilers were motivated primarily by a wish to differentiate between their own variety of English and the British and American varieties. For that reason, they followed the Brown and LOB models closely in terms of corpus design, though again there is a chronological gap: ACE samples texts from 1986.
At Victoria University of Wellington, New Zealand, researchers compiled the Wellington Corpus of Written New Zealand English (WWC; Bauer 1993). Once again, Brown and LOB were the models, though the compilers decided to use ACE as their model in terms of sampling date. In the Wellington corpus, the majority of samples date from 1986 or 1987. The corpus of written New Zealand English was followed in 1998 by the Wellington Corpus of Spoken New Zealand English (WSC), consisting of dialogues and monologues collected in the period 1988 to 1994 (Holmes, Vine, & Johnson 1998).
Beginning with the highly influential Brown corpus in the early 1960s, the enterprise of compiling English‐language corpora has continued in highly principled and systematic ways. As a result, linguists now have five “parallel” corpora of international written English at their disposal: Brown, LOB, Kolhapur, ACE, and Wellington. In 1990, a new project was initiated which would significantly expand this collection, and more importantly, greatly increase both the linguistic and the geographical coverage of available corpora.
The International Corpus of English (ICE) project was conceived in the late 1980s by Sidney Greenbaum, then director of the Survey of English Usage, University College London. The idea was first proposed in a brief notice in World Englishes (Greenbaum 1988), in which researchers were invited to collaborate on the compilation of parallel English corpora, specifically in countries where English is used as a first language, or as a second official language. The invitation was timely, and the response from linguists worldwide was both immediate and enthusiastic. At the time of writing, the ICE project involves research teams working in the following countries or regions:
Table 29.2 ICE countries/regions.
Compiled by author.
Australia Bahamas Canada East Africa Fiji Ghana Gibraltar Great Britain Hong Kong |
India Ireland Jamaica Malaysia Malta Namibia New Zealand Nigeria Pakistan |
Philippines Scotland Singapore South Africa Sri Lanka Trinidad & Tobago Uganda United States |
From its inception, ICE aimed to compile parallel corpora from two of Kachru’s Three Circles of English (Kachru 1985). The Inner Circle is represented by countries such as Great Britain, the United States, and Australia, and the Outer Circle is represented by countries such as India, Singapore, and the Philippines. Kachru’s third circle, the Expanding Circle, is represented in an ancillary project, the International Corpus of Learner English (ICLE), which is discussed below.
Each ICE team is compiling (or has already compiled) a one‐million‐word corpus of their own variety of English, produced by adults (aged 18 or over) in the period after 1989. While each national or regional corpus can exist independently as a valuable resource for the study of individual varieties, the real value of the corpora lies in their being exactly compatible with each other. This compatibility lies in every area of the corpus design and annotation (Nelson 1996a, 1996b). The design, in terms of text categories, is shown in Table 29.3. Each corpus consists of 500 samples of approximately 2,000 words each, to give a total of one million words. The first major division is between speech (300 samples) and writing (200 samples). Further subdivisions are made in a hierarchical fashion, with speech divided among dialogue (180 samples) and monologue (120 samples), and writing divided among nonprinted (50 samples) and printed (150 samples). The hierarchical subdivision continues to the fundamental level of the text categories, of which there are 15 in speech and 17 in writing.
The overall design was arrived at following extensive discussion (Leitner & Hesselmann 1996; Schmied 1990). While it is informally based on the design of Brown and LOB, it also reflects some important differences. Most notably, it samples spoken English, and in a greater proportion than writing. Within the spoken component, by far the greatest contribution is from face‐to‐face conversations (90 samples, or 180,000 words). The ICE corpora, therefore, are distinctive in the emphasis they place on the spoken medium, and in particular on informal, conversational English. They are also distinctive in that they include only those text categories which are internationally applicable. So, for example, the corpus design (Table 29.3) contains no religion category, as Brown and LOB did, because writing on this topic is not available (in English, at least) in all the participating countries. Similarly, the subdivision of fiction into romance, westerns, detective fiction, etc., has been dispensed with, since these subtypes simply do not apply internationally. The ICE corpora aim to be maximally representative of English in use in all the participating countries, and not in any one country. Each of the ICE teams has had a slightly different experience in compiling their respective corpora, depending on local circumstances, and specifically on the status of English in the country concerned. Many of the teams have written informatively about these experiences, and they provide some valuable insights into the processes and rationales behind corpus building in the context of world Englishes. In 2010, a special issue of the ICAME Journal (ICE Age 2: ICE corpora of new Englishes in the making.) was devoted to the challenges that researchers faced in compiling ICE corpora in several countries, including Fiji, the Bahamas, Trinidad and Tobago, Malta, Sri Lanka, and Nigeria. This special issue makes a very valuable contribution to our understanding of the dynamics of English and English‐language education in many parts of the world.
At the time of writing, a total of 13 ICE corpora are available for the purposes of nonprofit, academic research. These are given in Table 29.4.
All of the corpora are available in a basic lexical form, but most of them have also been annotated (“tagged”) for part‐of‐speech using the CLAWS 7 tagset (Garside 1987), which was devised in the UK at Lancaster University’s UCREL Centre (University Centre for Computer Corpus Research on Language). The corpora have also been semantically tagged using the USAS tagger (the UCREL Semantic Analysis System; Rayson 2008). Both levels of annotation were carried out using the Wmatrix system, which allows online access to the UCREL taggers (Rayson 2009). In terms of annotation, the British ICE corpus (ICE‐GB) is the most advanced of all the ICE corpora. Each word has been tagged for part‐of‐speech, and each sentence/utterance has been parsed at phrase and clause level. The syntactic structures are represented in the form of tree diagrams, of which there are over 83,000 in the corpus as a whole. The grammatical terminology used is based for the most on the function/form approach found in A Comprehensive Grammar of the English Language (Quirk, Leech, Svartvik, and Greenbaum 1985). The annotation was carried out using software developed by the TOSCA research group at the University of Nijmegen, under the direction of Professor Jan Aarts (Van Halteren & Oostdijk 1993). The ICE‐GB corpus is distributed with its own retrieval software, ICECUP (the ICE Corpus Utility Program), which supports complex searches of the syntactic trees (Nelson, Wallis, & Aarts 2002). Details of availability of all the ICE corpora are given at the end of this chapter.
Table 29.3 Composition of the ICE Corpora.
SPOKEN TEXTS (300 samples) | WRITTEN TEXTS (200 samples) |
|
|
Table 29.4 ICE available corpora.
Canada East Africa Great Britain Hong Kong India Ireland Jamaica |
New Zealand Nigeria (written texts) Philippines Singapore Sri Lanka (written texts) USA (written texts) |
The corpora described above have been used as the basis for a very large and varied body of research, and they continue to be used in this way. The results of this research are far too numerous to cite here, though special mention should perhaps be made of the pioneering work, Computational Analysis of Present‐Day American English (Kučera & Francis 1967), based on the Brown corpus. This large‐scale, quantitative study was later replicated using the LOB corpus of British English, in Frequency Analysis of English Vocabulary and Grammar (Johansson & Hofland 1989). For comprehensive bibliographies of corpus‐based studies of English, see Glauser, Schneider, and Görlach (1993), and Fallon (2004).
The Kolhapur corpus has proved to be an especially fruitful resource for investigators of world Englishes. For an account of early work based on the corpus, see Shastri (1988). Sayder (1989) contrasts the use of the subjunctive in Indian, British, and American English, and Leitner (1992) analyzes the verbs begin and start in Indian English, in comparison with both AmE and BrE. A particularly important contribution in this context is Schneider (2000), which analyzes a range of grammatical phenomena in the Kolhapur corpus, including the subjunctive, case marking of wh‐pronouns, pro‐form do, and the indefinite pronouns in ‐body and ‐one. Comparing his findings with those from Brown and LOB, Schneider concludes that “my empirical corpus investigations have shown that no fundamental, categorical difference between Indian English and any other of the national varieties was detected, but on the other hand, there is no full identity of patterns and preferences to be observed” (Schneider 2000:133).
The ACE corpus of Australian English has provided data for a wide range of investigations, focusing on, for example, comparisons of Australian and British usage (Peters 1993a, 1993b), the influence of AmE and BrE on Australian verb morphology (Peters 1994), the language of Australian newspapers (Peters, Collins, Blair, and Brierley 1988), and the semantics of modal verbs (Collins 1988, 1991). The two Wellington corpora – of written New Zealand English (WWC) and of spoken New Zealand English (WSC) – have supported research into a wide range of topics, including gender‐based variation (Holmes 1993), relative pronouns (Sigley 1997), and the discourse of direct and indirect speech (Yang 1997).
Whether researchers examine aspects of Indian English, Australian English, or New Zealand English, a comparison is usually made – explicitly or implicitly – with AmE and/or BrE. This is to be expected, because of the traditional dominance of these two varieties, and, on a more practical level, because of the availability of the Brown and LOB corpora. The corpora in the International Corpus of English offer scope for much more inclusive studies of English worldwide, taking account not only of first‐language varieties, but of second‐language varieties as well. ICE‐GB has been extensively used in research, initially as a “snapshot” of BrE in the early 1990s, and later in comparative studies with other varieties. Most notably, ICE‐GB formed the most important data source for both the Oxford English Grammar (Greenbaum 1996) and the Oxford Modern English Grammar (Aarts 2011).
Studies of several ICE corpora were published in a special issue of World Englishes in 2004 (vol. 23, no. 2). The papers in that volume deal with such topics as particle verbs in world Englishes (Schneider 2004), emphasizer now in South African English (Jeffery & Van Rooy 2004), and article use in contact varieties (Sand 2004). In recent years, ICE corpora have been used as the basis for many monographs on specific aspects of English grammar and on individual varieties of English. These include studies of the English noun phrase (Keizer 2007), adjunct adverbials (Hasselgård 2010), English in the Caribbean (Deuber 2014), and the syntax of spoken Indian English (Lange 2012).
When the ICE project began in 1990, one million words was considered to be more or less the standard size for all corpora being used in academic research. This was partly because corpus compilers at that time were following closely in the footsteps of LOB and Brown but also because of the storage capacity of the computers that were then available. Since then, of course, the power of computers has grown exponentially, both in terms of storage capacity and of processing speed. Computers today can store and process enormous amounts of data, sometimes running to several billion words. Another post‐1990 development is also having a major effect on corpus compilation, namely the coming of the Internet. Since it first became widely available in the 1990s, the Internet has grown to an incalculable size, and this vast amount of text is more or less freely available on every desktop. In the light of this, corpus compilers have been attracted to the Internet as a source of corpus data (Hundt, Nesselhauf, & Biewer 2007), and some have even developed systems which treat the Internet as a corpus in its own right. An example of this is the WebCorp project, which was developed at the Research and Development Unit for English Studies, Birmingham City University, UK (Renouf & Kehoe 2013). Using the online WebCorp interface, researchers can produce concordances of words and phrases drawn from the entire Internet. Obviously, the system yields a quantity of results that has hitherto been unavailable using traditional corpora.
In the field of world Englishes, perhaps the most interesting recent development has been the Global Web‐Based English Corpus (GloWbE, pronounced “globe”), which was compiled in 2013 by Mark Davies and his team at Brigham Young University (Davies & Fuchs 2015). The GloWbE corpus consists of 1.9 billion words drawn exclusively from the Internet. It samples webpages from a total of 20 Internet domains, and it includes, for example, 775 million words of American and British English, 148 million words of Australian English, and 45 million words of South African English. The corpus is freely available online via a very user‐friendly interface. Users can search the entire 1.9 billion words or restrict the search to specific varieties. The system also supports collocational studies and searches by part of speech.
The sheer size of the GloWbE corpus is impressive, and this alone will attract many researchers to use it as a research tool. On the other hand, the absence of any spoken data is a major limitation, and so too is the very narrow range of written text types that has been included. A corpus consisting only of webpages is unlikely to capture the full range of linguistic phenomena that exists in any one variety of English. More important, perhaps, there must be some doubt about the exact provenance of each corpus text. Texts have been assigned to English varieties on the basis of the domain name from which they were downloaded, so that texts from the “.hk” Internet domain, for example, have been classified as containing “Hong Kong English.” This is by no means a foolproof classification, since there is no guarantee that every text in the “.hk” domain was actually written by a native of Hong Kong. The same problem applies to all the other varieties and domains, so the validity of the classification of texts into English varieties remains in doubt (Nelson 2015). Despite this, the GloWbE corpus is a valuable resource, and it will be especially useful in areas of research where the quantity of data is crucially important. Among these is the study of vocabulary and collocations. At just one million words each, the ICE corpora were not designed for these types of studies, but the 1.9 billion words in GloWbE opens up exciting possibilities for new research in these areas.
Varieties of English from the Expanding Circle are catered for in the International Corpus of Learner English (ICLE), which is coordinated by Sylviane Granger, University of Louvain‐la‐Neuve, Belgium. The ICLE project samples learner (EFL) English from a wide range of mother‐tongue backgrounds, including French, German, Dutch, Spanish, Swedish, Finnish, Czech, Japanese, Chinese, Polish, and Russian (Granger 1996). The ICLE corpus has been extensively used in studies of learner English, focusing, for example, on phrasal verbs (Chen 2013), the use of well in spoken English (Aijmer 2011), and article errors in the writings of advanced Arabic learners (Crompton 2011).
The collection and analysis of corpora of world Englishes will undoubtedly continue into the future, and many new insights will be revealed. However, this effort should be coordinated and systematic. Collecting more and more data, from the Internet or from elsewhere, is not a worthwhile approach in itself. The collection of new data should focus on filling the gaps that currently exist in corpora of world Englishes. For example, in both the ICE corpora and in GloWbE, English varieties from Africa are still very underrepresented. Mair (2006) points out that the role of small corpora (such as ICE) and very large corpora (such as GloWbE) should be “complementary.” The much larger GloWbE corpus can be used as a “monitor”’ corpus to cross‐check and verify results drawn from ICE data.
At a more theoretical level, some consideration must also be given to the methodology of comparing corpora. A useful starting point for such an approach is provided by Kennedy (1996), who offers an outline of the topics that can be investigated using the ICE corpora once they have been fully annotated to the same level as ICE‐GB. Kennedy’s outline is summarized in Table 29.5. As this table shows, carefully annotated corpora offer exciting possibilities for future research. This is especially true since most of the topics listed have never been systematically studied in most of the ICE varieties, and, for the most part, no comparative studies have ever been carried out on these topics. Though it is far from exhaustive, Table 29.5 might be considered the starting point for a “prospectus” for future corpus‐based research into world Englishes.
The corpora mentioned in this chapter are available as follows:
Brown, LOB, Kolhapur, ACE, Wellington
These corpora are available on a single CD‐ROM from ICAME (International Computer Archive of Modern English), at the following address: The HIT Centre, Allegt. 27, N‐5007, Bergen, Norway. E‐mail: icame@hit.uib.no
Website: http://clu.uni.no/icame
The manuals of these corpora, cited in the references, are also available on the ICAME CD‐ROM, and online at http://clu.uni.no/icame/manuals/.
ICE Corpora
Most of the currently available ICE corpora can be downloaded under license from the ICE website: http://www.ice‐corpora.uzh.ch/en.html
Table 29.5 An outline of the units and structures from morpheme to discourse that can be investigated using the completed ICE corpora.
Adapted from Kennedy (1996: 223).
Word classes adjectives, adverbs, determiners, nouns, prepositions, pronouns, verbs, etc. Word morphology and functions affixation, tense, number, etc. Word types Lemmas Collocations Phrases noun phrases, prepositional phrases, verb phrases, etc. Clause elements subject, object, complement, adverbial, etc. Clause patterns SV, SVO, SVOA, existential there constructions, etc. Clause processes and information packaging extraposition, clefting, fronting, passivization, negation, etc. Sentence types declarative, interrogative [yes/no, wh‐), imperatives, etc. Form and function interrogative versus question, etc. Clause types subordinate clauses (nominal, relative, adverbial, comparative, etc.) Clause relationships coordination, subordination, hypotaxis, parataxis Discourse particles Cohesion Varieties and variation lexis, grammar, and discourse in different domains speech and writing sociolinguistic variation register variation regional variation |
The British ICE corpus (ICE‐GB) is available from The Survey of English Usage, University College London, Gower St, London WC1E 6BT, UK. E‐mail: ucleseu@ucl.ac.uk.
Website: http://www.ucl.ac.uk/english‐usage/
The New Zealand ICE corpus (ICE‐NZ) is available from Dr Bernadette Vine, School of Linguistics & Applied Language Studies, Victoria University of Wellington, Wellington, New Zealand. E‐mail: bernadette.vine@vuw.ac.nz.
The Nigerian ICE corpus (ICE‐NG) (written texts) can be downloaded from
http://sourceforge.net/projects/ice‐nigeria
To obtain the Sri Lanka ICE corpus (ICE‐SL) (written texts), send an e‐mail to
ice‐sl@anglistik.uni‐giessen.de
Most of the ICE corpora have been tagged with the CLAWS7 tagset and the USAS semantic tagger. To obtain access to these, please contact the author at Department of English, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR.
E‐mail: geraldanelson@gmail.com
WebCorp
WebCorp is available online at http://www.webcorp.org.uk/live/
Global Web‐Based English Corpus (GloWbE)
GloWbE is available at https://corpus.byu.edu/glowbe/
International Corpus of Learner English (ICLE)
For details of the availability of ICLE corpora, contact Professor Sylviane Granger, Centre for English Corpus Linguistics (CECL), Collège Erasme, Place Blaise Pascal 1, Université Catholique de Louvain, 1348 Louvain‐la‐Neuve, Belgium. E‐mail: granger@lige.ucl.ac.be.
Website: https://uclouvain.be/en/research‐institutes/ilc/cecl