41
How to use corpus linguistics in forensic linguistics

Janet Cotterill

1. The notion of the forensic corpus and its potential use in authorship analysis

In an age of computerisation, the use of corpora in many types of forensic linguistic analysis is becoming increasingly commonplace. In fact, there are certain areas such as authorship, where corpus linguistics is seen as the way forward for identification and elimination of candidate authors. This chapter will explore the issues around the use of corpus linguistics in forensic linguistic analysis, including both its potential advantages and also some of the methodological challenges associated with its use.

Within the field of forensic linguistics, the idea of using corpora to analyse legal casework involving texts is a relatively new and cutting-edge concept. As a methodology, it derives from pre-existing work in biblical authorship and Shakespearean authorship, both of which have drawn on corpora of various constitutions. However, as O’Keeffe et al. (2007: 20) attest, ‘authorship and plagiarism are growing concerns within forensic linguistics, for which corpora can prove a useful instrument of investigation’. It is useful to note the terminology here; corpus linguistics is referred to as an instrument, in a similar way as forensic linguists tend to refer to it as a tool or a resource since no method of analysis, corpus or otherwise, can guarantee the identification or elimination of authors. This issue will be revisited at the end of the chapter.

One of the major difficulties within the forensic field is usually not present in the same way for corpus constructors of more general and widely available texts such as newspaper articles or casual conversation (e.g. the Bank of English, the Brown Corpus or the BNC). As O’Keeffe et al. (2007) point out, the construction of corpora and the notion of sampling can be crucial. As they put it, ‘any old collection of texts does not make a corpus’. Unfortunately, and frustratingly for the forensic linguist, any old collection of texts is precisely what is provided by either the police or solicitors, who have trawled the home, office and computer of a suspect for any texts which are available. They are unaware of genre/register differences, variations in text size and temporal factors, all of which may influence the potential of texts to be analysed.

There are a range of text types which are commonly found in forensic linguistic casework. These include:

• Threat letters

• Suicide notes

• Blackmail/extortion letters

• Terrorist/bomb threats

• Ransom demands

• E-mails

• Text messages

• Police and witness statements

• Plagiarised texts

Corpora, whether general or specialised, have the potential to be used at any stage of legal proceedings. At the investigative stage, linguists may be brought in to comment on questioned documents at the information gathering stage. Typically, psychologists have taken on this role and linguists are often on the peripheries of this type of work, with psychologists producing profiles which may assist the police in locating a possible offender. Once the trial begins, questions of authorship may be argued for and against in court. Finally, if a case goes to appeal, as is increasingly happening in the UK following the creation of the Criminal Cases Review Commission (CCRC) in 1997, as Olsson (2004: 3) states:

it is becoming increasingly common for linguists to be called in to assist legal counsel at the appeal stage, either because there may be some dispute about the wording, interpretation or authorship of a statement or confession made to police, or because a new interpretation of a forensic text (such as suicide or ransom note) may have become apparent since the conviction

However, there are a number of constraints which currently restrict the use of corpus linguistics in forensic contexts. Some of these are due to the technology available to the linguist to both collect and analyse the data – this is something which may be resolved in due course as the software evolves; others relate to the text types themselves, which is potentially more of a problematic issue.

All the above listed text types pose a number of challenges for the analyst. The first concerns the length of these texts. In general, all are typically short pieces of writing with very few extending beyond a single page or several pages. In fact, texts such as e-mails and in particular text messages frequently consist of fewer than ten words and as such represent an analytical dilemma for the consultant. Without very much language to work on, it is difficult to establish authorship, in terms of either identification or exclusion of candidate authors.

This kind of work also involves the notion of idiolect and uniqueness of expression, both of which can very effectively involve the use of corpora, as we will see in the case studies discussed below; however, there are additional problems with work of this kind. First, the linguist has to attempt to deal with the thorny issue of genre. Such work in general linguistics which has attempted to define the characteristics of particular genres (for example Biber 1988, 1995; Stubbs 1996) has resulted so far in only partial descriptions of registerial characteristics associated with certain text types. This is perhaps one of the greatest challenges in this type of work.

The first text which comes to light, and which triggers the need for a forensic linguist, is usually a questioned text: that is, a piece of writing (or speech) which comes from an unknown source. In this case, the linguist is asked to do one of two things. They may be asked by the police to consult on the text in terms of its sociolinguistic or idiosyncratic features in an investigative role. If the text is in isolation, without other additional texts, for example a series of threat letters, or if there are no candidate authors, then this is the limit of the linguist’s input. Here, corpus linguistics cannot play a very useful role since there is only a single text to be analysed. If, however, there are a number of texts and/or one or more candidate authors, then a more detailed analysis can be carried out.

Police officers and other legal professionals are not usually aware of such issues as genre and timespan and the effect that such variables may have on the nature of the text. Thus, the forensic linguistic is often presented with a set of texts, a ‘corpus’, which may be very different in their relative genres, dates of production and contexts of writing. Even such variables as mode of production – handwritten texts versus word processed texts, for example – may influence the type of language produced as well as opportunities to edit the text and produce revised versions. During police searches of suspects’ computers and personal belongings, such texts as diaries, personal and professional letters, text messages from mobile telephones, e-mails and even greetings cards may be presented to the forensic linguist for comparative analysis with the questioned text. What the police are aiming to do is to connect the questioned text, for example a bomb threat, with the candidate author who is typically in custody. In addition, there are tight time constraints hanging over the analyst since any suspect arrested and cautioned must usually be charged or released within seventy-two hours. Also, the police often present the analyst with texts in a text-by-text fashion, so that the corpus is disjointed and the goalposts are constantly being moved. This is because additional material may emerge from searches carried out later and because the computer technicians require time to undelete files on computers where a guilty suspect may have hurriedly attempted to delete potentially incriminating texts.

A recent case (Cotterill and Grant, in preparation), commonly regarded as the largest and most serious terrorist threat since the September 11th attack in the USA, involved an Al-Qaeda terrorist plot which led to the conviction of eight individuals based predominantly on authorship analysis. A huge international terrorist plot involving major targets in the UK and US had been retrieved, including plans for large-scale explosions, biological and chemical warfare and potentially thousands of victims.

The challenge for the team of forensic linguists was the identification of the author or authors of this text in order to prove a conspiracy charge beyond reasonable doubt. This meant that an analysis of the texts of all eight of the individuals arrested was necessary. The work was carried out in the most extreme circumstances, with only seventy-two hours remaining until the suspects had to be either charged or released. During this time, the team of forensic linguists were bombarded by newly retrieved and discovered texts, often incomplete and in hard copy form, and therefore not lending themselves to corpus analysis as they had not yet been digitised. The whole analysis took over two years and involved many dozens of documents, all of which were eventually presented in digitised form and then became amenable to corpus-assisted analysis.

2. The use of pre-existing linguistic corpora in forensic linguistics

As well as the construction of somewhat opportunistic but bespoke case-specific corpora as in the example discussed above, pre-existing large corpora of English are extremely useful for the forensic linguist in a variety of case types. This may be in an investigative role where the linguist is asked to comment on idiolectal, dialectical or regional features of texts which may aid police officers in identifying and locating a potential perpetrator. It is also possible to use corpora to illustrate common meaning (the difference between a legal meaning and a lay person’s ‘plain’ understanding) of a term in – usually civil cases – of disputed comprehensibility of texts. These include texts such as patient information leaflets included in packs of medication, instructions and warnings. Failure to fully comprehend these types of texts can lead to frustration at best and even injury or death at worst (for a more detailed analysis of this type of forensic casework see Cotterill, in preparation). In examining texts like this, having access to large corpora can be invaluable since it allows the linguist to gauge a sense of the common usage of such terms and hence the common understanding of them, without the need to conduct a large-scale survey which is usually impossible in terms of time and financial resources

One of the first cases involving the use of corpora actually predates the creation of large computerised corpora or even the field of corpus linguistics itself, and was described by Jan Svartvik who investigated the case of Timothy Evans, the lodger of serial killer John Christie. Evans was wrongly convicted of the murder of his young daughter and was hanged in 1950 (Marston 2007). Following Christie’s own conviction and execution for the subsequent murder of Evans’ wife, Evans was posthumously pardoned. Using a self-constructed mini-corpus, Svartvik demonstrated that certain incriminating sections of statements supposedly produced by Evans and used as a confession at his trial did not match the grammatical style of unchallenged parts of his statements. Had this evidence been available to the jury, the outcome of the trial might have been different.

Thirty years later, Coulthard and his colleagues investigated the re-opened case of Derek Bentley. Bentley, a teenager with learning disabilities, was accused and convicted of the murder of a policeman during a bungled robbery and was hanged. The main evidence against Bentley was a statement which he was alleged to have produced following his arrest. Bentley’s claim was that in fact the statement was a composite document comprising not only his own words but those of the police who, he claimed, had contributed incriminating sections to his statement. In Goffman’s (1981) terms, Bentley’s claim was that one or more corrupt police officer was in fact in part the author (or authors) of the text, and certainly those parts of it which incriminated him in the crime.

Coulthard (2000) presents Bentley’s statement in full, and discusses those features which appear to support Bentley’s claim. Intuitively, a number of aspects of the text were identified which seemed to suggest that (a) it was unlikely that Bentley had produced this language and, significantly, (b) the register of the text was more indicative of ‘policespeak’ than the language of a lay person (Fox 1993). There were unusually specific lexical items such as references to a ‘shelter arrangement’ on the roof, a ‘brickwork entrance’ to the door, which are unlikely to have been produced by Bentley, given his low level of verbal competence, and phraseological formulations more indicative of police language than that of a lay person (particularly one of low verbal competence).

The predominant feature which drew attention was the simple word ‘then’. In the Bentley case, the word ‘then’ occurs eleven times in a statement of 582 words. In typical narratives, it may not be remarkable that this word is so common (particularly in relatively ‘simple’ narratives, for example those produced by children), since an individual describing events may well use ‘then’ to describe a series of events. The significant factor was the positioning of the word. In most of the instances, ‘then’ occurred in a medial position between subject and verb, as in the following extracts taken from the statement:

I then caught a bus to Croydon … Chris then jumped over and I followed. Chris then climbed a drainpipe to the roof and I followed … The policeman then pushed me down the stairs.

Although linguists and lay people alike are often skilled at spotting these types of unusual features, when presenting evidence to a jury, the burden of proof means that it is necessary for the expert witness linguist to present evidence which is based on more than intuition and experience of language use. In this case, Fox (1993) constructed two small parallel but contrastive corpora, a method which Coulthard terms a ‘corpus assisted analysis of register’ (Coulthard and Johnson 2007: 178), the term acknowledging the need for human engagement with the text which generally precedes any corpus analysis and which flags up areas of interest to be pursued within corpora. One sub-corpus consisted of witness statements from the Bentley case and others; the other was made up of police officer statements. Coulthard reports the results of a comparative analysis of the word ‘then’, which he describes as ‘startling’ (2000: 273); see Table 41.1.

Comparatively, Bentley’s use of the temporal form of ‘then’, which was calculated at one per fifty-three words, means that his statement is far closer in frequency to the police officer language than that of lay witnesses. To further test his hypothesis, Coulthard used the spoken sub-corpus of the COBUILD Bank of English. The word ‘then’, both in its clause-initial and clause-medial uses, was found to occur in 1 per 500 words, aligning it more closely with the witness statements than the policespeak. There is also an acknowledgement of this specialised register in the US context (see Philbin 1995). Perhaps even more significantly, in the COBUILD spoken data, the string ‘then I’ was found to occur ten times more frequently than ‘I then’, once per 165,000 words versus 1 in 194 words in Bentley’s alleged statement. Coulthard concludes that:

The structure ‘I then’ does appear to be a feature of policeman’s (written) register. More generally, it is in fact the structure Subject (+ Verb) followed by ‘then’ which is typical of policeman’s register – it occurs 26 times in the statements of the … officers and 7 times in Bentley’s own statement.

(Coulthard 2000: 274)

Coulthard goes on to speculate that ‘whatever else it was, his [Evans’] statement was not a verbatim record of the dictated monologue’ (ibid.). Partly as a result of this evidence, Bentley was posthumously pardoned and his conviction quashed forty-eight years after he was hanged. The Bentley case is a powerful illustration of one of the ways in which forensic linguistics and corpus linguistics can come together to produce compelling evidence. For further discussions of the Bentley case and others, see Coulthard (1992, 1993, 1994, 1997 and 2004).

3. The use of specialised individual corpora to identify or eliminate candidate authors

An interesting case is reported by Eagleson in Gibbons (1994), which involved a ‘farewell letter’, apparently written by a woman who had left her husband, but who in fact had been murdered by him. The letter was compared with a sample of her previous writings and a similar corpus of those of her husband. Eagleson concluded that the letter had been written by the husband of the missing woman, and when presented with this compelling evidence the husband confessed to having written it himself and to the murder. The features identified by Eagleson in both the disputed letter and the husband’s corpus of texts included marked themes, the deletion of prepositions and the misuse of apostrophes, as well as grammatical features such as omissions in the present tense inflections and in the weak past tense ending -ed. Table 41.2 is a summary of Eagleson’s analysis: H represents the husband’s corpus, F equals the farewell letter of disputed authorship, and W indicates the wife’s set of texts.

While the majority of forensic work does not yield such startlingly neat results, Table 41.2 gives an indication of the power of using corpora to analyse texts of this type. Had these personal comparative corpora not existed or not been analysed, it is unlikely that the killer would have confessed since he had denied any involvement in his wife’s disappearance until he was presented with the parallel corpora evidence.

It is not only the use of large corpora such as COBUILD or the British National Corpus which enables forensic linguists to contribute to casework. The internet is increasingly being used as a resource for linguistic analysis, although there are certain caveats which need to be applied to its use.

4. Using the web as a reference corpus

Caution must be advised in using the internet as a reference corpus both in terms of its construction and its content. Although the web could arguably be seen as the largest corpus in existence, it is also perhaps the most corrupted. There are no controls over what is posted, there are no ‘rules’ governing its content and certain genres (for example media texts) are somewhat overrepresented. Nevertheless, as a resource for ‘the language of the people’, it is possible to conduct searches of billions of words in seconds and to gauge some sense of common meanings.

The following case is a striking example of a combination of individual intuition and idiolectal features of language and the use of the internet as a linguistic corpus which spans some thirty years. From 1978 to 1995 Theodore Kaczynski waged a campaign of terror across the US. A disenchanted ex-university professor of mathematics who had become a recluse in a log cabin in Montana sent sixteen letter bombs to a variety of targeted individuals and institutions. His attacks increased in the level of violence used and resulted in three fatalities and twenty-three injuries of varying severity. He was named by the FBI as the Unabomber because of his choice of targets – the university and airline bomber. In 1995, he contacted The New York Times with an ultimatum: he would ‘desist from terrorism’ but only if either the Times or The Washington Post agreed to publish a 35,000-word manifesto he had written entitled Industrial Society and Its Future (the text of which is available online in numerous locations, including the CourtTV website). Foster (2001) discusses the language and style of the manifesto as a non-linguist. One of the hopes of The Washington Post and The New York Times who both agreed to publish the manifesto was that someone might recognise the writing style of it. Holt (2000), an expert on Shakespearean authorship, notes that the manifesto contains a number of distinctive features indicative of the writer’s idiolectal style. In addition to consistent self-references using the plural form ‘we’ and/or ‘FC’ (standing for Freedom Club), the author also capitalises whole words as a means of indicating emphasis. Holt also observed that, although the text displayed ‘irregular hyphenation’, it was otherwise an error-free text in terms of both spelling and grammar. This was despite the fact that this the author had used an old-fashioned typewriter rather than a word processor with a spelling and grammar checker.

A short time after the manifesto was published, the FBI was contacted by an individual claiming that the document appeared to have been written by his long-estranged brother. In fact, his wife had intuitively recognised stylistic features of the document which rang alarm bells in her mind. The manifesto contained certain unusual expressions such as a self-description of the writer as ‘a cool-headed logician’. David Kaczynski, Theodore’s brother, discovered a set of letters dating back to the 1970s which appeared to contain ‘similar phrasing’ to that of the manifesto. This is a clear example of individual recognition of an idiosyncratic and distinctive writing style. Following a search of Kaczynski’s property, Kaczynski was arrested and charged with manufacturing and sending the incendiary devices. With the possession of a set of additional documents seized from the property, the FBI’s analysis determined that there were multiple similarities between the manifesto and, particularly, a lengthy letter sent to a newspaper on a similar topic.

Following this individual analysis and discovery process, a corpus approach was subsequently adopted. The FBI carried out an analysis of the web, which was significantly smaller in the mid-1990s. Using the web as a reference corpus, they used a set of twelve expressions which had been used in an analysis previously carried out for the defence; see Table 41.3.

When the twelve words and expressions were entered into the search engine, a total of approximately three million hits were returned which contained one or more of them. This is an unsurprising result but was clearly extremely disappointing for the FBI’s analyst (confirmed in personal communication with the Special Agent concerned and published by him in Fitzgerald 2004). The query was then refined to include only those documents which contained all twelve of the search terms. This produced a remarkable result: only sixty-nine documents on the entire web contained all twelve of these words and phrases. Perhaps more remarkably, each hit consisted of a version of Kaczynski’s 35,000-word manifesto. As Coulthard points out in his discussion of the methodology employed in the Unabomber case:

This was a massive rejection of the defence expert’s view of text creation as purely open choice [see Sinclair 1991], as well as a powerful example of the idiolectal habit of co-selection and an illustration of the consequent forensic possibilities that idiolectal co-selection affords for authorship attribution

(Coulthard 2004: 433)

Although the Unabomber case used web searches highly successfully in terms of its resulting conviction, it does illustrate one note of caution regarding the use of the web as a reference corpus. Just as statistics can be represented and misrepresented depending on how they are calculated and used, so the type of search carried out in terms of both search terms and search engines employed can produce very different results, as Tomblin (2004) has also pointed out.

Another example of the use of the internet involves a man who spent twenty-five years in jail for the murder of a woman he claimed never to have met; he was convicted by a confession which he claimed never to have produced. The man claimed that the statement was a composite document consisting of questions asked by a police officer with parallel responses, mostly in the form of positive or negative answers. This dialogue, the man alleged, had then been constructed into a monologic statement. If this was the case, the statement would become inadmissible for the purposes of the court.

Coulthard (2004) employed a corpus-assisted approach in an attempt to explore this claim. Coulthard presents the following pair of sentences which occur in both the (disputed) monologic statement and the (disputed) interview record:

(i)

Statement

I asked her if I could carry her bags she saidYes

 

Interview

I asked her if I could carry her bags and she saidyes

(ii)

Statement

I picked something up like an ornament

 

Interview

I picked something up like an ornament

On this occasion, Coulthard decided to access the internet as a corpus rather than other available corpora such as the Bank of English or the BNC on the grounds that it contained more general language usage than either of the other two.

In and of themselves, the utterances above do not seem at first sight to be remarkable. However, as Coulthard found, neither one occurred at all in an internet search. This, he attributes to Sinclair’s (1991) idiom principle, which relies on the notion of preconstructed chunks, also referred to as linguistic formulae (in Wray’s 2006 terms) and as lexical priming by Hoey (2005; Hoey et al. 2007), rather than a principle of completely free choice in a slot and filler approach to language use.

The internet search clearly indicates a principle of diminishing returns as the string becomes longer, as is illustrated by the extract from Coulthard’s data in Table 41.4 (Coulthard 2004: 441).

Perhaps most striking is the next string, which begins with the most mundane phrase ‘I asked’ through the equally mundane ‘I asked her if I could’ until it finally reaches the string in the disputed statement –‘I asked her if I could carry her bags’; see Table 41.5.

Coulthard concludes that clauses such as ‘I asked her’ seem to display characteristics of pre-formulated chunks of language, but as additional words are added the frequency of occurrence reduces dramatically until it reaches zero. Ultimately, Coulthard takes an optimistic view of data elicited in this way, stating that:

From evidence like this we can assert that even the sequence as short as 10 running words has a very high chance of being a unique occurrence. Indeed rarity scores like these begin to look like the probability scores that DNA experts proudly present in court.

(Coulthard and Johnson 2007: 198)

5. Limitations of corpus linguists in forensic linguistic contexts and future challenges

One of the problems with using corpora in forensic linguistic analysis concerns the presentation of any resulting evidence to a jury of lay people and legal professionals. Lawyers and judges are not usually familiar with the terminology or the methodology associated with corpus linguistics, nor is the jury. Members of a jury are ‘lay’ in two significant senses: not only are they unaware of the law until instructed on it by the judge in his/her summing up (and as research indicates, even after this too) but they are also usually linguistically naïve.

Corpus linguistics represents a particularly tricky area to explain to a group of lay jurors since it involves an explanation not only of the results but also of the methodology. Jurors are paradoxically notoriously bad at assimilating evidence which appears ‘scientific’ but find it appealing in terms of convincingness. Explanations of concepts such as concordancing, levels of frequency and even the idea of corpora per se can be problematic. Many of the ‘givens’ which linguists take for granted must be explained in minute detail before any results are provided. Thus, the value of using corpora in one’s analysis must be weighed carefully against the difficulties which will come inevitably in the courtroom.

Important work remains to be done but is now starting to be explored by scholars using quantitative computational methodologies (Chaski 2001 inter alia is a good example of an analyst using such methods). Coulthard and Johnson (2007), and others, advocate the creation of a reference corpus of authentic police and witness statements (among other forensic genres), both to permit

statistically valid statements in court and to protect [oneself] against the suggestion by hostile cross-examiners that, as any reasonable person will agree, a corpus of general conversation is irrelevant for comparative and normative purposes, because the linguistic behaviour of witnesses and subjects must change when they are making statements under oath

(Coulthard and Johnson 2007: 89)

A number of researchers are attempting to construct specialist corpora of this type, including those consisting of text messages, suicide notes and courtroom interaction. The value of such corpora cannot be overestimated as their use increases and is leading to more and more convictions based on this kind of evidence.

Added to this, perhaps the most pressing issue is the fact that many forensic linguists (at least initially) operate qualitatively and intuitively. This means that the evidence produced may not meet the burden of proof. In the USA, criteria known as the Daubert criteria have been established to prevent the presentation of evidence which is considered ‘unreliable’ by the courts. Solan and Tiersma (2004) discuss the Daubert criteria, which resulted from a case involving drugs alleged to cause birth defects when given to pregnant women (Daubert v. Dow Merrill Pharmaceuticals, Inc. (509 U.S. 579 1993)). According to these standards, new, more stringent requirements in determining the admissibility of evidence were implemented:

1 whether the theory offered has been tested;

2 whether it has been subjected to peer review and publication;

3 the known rate of error; and

4 whether the theory is generally accepted in the scientific community.

Within forensic linguistics, as well as corpus linguistics, some of these criteria are difficult if not impossible to meet at present. This is particularly true of the third measure, since both fields have been subject to testing, review, publication and general acceptance within their respective communities.

Corpus linguistics is almost certainly the best placed of all of the tools at the disposal of the forensic linguist to enable linguistic evidence to be admitted in court, since, aside from forensic phonetics work which operates with sound scientific and statistical principles and has a formal accreditation process, corpus linguistics is the most ‘scientific’ method employed by linguists. Certainly, with the development of large corpora of specialised text corpora of various genres on the horizon, the future for forensic linguistics, particularly in terms of authorship analysis, seems to hang on computerised analysis and the development of robust statistical measures, both of which suggest that corpus linguistics will have an increasingly important role to play in the future in the detection of crime.

Further reading

Coulthard, R. M. (1994) ‘On the Use of Corpora in the Analysis of Forensic Texts’, Forensic Linguistics: International Journal of Speech, Language and the Law 1(1): 27–43. (This article, which appeared in the very first issue of the journal, was and remains seminal to the use of corpus linguistics in forensic linguistics. It explores a number of cases where corpora were employed in order to assist in showing notions of idiolect and register in forensic linguistic casework.)

Fox, G. (1993) ‘A Comparison of “Policespeak” and “Normalspeak”: A Preliminary Study’, in J. Sinclair, M. Hoey and G. Fox (eds) Techniques of Description: Spoken and Written Discourse. London: Routledge, pp 183–95. (This important work explores the concept of a police idiolect, crucial to many cases of alleged fabricated confessions and false statements. Employing corpus linguistic methods, Fox discusses conducting comparative analysis of corpora and the potential value of this approach to forensic linguists.)

Solan, L. and Tiersma, P. (2004) ‘Author Identification in American Courts’, Applied Linguistics 25(4): 448–65. (Solan and Tiersma provide a comprehensive outline of the use of corpora in authorship in the US legal context, which is very different to that of the UK. They also provide a discussion of some of the drawbacks of such evidence and the problems of communicating such evidence to jurors.)

Woolls, D. (2003) ‘Better Tools for the Trade and How to Use Them’, Forensic Linguistics: International Journal of Speech, Language and Law 10(1): 102–12. (In this article, Woolls illustrates the use of his computer software Copycatch which uses corpora, either self-constructed small-scale and specialised ones or larger, more general pre-existing collections of texts, in cases of plagiarism, disputed authorship of police records and interviews and confessions.)

References

Baker, P. (2007) Using Corpora in Discourse Analysis. London: Continuum.

Biber, D. (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press.

——(1995) Dimensions of Register Variation. Cambridge: Cambridge University Press.

Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Broeders, A. (1999) ‘Some Observations on the Use of Probability Scales in Forensic Identification’, Forensic Linguistics: The International Journal of Speech, Language and the Law 6(2): 228–41.

Chaski, C. (2001) ‘Empirical Evaluations of Language-based Author Identification Techniques’, Forensic Linguistics: The International Journal of Speech, Language and the Law 8(1): 1–65.

Cotterill, J. (2010, in preparation) ‘Keep Taking the Medicine: The Comprehensibility of Patient Information Leaflets’.

Cotterill, J. and Grant, T. (in preparation) ‘The Case of Dhiren Barot: Forensic Authorship Analysis and the Prevention of Terrorism’.

Coulthard, R. M. (1992) ‘Forensic Discourse Analysis’, in R. M. Coulthard (ed.) Advances in Spoken Discourse Analysis. London: Routledge, pp. 242–57.

——(1993) ‘Beginning the Study of Forensic Texts: Corpus, Concordance, Collocation’,inM.P. Hoey (ed.) Data, Description Discourse. London: HarperCollins, pp. 86–97.

——(1994) ‘On the Use of Corpora in the Analysis of Forensic Texts’, Forensic Linguistics: The International Journal of Speech, Language and the Law 1(1): 27–43.

——(1996) ‘The Official Version: Audience Manipulation in Police Records of Interviews with Suspects’, in C. Caldas-Coulthard and R. M. Coulthard (eds) Texts and Practices: Readings in Critical Discourse Analysis. London: Routledge, pp. 166–78.

——(1997) ‘A Failed Appeal’, Forensic Linguistics: The International Journal of Speech, Language and the Law 4(2): 287–302.

——(2000) ‘Whose Text Is It? On the Linguistic Investigation of Authorship’, in S. Sarangi and R. M. Coulthard, Discourse and Social Life. London: Longman, pp. 270–89.

——(2004) ‘Author Identification, Idiolect, and Linguistic Uniqueness’, Applied Linguistics 25(4): 431–47. Coulthard, R. M. and Johnson, A. (2007) An Introduction to Forensic Linguistics: Language in Evidence. London: Routledge.

Eagleson, R. (1994) ‘Forensic Analysis of Personal Written Texts: A Case Study’, in J. Gibbons (ed.) Language and the Law. London, Longman, pp. 362–73.

Firth, J. R. (1957) Papers in Linguistics 19341951. Oxford: Oxford University Press.

Fitzgerald, J. R. (2004) ‘Using a Forensic Linguistic Approach to Track the Unabomber’,inJ.H. Campbell, and D. Denivi (eds) Profilers. New York: Prometheus Books, pp. 193–222.

Foster, D. (2001) Author Unknown: On the Trail of Anonymous. New York: Holt.

Fox, G. (1993) ‘A Comparison of “Policespeak” and “Normalspeak”: A Preliminary Study’, in J. Sinclair,

M. Hoey and G. Fox (eds) Techniques of Description: Spoken and Written Discourse. London: Routledge, pp. 183-95

Gibbons, J. (ed.) (1994) Language and the Law. London: Longman.

Goffman, E. (1981) Forms of Talk. Philadelphia, PA: University of Pennsylvania Press.

Grant, T. (2007) ‘Quantifying Evidence in Forensic Authorship Analysis’, Forensic Linguistics: International Journal of Speech, Language and Law 14(1): 1–25.

Grant, T. and Baker, K. (2001) ‘Identifying Reliable, Valid Markers of Authorship: A Response to Chaski’, Forensic Linguistics: The International Journal of Speech, Language and Law 8(1): 66–79.

Hoey, M. (2002) ‘Textual Colligation: A Special Kind of Lexical Priming’, in K. Aijmer and B. Altenberg (eds) Language and Computers, Advances in Corpus Linguistics. Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23), Göteborg, 22–26 May, pp 171–94.

——(2005) Lexical Priming: A New Theory of Words and Language. London: Routledge.

Hoey, M., Mahlberg, M., Stubbs, M. and Teubert, W. (2007) Text, Discourse and Corpora: Theory and Analysis. London: Continuum.

Holt, H. (2000) ‘The Bard’s Fingerprints’, Lingua Franca:29–39.

Howald, B. S. (2008) ‘Authorship Attribution under the Rules of Evidence: Empirical Approaches in a Layperson’s Legal System’, Forensic Linguistics: The International Journal of Speech, Language and Law 15 (2): 219–47.

Hunston, S. (2001) ‘Colligation, Lexis, Pattern, and Text’, in M. Scott and G. Thompson (eds) Patterns of Text: In Honour of Michael Hoey. Amsterdam: John Benjamins, pp. 13–34.

Jenkins, C. (2003) ‘Stuart Campbell Thought Technology Would Stop the Police Proving that He Murdered His Niece, Danielle Jones. Instead, It Proved His Downfall’, Police Review (Police Review Publishing Co. Ltd): 28–9.

Johnson,A. (1997) ‘Textual Kidnapping – A Case of Plagiarism among Three Student Texts’, Forensic Linguistics: International Journal of Speech, Language and Law 4(2): 210–25.

Kredens, K. (2001) ‘Towards a Corpus-Based Methodology of Forensic Authorship Attribution: A Comparative Study of Two Idiolects’, in B. Lewandowska-Tomaszczyk (ed.) PALC 2001: Practical Applications in Language Corpora. Frankfurt: Peter Lang, pp. 405–46.

——(2002) ‘Idiolect in Forensic Authorship Attribution’, in P. Stalmaszczyk (ed.) Folia Linguistica Anglica, Vol. 4. Lodz: Lodz University Press.

McMenamin, G. (2001) ‘Style Markers in Authorship Studies’, Forensic Linguistics: International Journal of Speech Language and the Law 8(2): 93–7.

——(2004) ‘Disputed Authorship in US Law’, Forensic Linguistics: International Journal of Speech Language and the Law 11(1): 73–82.

Marston, E. (2007) John Christie (Crime Archive), The National Archives, London.

O’Keeffe, A., McCarthy, M. J. and Carter, R. A. (2007) From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge University Press.

Olsson, J. (2004) Forensic Linguistics: An Introduction to Language, Crime and the Law. London: Continuum.

Philbin, P. (1995) Cop Speak: The Lingo of Law Enforcement and Crime. London: Wiley.

Risinger, D. and Saks, M. J. (1996) ‘Science and Nonscience in the Courts: Daubert Meets Handwriting Identification Expertise’, Iowa Law Review 82: 21–74.

Sarangi, S. and Coulthard, R. M. (eds) (2000) Discourse and Social Life. London: Longman.

Scott, M. (1999) WordSmith Tools. Oxford: Oxford University Press.

Shuy, R. W. (2002) Linguistic Battles in Trademark Disputes. New York: Palgrave Macmillan.

——(2006) ‘From Spam to McDonald’s in the Trademark Wars’, Language Log, 13 October.

Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Solan, L. and Tiersma, P. (2004) ‘Author Identification in American Courts’, Applied Linguistics 25(4): 448–65.

——(2005) Speaking of Crime: The Language of Criminal Justice. Chicago, IL: University of Chicago Press.

Stubbs, M. (1996) Text and Corpus Analysis. Oxford: Blackwell.

Svartvik, J. (1968) The Evans Statements: A Case for Forensic Linguistics. Gothenburg: University of Gothenburg Press.

Tiersma, P. and Solan, L. (2002) ‘The Linguist on the Witness Stand: Forensic Linguistics in American Courts’, Language 78: 221–39.

Tomblin, S. D. (2004) ‘Author Online: Evaluating the Use of the WWW in Cases of Forensic Authorship Analysis’, unpublished MA dissertation, Cardiff University.

——(2009) ‘Future Directions in Forensic Authorship Analysis: Evaluating Formulaicity as a Marker of Authorship’, unpublished and ongoing PhD thesis, Cardiff University.

Winter, E. (1996) ‘The Statistics of Analysing Very Short Texts in a Criminal Context’, in H. Kniffka (ed.) Recent Developments in Forensic Linguistic. Frankfurt am Main: Peter Lang, pp. 141–79.

Woolls, D. (2003) ‘Better Tools for the Trade and How to Use Them’, Forensic Linguistics: International Journal of Speech, Language and Law 10(1): 102–12.

Woolls, D. and Coulthard, R. M. (1998)‘Tools for the Trade’, Forensic Linguistics: International Journal of Speech, Language and Law 5(1): 33–57.

Wray, A. (2006) Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.