The Routledge Handbook of Corpus Linguistics

8
Building a specialised audio-visual corpus

Paul Thompson

1. What are specialised audio-visual corpora and what are they used for?

Writing a chapter about building audio-visual corpora is a challenge as this is an area of considerable growth in corpus linguistics, computational linguistics, behavioural sciences and language pedagogy, among others, and, by the time this chapter appears, it is likely that technological advances will have moved the field substantially further forward.

In broad terms, an audio-visual corpus is a corpus that consists of orthographic transcripts of spoken language communication events, and the audio and/or video recordings of the original events. Such a corpus is likely to have links in the transcripts, which makes it possible to locate the relevant parts of the audio-visual records. In a basic form the links could consist of indexical information included with the transcripts which would allow the researcher to find the section of the recording manually, but in a more sophisticated form such annotation, included in an electronic document, would allow the user to click on a button or activate a hyperlink within the electronic version of the transcript and automatically open the file in a media player at the exact point. A further, alternative type of audio-visual corpus is one in which existing audio-visual texts, such as films, poster advertisements or online news pages, are annotated for multimodal analysis (see Adolphs and Knight, this volume). Such annotations may be organised on a range of levels, coding features such as voice, music, other sounds, graphically represented words, hand gestures, facial gestures, location, and so on, and these codes can be organised in parallel rows or columns. Baldry and Thibault (2006), for example, present a framework for transcription and analysis of multimodal texts using television advertisements and websites as example texts, and their approach can be applied to collections of multimodal texts.

A specialised audio-visual corpus may therefore contain recordings of sets of spoken language events that are used for analysis of situated language behaviours in specialised settings – such as doctor–patient consultations, child–caregiver interactions or classroom task activities – or it may contain samples of certain categories of multimodal texts. The purpose of constructing an audio-visual corpus is to make it possible to identify relationships between the non-linguistic and linguistic features of human or textual interaction, or to allow access to information that supplements the plain orthographic transcription. In this chapter, the focus is primarily on corpora in which the transcripts are linked to the video or audio recordings, or in which the video data have been made searchable for certain coded features.

Some linguistic investigations are more heavily dependent on audio-visual information than others. An example of the former is the study of sign languages, which are gestural and where the facility to record language performance on video (frontal view of the signer, with facial expression and hand gestures clearly presented) constitutes an excellent alternative to simple orthographic representation, or to a succession of still photographs, each portraying a single gesture. Such a project does, however, also present its own challenges as the video data have to be searchable by some means. If one is to look within a sign language corpus for the representation of ‘a large red ball’, for example, one has to have either the means to enter the orthographic form ‘a large red ball’ (which would use non-sign language means to retrieve sign language representations), or some graphic means by which a sign language user could formulate a non-orthographic query capable of locating all examples of this concept within the corpus. The British Sign Language corpus and the corpus of German Sign Language data are two major projects building large-scale audio-visual corpus resources for the sign linguistics community.

There are many purposes for which linked transcript and video data can be used. In language teaching, the presentation of communicative events visually as well as orthographically can help the learner to relate language use to the contexts in which it occurs. An audio-visual corpus can be used in the same way as a multimedia language learning package, except that it also offers the user the opportunity to retrieve multiple examples of a phrase or a grammatical structure and hear/see those examples, one after another. The EU-funded SACODEYL project, for instance, exploits clips of commissioned video recordings of teenagers, from seven different European language groups, speaking about their interests, experiences, friends and families, and the SACODEYL website (see also Chambers, this volume) contains language learning activities which prompt students to watch clips from the videos and search for answers to set questions. At one level, the video provides language learners with good listening practice, with orthographic transcripts provided so that the learner can check his or her understanding, but, on another level, the learner can search the data to locate certain features. The SACODEYL data have been annotated so that one can search by topic, grammatical point and part of speech, among other features, and one can also do concordance searches. When the concordance lines appear, it is then possible to select any one line, click ‘Go to section’, and open the relevant wider section of the transcript. The learner can then choose to view that section of the video, online.

Another example of the use of an audio-visual corpus is in the investigation of language use in education. The Singapore Corpus of Research in Education (SCoRE) project at the Centre for Research in Pedagogy and Practice, National Institute of Education, Singapore, is collecting recordings of classroom interactions in a variety of subject areas (English, Mandarin, Malay, Tamil, Maths, Science) in different levels of education in Singapore. The corpus interface allows the user to search for words or phrases in the corpus and then choose to view a video clip (if available) or listen to an audio clip. The corpus data have also been annotated on a number of levels: it is consequently possible to search by part of speech, by semantic category or by syntactic, pragmatic or pedagogical features. The recordings have been divided into speaker turns, and for each turn there is a sound file. The user is given the choice for any search to receive the results in turns (in other words, with each word shown within the full turn of the speaker) rather than as KWIC concordances; if this option is chosen, the user is given the text for each turn in which the search items occur and also a link to the audio or video file. In addition to the access to the audio-visual material, the interface also generates statistics on the frequency of occurrence of each feature in each file, both in raw terms and also as a percentage of the entire file.

Constructing an audio-visual corpus involves providing the links between the transcript and the audio or video files. In the previous example, that of the SCoRE, the corpus developers have devoted an enormous amount of time, resources and expertise to preparing the corpus. Recording the data is in itself a major task, but after that the recordings have to be transcribed and speaker turns identified. Audio files for each turn are then created and given unique identifying names. The transcripts are annotated for the various features mentioned above, and the information stored in a searchable database. The interface then has to be built, trialled, revised and extended, exploiting existing technologies. Not every audio-visual corpus will have the same levels of multilayered annotation as the SCoRE but it has to be recognised that working with audio-visual corpora is a demanding enterprise.

An alternative way to work with audio data is to use a popular concordance program such as WordSmith Tools (Scott 2008) with a corpus of transcripts and audio recordings (see chapters by Scott and Tribble, this volume). Such a solution might be more appropriate for end-users who are trained in the use of the particular computer program for corpus analysis work and who, on a specific investigation, require access to the audio files for closer analysis. In the case of a study of phraseology in seminar talk, for example, analysts may want to be able to do concordance searches in a corpus of seminar transcripts for a variety of lexical chunks. Within WordSmith Tools, provided the corpus has been prepared in advance and the program’s tag settings have been configured, the user can click in the Tag column of WordSmith Tools Concord to activate the audio player at the right point, and hear the intonational contours of the lexical chunks. To prepare the files, one needs to insert tags into the transcripts that refer to the audio recordings (the default audio file formats supported in WordSmith Tools are .mp3 and .wav but other formats can be accommodated). An example of the tagging is as follows, where the first tag is placed at the part of the transcript referred to and it identifies the .mp3 file that is to be played, while the closing tag indicates where the recording ends:

< soundfile name = ah02e001.mp3 > on a double-sided sheet and once again i haven’t put a summary on this one but what i have put < /soundfile >

The Help files for the program offer some guidance in this, but, again, it must be recognised this is a time-consuming task, and there are several complexities involved. The files can be set up in such a way that it is possible to listen to small clips of the audio files, as in the above example, but this requires creating many small files, with a high degree of precision, from the original audio recording. The more fine-grained the detail, the more time-intensive the task, but without fine granularity the corpus may be too limited in its uses. Another point that needs to be taken into account is that annotation of the data in order to make it useable in WordSmith Tools would not necessarily make it useable in other applications. In other words, the corpus is then tied into a particular package, when a more useful solution would be to make it useable in a range of programs.

So far we have proposed a number of reasons why a researcher might want to build an audio-visual corpus, and we have identified some of the ways in which a researcher might link audio and video inputs with orthographic transcripts. The purpose of the rest of this chapter is to give an overview of what the process of building an audio-visual corpus entails, from initial conception through project design to data collection, processing and finally the development of tools and interfaces for exploitation. Design criteria and data collection are discussed in Section 2, and transcription and annotation issues are reviewed in Section 3.

There are several tools available for the development of audio-visual corpora which make the job of linking points in the transcript to points in the video and audio files much easier. Some of these tools tie the developer into the proprietary system, while others use systems which have a higher degree of potential for interchangeability. A number of these tools will be discussed in Sections 3 and 4. As suggested in the first paragraph of this chapter, technology is advancing quickly, and it is dangerous to provide too much information on specific tools and platforms, so the discussion below will not attempt to be exhaustive. It is useful at this point to suggest that XML technologies offer flexibility (the ‘X’ in XML stands for ‘extensible’) and power, and that with researchers now starting to build better tools and interfaces for handling XML documents, it is probable that XML will become a standard for audio-visual corpora in the future. The final section of this chapter looks towards the future and speculates on what advances may be made in the coming years.

2. Collecting data

Corpus design is discussed in detail elsewhere in this volume (see Reppen, this volume, who discusses key considerations in building a corpus, and Koester, this volume, who deals with specialised corpora). Before collecting data through video recording, it is essential that appropriate ethical procedures are followed. Where the participants can clearly be identified through their physical features (on video), or acoustic features (through audio), they must be asked fairly to provide informed consent, and the researchers need to decide in advance what the data are to be used for, and to ensure that the data will be used only for the purposes stated. In some cases, the data will only be used by the research team and it is therefore easier to preserve anonymity, but if the audio and video recordings are to be made public in any form (such as in conference presentations or on the internet), permissions must be obtained before the data are collected. It is advisable to consult a legal expert in cases where the video recordings are to be put into the public domain. It is also important to consider carefully what possible uses the data may be put to in advance of data collection, as it is difficult to return to all participants after the data collection in order to collect consent retrospectively.

Data for such purposes are most likely to be collected in pre-determined locations, such as a room that has been set up specifically for the purposes of recording. For good audio recording it may be necessary to use more than a single microphone, and one solution for recording group conversations is to record each participant individually. Perez-Parent (2002) collected recordings of pupil–staff interactions in British primary school classrooms, in which each speaker was recorded on a mini-disc recorder with lapel microphone. The recordings were then brought together in a multichannel version, which allowed for much clearer distinction of each speaker’s contribution, but Perez-Parent notes that, counter to expectation, the mini-disc recorders functioned at slightly different speeds and therefore the recordings had to be further processed, with some ‘stretching’ of the files, in order to synchronise them.

Quality video recording in particular requires good lighting and camera work, as well as good equipment. Decisions about the camera angles to take and the lighting required will depend on the purposes of the project. In the Headtalk project conducted at the University of Nottingham, UK (see Adolphs and Knight, this volume), the focus is on the uses of head nodding and hand gestures in conversation. The team has developed techniques for the automatic analysis of the video data, which identifies head and hand movements and tracks the movements. To make the head and hand identifiable, however, it was necessary to ensure that people in the video recordings were seated, with face towards the camera, in a well-lit location and wearing long sleeves, so that it was possible to distinguish each hand clearly from the rest of the arm. One of the elements to be considered in preparing for good data collection, then, may be that of visual detail – what clothes the participants are wearing, what the background is, how well the speakers’ features stand out against that background, whether lighting is required to improve the visibility of key features, and so on.

Another example is the AMI (Augmented Multi-party Interaction) Meeting Corpus which consists of 100 hours of meeting recording. The data in the corpus are drawn from video material recorded with a number of cameras in a given smart room, set optimally to capture different shots of participants, and a range of audio captures on several microphones, which have then to be synchronised. The cameras are set to capture each participant’s facial and hand gestures (most are seated around the table and can only be seen from the midriff up) and the view provided is a fish-eye lens view, so that more peripheral information can be gathered. For analysis purposes, three or four camera angles can be placed in a row alongside each other on the screen, so that a more comprehensive perspective of the event can be captured. In addition to the individual view camera shots, there are also cameras set to capture the whole room, and output from a slide projector and an electronic whiteboard.

The Computers in the Human Interaction Loop (CHIL) project (Mostefa et al. 2007) went further by recording lectures and meetings in five different locations, each of which was a smart room. The main purpose of the project was to support the development and evaluation of multimodal technologies in the analysis of realistic human interaction, and the project added the variation of location as an extra challenge. In each location the minimum specifications for the data collection set-up included at least eighty-eight microphones capturing both close-talking and far-field acoustic data, four fixed cameras (one in each corner of the room), a fixed panoramic camera under the room ceiling and one active pan–tilt–zoom camera. The size of the project is impressive, with huge quantities of data collected and processed. While not immediately replicable, it provides a full and useful range of evaluations of the technologies for data capture and for semi-automatic to automatic analysis of the data.

The quality of data collected for an audio-visual corpus will depend not only on the positioning and number of recording devices but also on the equipment used and the processes by which data are transferred, synchronised, saved and transformed, and also on the skill of those who capture the data. It is not possible to examine these in detail here, but it should be noted that generally speaking it is advisable to capture data at the highest resolution and then to make use of compression technologies at later stages, when smaller file sizes, and faster transfer times, are required, and keep the high resolution recordings as archive material.

3. Preparing transcriptions and annotations

One of the first decisions to be made is that of which transcription and spelling conventions to use. The choice will be determined to a large extent by the nature of the research, and, in cases where a corpus is being developed as a resource to be placed in the public domain, predictions of the range of potential uses for the corpus. For more detailed discussion of the issues, see Reppen and Adolphs and Knight, this volume.

Consistency is a main concern whatever the system chosen. The team that is responsible for making the transcription needs to set up a shared document for specifying the conventions to be used. This document sets out the agreed conventions but is subject to addition and amendment as the team encounters problematic cases of spelling or coding that have to be decided on during the course of transcription. Where the members of the team are working in geographically diverse locations, it is advisable to set up shared documents, using an online document-sharing facility, or to set up a discussion Wiki.

Transcription and coding can be performed concurrently or sequentially; in other words, one approach is to produce an orthographic transcription quickly and then use this as the basis for one or many layers of annotation, while another approach is to make the transcription and insert the time stamps at the same time, and possibly to insert other levels of annotation as well. If working with CLAN, for example, the tool developed for use with the Child Language Data Exchange System (CHILDES) database, one can insert the time stamps directly into the transcripts from within the program. Alternatively, working with Praat, the freeware research transcription tool for the synthesis, analysis and manipulation of speech, the transcriber can work from the spectrogram to the orthographic (or other) transcription and link the two at whatever level is required (for example, phoneme, word or utterance).

When working with video data, some transcribers prefer to work with the audio input first as they find the visual mode distracts their attention from the oral, while other transcribers have a preference for a bimodal view of the event, on the basis that paralinguistic, gestural and other features help them to make sense of the audio input.

For the transcriber who is used to the physical audiocassette transcription machine there is a simple program called Soundscriber, developed by Eric Breck for the MICASE corpus project. This program plays audio and video files and has a ‘walk’ facility which plays a chunk of the file (say, four seconds long) three times on a loop and then moves to the next chunk, with a slight overlap built in.

The programs mentioned in the previous paragraphs are primarily for the transcription of audio data. A comprehensive set of links to annotation tools, including transcription tools, can be found on the Linguistic Data Consortium website, where there is a separate page for annotation tools used for the mark-up of gesture. It is to the latter that we now turn, taking two examples of existing tools.

Michael Kipp has developed a video annotation and analysis tool called Anvil. This program presents a multiple view, of the video input in one window (or more, if there is more than one video input), the video controls in another window, a description of the gesture codes applied to the present view of the video and, below all this, the transcription within a multilevel representation that is similar to a musical score. As the video moves, the transcription lines move past, too, and at any point in the transcription the researcher can see the multiple levels of annotation applied. This window can also present speech waveforms and the Praat intensity diagrams (Anvil imports Praat files directly). In addition to Praat files, Anvil can also import Rhetorical Structure Theory (RST) files from the RST program made by O’Donnell (for coding clause relations). Anvil is an XML tool and the program produces XML files, which gives the potential for interchangeability.

A commercial program that can used for annotation of behavioural features of video data is the Observer XT program which supports the creation of timestamped state and event codes, either in independent layers or in layers related in a hierarchical decomposition. As with Anvil, it is possible to open several windows together, to see the video, the timeline, the codes, the speech waveform, in addition to the transcription, and also to export to XML. An additional advantage to the program is that it offers the facility to conduct collaborative coding of the data with members of a team working independently in space and time.

Two toolkits that are Open Source at the time of writing and which create XML files are the NITE Toolkit, used on the AMI project (see Carletta et al. 2003), and EXMARaLDA. The NITE Toolkit is a set of libraries and tools for the creation, analysis and browsing of annotated multimodal, text or spoken language corpora, and it can represent both timing and rich linguistic structure. It also contains libraries for developers and a number of end user tools. The EXMARaLDA project has developed a set of concepts and tools for the transcription and annotation of spoken language, and for the creation and analysis of spoken language corpora. The tools are Java programs which can be used for editing transcriptions in partitur (musical score) notation, and for merging the transcriptions with their corresponding recordings into corpora and enriching them with metadata (see also Adolphs and Knight, this volume). A demonstration of data that have been marked up using EXMARaLDA can be found on its website.

The discussion so far has concentrated on processes and technologies for annotating data without providing an example of what annotation frameworks might be used with multimodal data. Pastra (2008) introduces a framework based on Rhetorical Structure Theory which describes the semantic interplay between verbal and non-verbal communication. The framework, called COSMOROE, has three core relations (equivalence, complementarity and independence) to describe the relationship between verbal and non-verbal content, and for each core relation there are sub-types. This framework, it is claimed, provides a ‘language’ for investigating cross-channel dialectics, and is clearly developed for a purpose. The framework has been implemented on a corpus of TV travel programmes and the data were annotated using Anvil.

The annotation scheme for the AMI corpus (see Section 2 above) describes individual actions and gestures on four ‘layers’: head gestures, hand gestures (further separated into deictic and non-deictic), leg gestures and trunk gestures. The trunk events, to take an example, are coded as one of the following: shrug, sit_upright, lean_forward, lean_ backward, other_trunk, no_trunk, or off_camera. The coding is added using one of the tools in the NITE Toolkit, the Event Editor, and it is entered into an XML file that is created purely for trunk gesture information. In other words, each layer of coding is held in a separate file.

4. The interface: assembly and analysis

In programs such as Anvil or Observer XT, the interface is built into the program. These interfaces require that the user possess a licence for the program and that the program is installed locally. In some projects, however, the aim may be to make the corpus available to the wider research community in a more independent manner, and the most likely medium for this is the world-wide web. This offers a number of challenges, including the following:

• Download speeds

• File formats

• Provision of adequate flexibility.

The corpus builder needs to consider the limitations of access to the internet for potential users of the corpus, particularly the differing speeds of data transfer, and also the frequent congestion of the system. When investigating an audio-visual corpus, the user does not want to wait several minutes for a video to open in the local browser, and preferably the video should open almost immediately. Clearly there is much to be said for creating lighter data, particularly in the size of the video files that are called for when a hyperlink is clicked. A file which is 100 MB in size will take much longer to send and load than a 10 MB file. One solution is to use video streaming, which is a technique for transferring compressed video data to a computer over the internet in a continuous flow so that a user can begin viewing it before the entire file has been received. One factor affecting the choice between streaming and non-streaming video is that of whether the corpus holder wants to prevent the video being held temporarily (at least) on the user’s computer – in the case of steaming video, this is prevented, but if the file is downloaded to the user’s computer it is possible that the user will save a copy locally.

The second problem is that of the file formats. At the time of writing, there is a variety of video file formats such as Real Audio (.ra), Shockwave (.swf), QuickTime (.mov), Audio/ Video Interleaved (.avi) and Windows Media Video (.wmv). One’s choice of video file format will be partly determined by the quality of the picture and by the size of files produced, but it will also be affected by the currency of the player required for playback of the file. In most cases, video player plug-ins can be downloaded for popular internet browsers, but the corpus developer will probably want to choose a media player that is widely used and that is likely to have a long life (with new plug-ins regularly created for newer versions of the browsers).

The Scottish Corpus of Texts and Speech (SCOTS) website contains video data that is accessible through a browser. The transcripts are segmented into tone units and the user is able to click on any given point on the transcript, then select the video view icon from the bottom of the screen and activate the video at that point of the file. The video is activated by a Java script command that communicates with a QuickTime plug-in. The video starts playing at that point and then continues until manually stopped. The benefit of this method is that the mark-up of the document is relatively simple: a timestamp is recorded for the beginning point of each segment, and the HTML for the document has an identifier which is linked to that timestamp. It is not necessary to add information about the closing point of each segment (although, technically, it would not be difficult to retrieve that information from the timestamp for the next segment). This is an approach taken on several corpus websites which provide access to audio or video files.

Some researchers, however, may prefer to have only a smaller section played, either because this is seen to be a more economical way to transfer data from one source to another (if the user is going to activate a video at a given point and then play the video straight through, then all of the video file has to be transmitted, potentially), or because the user is interested in working intensively with that section. Approaches that can be taken here are:

• Split the audio file into short chunks, in an arbitrary manner.

• Split the files at selected points, such as at turn boundaries.

• Select parts of a recording on the grounds of a query.

The COLT corpus team at the University of Bergen made the sound files for the corpus available through a web interface (Hofland 2003). The sound files were split into tensecond chunks (this can done automatically using a file-splitting command), and then timestamps were added into the files at the break points. The ten-second files load reasonably quickly and the method of aligning text to sound is efficient, but one drawback is that a ten-second extract does not necessarily have natural boundaries.

The SCoRE (mentioned in Section 1 above) has employed the second method: individual sound files have been created for each of the speaker turns in the original audio files. The interface employs Java script to activate the media player (built around the Adobe Flash player) to play the sound files. Each turn in the transcripts is coded with information about the corresponding sound file, and when a turn is retrieved and displayed in a search results page a hyperlink to the sound file is created. It is this link which retrieves the sound file and opens the media player. Once the media player opens, it then automatically plays the file. The controls to the player allow the user to manipulate the file, with play and pause functionality.

A similar approach but with added sophistication, exemplifying the third option of selecting points of a recording through timestamp information, is taken in the design of the GLOSSA corpus query interface created at the Tekstlab at the University of Oslo and used by two corpus projects: a corpus of the Oslo dialect of Norwegian, and a Scandinavian dialect corpus of five Nordic languages (Norwegian, Swedish, Danish, Icelandic and Faroese). Among many functions available in the interface, we shall choose the playback features (for a screenshot, see Andersen, this volume).

The video picture is displayed in the top right corner through a QuickTime plug-in, with controls to the right of the video image which allow the viewer to start and stop the playback. Pressing either button opens up a digital counter and the user can change the start and end points of the clip using these counters. At the bottom are the concordances for the search term, and the small icons at the beginning of each line allow the user to activate either the audio or video playback. On the left side of the video player is the relevant section of the transcript – here showing the selected concordance line and three more before and after (this option was selected using the ‘context’ menu above the start control). Each line of the transcript has been given a timestamp, in the preparation of the corpus. The context drop-down menu is a powerful feature as it allows the user to expand the context of the utterance in the same way that one can in a KWIC concordance program, by asking to be shown more data both before and after the line, but in this audio-visual corpus the user can access more lines both in the transcript and also in the video.

5. Looking to the future

As suggested earlier, it is difficult to make predictions about the technologies to be used in the future, but there seems to be an increasing adoption of Extensible Markup Language (XML) solutions in the development of corpora and of tools for querying corpora. Mention has been made of the NITE Toolkit used on the AMI project and of EXMARaLDA, which use XML for creating the corpus files, and it is likely that more projects will follow suit (although it should also be noted that XML is criticised for its ‘bulkiness’, and that alternatives exist).

One prediction that can be made with confidence is that data transfer speeds and storage power are going to increase rapidly, and this will change current concepts about the size and potential of such corpora. As such, the Human Speech Project (HSP), based at the Massachusetts Institute of Technology, may be a sign of the future. This project follows the language development of a single child in the first three years. To gather the data, cameras and microphones have been set up in all rooms in the house, and these recording devices are running from morning to evening every day. While the project itself is so specialised that it is unlikely to be replicated, the technologies that have been developed to process the huge quantities of data suggest that automatic processing of visual data is likely to be an area of major development in the creation of AV corpora. Basically, with so much video data to be examined, it is necessary to identify which parts of the data require attention, and so the HSP team have created methods for automatically reading the video data and noticing which cameras are picking up movement. On the basis of this information, the human coders are then able to focus their attention on the video information that is relevant to them. Second, in order to speed up the job of transcribing all the oral interaction, speech recognition tools are used in order to provide a rough initial transcription of the speech, and a human transcriber then listens to the recordings and corrects the transcription as necessary. The possibility of creating speech recognition programs that can accurately translate audio signals into words remains remote (particularly in the case of spontaneous group talk) but there has been substantial progress in improving the quality of output. Developers of audio-visual corpora are likely to contribute to, and benefit from, advances in the semi-automatic processing, transcription and annotation of data.

The HSP team have observed that language use is often connected to location and activity. Certain communicative events tend to be enacted in the kitchen, for example, such as at the time that the parents make coffee. Audio-visual corpora make it possible to search for the relationships between location, activity and language use in ways that are unique and these observations may lead to new developments in linguistic theory. Such work may offer further confirmation for those who have previously shown links between language and physical contexts of use (e.g. Mitchell 1957).

As observed in Section 2 above, one serious constraint on the creation of audio-visual corpora to be placed in the public domain is the need to obtain informed consent from all participants who are recorded, particularly where video information is captured. Projects such as the AMI corpus and the CHIL corpus have therefore worked with participants in controlled environments. At the same time, linguistic and behavioural researchers will need to gather information about language in use in natural settings. A possible solution in coming years may be the use of techniques used in motion tracking and face and gesture analysis (see the Headstart project as part of the Nottingham MultiModal Corpus, Adolphs and Knight, this volume) to build models of human physical activity from video input (see, for example, Kipp 2004), and then convert these models into anonymised computer-generated animated figures that behave in the ways that the original subjects did, without compromising their identity.

Other areas in which there is likely to be development are in the retrieval of information, and also in the playback and manipulation of the video data. First, if there are to be more XML corpora, then researchers are going to need more powerful XML-aware corpus query tools. Popular text-concordancing programs may have some limited capacity to work with audio files, but cannot cope with multi-layered annotation in XML. As discussed above, there is a lack of standardisation in video file format, and therefore also in the technologies required for playback. It is possible that in future years there may be a move towards standardisation, but at the same time the likelihood of success in such an endeavour is slim.

References

Baldry, A. and Thibault, P. (2006) Multimodal Transcription and Text Analysis. London: Equinox.

Carletta, J., Evert, S., Heid, U., Kilgour, J., Robertson, J. and Voormann, H. (2003) ‘The NITE XML Toolkit: Flexible Annotation for Multi-Modal Language Data’, Behavior Research Methods, Instruments, and Computers 35(3): 353–63.

Hofland, K. (2003) ‘A Web-based Concordance System for Spoken Language Corpora’, paper presented at Corpus Linguistics 2003, Lancaster; available at ucrel.lancs.ac.uk/publications/CL2003/papers/hofland_abstract.pdf (accessed 5 June 2009).

Kipp, M. (2004) Gesture Generation by Imitation – From Human Behavior to Computer Character Animation. Boca Raton, FL: Dissertation.com

Mitchell, T. F. (1957) ‘The Language of Buying and Selling in Cyrenaica: A Situational Statement’, Hespéris XLIV: 31–71.

Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S., Tyagi, A., Casas, J., Turmo, J., Cristoforetti, L., Tobia, F., Pnevmatokakis, A., Mylonakis, V., Talantzis, F., Burger, S., Stiefelhagen, R., Bernadin, K. and Rochet, C. (2007) ‘The CHIL Audio-visual Corpus for Lecture and Meeting Analysis Inside Smart Rooms’, Language Resources and Evaluation 41: 389–407.

Pastra, K. (2008) ‘COSMOROE: A Cross-Media Relations Framework for Modelling Multimedia Dialectics’, Multimedia Systems 14: 299–323.

Perez-Parent, M. (2002) ‘Collection, Handling, and Analysis of Classroom Recordings Data: Using the Original Acoustic Signal as the Primary Source of Evidence’, Reading Working Papers in Linguistics 6:245–54;availableatwww.reading.ac.uk/internal/appling/wp6/perezparent.pdf(accessed5June2009).

Scott, M. (2008) WordSmith Tools version 5. Liverpool: Lexical Analysis Software.

8Building a specialised audio-visual corpus