Assessment in Second Language Pronunciation

Developments in the field of natural language processing have made it possible to make measurements of pronunciation features which can then be used to predict human judgments of pronunciation quality. However, automated scoring of pronunciation has most often been applied to assessment tasks such as read-aloud that elicit highly constrained and predictable speech. Assessment of creative language production is required to fully evaluate communicative competence, but such unconstrained speech presents considerable challenges for automated scoring of pronunciation.

This chapter begins with a brief review of early efforts to automatically assess pronunciation, which generally focused on constrained speech. Key differences in the assessment of constrained vs. unconstrained speech are introduced, including the challenges of scoring spontaneous speech, where there is little or no prior knowledge of response content that can be used as a basis for measuring pronunciation quality. This is followed by a discussion of approaches used for measurement of pronunciation in unconstrained language, including an exploration of whether and how measures used for highly predictable speech can be applied to less predictable speech, as well as other approaches such as the use of measures that do not require any knowledge of the words used. This chapter will also discuss validity issues related to the automated scoring of pronunciation in unconstrained speech, focusing on how advance considerations of evidence needed to support the various claims in a validity argument can encourage critical thinking about conceptual issues involved in automated pronunciation assessment, as well as principled development of specific approaches to evaluating pronunciation. Finally, the chapter concludes with an examination of current trends and future opportunities in this domain, such as the impact of continuing improvements in speech recognition technology and improvements to pronunciation measures stimulated by such trends as the rise of “Big Data.”

The ability to accurately and effectively use pronunciation to express meaning is an element of speaking ability that is often referenced in various ways when scoring spoken performance (Fulcher, 2003). As used in assessment, the term ‘pronunciation’ may include a variety of acoustic phenomena, but for the purposes of this chapter we consider pronunciation ability to include the accurate production of the individual sounds of speech (vowels and consonants in stressed and unstressed syllables), production of appropriate rhythm patterns, effective use of prosodic prominence to express emphasis, accurate use of intonation to indicate thought group boundaries or support syntactic structures (e.g., yes/no questions), and appropriate use of intonation to express attitudes or pragmatic meanings. Speech phenomena related to speaking rate and pausing are sometimes also considered aspects of suprasegmental pronunciation (e.g., Kang et al., 2010). We do not include them in the current definition of pronunciation used here, however, given that in language proficiency frameworks and scoring rubrics such fluency features are often treated as a separate component of speaking ability, as in the Common European Frame of Reference which separately considers ‘phonology’ and ‘spoken fluency’ (Council of Europe, 2001).

Attempts to create automated systems to evaluate the quality of speech began to bear fruit in the early 1990s. These efforts benefitted from improvements to speech recognition technology achieved in the late 1980s, which led to systems that could provide adequate transcription accuracy to support various measures of pronunciation and other language features. The first systems focused on providing feedback to inform second/foreign language learning and targeted a variety of languages and types of feedback. One of the earliest systems evaluated the pronunciation of Japanese speakers completing an English read-aloud task, with the goal of eventually incorporating such capability into an automated English training system (Bernstein, Cohen, Murveit, Rtischev, & Weintraub, 1990). The system was designed to evaluate Japanese students reading six English sentences which were selected to cover a range of phones. Scores produced by the system achieved a correlation of .81 with human scores for pronunciation quality.

Other early efforts to automatically evaluate pronunciation within a computer-assisted language learning environment included the SPELL and Voice Interactive Language Training System (VILTS) systems. Project SPELL (Interactive System for Spoken European Language Training), funded by the European Community, was a demonstration system for the teaching of English, French, and Italian pronunciation and included modules for segmental pronunciation of vowels and consonants as well as rhythm and intonation (Lefèvre, Hiller, Rooney, Laver, & Di Benedetto, 1992; Hiller, Rooney, Vaughan, Eckert, Laver, & Jack, 1994). VILTS was a demonstration effort initially targeting the teaching of French that produced scores intended to approximate overall pronunciation ratings produced by experts, with correlations up to .86 using combined data from 30 sentences (Neumeyer, Franco, Weintraub, & Price, 1996; Rypa, 1996; Franco, Neumeyer, Kim, & Ronen, 1997). Other early research/demo systems included a tool to provide feedback on the pronunciation of long vowels in Japanese (Kawai & Hirsoe, 2000) and a system for scoring both individual phones and overall pronunciation performance (Witt & Young, 2000). Overall, the early research-based systems demonstrated that automated evaluation of pronunciation was feasible, and under certain conditions it could produce results that correlated well with human judgments of pronunciation quality.

The first assessment to incorporate automated pronunciation measures was the PhonePass SET-10, which was an automated test of general English speaking proficiency developed in the late 1990s (Townshend, Bernstein, Todic, & Warren, 1998; Bernstein, De Jong, Pisoni, & Townshend, 2000). The assessment included four item types which elicited constrained speech: (1) reading sentences aloud; (2) repeating sentences; (3) saying an antonym of a word; (4) providing short answers to questions. There also was a fifth, open-ended speaking task, which was not scored. Test-taker responses were awarded an overall score as well as component scores for four aspects of oral performance (“sentence mastery,” “vocabulary,” “fluency,” and “pronunciation”); pronunciation accounted for 20% of the overall score (Bernstein et al., 2000; Hincks, 2001). Later, the PhonePass SET-10 and underlying technology became the basis for the Versant^TM family of tests, which utilize the same tasks and reporting structure to assess general and aviation English, along with Arabic, Chinese, Dutch, French, and Spanish (Pearson, 2013). The technology is also now used in the speaking section of the Pearson Test of English (PTE Academic; Pearson, 2011). Since the mid 2000s, other assessments have targeted pronunciation more specifically. The latter category of assessments includes Carnegie Speech Assessment (www.carnegiespeech.com), AutoAssess (www.spoken.com), and APST (www.phonologics.com). Assessments are typically less than ten minutes with results available a few minutes after the test. Targeted uses of the assessments include making decisions regarding whether the speaker can be understood when speaking the target language in academic or business contexts, such as call centers.

It is important to note that the systems mentioned so far analyze speech elicited from highly constrained tasks such as reading aloud or repeating a phrase or sentence, where the spoken response is highly predicable. Use of such predictable speech facilitates automated evaluation in terms of both streamlining the comparison of speaker output to a reference model as well as in optimizing performance of automatic speech recognition, a necessary component of many types of pronunciation measures. Reference speakers (usually individuals whose first language is a major variety of the target language) can be asked to read sentences which are then used to build a model for comparison with a learner’s response. Moreover, for prosody features, the use of a corpus of known examples may be the only practical way of directly comparing a response to a target pronunciation model, as will be discussed later. For automatic speech recognition, use of a limited range of targeted utterances may make it easier to optimize the speech recognizer for accurate transcription of the language elicited in the assessment. Accurate transcription may in turn improve the performance of pronunciation measures where it is necessary to know which phoneme or word was spoken in order to evaluate pronunciation quality.

However, communication requires the ability to use language creatively and speaking tasks that elicit highly constrained speech, such as reading aloud, elicit only partial evidence of such ability. Assessments based on constrained speech may therefore not fully support decisions score users may want to make regarding real world communication ability (Xi, 2010a, 2012). As a result, efforts have been made to develop systems to automatically assess spontaneous speech, incorporating a variety of language features including pronunciation. The first system to do this was SpeechRater^SM, developed by researchers at Educational Testing Service (Xi, Higgins, Zechner, & Williamson, 2008; Zechner, Higgins, Xi, & Williamson, 2009). SpeechRater^SM has been used to provide a holistic score for a practice version of the TOEFL iBT Speaking Test, scoring spontaneous responses 45–60 seconds in length (Xi et al., 2008). The initial version of SpeechRater^SM included a measure of phoneme pronunciation quality, and in later versions additional measures of both segmental and suprasegmental pronunciation have been added to the system (Zechner et al., 2014). More recently the PTE Academic and the Arizona English Language Learner Assessment (AZELLA) developed by Pearson has incorporated brief open-ended speaking tasks in which scores are generated from a variety of language measures, including pronunciation (Pearson, 2011; Cheng, D’antilio, Chen, & Bernstein, 2014). Automated analysis of spontaneous speech remains highly challenging and examples of such automated systems are few. However, given that automated scoring technology has the potential to dramatically reduce the cost of scoring speaking tests, research in this area continues, and it seems likely automated scoring will be implemented more broadly for open-ended speaking tasks as the technology improves.

Automated methods for assessing pronunciation in constrained and unconstrained speech

The general goal of research on automatic evaluation of pronunciation is to create a system that would reliably replicate expert human evaluations. Therefore, development of such systems generally starts with a corpus of learner speech that has been annotated by human raters. These annotations can range from detailed phone-level corrections to holistic judgments of speaker proficiency. This corpus is then used to identify measurements, based on the acoustic analysis of the signal, that are likely to represent various aspects of pronunciation such as segmental accuracy, timing (durational patterns), stress, or intonation. There are two major groups of such measurements which may be used for either constrained or unconstrained speech. The first group compares the speaker’s pronunciation with a reference model for the same segment, word, or sentence, extracted from a corpus of spoken data that represents the pronunciation norms targeted by the assessment. The second group focuses on general patterns that discriminate between various levels of proficiency without reference to any particular model. Finally, these measurements are evaluated and fine-tuned based on how well they agree with the existing human annotations under the assumption that the best automatic evaluation of pronunciation accuracy would replicate the judgment of an expert human rater (see, for example, Eskenazi, 2009).

In model-based approaches to pronunciation assessment, the learner pronunciation is compared to the existing reference model for the same segment, word, or sentence in order to either compute a continuous similarity score between the two pronunciations (pronunciation scoring or assessment) or classify the learner pronunciation as correct or incorrect (error detection). Since these methods require reference models for each possible learner utterance, they have been most successful in situations where the inventory of possible models is relatively small and can be defined in advance. One obvious example of such a small and predefined inventory is the list of L2 segments and therefore it is perhaps unsurprising that the model-based approaches have been most successfully applied to the evaluation of segmental accuracy.

The most widespread model-based approach to evaluating segmental accuracy relies on a technology similar to what is used for automatic speech recognition (ASR). This method is covered in detail in Van Moere and Suzuki (Chapter 7, this volume) and therefore will be only briefly reviewed here. In this approach, a large corpus of data from proficient speakers is used to compute the expected distribution of spectral properties for each segment, which become the reference models. The learner pronunciation of each segment is then compared to the reference model for this and other segments to evaluate the likelihood that the learner pronunciation corresponds to a given phone produced by the speakers in the reference corpus (Franco, Neumeyer, Kim, & Ronen, 1997; Witt & Young, 2000). This approach has become the cornerstone of automatic assessment of segmental accuracy and led to the development of a number of metrics such as the influential Goodness of Pronunciation score (Witt & Young, 2000). Such phone-level scores can then be averaged across all segments in a sentence or the whole response and used to measure the overall pronunciation accuracy in both constrained and unconstrained speech (e.g., Chen, Zechner, & Xi, 2009; Cheng, D’antilio, Chen, & Bernstein, 2014). Phone-level models have also been widely used for evaluating the timing patterns of learner language for both constrained and unconstrained speech: in this approach a corpus of proficient speech is used to compute reference durations for each phone which are then compared to the phone durations in learner pronunciation. Different implementations of this approach have used various ways of estimating reference durations, from simple normalized means (Chen et al., 2009), to different probability functions (Neumeyer, Franco, Degalakis, & Weintraub, 2000; Sun & Evanini, 2011).

Although segments have often been an obvious choice for training the reference models, model-based approaches have also looked at other units. Franco, Neumeyer, Digalakis, and Ronen (2000) compared the intervals between two stressed vowels to a reference model to evaluate the timing of learner speech. Cheng (2011) built reference models at the word level for intonation and energy contours and used these to assess the prosody of learner read-aloud speech for the PTE Academic. Finally, the comparison can be done at the utterance level when evaluating learner intonation. In this case, the pitch contour of the learner sentence is compared to the pitch contour of the same sentence pronounced by a reference speaker and the difference is used as a measure of the accuracy of learner intonation (e.g., Arias, Yoma, & Vivanco, 2010). The latter approach is only applicable to constrained speech where the sentences uttered by the learner are known in advance.

In addition to pronunciation scoring, model-based approaches can also be used for error detection. This strand of research has generally focused on segmental errors and has been dominated by measures based on spectral similarity discussed in the beginning of this section. Earlier research in this area is discussed in detail in a review paper by Eskenazi (2009). More recent work on error detection focused on both expanding the range of measurements obtained for each phone and exploring various machine learning algorithms to improve the classification accuracy based on these measurements. For example, Strik, Truong, De Wet, and Cucchiarini (2009) showed that the accuracy of classification into correct and incorrect pronunciation of the Dutch phonemes [x] and [k] can be further improved by supplementing the similarity measures such as Goodness of Pronunciation score with further information about duration and intensity as well as additional linguistically informed features. However, perfect detection of mispronounced words remains difficult, which is partially due to low agreement between human raters when making such judgments: when asked to mark incorrectly pronounced phones or words, human expert raters tend to mark a similar number of errors but do not agree very well on the localization of such errors with κ around 0.3 (see Cincarek, Gruhn, Hacker, Nöth, & Nakamura, 2009; and also Loukina, Lopez, Evanini, Suendermann-Oeft, & Zechner, 2015 for a discussion of rater agreement).

All model-based methods discussed in this section require a reference model for each unit of measurement. When the evaluation is done at the level of phone where the inventory of possible models is finite, the reference models can be obtained from large corpora of existing data which are likely to contain multiple instances of all phones. Given a sufficiently large reference corpus, this approach may also work for word-level measurements. However, for prosody assessment, where the measurement may be done at the level of a phrase or even a whole sentence, the probability of finding an existing corpus which contains all necessary sentences is rather low. For assessments done on constrained speech, this problem can be solved by collecting a new reference corpus for each utterance to be produced by test-takers. However, this solution quickly becomes impractical for large-scale assessments with multiple items and continuous introduction of new items. Moreover, it is of course not possible to obtain targeted sentence-level reference data for unconstrained speech where the exact content of the learner response is not known in advance and the number of possible utterances is infinite. Therefore, a second approach to assessment of pronunciation was developed that focuses on general patterns which do not require a pre-existing reference model.

Generic approaches to pronunciation assessment focus on identifying general patterns of pronunciation which discriminate between different levels of proficiency. These approaches have been most used in the area of prosody assessment. As discussed previously, intonation, an important component of prosody evaluation, is generally evaluated at the level of phrase or sentence and it is often impractical, or in the case of unconstrained speech even impossible, to obtain reference models for all sentences that can be potentially uttered by the learner.

In addition to solving the issue of intonation assessment, generic approaches also have several other advantages over model-based approaches: first of all, they do not require defining who should be considered a reference speaker, a problem we will discuss in more detail later. Second, the application of model-based approaches requires knowledge of the content of the spoken response in order to identify the appropriate reference model for comparison. While this is also true for many generic methods, some of the measures discussed in this section can be computed without such information. This makes them particularly attractive for unconstrained speech, where the content is not known in advance. Finally, from a practical point of view, model-based approaches require finding or collecting a large reference corpus that matches the speech samples to be assessed in terms of the recording quality and the type of speech. There is no such requirement for generic measures, which can be developed using only the candidates’ responses collected in the assessment.

A set of generic measurements for automated evaluation of intonation was suggested by Teixeira, Franco, Shriberg, Precoda, and Sönmez (2000). These comprised such metrics as the minimum and maximum of the normalized pitch of the response or the number of changes in the intonation contour, and notably did not require the knowledge of the content of the utterance. Unfortunately, these features did not show as good agreement with human scores as other features which made use of transcription. Prosody features which do not require the transcription of the text were further explored by Maier et al. (2009), who computed 187 different features which covered pitch, energy, and duration for the whole utterance or for voiced segments only. They reported that for German learners of Japanese the agreement between a model based on all these features and human ratings of proficiency reached the level of agreement between two human raters and was only slightly lower than for features which made use of the content of the response. This study was conducted on constrained speech, but the same technique could be applied to unconstrained speech.

In addition to intonation, many generic approaches have been suggested for automatic evaluation of rhythm and timing. Measures based on various properties of automatically identified prominent vowels such as the distance between stressed vowels (e.g., Teixeira et al., 2000, Zechner, Xi, & Chen, 2011; Johnson, Kang, & Ghanem, 2016) have proved successful in predicting proficiency levels across a number of studies for both constrained and unconstrained speech. More recently, several studies attempted to evaluate non-native prosody using popular “rhythm measures.” These measures capture the variability in duration of vocalic and consonantal intervals and were originally developed to study differences in durational patterns between different languages (see Arvaniti, 2009, for a detailed overview). Some of the most popular measures are the percentage of vocalic and consonantal intervals in speech (Ramus, Nespor, & Mehler, 1999) and the index of pairwise variability of adjacent consonantal and vocalic intervals also known as PVI (Ling, Grabe, & Nolan, 2000). Initial applications of these measures to the evaluation of both constrained (White & Mattys, 2007; Chen & Zechner, 2011) and unconstrained non-native speech (Lai, Evanini, & Zechner, 2013) appear promising. Furthermore, these measures could be computed without knowledge of utterance content (e.g., Loukina, Kochanski, Rosner, Keane, & Shih, 2011). However, recent studies have also raised questions about the validity of these measures and their dependency on text and speaker as well as their close connection to speech rate (Arvaniti, 2012; Loukina, Rosner, Kochanski, Keane, & Shih, 2013).

As we have already discussed, both model-based and generic approaches have their advantages and disadvantages. This is why most systems for pronunciation assessment have traditionally adopted a “mix-and-match” approach by using a set of several measurements or “features” based on both approaches. Thus SpeechRater^SM, an automated scoring engine for unconstrained speech, extracts both model-based measures of segmental accuracy and duration as well as prosody measures based on generic methods (Higgins, Xi, Zechner, & Williamson, 2011). A combination of both types of approach is also used in the automated scoring system for the AZELLA test (Cheng et al., 2014) as well as in many research systems (e.g., Cincarek et al., 2009).

A direct comparison between these different systems in terms of performance is difficult because they differ in the type of speech assessed as well as the type of proficiency scores produced. However, a number of studies have reported that systems based on pronunciation features using both approaches are capable of producing scores that achieve the same level of reliability as human scores, at least in constrained speech (e.g., Cincarek et al., 2009). While no targeted evaluations of pronunciation are available for unconstrained speech, systems that include both model-based and generic pronunciation measures in the mix often achieve agreement with holistic proficiency scores which is similar or only slightly below the agreement between two expert raters (Cheng et al., 2014; Loukina, Zechner, Chen, & Heilman, 2015).

While the methods discussed in the previous section have been successfully used in different research and commercial systems for assessing pronunciation, this area still presents a number of both technical and conceptual challenges.

All model-based approaches and most of the generic approaches to automatic assessment of pronunciation require knowledge of the content of learner speech and the boundaries between phones and words. When the word-by-word transcription of a speech sample is known in advance, such boundaries can be established by using a technology called “forced alignment” to match the transcript to the recording (e.g., Franco et al., 1997). However, in the context of language assessment, the speech produced by the test-taker may deviate from the expected output, even for highly constrained tasks such as read-aloud (e.g., Cheng, 2011). Of course, no transcription is available for unconstrained speech where the content is not known in advance.

The solution currently adopted in most systems for pronunciation evaluation is to obtain the transcription automatically using ASR (e.g., Zechner et al., 2009; Higgins et al., 2011). This automatically generated transcription also includes the information about the boundaries of words and phones.

However, despite recent advances in speech technologies, the accuracy of ASR engines on unconstrained non-native speech still varies, with at least 20% of words recognized incorrectly (Tao, Evanini, & Wang, 2014). This presents a particular problem for model-based approaches since test-taker pronunciation may be flagged as incorrect simply because it is compared to a wrong reference model due to an error in automatic transcription.

Various attempts have been made to mitigate this problem. For example, Herron et al. (1999) suggested a two-pass approach to pronunciation evaluation: first the recognizer is trained on speech of varying levels of proficiency and these models are used for transcription. Second, for pronunciation evaluation the candidate speech is compared to a different set of models trained on proficient speech only. While Herron et al.’s (1999) study evaluated highly constrained speech, Chen et al. (2009) showed that this approach also works well on unconstrained speech. The ASR accuracy can also be improved by creating separate models for each item reflecting the words and word combinations most likely to be elicited by this item (Van Doremalen, Cucchiarini, & Strik, 2010). Finally, one can focus pronunciation assessment on only the words that are most likely to be recognized correctly using confidence scores usually computed by the ASR (Chen et al., 2009). While this approach reduces the number of ‘false alarms’ due to recognition errors, it also introduces bias since mispronounced words are often the ones that are likely to have low confidence scores. If the automatic pronunciation assessment only includes words with high confidence scores, many mispronounced words will be excluded from such analysis.

Although the impact of ASR errors remains an unsolved problem for error detection in unconstrained speech, the impact of individual ASR errors on overall pronunciation scores appears to be less severe than one might expect because model-based evaluations for each phone are averaged across the whole response. In a study of pronunciation measures based on both model-based and generic approaches, correlations between pronunciation features such as the ones discussed in previous sections and human scores changed little even when the ASR word error rate varied from 10% to 50% (Chen et al., 2009; Tao et al., 2014). In other words, such features may be relatively robust to errors in ASR hypothesis. One should bear in mind however that ASR still introduces an additional confound into these measurements: since the ASR accuracy tends to be lower for low-proficiency speakers, the pronunciation scores for these speakers also tend to be more affected by ASR errors. Improving ASR accuracy will therefore increase the validity of these pronunciation measurements.

A second major challenge is the question of what reference models should be used for model-based approaches to pronunciation assessments. For example, models of English pronunciation have typically targeted the pronunciation of ‘native speakers,’ which ignores the many regional and social varieties of English. Such an approach is especially problematic in international contexts where English serves as a lingua franca and the very notion of a ‘native speaker’ becomes problematic (Seidlhofer, 2004; Jenkins, 2006). Furthermore, the current focus of pronunciation assessments on intelligibility rather than accent reduction challenges the whole concept of using so-called native speech as a single reference model.

There are a variety of possibilities for dealing with such challenges; all of these incorporate the idea that target norms must be clearly identified for the assessment purpose. For international contexts, Elder and Davies (2006) have argued that assessments to be used in local contexts must target the language norms of those localities. Where the target is international English, attempts have been made to identify pronunciation features that form the ‘core’ of English as an international language (e.g., Jenkins, 2000) or that otherwise are especially important for intelligibility (e.g., Munro & Derwing, 2006; Kang & Moran, 2014). Such core features might then be targeted in a reference model. Another approach is to build a reference model based on ‘proficient speakers’ that are relatively high-performing in a given context, such as high-scoring test-takers. A number of studies of model-based approaches, such as Chen et al. (2009), trained their models on a combination of L1 speakers and proficient L2 speakers. In fact, Sun and Evanini (2011) showed that vowel duration measures which used L2-speaker models showed better agreement with proficiency scores than the models based on L1 speakers.

In addition to between-speaker variability, within-speaker variability adds another challenge to defining a coherent reference model. This is a particularly big problem for unconstrained speech since there is substantial variation in pronunciation of the same segments, words, or sentences due to context even when uttered by the same speaker (Lindblom, 1990; Bell et al., 2003; Aylett & Turk, 2004). Therefore, the fact that a learner’s production deviates from a model built on a particular reference sample does not necessarily mean that this production is wrong, especially if the two productions occurred in different contexts.

The solution to both these problems so far has been to obtain reference models from large speech corpora which include speakers from different areas and social backgrounds and a large variety of contexts. The problem with this approach is that the resulting models may be too broad: for example, if segmental accuracy is evaluated using a model which covers several regional accents, a similar score would be assigned to a learner who consistently follows the conventions of a particular accent and a learner who switches between different accents. Yet studies in speech perception have shown that speakers with more consistent production are more intelligible to listeners (Newman, Clouse, & Burnham, 2001). Therefore, in the future, evaluating consistency and/or adapting the models to each learner may be another aspect that will be addressed by automatic pronunciation assessment.

One way to achieve significant improvement in both pronunciation scoring and error detection systems is to fine-tune these systems to a particular pair of L1 and L2, which has several advantages. First of all, the models for ASR can be trained on speakers with the same L1 to further improve recognition accuracy. Second, the system can focus on error patterns which are particularly relevant for a given L1. For example, Van Doremalen, Cucchiarini, and Strik (2013) showed that combining acoustics-based measures of pronunciation quality with previous knowledge of expected error patterns lead to a roughly 15% improvement in the performance of these measures for learners of Dutch. Finally, a system tailored to a particular pair of languages may even be able to bypass the problem of ASR accuracy: in the system developed by Moustroufas and Digalakis (2007), learner speech is recognized and evaluated twice, first using reference models trained on native speakers of the L2 and then using the reference models trained on native speakers of the L1 (in their case Greek). The comparison between these two evaluations shows whether the non-native pronunciation is closer to native speakers of the L1 or the L2 without relying on the actual output of the ASR.

However, such L1-specific systems may not always be appropriate for large-scale assessments due to fairness issues (Xi, 2010b). While different algorithms could be developed for speakers with different L1s, this could lead to a situation where the same speaker may receive a different score depending on the L1 they report. In addition, it is likely that the availability of training data and research may vary between languages, putting speakers of minority languages in a disadvantaged position.

In addition to the challenges mentioned so far, developers and users of automated assessments of pronunciation must ensure that the assessment is appropriate for the purpose for which it is used. The assessment of pronunciation can potentially serve many different purposes ranging from making high-stakes decisions such as hiring, to medium-stakes decisions such as placing students into the right levels of instruction, to low-stakes purposes such as providing opportunities for practice and feedback in a learning context. Automated evaluation of pronunciation may be used in a few different ways to produce scores in a pronunciation assessment, including serving as the sole score, a contributory score in the reported score, or as a check on human scores. In a speaking assessment, automated features of pronunciation can also be combined with other automated features using statistical methods to predict speaking scores. Additionally, automated engines can be used to provide qualitative feedback on overall pronunciation quality and specific pronunciation errors, both segmental and suprasegmental. Supporting these various uses requires differing kinds and degrees of information; construction of a validity argument to support an intended use is now a widely used procedure for systematically thinking about the kinds of information needed.

Xi (2010b) and Xi (2012), building on Kane’s (2006) validity argument structure and Chapelle, Enright, and Jamieson’s (2008) expansion of it, propose a list of validity questions introduced by the use of automated scoring that pertain to each validity inference in a validity argument for an entire assessment:

1. Does the use of assessment tasks constrained by automated scoring technologies lead to construct under- or misrepresentation?

2. Do the automated scoring features under- or misrepresent the construct of interest?

3. Is the way the scoring features are combined to generate automated scores consistent with theoretical expectations of the relationships between the scoring features and the construct of interest?

p.165

4. Does the use of automated scoring change the meaning and interpretation of scores, compared to scores provided by trained raters?

5. Does automated scoring yield scores that are accurate indicators of the quality of a test performance sample? Would examinees’ knowledge of the scoring logic of an automated scoring system impact the way they interact with the test tasks, thus negatively affecting the accuracy of the scores?

6. Does automated scoring yield scores that are sufficiently consistent across measurement contexts (e.g., across test forms)?

7. Does automated scoring yield scores that have expected relationships with other test or non-test indicators of the targeted language ability?

8. Do automated scores lead to appropriate score-based decisions?

9. Does the use of automated scoring have a positive impact on test preparation, teaching and learning practices?

With regard to question 1, it is possible that limitations of pronunciation scoring technologies for spontaneous speech may prompt the use of read-aloud speech only in an assessment so as to increase the accuracy of automated scoring of pronunciation. But, this decision could potentially constrain score inferences to the pronunciation quality of read speech only, which may under-represent the construct of interest. It is also likely that the automated pronunciation features used to predict human pronunciation scores may fail to include key aspects of the construct such as intonation and stress or include features that are irrelevant to the construct of pronunciation (although relevant to the broader construct of speaking) such as speaking rate or pausing phenomena. Features must also be aligned to what is targeted as pronunciation quality. For example, if the assessment targets intelligibility, the measures used to evaluate pronunciation should address phenomena that influence intelligibility rather than simply contributing to perceptions of accent. Moreover, when these features are combined to predict expert human scores, inappropriate weights may be given to some features in a way that is inconsistent with the relative role they play in driving expert raters’ perceptions of overall pronunciation quality. If the automated scoring logic under- or misrepresents the target construct of pronunciation, test-takers who have knowledge of the logic may try to take advantage of the “loopholes” and game the automated system to obtain higher pronunciation scores than they actually deserve.

Another validity issue relates to the consistency of pronunciation scores across test forms. Evidence needs to be demonstrated that regardless of the test form used, a test-taker will receive a similar pronunciation score. Additionally, automated and human pronunciation scores are expected to have comparable relationships with criterion measures that measure similar pronunciation constructs, such as comprehensibility. Finally, regarding the use of automated pronunciation scores, the impact on both score-based decisions and on test preparation, teaching, and learning practices needs to be investigated. These investigations include the extent to which automated scores support appropriate score-based decisions (such as screening candidates for call-center jobs), and promote a positive impact on teaching and learning practices to improve pronunciation and speaking ability generally.

Xi (2010a) argues that depending on the planned use of the assessment and the way automated scoring is intended to be implemented (e.g., as the sole score), validation priorities differ. Higher-stakes uses demand a great amount of rigorous validity evidence whereas the burden of evidence is much less for lower-stakes uses.

A number of trends are likely to influence the further development of systems to automatically assess pronunciation in spontaneous speech. One such trend is the increasing capability to store and process large amounts of data (so-called “Big Data”). Larger training sets and new machine learning algorithms have already led to substantial improvements in ASR accuracy for native speech and these results are likely to transfer to the recognition of non-native speech. Deep learning is one particularly promising technology in this area (see LeCun, Bengio, & Hinton, 2015 for a good introduction) which has already shown very good results in improving general ASR accuracy (Hinton et al., 2012), the accuracy of detection of pronunciation errors in learner speech (Hu, Qian, Soong, & Wang, 2015), and speech scoring in general (Cheng, Chen, & Metallinou, 2015). This development will be particularly important for unconstrained speech which is more impacted by ASR accuracy.

Crowdsourcing is another developing trend that promises to contribute to better automated evaluation of pronunciation. As discussed throughout this chapter, the development and fine-tuning of systems for pronunciation assessment requires human evaluations to calibrate these systems. While the amount of data that can be processed by computers continues to increase, the resources available to obtain evaluations from expert raters remain limited, hindering rapid creation of new large training corpora with well-attested annotations. New methods of data collection such as crowdsourcing have been repeatedly shown to provide annotations similar in quality to those done by experts but at a fraction of the cost and time (Parent, 2013; Tetreault, Chodorow, & Madnani, 2013). Crowdsourcing methods have also been applied to collecting judgments about prosody (Evanini & Zechner, 2011), pronunciation accuracy (Peabody, 2011) and intelligibility (Loukina, Lopez, et al., 2015). Future improvements in protocols and processes for crowdsourced annotations (Attali, 2015) as well as the increasing use of other methods of distributed data collection such as games or cell phone applications, will lead to a substantial expansion of the training sets available for calibrating new systems.

Finally, we should expect to see greater co-operation between speech technologists, phoneticians, and second language acquisition researchers as these fields increasingly share the same tools and methods. Many existing systems for pronunciation assessment rely on traditional phonetic theories which represent speech as a sequence of non-overlapping segments with the prosodic component superimposed on this sequence. Yet since the late 1980s, there have emerged a number of new empirically tested models such as articulatory phonology (Browman & Goldstein, 1992) that are very different from the traditional “string-of-phones” representation. There have already been successful attempts to integrate such theories into systems for pronunciation assessments. Tepperman and Narayanan (2008) used an approach inspired by articulatory phonology to identify possible pronunciation errors which improved the accuracy of error identification for some segments by 16–17%, especially in the case of minor pronunciation deviations that can be difficult to identify using traditional methods. In another study Koniaris, Salvi, and Engwall (2013) based their error detection system for learners of Swedish on findings from perception studies by comparing whether learner pronunciation has the same perceptual qualities as the model pronunciation of a group of native speakers. In a small listening test, their model showed a very high agreement with native listeners (73% of vowels and 100% of consonants). Such inter-disciplinary crosspollination is likely to lead to radically new approaches to pronunciation assessments.

Arias, J. P., Yoma, N. B., & Vivanco, H. (2010). Automatic intonation assessment for computer aided language learning. Speech Communication, 52, 254–267.

Arvaniti, A. (2009). Rhythm, timing and the timing of rhythm. Phonetica, 66, 46–63.

Arvaniti, A. (2012). The usefulness of metrics in the quantification of speech rhythm. Journal of Phonetics, 40, 351–373.

Attali, Y. (2015). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115.

Aylett, M., & Turk, A. (2004). The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech, 47, 31–56.

Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., & Gildea, D. (2003). Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. Journal of the Acoustical Society of America, 113, 1001–1024.

Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., & Weintraub, M. (1990). Automatic evaluation and training in English pronunciation. Proceedings of ICSLP 90. Kobe, Japan: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/icslp_1990/i90_1185.html.

Bernstein, J., De Jong, J., Pisoni, D., & Townshend, B. (2000). Two experiments on automatic scoring of spoken language proficiency. In P. Delcloque (Ed.), Proceedings of STIL 2000 (pp. 57–61). Dundee, UK: University of Abertay.

Browman, C. P., & Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica, 49, 155–180.

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language™. Mahwah, NJ: Lawrence Erlbaum.

Chen, L., & Zechner, K. (2011). Applying rhythm features to automatically assess non-native speech. Proceedings of InterSpeech 2011 (pp. 1861–1864). Florence, Italy: International Speech Communication Association. Retrieved from: http://www.isca-speech.org/archive/interspeech_2011/i11_1861.html.

Chen, L., Zechner, K., & Xi, X. (2009). Improved pronunciation features for construct-driven assessment of non-native spontaneous speech. Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics (pp. 442–449). Boulder, CO: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/N09-1050.

Cheng, J. (2011). Automatic assessment of prosody in high-stakes English tests. Proceedings of InterSpeech 2011 (pp. 1589–1592). Florence, Italy: International Speech Communication Association. Retrieved from: http://www.isca-speech.org/archive/interspeech_2011/i11_1589.html.

Cheng, J., Chen, X., & Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73, 14–27.

Cheng, J., D’antilio, Y. Z., Chen, X., & Bernstein, J. (2014). Automatic Assessment of the Speech of Young English Learners. Proceedings of the 9th workshop on innovative use of NLP for building educational applications (pp. 12–21). Baltimore, MD: Association for Computational Linguistics. Retrieved from http://anthology.aclweb.org/W/W14/W14-1802.pdf.

Cincarek, T., Gruhn, R., Hacker, C., Nöth, E., & Nakamura, S. (2009). Automatic pronunciation scoring of words and sentences independent from the non-native’s first language. Computer Speech & Language, 23, 65–88.

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.

Elder, C., & Davies, A. (2006). Assessing English as a Lingua Franca. Annual Review of Applied Linguistics, 26, 282–304.

Eskenazi, M. (2009). An overview of spoken language technology for education. Speech Communication, 51, 832–844.

Evanini, K., & Zechner, K. (2011). Using crowdsourcing to provide prosodic annotations for non-native speech. Proceedings of InterSpeech 2011 (pp. 3069–3072). Florence, Italy: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/interspeech_2011/i11_3069.html.

Franco, H., Neumeyer, L., Digalakis, V., & Ronen, O. (2000). Combination of machine scores for automatic grading of pronunciation quality. Speech Communication, 30, 121–130.

Franco, H., Neumeyer, L., Kim, Y., & Ronen, O. (1997). Automatic pronunciation scoring for language instruction. 1997 IEEE international conference on acoustics, speech, and signal processing, 2, 1471–1474. Los Amitos, CA: IEEE Computer Society Press. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=604597.

Fulcher, G. (2003). Testing second language speaking. Harlow, UK: Pearson Education.

Herron, D., Menzel, W., Atwell, E., Bisiani, R., Daneluzzi, F., Morton, R., & Schmidt, J. A. (1999). Automatic localization and diagnosis of pronunciation errors for second-language learners of English. EUROSPEECH’99 (pp. 855–858). Budapest, Hungary: International Speech Communication Association. Retrieved from www.isca-speech.org/archive/eurospeech_1999/e99_0855.html.

Higgins, D., Xi, X., Zechner, K., & Williamson, D. (2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech & Language, 25, 282–306.

Hiller, S., Rooney, E., Vaughan, R., Eckert, M., Laver, J., & Jack, M. (1994). An automated system for computer-aided pronunciation learning. Computer Assisted Language Learning, 7, 51–63.

Hincks, R. (2001). Using speech recognition to evaluate skills in spoken English. Lund University Department of Linguistics, Working Papers in Linguistics, 49, 58–61.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29, 82–97.

Hu, W., Qian, Y., Soong, F. K., & Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154–166.

Jenkins, J. (2000). The phonology of English as an international language. Oxford: Oxford University Press.

Jenkins, J. (2006). Current perspectives on teaching world Englishes and English as a Lingua Franca. TESOL Quarterly, 40, 157–181.

Johnson, D. O., Kang, O., & Ghanem, R. (2016). Language proficiency ratings: Human vs. machine. In J. Levis, H. Le, I. Lucic, E. Simpson, & S. Vo (Eds.), Proceedings of the 7th pronunciation in second language learning and teaching conference (pp. 119–129). Dallas, TX. Retrieved from https://apling.engl.iastate.edu/alt-content/uploads/2016/06/PSLLT_Proceedings_7updated.pdf.

Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement, 4th ed. (pp. 18–64). Washington, DC: American Council on Education-Praeger.

Kang, O., & Moran, M. (2014). Functional loads of pronunciation features in nonnative speakers’ oral assessment. TESOL Quarterly, 48, 176–187.

Kang, O., Rubin, D. L., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.

Kawai, G., & Hirose, K. (2000). Teaching the pronunciation of Japanese double-mora phonemes using speech recognition technology. Speech Communication, 30, 131–143.

Koniaris, C., Salvi, G., & Engwall, O. (2013). On mispronunciation analysis of individual foreign speakers using auditory periphery models. Speech Communication, 55, 691–706.

Lai, C., Evanini, K., & Zechner, K. (2013). Applying rhythm metrics to non-native spontaneous speech. Proceedings of SLaTE (pp. 159–163). Grenoble, France: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/slate_2013/sl13_137.html.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.

Lefèvre, J.-P., Hiller, S. M., Rooney, E., Laver, J., & Di Benedetto, M. G. (1992). Macro and micro features for automated pronunciation improvement in the SPELL system. Speech Communication, 11, 31–44.

Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and speech modeling (pp. 403–439). Dordrecht, the Netherlands: Kluwer Academic.

Ling, L. E., Grabe, E., & Nolan, F. (2000). Quantitative characterizations of speech rhythm: Syllable-timing in Singapore English. Language and Speech, 43, 377–401.

Loukina, A., Kochanski, G., Rosner, B., Keane, E., & Shih, C. (2011). Rhythm measures and dimensions of durational variation in speech. Journal of the Acoustical Society of America, 129, 3258–3270.

Loukina, A., Lopez, M., Evanini, K., Suendermann-oeft, D., & Zechner, K. (2015). Expert and crowdsourced annotation of pronunciation errors for automatic scoring systems. Proceedings of InterSpeech 2015 (pp. 2809–2813). Dresden, Germany: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/interspeech_2015/i15_2809.html.

Loukina, A., Rosner, B., Kochanski, G., Keane, E., & Shih, C. (2013). What determines duration-based rhythm measures: Text or speaker? Laboratory Phonology, 4, 339–382.

Loukina, A., Zechner, K., Chen, L., & Heilman, M. (2015). Feature selection for automated speech scoring. Proceedings of the tenth workshop on innovative use of NLP for building educational applications (pp. 12–19). Denver, CO: Association for Computational Linguistics. Retrieved from https://aclweb.org/anthology/W/W15/W15-0602.pdf.

Maier, A., Hönig, F., Zeisser, V., Batliner, A., Körner, E., Yamanaka, N., et al. (2009). A language-independent feature set for the automatic evaluation of prosody. Proceedings of InterSpeech 2009 (pp. 600–603). Brighton, UK: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/interspeech_2009/i09_0600.html.

Moustroufas, N., & Digalakis, V. (2007). Automatic pronunciation evaluation of foreign speakers using unknown text. Computer Speech & Language, 21, 219–230.

Munro, M. J., & Derwing, T. M. (2006). The functional load principle in ESL pronunciation instruction: An exploratory study. System, 34, 520–531.

Neumeyer, L., Franco, H., Digalakis, V., & Weintraub, M. (2000). Automatic scoring of pronunciation quality. Speech Communication, 30, 83–93.

Neumeyer, L., Franco, H., Weintraub, M., & Price, P. (1996). Automatic text-independent pronunciation scoring of foreign language student speech. Proceedings of fourth international conference on spoken language processing (pp. 1457–1460). Philadelphia, PA: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/icslp_1996/i96_1457.html.

Newman, R. S., Clouse, S. A., & Burnham, J. L. (2001). The perceptual consequences of within-talker variability in fricative production. Journal of the Acoustical Society of America, 109, 1181–1196.

Parent, G. (2013). Crowdsourcing for speech transcription. In M. Eskenazi, G.-A. Levow, H. Meng, G. Parent, & D. Suendermann (Eds.), Crowdsourcing for speech processing: Applications to data collection, transcription and assessment (pp. 72–103). Chichester, UK: John Wiley & Sons.

Peabody, M. A. (2011). Methods for pronunciation assessment in computer aided language learning (unpublished doctoral dissertation). Massachusetts Institute of Technology, Cambridge, MA.

Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73, 265–292.

Seidlhofer, B. (2004). Research perspectives on teaching English as a Lingua Franca. Annual Review of Applied Linguistics, 24, 209–239.

Strik, H., Truong, K., De Wet, F., & Cucchiarini, C. (2009). Comparing different approaches for automatic pronunciation error detection. Speech Communication, 51, 845–852.

Sun, X., & Evanini, K. (2011). Gaussian mixture modeling of vowel durations for automated assessment of non-native speech. 2011 IEEE international conference on acoustics, speech and signal processing (pp. 5716–5719). Prague, Czech Republic: IEEE. Retrieved from http://ieeexplore.ieee.org/document/5947658/.

Tao, J., Evanini, K., & Wang, X. (2014). The influence of automatic speech recognition accuracy on the performance of an automated speech assessment system. 2014 IEEE spoken language technology workshop (pp. 294–299). South Lake Tahoe, CA: IEEE. Retrieved from http://ieeexplore.ieee.org/document/7078590/.

Teixeira, C., Franco, H., Shriberg, E., Precoda, K., & Sönmez, K. (2000). Prosodic features for automatic text-independent evaluation of degree of nativeness for language learners. Proceedings of the 6th international conference on spoken language processing (vol. 3, pp. 187–190). Beijing, China: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive/icslp_2000/i00_3187.html.

Tepperman, J., & Narayanan, S. (2008). Using articulatory representations to detect segmental errors in nonnative pronunciation. IEEE Transactions on Audio, Speech, and Language Processing, 16, 8–22.

Tetreault, J., Chodorow, M., & Madnani, N. (2013). Bucking the trend: Improved evaluation and annotation practices for ESL error detection systems. Language Resources and Evaluation, 48, 5–31.

Townshend, B., Bernstein, J., Todic, O., & Warren, E. (1998). Estimation of spoken language proficiency. Proceedings of STIL 1998 (pp. 179–182). Marholmen, Sweden: International Speech Communication Association. Retrieved from http://www.isca-speech.org/archive_open/still98/stl8_179.html.

Van Doremalen, J., Cucchiarini, C., & Strik, H. (2010). Optimizing automatic speech recognition for low-proficient non-native speakers. EURASIP Journal on Audio, Speech, and Music Processing, 2010, Article ID 973954.

Van Doremalen, J., Cucchiarini, C., & Strik, H. (2013). Automatic pronunciation error detection in non-native speech: The case of vowel errors in Dutch. Journal of the Acoustical Society of America, 134, 1336–1347.

White, L., & Mattys, S. L. (2007). Calibrating rhythm: First language and second language studies. Journal of Phonetics, 35, 501–522.

Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30, 95–108.

Xi, X. (2010a). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27, 291–300.

Xi, X. (2010b). How do we go about investigating test fairness? Language Testing, 27, 147–170.

Xi, X. (2012). Validity and the automated scoring of performance tests. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 438–451). New York: Routledge.

Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of spontaneous speech using SpeechRater v1.0. ETS Research Report No. RR-08-62. Princeton, NJ: Educational Testing Service.

Zechner, K., Evanini, K., Yoon, S-Y., Davis, L., Wang, X., Chen, L., . . . Leong, C. W. (2014). Automated scoring of speaking items in an assessment for teachers of English as a foreign language. Proceedings of the ninth workshop on innovative use of NLP for building educational applications (pp. 134–142). Baltimore, MD: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W14-1816.

Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51, 883–895.

Zechner, K., Xi, X., & Chen, L. (2011). Evaluating prosodic features for automated scoring of non-native read speech. 2011 IEEE workshop on automatic speech recognition & understanding (pp. 461–466). Waikoloa, HI: IEEE. Retrieved from http://ieeexplore.ieee.org/document/6163975/.