Assessment in Second Language Pronunciation

The assessment of second language (L2) speaking proficiency has been of central interest to researchers in Applied Linguistics since the first discussions of communicative competence (Hymes, 1972; Canale & Swain, 1980); however, research on pronunciation, once marginalized in part due to its association with discrete aspects of oral production (Lado, 1961, 1964), is now emerging as a revitalized field of inquiry with its own important implications and concerns. Part of this resurgence can be attributed to a shift in focus from perceptions of accentedness to broader aspects of performance, primarily intelligibility and comprehensibility.

Since the mid 1990s there has been enormous growth in research on L2 pronunciation. Pronunciation is an essential aspect of the assessment of oral skills because it helps us understand the fundamentals in the process of the construction of spoken discourse in L2 performance; that is, listeners begin by processing individual sounds constructed by L2 speakers to arrive at an interpretation for a stream of speech. The discrete sounds of speech remain a critical area of investigation as listeners tend to attribute native/nonnative speaker status on the basis of pronunciation (Luoma, 2004). Pronunciation is also an important facet of proficiency on which most L2 learners have ready views and clear motivations (Leather, 1999). In recognition of such significance, the 4th Pronunciation in Second Language Teaching and Learning (PSLLT) 2012 conference held in Vancouver, British Columbia, focused its theme on pronunciation and assessment. In fact, it is PSLLT 2012 that initially motivated the idea of this edited volume.

However, the history of L2 pronunciation has been compared to a pendulum swinging back and forth between times when it has been completely ignored, and times when it has been of primary importance. As the role of pronunciation in general L2 language learning has been a history of extremes (Levis, 2005), the role of assessment has equally fluctuated with the times. In some cases, assessment has focused on the accuracy of segmentals, in others, on the approximation or the mastery of suprasegmentals. Since about the year 2005, however, L2 pronunciation research has gleaned a renewed focus on intelligibility and comprehensibility, and accentedness. The assessment of pronunciation is a reflection of these historical perspectives, and the methods of assessing pronunciation have reflected the times.

Thomson examines the representation of accentedness, comprehensibility, and intelligibility – criterial components of pronunciation assessment. He traces the transformative influence of Munro and Derwing who have addressed, through a long line of related investigations, both the definition and inter-relatedness of the constructs of accentedness, comprehensibility, and intelligibility. These studies have transformed the field of L2 pronunciation. Notions that accentedness, comprehensibility, and intelligibility are related but partially independent constructs have paved the way for pronunciation teaching and research to put accent in its place (Derwing & Munro, 2009) and emphasize the end goal of comprehensibility. Interest in these inter-related constructs has led researchers to seek and discover how they are best operationalized and measured, how they affect listeners’ ratings, and how they should be addressed in the ESL/EFL classroom. Thomson’s comprehensive discussion on methodological approaches to the measurement and evaluation of leading pronunciation constructs should be of use to teachers and researchers in the field.

One of the critical issues involved in assessing pronunciation is determining to what extent the measures of pronunciation constructs are valid and reliable. Harding argues that L2 pronunciation assessment presents unique challenges in drawing valid inferences from performance, to score assignment, to the ultimate decisions for which a pronunciation assessment was intended, and to the pedagogical and social consequences beyond. He raises legitimate questions: Are administration and scoring procedures accurate and consistent? Does the task used in assessment elicit relevant target features? Does the test yield a score which is fit for decision-making purposes? Is the assessment fair? While such inquiries can be generally perceived across all types of language assessments, some of them are central to the concept of validity in L2 pronunciation.

The discussion of validity is directly related to the context of World Englishes (WE), where the reliance on prestigious inner circle norms of native English has been challenged. This validity issue examines the sociolinguistic realities of diverse language learners’ actual use (Elder & Harding, 2008). Given that defining a standard norm is problematic in the era of globalization, where new norms are emerging, so the assessment of L2 pronunciation must address more questions now than ever before. Should the pronunciation assessment focus on accuracy with respect to a norm or on mutual intelligibility? What role does accentedness play with respect to mutual intelligibility? Providing that within native-speaker varieties of English (e.g., British, American, New Zealand English or Australian English) variability exists, the focus on accuracy (i.e., deviations from a native-speaker norm) introduces the complexity of identification and selection of multiple norms for teaching and assessment: Whose norm, which norm is the most appropriate? Furthermore, research in L2 pronunciation now gears toward setting a realistic goal for pronunciation acquisition, i.e., intelligibility and comprehensibility – rather than native-like pronunciation (Munro & Derwing, 2011, 2015). While most large-scale language assessments still appeal to established standard varieties, the “educated native speaker” has receded into the background, and comprehensibility and intelligibility have become the focus.

In all pronunciation endeavors, the importance of listeners cannot be underestimated, and background characteristics that listeners bring to assessment tasks have been found to influence their evaluations of accentedness, comprehensibility, and intelligibility. In this volume, two research chapters examine listeners’ speech judgments and the background factors that affect their evaluations’ speech. Because idiosyncratic listener judgments pose a threat to both the reliability and validity of operational tests, listeners become raters when they are trained to rate to a scale. Yan and Ginther discuss the differences between listeners and raters, and the effects of rater training in speaking and pronunciation assessment are highlighted.

The use of listeners who apply minimally represented scales in broad, applied research contexts invites our attention to reliability and scale representation (validity). While 9-point Likert-type scales are used widely, idiosyncratic application may be masked by the over-generality of the scales. Identifying differences across raters in their application of a scale creates challenges when there are only minimal specifications that are provided. Raters in fact have reported some difficulty when attempting to differentiate middle scale points (Isaacs & Thomson, 2013). Empirical evidence is needed to better understand how listeners may differentially interpret scales based on their own underlying representations of different pronunciation constructs.

On a more technical side, L2 pronunciation involves many acoustic features of the speech stream such as the quality of vowels and consonants, the presence and placement of pauses, as well as broader, measurable components of prosody: stress, rhythm, and intonation. Understanding how acoustic features influence human rating of pronunciation presents a rich domain for research. Thanks to advances in speech science, computer-assisted instruments can aid in examining some elements of the acoustic properties of L2 pronunciation. The knowledge of these instrumentally analyzed pronunciation properties has the potential to advance our understanding of speech production and inform rubric development and rater training in oral proficiency testing. The physical properties of the acoustic measures can also build bases for speech recognition and processing techniques, which has increasingly attracted the attention of language testers. As topics on the improvement of automated speech recognition (ASR) effectiveness for L2 speech continues to be of interest to L2 researchers (e.g., Oh, Yoon & Kim, 2007), better understanding of this technology-driven approach to pronunciation assessment is of benefit to all interested parties (Van Moere & Downey, 2016).

Overall, despite researchers’ and language teachers’ growing interest in the issues mentioned above, there are limited resources that address these issues in a systematic and comprehensive manner. A reliable and wide-ranging source of information that reflects current knowledge of such topics is beneficial. In the current volume, we have attempted to serve this purpose. Our contributors offer their specialized understanding on the importance of pronunciation in the assessment context. This volume provides a bridge – highlighting both common concerns and differences for researchers in both domains. We hope that this volume can be used by students and instructors at upper-level undergraduate, graduate, and post-graduate courses as well as by established and emerging scholars involved in research on L2 speech and language assessment.

This volume offers detailed accounts of issues related to L2 pronunciation and assessment. It is divided into two parts containing eight chapters written both by applied linguists who specialize in language assessment together with researchers that expertize in the topic of L2 pronunciation and speaking. Some of the topics that guide the chapter selections include:

• measurement and evaluation of pronunciation constructs;

• validity and reliability of pronunciation assessment;

• World Englishes and the assessment of pronunciation;

• listeners’ individual variation and their background characteristics in L2 accent;

• pronunciation features to inform rating criteria;

• assessing pronunciation via speech technology.

Beginning in Part I, “Current issues in pronunciation assessment,” Thomson’s “Measurement of accentedness, intelligibility, and comprehensibility” (Chapter 1) surveys methodological approaches to the measurement and evaluation of leading constructs associated with L2 pronunciation, with the goal of providing a detailed description for applied linguists who are just entering this area of research and for established researchers looking for a clear overview. The chapter begins by outlining what the author considers to be valid reasons for measuring L2 pronunciation. Three dominant L2 pronunciation constructs are then introduced: intelligibility, comprehensibility, and accentedness. While these terms are widely used in the literature, they are not always applied consistently. Thomson describes key findings of research investigating accent, intelligibility, and comprehensibility and suggests new directions for L2 pronunciation assessment and recommendations for further research.

In Chapter 2, “Validity in pronunciation assessment,” Harding draws on recent developments in validity theory to illuminate the issues facing all test developers but particularly those who are interested in pronunciation assessment. Threats to validity are identified with reference to pronunciation assessment, and research, which has investigated such threats, is discussed. He also outlines common research methods in conducting pronunciation assessment validation research and suggests new directions for validity-related research in pronunciation assessment, with a recommendation that pronunciation assessment is a case where effect-driven test design is clearly warranted.

Dimova, in “Pronunciation assessment in the context of World Englishes” (Chapter 3), begins her chapter by outlining the early WE conceptualizations of pronunciation through the model of understanding in cross-cultural communication. She continues with a discussion of criticism of the current practices in language testing and assessment that claim that the field has failed to adopt the WE perspective and realistically represent the variation of pronunciation norms in international communication. She argues that embracing WE or English as a Lingua Franca, particularly in relation to pronunciation, is a challenging task due to the existing constraints guiding the design of valid tests which accurately represent the domains of target language use (Elder & Harding, 2008). She concludes that despite these constraints, strides have been made toward encompassing a WE perspective in test construction and task design, especially in listening and speaking tests.

Yan and Ginther in “Listeners and raters: similarities and differences in evaluation of accented speech” (Chapter 4) make a distinction between listeners and raters. They compare findings from research investigating background characteristics observed to have an impact on the perceptions of L2 accents by general listeners, to findings concerning background characteristics that have influenced raters’ evaluations of L2 oral proficiency in testing contexts. While there is overlap, they argue that in operational assessment contexts, it is important to consider how rater interactions with L2 accents may introduce construct irrelevant variance into the assessment domain. They emphasize the importance of rater training to mitigate potential rater bias when high-stakes decisions are involved. Accentedness and comprehensibility are discussed as embedded components within a broader speaking score.

Isbell’s “Assessing pronunciation for research purposes with listener-based numerical scales” (Chapter 5), takes a slightly different approach from the other chapters in this volume. Analyzing data from a study on L2 Korean pronunciation instruction, the chapter explores the degree to which scores derived from commonly used scales in L2 pronunciation research are representative of interval-level measurement, a necessary precondition for many subsequent statistical analyses that apply inferential techniques. Many-facet Rasch measurement, an analytical technique just beginning to be used in L2 pronunciation research (e.g., Isaacs & Thomson, 2013), is employed to investigate the functioning of numerical pronunciation scales and variation in rater judgments in detail. His findings illustrate that when raters were treated as fixed items, internal consistency for both comprehensibility and accentedness scores may be high, but important individual differences, particularly for accentedness scores, may be hidden. While most L2 pronunciation studies report an overall reliability for each attribute measured, the illustrative example here suggests a need for closer examination of how accentedness and comprehensibility scales function across individuals. Isbell’s chapter illustrates the usefulness of Rasch-based analyses to additionally explicate aspects of the underlying constructs.

Part II, “Technology and pronunciation assessment,” describes the strides made by the advancement of technology for producing reliable and effective assessment of accented speech as well as challenges and issues that we still face up to date. Ghanem and Kang’s Chapter 6, “Pronunciation features in rating criteria,” provides descriptive accounts of pronunciation features that may serve to inform rating criteria. In particular, they offer a detailed explanation about how selected pronunciation features can be measured, with examples and step-by-step procedures. Then, they review rating descriptors currently used in high-stakes tests and make links to pronunciation evaluation criteria as part of speaking skill assessment. They further discuss the development of rating scales for pronunciation in speaking to support the rating process and argue that the most relevant features of the speaking performance should be included.

Chapter 7 by Van Moere and Suzuki, “Using speech processing technology in assessing pronunciation,” reviews the state-of-the-art in assessing aspects of pronunciation using automatic speech recognition (ASR) and machine scoring techniques. The chapter describes, without formulas, how ASR systems can be developed to predict expert judgments of pronunciation and discusses the approach from the perspective of establishing a definable norm against which the construct of pronunciation can be measured. Test developers’ choices about who should comprise this reference set of speakers can influence how pronunciation is evaluated. Van Moere and Suzuki highlight future potential in the automated assessment of pronunciation, such as the possibility of automated systems that indicate a speaker’s level of accentedness, intelligibility, and comprehensibility, in reference to different L1 listener groups and recommend possible future directions and improvements in the technology.

Further examination of how technology links to pronunciation assessment comes from the Educational Testing Service team of Loukina, Davis, and Xi in “Automated assessment of pronunciation in spontaneous speech” (Chapter 8). This chapter begins with a brief review of early efforts to automatically assess pronunciation, which generally focused on constrained speech. This is followed by a discussion of approaches used in the measurement of pronunciation in unconstrained language. Loukina et al. also discuss validity issues related to the automated scoring of pronunciation in unconstrained speech, focusing on how advanced considerations of evidence needed to support the various claims in a validity argument can encourage critical thinking about conceptual issues involved in automated pronunciation assessment. They conclude with an examination of current trends and future opportunities in this domain, such as the impact of continuing improvements in speech recognition technology and improvements to pronunciation measures stimulated by such trends as the rise of ‘big data.’

The current volume examines critical issues relevant to L2 pronunciation assessment. Some of them (e.g., validity and use of scales), however, are not necessarily limited to L2 pronunciation, but can be extended to language assessment in general. ESL teachers and learners can benefit from the discussions made by our contributors as they can inform their learning and teaching of L2 language (speaking and pronunciation skills). We observe that pronunciation assessment still juggles the two contradictory principles that have guided research and instruction in L2 pronunciation: the nativeness principle vs. the intelligibility principle (Levis, 2005). This issue has been discussed throughout the volume, and should be a guiding point for decision-making processes in L2 oral assessment. We also argue that the listener characteristics and contextual influences should be carefully examined to promote pedagogical and theoretical accounts in L2 speech and assessment studies. Last but not least, the future of assessment may be strongly influenced by the development of machines being trained to mimic the accuracy of human ratings of accented speech through automated scoring systems. It is exciting to see how the improvement of the ASR approach can lead the field of speech assessment in the future.

Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47.

Derwing, T. M., & Munro, M. J. (2009). Putting accent in its place: Rethinking obstacles to communication. Language Teaching, 42, 1–15.

Elder, C., & Harding, L. (2008). Language testing and English as an international language. Australian Review of Applied Linguistics, 21(3), 34.1–34.11.

Hymes, D. H. (1972). On communicative competence. In J. B. Pride & J. Holmes (Eds.), Sociolinguistics: Selected readings (pp. 269–293). New York: Penguin.

Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgements of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10, 135–159.

Lado, R. (1961). Language testing: The construction and use of foreign language tests. London: Longman.

Lado, R. (1964). Language teaching: A scientific approach. New York: McGraw-Hill.

Leather, J. (1999). Second language speech research: An introduction. Language Learning, 49, 1–56.

Levis, J. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39, 369–377.

Munro, M. J., & Derwing, T. M. (2011). The foundations of accent and intelligibility in pronunciation research. Language Teaching, 44, 316–327.

Munro, M. J., & Derwing, T. M. (2015). A prospectus for pronunciation research in the 21st century: A point of view. Journal of Second Language Pronunciation, 1(1), 11–42.

Oh, Y. R., Yoon, J. S., & Kim, H. K. (2007). Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Communication, 49, 59–70.

Van Moere, A., & Downey, R. (2016). Technology and artificial intelligence in language assessment. In D. Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 342–357). Berlin: De Gruyter Mouton.