Assessment in Second Language Pronunciation

p.67

Xun Yan and April Ginther

Introduction

This chapter will review research findings on listener background characteristics that influence evaluations of L2 accented speech, and discuss how these findings may affect both listeners and raters when evaluating speech. In this chapter, we define raters as a group of specialists who are formally trained to rate speaking performance on a language proficiency test. In contrast, we define listeners as anyone, regardless of formal training, who is asked to evaluate speech for research purposes. Although these two terms may be used interchangeably, we want to emphasize the difference between the two groups because, in assessment contexts, not only is internal consistency of the scale important but also agreement among raters and rater training are standard practice.

Speaking with an accent is an inherent characteristic of being human, central to our identities as both first and second language speakers. Indeed, a speaker’s accent is one of the first things that listeners notice. From first impressions based on accentedness, listeners may attribute, correctly or incorrectly, many additional characteristics to the speaker. For example, Labov (2006) demonstrated that listeners infer speakers’ nationalities, regional memberships, ethnicities, and social class from spoken speech. Furthermore, listeners may attribute other characteristics to speakers, including levels of intelligence (Lambert, Hodgson, Gardner, & Fillenbaum, 1960), communication skills (Hosoda, Stone-Romero, & Walter, 2007), social desirability (Kinzler & DeJesus, 2013), perceived competence (Nelson, Signorella, & Botti, 2016), teaching skills (Fox, 1992), and suitability for employment (Kalin & Rayko, 1978). The speaker’s social identity can also affect listener perceptions (Kang & Rubin, 2009). Despite the wide variation of attribution studies (see Giles & Billings, 2004; for an overview), findings easily lead to the conclusion that “spoken language has a socially communicative power beyond the literal information it conveys” (Kinzler, Shutts, DeJesus, & Spelke, 2009, p. 1).

p.68

Social psychologists and applied linguists concerned with listener attitudes about speakers have left no stone unturned with respect to the ability of accentedness to trigger attributions of many kinds; the salience of accent to listeners made accentedness an attractive area of investigation. In a widely cited study, Flege (1984) demonstrated that listeners are able to identify an accent different from their own after listening to as little as 30 milliseconds of recorded speech. However, research on accentedness in second language (L2) pronunciation is not primarily concerned with attribution as related to accentedness. Rather, it examines accentedness in relation to the larger constructs that matter in effective communication: intelligibility and comprehensibility. Researchers (e.g., Derwing & Munro, 2009; Kang, 2010; Munro & Derwing, 1998; Trofimovich & Isaacs, 2012) generally agree that the goal of pronunciation teaching and learning should be comprehensibility and intelligibility.

As with other aspects of language proficiency, research has found that accentedness, comprehensibility, and intelligibility, while related, remain at least partially independent (e.g., Derwing & Munro, 2009; Thomson, Chapter 1 this volume). Results from a long line of investigations examining the relationships among accentedness, intelligibility, and comprehensibility (e.g., Derwing & Munro, 1997; Munro, 2008; Munro & Derwing, 1995) suggest that while speech identified as unintelligible is always identified as highly accented, highly accented speech is not always identified as unintelligible. With respect to comprehensibility ratings, listeners display a tendency to rate accentedness more harshly and comprehensibility more leniently, again suggesting that while listeners are sensitive to accent, accented speakers largely remain comprehensible. Sharing common ground with research in social psychology on attribution, Derwing and Munro (2009) state, “we want to emphasize again that accent is about difference”; however, they extend their arguments to broader domains: “comprehensibility is about the listener’s effort, and intelligibility is the end result: how much the listener actually understands” (p. 480). A listener’s perception of difference (the speaker’s deviation from a selected or assumed norm) may influence (perceived) comprehensibility (the processing effort the listener expends) in attempts to arrive at intelligibility (what is actually understood). The interactions are important for teaching and learning because they affect decisions about what and how L2 learners should be taught.

As in social psychology, studies of accentedness and comprehensibility in applied linguistics largely depend on listeners’ subjective evaluations of accentedness and comprehensibility. Intelligibility, on the other hand, is usually measured objectively, operationalized as the accurate identification of words or phrases reported as number or percent correct. The scales for accentedness and comprehensibility tend to be minimally explicated (only endpoints are anchored), and raters are not trained to the scale; indeed, there is very little that might provide a basis for training. The use of minimally explicated scales with minimally trained listeners is attractive for ease of administration, but also for understanding the general listener’s perceptions of accentedness and how these perceptions in turn influence comprehensibility. As Derwing and Munro (2009) state, “From our perspective, listeners’ judgments are the only meaningful window into accentedness and comprehensibility. For this reason, judgment data are the gold standard; what listeners perceive is ultimately what matters most” (p. 478). As teachers, we want to focus on those aspects of a speaker’s production that will facilitate effective communication in contexts outside of the classroom.

p.69

In language testing, the contributions of listeners’ perceptions are again the “gold standard,” but listeners are seldom left to their own devices. They are trained to rate to scales that include descriptors (i.e., characteristics of performance expected at each level in order to facilitate score assignment). When rating holistically or globally, the scales and descriptors represent the test developers’ best attempts to represent the underlying constructs. Ginther (2013) explains:

To clarify what such a global assessment means, the abilities associated with scale levels are represented by level descriptors which represent a qualitative summary of the raters’ observations. In order to facilitate description, benchmark performances are selected to exemplify the levels and their descriptors. Such descriptors are typically associated with, but are not limited to, descriptions of the following components of a speaking performance at different levels of the scale: pronunciation (focusing on segmentals); phonological control (focusing on suprasegmentals); grammar/accuracy (morphology, syntax, and usage); fluency (speed and pausing); vocabulary (range and idiomaticity); coherence and organization. If the assessment involves evaluation of interaction, the following may also be included: turn-taking strategies, cooperative strategies, and asking for or providing clarification when needed.

p. 3

Holistic scales are best understood as providing raters with guidelines, not blueprints. The generality of rater scales serves several purposes. No scale is a perfect representation of the underlying constructs, and test developers allow room for scales to develop. Through regular quality control procedures, and as conceptualizations of the underlying constructs change over time, descriptors may be revised in order to facilitate rater reliability and enhance scale validity.

An important difference between the use of everyday listeners in L2 pronunciation research and trained raters for language testing is that the former focuses primarily on listener perceptions of accentedness, intelligibility, and comprehensibility, while the latter focuses on language proficiency, of which accentedness, intelligibility, and comprehensibility are parts. Research across these domains differs as well because of the purposes involved: pronunciation research has focused on the relationships among accentedness, comprehensibility, and intelligibility with respect to instruction, while language testing research has focused on general language proficiency with respect to decisions about selection and placement. Given the different purposes across these two domains, it is tempting to argue that there is limited common ground. Happily, this is not the case.

p.70

Both pronunciation and language testing researchers are invested in better explication of individual differences in perceptions of accentedness, comprehensibility, and intelligibility. In pronunciation research, understanding individual differences may require that generalizations be tempered, but better representations of the underlying constructs may result; in language testing, individual differences signal potential problems with language proficiency scales and/or with raters’ application of those scales. Again, the underlying constructs are central. Developing a more complete understanding of the judgment of listeners or raters is part and parcel of both domains.

This chapter continues by discussing how listener/rater background characteristics may influence oral language proficiency score assignment. In the following sections, we first provide a historical context for issues related to accentedness, intelligibility, and comprehensibility of L2 speech. Next, we review theoretical and empirical research in speech perception and language testing in order to highlight the list of listener background characteristics that influence evaluations of L2 accented speech. We then discuss listener background characteristics that have been examined as potential sources of rater bias in pronunciation and speaking assessment. Finally, we recommend new directions for research on listener background characteristics in pronunciation and speaking assessment, with respect to: (1) the practical impact of listener background characteristics on rater performance as compared to other sources of rater bias; (2) the effectiveness of rater training on intelligibility and comprehensibility judgments; and (3) individual differences of intelligibility and comprehensibility judgments among everyday listeners and the implications for pronunciation rating and rater training in speaking assessment.

Historical context and conceptualizations

The evaluation of accented speech has been extensively researched in speech production and perception and in language testing. However, to better understand the impact of listener background characteristics and its implications for pronunciation and speaking assessment, it is important to contextualize the evaluation of L2 accented speech in a setting that is commonly investigated in both L2 pronunciation and language testing research. In this section, issues related to the language proficiency of international teaching assistants (ITAs) are examined to provide a historical context that will be familiar to most readers. As this volume focuses on pronunciation and speaking assessment, fundamental measurement concepts and principles related to scales, raters, and rater training are also introduced to put the evaluation of L2 speech for listeners vs. raters into perspective.

Measuring and judging

How we evaluate performance is complicated by the fact that some aspects of language proficiency are easily measured (e.g., percent correct, speech rate), while others must be judged (e.g., accentedness, comprehensibility). The importance of listeners’ judgments is central not only to research examining accentedness and comprehensibility, as Derwing and Munro (1997) noted, but also to the evaluation of performance in general. Spolsky (1995) explains by comparing the evaluation of language proficiency to evaluation of performance in another domain:

p.71

While I was writing this book, the Olympic Games took place in Barcelona, and I was struck by the thought that testing has many analogies with the two kinds of events that athletes participate in. One class of events has strictly measured results: the time of the 100-metre race, the distance the shot is put, the weight that is lifted, the number of goals scored in a hockey or football match . . . Other events, however, continue to depend on subjective scores awarded by judges: the diving, the gymnastics, the skating, the dressage. In these terms, the field of language testing faces the question of whether language proficiency can be measured or judged, or perhaps only described.

p. 6

When accentedness, intelligibility, and comprehensibility are examined, only intelligibility lends itself to objective measurement (usually the percent of correctly identified phonemes or words); accentedness (the degree to which a performance may deviate from a particular norm) and comprehensibility (the difficulty associated with listeners’ ability to process the intended message) are usually evaluated by listeners.

In the example provided by Spolsky, judges in the Olympics are not only experts in the performance domain, but also are trained in the application and interpretation of the rating scales used. All spectators will have preferences and opinions, but if non-expert raters were used to actually rate a figure skater’s performance, the skater would reasonably be expected to object. Non-expert, inexperienced, or naïve raters may be subject to bias, and these biases may be the function of nationality (e.g., Americans might judge American skaters more favorably), preferences unrelated to the actual performance (e.g., a preference for music or costume), or the difficulty in the use and application of the scale (e.g., untrained raters may award points differently). Untrained raters may agree or disagree with the evaluation of trained raters; nevertheless, in the Olympics, as in language testing, the use of trained raters is required.

p.72

Evaluating the language proficiency of international teaching assistants

With reference to listener judgments of accentedness and how these perceptions have influenced the assessment of L2 proficiency, nothing has had a greater impact than the increase of ITAs in North American institutions of higher education. Rubin (1992) commented that the issues surrounding ITAs were among the few instructional issues that captured the attention of the popular press. Ginther (2003) describes the educational context in the late 1980s that gave rise to the controversy:

[t]he increase in the number of international students in North American institutions of higher learning coincided with the increasing dependence of graduate programs in science, technology, engineering, and mathematics (STEM) to fill graduate classes, staff research laboratories and to provide undergraduate instruction.

p. 57

Many undergraduate students, mostly L1 English speakers, found their studies complicated by the presence of international instructors. Undergraduate student complaints, often focused on accentedness, eventually led 27 U.S. state legislatures to mandate ITA screening (Oppenheim, 1997) at public institutions. ITA language proficiency screening and instruction is now commonplace at most North American institutions with large populations of international graduate students.

This controversy, referred to at the time as the foreign TA problem (Bailey, 1984), led to a considerable amount of research investigating the sources of and solutions to the difficulties and challenges ITAs faced (see Briggs, Hyon, Aldridge, & Swales, 1990, for an overview). The development of ITA programs reflects the idea that while accentedness may need attention, especially in individual cases, instruction should include broader aspects of language proficiency that influence the performance of teaching assistants (e.g., vocabulary, discourse competence, fluency, prosody, interactivity, cultural expectations). During the heyday of ITA support programs, instructional approaches that focused primarily on accent reduction gave way to programs where pronunciation was embedded within the development of academic language proficiency.

Similarly, ITA language test developers consider pronunciation, but the construct is typically embedded within scales that focus on multiple and complementary aspects of proficiency. These scales may also include reference to teaching. However, in an examination of the viability of the use of the TOEFL iBT speaking subsection for ITA testing, Xi (2008) found that when speaking scales included teaching, the correlations between the local scales and the TOEFL iBT speaking subsection were much lower than for comparisons across scales that focused solely on language proficiency. Ginther (2003) argued that the inclusion of teaching in ITA evaluation is problematic since the teaching skills of L1 speakers of English are seldom considered when assistantships are assigned. Of course, that is not to say that the quality of teaching is not critical to the success of ITAs, and there are reasons why teaching can or should be included in their training; however, the decision to include teaching as part of an operational ITA test is a decision that must be made deliberately in light of fairness issues, as well as the additional complexity for reliable and valid assessment when teaching quality is included. In order to address the tension between language proficiency and teaching effectiveness, some ITA programs opt for direct assessment methods that simulate classroom contexts but exclude teaching quality from scale descriptors.

Actual assignment of scores seldom depends entirely on the perceptions of everyday listeners in any testing context; however, panels of judges for ITA performance tests may include faculty and undergraduate representatives of these broader domains of use. In the case of ITA testing, dependence on a general listener – in this case, most appropriately undergraduate L1 English speakers – may introduce the biases identified in social psychology attribution and language attitude studies. For instance, influential studies conducted by Rubin and Smith (1990) and Rubin (1992) employed the matched guise technique, introduced by Lambert, Hodgson, Gardner, and Fillenbaum (1960), in ITA research on language attitudes and social attribution. Undergraduate subjects evaluated four-minute lectures while looking at a photograph of either a Caucasian or Asian woman identified as the “instructor” they were hearing. Listeners subsequently completed speech evaluation instruments to measure their recall of the subject, homophily (i.e., perceived similarly between themselves and the speaker), and their experiences with and attitudes towards ITAs. Because ethnicity and lecture topic (in the humanities or in science) tended to be stronger predictors of undergraduate comprehension and attitudes than the actual speaking proficiency of the lecturers, the evaluations of undergraduate informants and their use as the appropriate referent in terms of ITA evaluation were called into question. Similar findings have been found in more recent investigations (e.g., Kang & Rubin, 2009; Kang, 2012) in that listeners’ judgments of accented speech can be substantially influenced by listener background characteristics and construct-irrelevant characteristics of the speaker. Interestingly, findings from these studies may have dampened the desire to make comparisons between raters and undergraduate listeners. This line of inquiry remains an under-investigated area (Isaacs & Thomson, 2013).

p.73

The ITA context highlights the difficulty and complexity of representing broader categories of listeners when perceived accentedness and language proficiency are involved. If listeners’ judgments are clouded by bias, despite being reliable, their evaluations may not be valid. That is, negative attitudes towards accented speech may lead an unsympathetic undergraduate to judge an accented ITA as unintelligible, despite the fact that the speaker may have good pronunciation and communication skills (Lippi-Green, 1994, 1997; Isaacs, 2008). In defense of undergraduates, it should be noted that they often encounter ITAs for courses they perceive as difficult (e.g., introductory algebra), and any added complexity (e.g., an unfamiliar accent) may increase the effort needed to process the message (and get a good or even passing grade on exams). Therefore, concerted efforts are needed to improve undergraduate students’ abilities to understand accented speech produced by ITAs (see Kang, Rubin, & Lindemann, 2015, for an example of such efforts).

Scale specificity and the native speaker

While the ITA controversy was raging, another was brewing around L2 pronunciation research. In 1986, the American Council on the Teaching of Foreign Languages introduced proficiency guidelines (ACTFL Guidelines, 1986), which were intended as a basis for criterion-referenced testing and improved professional standards (Higgs, 1982). The argument was that professional standards would improve through provision of standard foundation for curriculum design, teaching methods, and testing and assessment methods. As Bachman and Savignon (1986) explained:

p.74

Guidelines that define language proficiency and its measurement can provide the basis for developing “common metric” tests for a wide range of language abilities in different languages and contexts. The obvious advantages of such tests are: (1) they would provide a standard for defining and measuring language proficiency that would be independent of specific languages, contexts, and domains of discourse; and (2) scores from these tests would be comparable across different languages and contexts.

p. 380

While Bachman and Savignon (1986) acknowledged the appeal of these good intentions, they challenged the associated assumptions on many grounds and, rather than endorsing the opportunity for standardization, strongly argued against this attempt. Instead, they called for the development of language proficiency scales that highlight the characteristics of language that are most important within specific domains; that is, “both the use of context-dependent scale definitions . . . and the interpretation of ratings based on them are limited to the specific situations for which they are designed” (p. 386). The extent to which performance is comparable across domains and the viability of using a scale designed for one purpose for another would remain open to question.

The ACTFL Guidelines were also thoroughly criticized for their appeal to the native speaker as the listener/rater referent. While Bachman and Savignon (1986) appreciated the abandonment of the “educated native speaker” that was included as the high-end anchor of the Interagency Language Roundtable oral interview (the Guideline’s predecessor), they found the native speaker embedded throughout the ACTFL Guidelines in phrases such as: “can be understood by native interlocutors,” “ability to . . . support opinions . . . using native-like discourse strategies,” and “errors do not disturb the native speaker” (p. 385). In this criticism, Bachman and Savignon were joined by Lantolf and Frawley (1985) who argued that the native speaker as an idealized representation is not only invalid, but should be expunged in instructional and assessment contexts. Their more serious complaint was that the Guidelines implied language teachers and learners should set this standard as the goal, although the value of the native speaker has been defended by others as an important and necessary representation in research and testing (e.g., Davies, 2002; Hulstijn, 2011). The native speaker as referent and the speaker varieties of English that should be represented on rating scales remain controversial issues (see Harding (Chapter 2) and Dimova (Chapter 3), this volume). A radical but impractical solution would be to develop as many different language tests as there are varieties.

Criticisms of the ACTFL Guidelines illustrate the difficulty of representation of domains of use. Operational scales reside in tension between the specificity that allows representation of a particular domain and the generality needed to accommodate the variability of performance that is influenced by L1 background and prior learning opportunities. Furthermore, holistic scales provide general descriptors in order to avoid the tendency of some raters to focus on and rate on single issues (e.g., accentedness). An examinee’s accent may be profound, but if characterized by a wide range of appropriately used vocabulary and/or accompanied by syntactic complexity, the performance should be rated at higher scale levels. However, an examinee who is less accented but lacks control of vocabulary and syntax should be rated at lower levels. Again, as with accentedness, comprehensibility, and intelligibility, the components of language proficiency represented on scales are related, and yet remain partially independent. Scale descriptors must address both domain specificity and examinee variability. Rater training is one way that mediates the tension.

p.75

Rater training basics

Establishing appropriate expertise for evaluating language performance is not straightforward. In one sense, all language users are appropriately expert, but successful first language (L1) acquisition is seldom the only, or most important, standard of expertise required. Indeed, success in L1 acquisition does not necessarily guarantee expertise in successful L2 teaching and learning. Most language teachers would argue that training in L1 and L2 acquisition and learning is required to be an effective teacher. Thus, teachers may be better judges of L2 performance than L1 or even successful L2 speakers. However, success in teaching does not necessarily guarantee expertise in successful application and use of a rating scale. Most test developers would argue that training in assessment and basic assessment literacy is not only beneficial but also necessary in order to be an effective rater.

For testing purposes, demonstrating reliability (internal consistency and inter- and intra-rater agreement) comprises a necessary but insufficient first pass with respect to quality control. Despite raters having completed rater training, individual raters may nevertheless assign scores in a manner that introduces construct-irrelevant variance. This may happen when a rater rates on a single issue or makes attributions to an examinee that are clearly not included in the rating scale. For example, in our experience as rater trainers, we once encountered a rater who consistently failed L1 Hindi speaking women with British RP (Received Pronunciation or “Oxford” English) on our local ITA exam. When asked why, the rater identified the pronounced accent as an indication of arrogance and even contempt for American undergraduates and questioned whether these examinees should be given teaching assignments. Kang’s (2012) study confirms that speech evaluations are subject to substantial bias based on listeners’ own backgrounds.

Analysis of rater performance typically begins by comparing score assignment across two or more raters (scores are seldom assigned based on a single rater in operational tests); when raters disagree, a third rater is assigned or negotiation is required. Sometimes suspicious patterns, like the one mentioned earlier, are obvious. More powerful statistical examinations allow analyses of how raters use scales individually and/or as a group, thereby informing the effectiveness of scale and rater evaluation. For example, in analyses using Rasch measurement modeling (Andrich, 1988; Bond & Fox, 2007), underlying ordinal scales can be transformed into interval-level measurement in which distributional patterns are standardized; in addition, by linking each rater to every other one in the rater pool, Rasch modeling can evaluate whether individuals operationalize a scale in a similar manner and whether an ordinal rating scale (e.g., accentedness and comprehensibility scales) functions similarly to an interval scale (see Isbell, Chapter 5, this volume). However, statistical analyses only indicate that a rater is rating differently than others, but not why. To further explicate different kinds of rating behavior, one must look into the characteristics of raters, the constructs they rate on, and the interaction between the two.

p.76

While it may be the case that the perceptions of everyday listeners might be examined in the validation of ITA tests (e.g., Bridgeman, Powers, Stone, & Mollaun, 2012), the alignment of trained rater judgments with those of other interested parties can be tricky. As is clearly demonstrated by attribution studies, everyday listener judgments may be affected by attitudes that are only indirectly or unrelated to the speaker’s language proficiency as judged by trained raters in speaking tests. Although trained rater judgments may align with those of other broader pools of listener/judges, test developers base their decisions on representations of proficiency, not on the impressions of general listeners. Furthermore, when the general impressions of everyday listeners are primarily of interest, individual differences and idiosyncratic score assignment may be ignored.

The take away here should not be that we must account for all intervening contextual factors in investigations of accentedness, comprehensibility, and intelligibility, but rather that we must specify our intended audience and qualify generalizations. In high-stakes contexts, we must examine and evaluate the performance of raters to ensure that they are not affected by bias or construct-irrelevant variance to the extent possible. Typically, rater performance is examined in terms of rater consistency and consensus (i.e., intra- and inter-rater reliability), and a catalog of rater effects (i.e., severity, central and halo effects, and rater interaction with various construct-irrelevant variables) have been found to bear important consequences to the reliability, validity, and fairness of the judgments or scores raters assign (see Yan, 2014, for a summary of common rater effects in language testing). L2 pronunciation research on accentedness, comprehensibility, and intelligibility is helpful in identifying intervening variables that can influence rater judgments. Other studies have specifically investigated intervening variables that influence rater performance, providing a link between concerns in applied linguistics and testing.

Listener background characteristics affecting L2 speech ratings

Listener perceptions of the accentedness, intelligibility, and comprehensibility of L2 speech can be influenced by a number of listener background factors. Several most frequently researched listener background characteristics in L2 speech perceptions include familiarity with the accent (e.g., shared or similar L1/L2 backgrounds, exposure to a particular accent; Winke, Gass, & Myford, 2012), familiarity with the speech topic (e.g., shared academic interests or background knowledge; Gass & Varonis, 1984), cultural expectations (e.g., Hay & Drager, 2010), attitude and motivation (e.g., Evans & Iverson, 2007), and the proficiency and linguistic awareness of the listener (e.g., Schinke-Llano, 1983, 1986).

p.77

Familiarity with the accent

Intuitively, comprehensibility is influenced by the linguistic or formal expectations of the listener. When a listener hears something that does not meet his or her expectations (e.g., the speaker speaking in a foreign accent), s/he may notice the phonological deviation and thus perceive a speaker as less intelligible (Nelson, 2011). However, if a listener shares the same or similar language backgrounds with the speaker, he or she will be familiar with the speaker’s accent and thus may perceive the speaker as more intelligible than speakers of other accents. Flowerdew (1994) hypothesizes that in classrooms, ESL students tend to understand better L2 lecturers who share their local accents. However, Tauroza and Luk (1997) argue that such a view of local accents being advantageous over model accents lacks empirical evidence. They encourage research to focus on familiarity in lieu of locality because local accents are not always the only familiar ones.

In a study of the impact of familiarity on the comprehensibility ratings of nonnative speech, Gass and Varonis (1984) found that familiarity with the accent is one of a few listener background characteristics (others include familiarity with the speech topic and familiarity with the speaker) that affects L2 speech intelligibility. In Bent and Bradlow’s (2003) study, L1 Chinese, L1 Korean, and native English speakers were rated by native and nonnative listeners of matching L1 backgrounds. Their study showed that native English listeners perceived native English speakers as the most intelligible, whereas nonnative English listeners rated highly proficient L2 English speakers of the same L1 background as equally intelligible as the native English speaker. In a later study, Bradlow and Bent (2008) investigated the facilitating effect of accent exposure on perceptions of intelligibility of foreign accented English by native English listeners, the results of which suggest that exposure to different accents helps increase listeners’ comprehension. Interestingly, the improvement of intelligibility as a function of familiarity with foreign accents was not observed in Major, Fitzmaurice, Bunta, and Balasubramanian (2002). However, overall, the majority of research on the relationship between intelligibility and familiarity supports that intelligibility increases as listeners become more familiar with foreign-accented speech.

Familiarity with content

In addition to familiarity with the accent, perceptions of intelligibility can be influenced by a listener’s familiarity with the content or the topic of speech. In L2 content classrooms, familiarity with the subject content tends to help students comprehend unintelligible speech. In Gass and Varonis’ study (1984), familiarity with speech topic, based on prior interactions or world knowledge, facilitated native speakers’ comprehension of nonnative speech more than familiarity with the accent or the speaker.

p.78

In addition, clinical research on speech pathology reached similar conclusions on the positive effect of content familiarity on the perception of intelligibility. In the assessment of intelligibility of children with developmental phonological disorders, variance of intelligibility estimates can be explained substantially by listeners’ knowledge of the linguistic content produced in a spontaneous conversational speech sample (Kwiatkowski & Shriberg, 1992). Similarly, based on both his clinical and research experience, Shriberg (1993) observed that speech unintelligibility of children with developmental phonological disorders occurred less frequently with the help of different contextual and content clues. It should be noted that in these studies, intelligibility was operationalized in terms of the percent of correctly identified words in transcription, a tradition of measuring intelligibility in speech pathological research that differs from that in L2 pronunciation research.

Attitudes towards L2 accented speech

Accent can elicit attributions that may bias the listener negatively against the speaker. Numerous studies have investigated the relationship between the perception of accent and attitude towards the speaker from a variety of perspectives (e.g., Cargile & Giles, 1997; Giles, 1972; Giles et al., 1995; Lambert, 1967). Perceptions of accents have also been shown to be confounded in the evaluation of the speaker regarding various aspects of power and socioeconomic status (e.g., Davila, Bohara, & Saenz, 1993; De Klerk & Bosch, 1995; Kalin & Rayko, 1978). Moreover, the presence of a foreign accent can even lead to underestimation of the language proficiency of ESL students (e.g., Schinke-Llano, 1983, 1986). However, considering the focus of pronunciation assessment, this chapter only discusses the perception of accent in relation to the perception of intelligibility and comprehensibility.

Attitude towards certain accents can mask or interact with rater judgments. Derwing and Munro (1997) drew a distinction between the perceived and actual intelligibility, arguing that certain accents can be more intelligible than listeners perceive them to be. Although few studies have empirically confirmed or disconfirmed this hypothesis, Munro, Derwing, and Sato (2006) cautioned about the potential unfair evaluation of the language proficiency of L2 speakers merely based on the presence of L2 accents or stereotypes against certain accents rather than unbiased evaluation of speech intelligibility.

Although there is a close relationship between accent and attitude, it is important to note that attitude towards a certain accent may change over time. A popular hypothesis is that attitudes may change due to familiarity with the accent and intelligibility perceptions of that accent (e.g., Gass & Varonis, 1984; Sato, 1998). However, research on perceptions of L2 accented speech has not reached consensus as to whether (increased) familiarity with the accent leads to higher ratings of intelligibility and more positive attitudes towards L2 accented speech. For example, Fayer and Krasinski (1987) examined attitudes of L1 English and L1 Spanish listeners towards Puerto Rican learners of English across proficiency levels. Findings of their study showed that the L1 Spanish listeners were less tolerant than the L1 English listeners towards the speech samples produced by the Spanish learners of L2 English, despite sharing the same L1 background. Therefore, given the close relationship between attitude and accent, current research appears to suggest that the influence of attitude on intelligibility perceptions of L2 accents may change over time and across contexts.

p.79

Language proficiency and linguistic awareness

In addition to the listener background characteristics mentioned above, a frequently overlooked listener factor important to the perception of (intelligibility of) L2 accented speech is the listener’s language proficiency and/or linguistic awareness. Research on intelligibility tends to select native speakers or highly proficient ESL professionals to be the judges of L2 speech intelligibility. However, listeners’ limited language proficiency may exert a negative influence on their intelligibility judgments of L2 speech. Although few studies have investigated the reaction to L2 accented speech of high- and low-proficiency listeners, several studies have suggested the impact of listener language proficiency on the perception of intelligibility (see, e.g., Harding, 2012; Yan, 2014).

The impact of listeners’ proficiency and linguistic awareness should be understood in conjunction with the cognitive demand in processing the meaning of speech. Though not directly related, Thompson’s study (1991) suggests that higher meta-linguistic knowledge or awareness influences the perception of L2 accent. Two groups of native speakers of English (he referred to as inexperienced native speakers and language experts) were asked to rate three speech samples produced by 36 Russian speakers of English of varying degrees of accentedness. Results showed that speech that was intentionally written with more difficult sounds was perceived as more accented than was regular speech. Additionally, experienced raters were more reliable and more tolerant towards L2 accents. This makes sense in that the perception of intelligibility or accentedness co-occurs with the processing of meaning, both of which impose a certain level of cognitive processing load to the listener. Therefore, when additional effort is required to decode the phonological, syntactic, or lexical information, the adjustment to L2 accent is likely to be labored and the tolerance of the accent lowered.

In addition to phonological complexity, syntactic and lexical complexity of spoken language may also elicit different perceptions of intelligibility from raters of different language proficiency levels. For example, Yan (2014), examined the performance of L1 English and L1 Chinese raters and found they displayed different patterns of score assignment when Indian speakers (L1 Hindi speakers) were involved. He argues that the difference can be partially attributed to the rater’s language proficiency. Yan argues that speech, especially as test performance, produced by Indian speakers of English tends to feature relatively high speech rates, high information density, and complex syntactic structures and lexical items. The combination of these factors imposes a higher processing load, which might create difficulty in the adjustment to Indian accents even for highly proficient L2 speakers, thus leading to a lower level of tolerance towards the accent. Overall, these studies suggest that the proficiency level or linguistic awareness of the listener/rater can influence score assignment.

p.80

Rater interaction with L2 accents

Although trained raters in language tests differ from everyday listeners in their experience in rating and training for scoring speaking performance, raters are no exception to the impact of the various listener background factors discussed earlier on the ratings of intelligibility, comprehensibility, and general speaking ability. During the rating process, if raters are influenced by factors that are not included in the rating rubric, then these factors may introduce construct-irrelevant score variance (e.g., rater bias) and thus influence valid score inferences.

In language testing, most rater studies tend to examine the impact of rater bias on scores of general speaking ability, not directly on intelligibility or comprehensibility judgments. One of the most frequently observed sources of rater bias in speaking assessment stems from the interaction between raters’ linguistic backgrounds and those of the examinees. Many of these studies associate differences in speaking test scores with difference in raters’ intelligibility and comprehensibility perceptions, and attribute score differences to the various interactions between the listener backgrounds and their intelligibility and comprehensibility judgments.

Although it is a commonly held hypothesis that increased familiarity with a certain accent leads to higher rating of general speaking ability, studies on rater variability and rater performance have shown mixed results. Results of several studies suggest that rater familiarity with examinee accent affects the rating of pronunciation and general speaking ability. For example, Carey, Mannell, and Dunn (2011) examined the impact of familiarity with examinee accent on the rater reliability on pronunciation in the speaking component of IELTS. They asked 99 trained raters of varying nationality and exposure to different English accents to rate the pronunciation of Chinese, Korean, and Indian speakers of English. Their findings suggest that both raters who share the same home country with the examinees and raters who have prolonged exposure to examinee accents tend to rate the examinees higher.

Winke, Gass, and Myford (2013) investigated the impact of raters’ L2 backgrounds on their rating of speaking ability for examinees on the TOEFL iBT. Specifically, 107 raters’ ratings on 432 TOEFL iBT speech samples from 72 L1 Chinese, Korean, and Spanish examinees were examined. Raters were L2 speakers of these three languages respectively. Results of their study suggest that familiarity with an accent due to raters’ L2 background leads to higher ratings of pronunciation and speaking ability. More specifically, L2 Spanish speaking raters were more lenient on L1 Spanish examinees, and the same trend was observed among raters who speak Chinese as an L2. The authors attributed this rater interaction to the hypothesis that accent familiarity due to L2 backgrounds leads the perception of higher intelligibility and comprehensibility by the raters, resulting in higher ratings of overall speaking ability.

p.81

However, a significant effect of accent familiarity on the evaluation of speaking performance was not observed in Xi and Mollaun (2009, 2011). Twenty-six bilingual speakers of one or more Indian languages and English from India were trained to score the TOEFL iBT Speaking Test. Participants were split into two groups, one receiving regular rater training and the other receiving regular rater training with an additional specialized training on the scoring of Indian-accented English. Findings of their research showed that while there was no noticeable difference in the inter-rater reliability of the two groups of raters, specialized training on the rating of Indian-accented English made raters more internally consistent in the rating of Indian speakers of English.

Effects of bias on score assignment in operational tests

Admittedly, the majority of research on rater interaction with L2 accents points to the fact that raters as listeners are likely to have differential severity effects towards speech of different accents. On the one hand, this seems to be a consistent issue in speaking assessment where human raters are employed. On the other, individual differences among raters are less likely to be eliminated even with rater training. However, a reasonable middle ground between rater idiosyncrasy and rater alignment is to examine the impact of rater interaction with L2 accents on the actual test scores, in order to determine whether rater interaction with different L2 accents would create severe test bias.

Although few studies in language testing have focused on the impact of rater interaction with L2 accent, inferences can be drawn from results reported in these studies. For example, in Winke et al. (2013), while significant interactions were observed, the effect sizes for the interactions on rater severity, expressed in correlation r coefficient, ranged from .05 to .1, indicating that these interactions are unlikely to have a large impact on rater severity, let alone the test scores. Similarly, in Yan’s study (2014), although significant rater interactions were observed between rater and examinee L1 backgrounds (correlation r coefficient mostly ranged between .04 and .17, except the interaction between L1 Hindi examinees and L1 Chinese raters, r = .43), these interactions tended to have a small impact on the examinee scores. Furthermore, he argues that the differential rater severity effect of L1 Chinese raters on L1 Hindi examinees might not stem from lack of familiarity with the accent of Hindi examinees alone; instead, perceptions of L2 accents might be masked by the ratings of other linguistic aspects of speaking performance. In the evaluation of L2 accented speech, especially with accents that are often perceived as strong or accompanied by certain attributions, it is important to ensure that raters are aligned in the operationalization of comprehensibility and evaluation of other components of speaking performance.

p.82

That said, it is important to note that the argument here is not that rater interaction with L2 accents is inconsequential in L2 speaking assessment. Rather, we argue that it is important to maintain a critical view towards rater interaction with L2 accents and focus on the practical impact such interactions may have on the evaluation of speaking performance. In any testing context, be it local or large-scale, if rater interaction with L2 accent creates a large proportion of unwanted score variance, the effectiveness of associated rater training must be evaluated.

Future directions and recommendations

Although rater training has been argued to be ineffective for improving rater performance and alignment (e.g., Lumley & McNamara, 1995; Knoch, 2010), the majority of studies on the efficacy of rater training has shown supportive evidence that rater training helps mitigate rater unreliability and rater bias to varying degrees (Cumming, 1990; Wigglesworth, 1993; Weigle, 1998; Elder, Knoch, Barkhuizen, & Von Randow, 2005; Lim, 2011; Kauper, 2012; Davis, 2016). Of these studies, although only Wigglesworth, Kauper, and Davis dealt with speaking assessment, we should expect the effectiveness of rater training to be similar for both writing and speaking assessment. Similarly, studies have shown that trained and untrained raters perform differently on the evaluation of L2 accents. For example, Thompson (1991) observed that experienced raters were more reliable and tolerant in their evaluation of accent than were inexperienced raters. In other studies where trained raters were used (e.g., Xi & Mollaun, 2009; Yan, 2014), that rater interactions with L2 accents, though significant, did not have a large impact on examinee scores suggests that rater training might have helped control the introduction of rater bias.

Rater training focusing on the mitigation of rater bias with respect to L2 accents should consider several points based on the literature reviewed earlier. First, clear definitions and descriptors of intelligibility and comprehensibility should be provided in the rating rubric, training materials, and during rater training sessions. Intelligibility and comprehensibility scales with minimal specifications (with only explications at the endpoints), although useful in decomposing general listeners’ perceptions, are not desirable for operational use in language tests where minimization of individual differences and aligned conceptualization and operationalization of those constructs are prioritized. In addition to ensuring the specificity of the rating scale, rater trainers should provide benchmark performance to exemplify descriptors containing intelligibility and comprehensibility. To address unwanted attitudes and stereotypes against certain L2 accents, rater trainers can specify what raters should (and should not) focus on and provide ample examples during rater training sessions to illustrate the desirable and unwanted ways of evaluating accents and pronunciation for their tests.

p.83

Second, regarding the subjective nature of comprehensibility judgment and the argument that rater training cannot eliminate rater idiosyncrasy, it is important for rater trainers to develop reasonable expectations on the rating of comprehensibility. Previous research on comprehensibility judgments of L2 accented speech has two important findings. First, perceptions of comprehensibility vary across individual listeners, depending on their familiarity with the accent, familiarity with the speaker, attitudes towards the accent, and perhaps their own language proficiency levels. Second, perceptions of comprehensibility change over time as a result of changes in the experiences of listeners over time. Therefore, it is highly unlikely that individual raters, regardless of prior rating experience, have the same intelligibility judgments of a particular speaker or accent. Rater trainers should expect a certain degree of variation among raters’ comprehensibility judgments and consider these ratings in conjunction with ratings of other analytic components in the rating rubric (e.g., fluency, lexical sophistication, grammatical accuracy) to arrive at a comprehensive evaluation of oral proficiency.

Because attitudes towards accents and perceptions of intelligibility may change over time, rater trainers need to develop systematic and regular quality control procedures to monitor and evaluate the impact of rater interaction with L2 accents on the actual test scores in order to detect potential test bias. When severe differential severity effects appear due to rater interaction with L2 accents, rater trainers can organize rater discussions in order to create a community (see Kauper, 2012) for raters to share their experience with and attitudes towards certain accents in relation to appropriate ways of evaluating pronunciation, with the goal of bringing them to alignment in terms of the effects of accentedness on score assignment. Finally, when it comes to assessment and rater interaction with L2 accented speech, it is important to maintain a balanced view of how much pronunciation should weigh in the evaluation of speaking performance.

In summary, this chapter explored a list of listener background characteristics that have been commonly observed to have an impact on perceptions of L2 accents and evaluations of L2 accented speech. These factors include familiarity with the accent, familiarity with the speech topic, attitude towards the accent, and language proficiency of the listener. In addition, perceptions of intelligibility and L2 accents are not static or inherent features of a speaker or listener. Rather, they may change over time as a result of changes in listener background characteristics.

In the context of pronunciation or speaking assessment, listener background characteristics that influence perceptions and evaluations of L2 accents may be potential sources of rater bias and thus deserve attention from language testers and rater trainers in particular. While unwanted interactions between listener background characteristics and different L2 accents should be important topics for rater discussion, monitoring and examination of the impact of these interactions is a necessary part of the quality control procedures for language tests. However, few studies have systematically investigated the impact of rater interactions with L2 accents. Future research can examine the impact of listener factors in comparison with other sources of test bias on rater performance and variance of examinee scores.

p.84

Regarding judgments of accentedness and comprehensibility in the context of rater training, reasonable expectations on rater agreement are recommended. The subjective nature of speech suggests that it is natural to expect different perceptions of comprehensibility across individuals. Therefore, rater training should focus on bringing raters to alignment on the conceptualization and operationalization of scale descriptors’ comprehensibility. In the context of oral proficiency testing, rater trainers should be mindful about the relationship between accentedness judgments and ratings of other analytic components of oral proficiency and the possibility that perceptions of L2 accents or comprehensibility may mask ratings on other components. On the one hand, from a measurement perspective, it is important for language testers to be aware of potential sources of rater unreliability due to raters’ different perceptions of L2 accented speech. On the other, examination of the impact of rater interactions gathered from language testing literature can inform researchers of other fields about the practical impact of different rater effects. As social interaction is moderated by an array of factors, examinations of the impact of different factors on speech production and perception can help rank these factors, especially in high-stakes contexts where decisions about placement are involved.

References

American Council on the Teaching of Foreign Languages (1986). ACTFL Proficiency Guidelines. Hastings-on-Hudson, NY: ACTFL.

Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage.

Bachman, L., & Savignon, S. (1986). The evaluation of communicative language proficiency: A critique of the ACTFL oral interview. The Modern Language Journal, 17(4), 381–390.

Bailey, K. M. (1984). The “foreign TA problem.” In K. M. Bailey, F. Pialorski, & J. Zukowski Faust (Eds.), Foreign teaching assistants in U.S. universities (pp. 3–15). Washington, DC: National Association for Foreign Student Affairs.

Bent, T., & Bradlow, A. R. (2003). The interlanguage speech intelligibility benefit. Journal of the Acoustical Society of America, 114(3), 1600–1610.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2nd ed. Mahwah, NJ: LEA.

Bradlow, A. R., & Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106, 707–729.

Bridgeman, B., Powers, D., Stone, E., & Mollaun, P. (2012). TOEFL iBT speaking test scores as indicators of oral communicative language proficiency. Language Testing, 29(1), 91–108.

Briggs, S., Hyon, S., Aldridge, P., & Swales, J. (1990). The international teaching assistant: An annotated critical bibliography. Ann Arbor, MI: The English Language Institute, University of Michigan.

Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? Language Testing, 28(2), 201–219.

p.85

Cargile, A. C., & Giles, H. (1997). Understanding language attitudes: Exploring listener affect and identity. Language and Communication, 17(3), 195–217.

Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51.

Davies, A. (2002). The native speaker: Myth and reality. Clevedon, UK: Multilingual Matters.

Davila, A., Bohara, A. K., & Saenz, R. (1993). Accent penalties and the earnings of Mexican Americans. Social Science Quarterly, 74(4), 902–916.

Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(2), 117–135.

De Klerk, V., & Bosch, B. (1995). Linguistic stereotypes: Nice accent – nice person? International Journal of the Sociology of Language, 116, 17–37.

Derwing, T. M., & Munro, M. J. (1997). Accent, intelligibility, and comprehensibility: Evidence from four L1s. Studies in Second Language Acquisition, 20, 1–16.

Derwing, T. M., & Munro, M. J. (2009). Putting accent in its place: Rethinking obstacles to communication. Language Teaching, 42, 1–15.

Elder, C., Knoch, U., Barkhuizen, G., & Von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.

Evans, B. G., & Iverson, P. (2007). Plasticity in vowel perception and production: A study of accent change in young adults. Journal of the Acoustical Society of America, 121(6), 3814–3826.

Flege, J. E. (1984). The detection of French accent by American listeners. Journal of the Acoustical Society of America, 76, 692–707.

Flowerdew, J. (1994). Research of relevance to second language lecture comprehension: An overview. In J. Flowerdew (Ed.), Academic listening (pp. 7–29). New York: Cambridge University Press.

Fox, W. S. (1992). Functions and effects of international teaching assistants at a major research institution. Dissertation Abstracts International, 52, 3193–A.

Gass, S., & Varonis, E. M. (1984). The effect of familiarity on the comprehensibility of nonnative speech. Language Learning, 34(1), 65–89.

Giles, H. (1972). The effect of stimulus mildness-broadness in the evaluation of accents. Language and Speech, 15, 65–87.

Giles, H., & Billings, A. (2004). Language attitudes. In A. Davies and C. Elder (Eds.), The handbook of applied linguistics (pp. 187–209). Oxford: Blackwell.

Giles, H., Williams, A., Mackie, D. M., & Rosselli, F. (1995). Reactions to Anglo- and Hispanic-American-accented speakers: Affect, identity, persuasion, and the English-only controversy. Language and Communication, 15(2), 107–120.

Ginther, A. (2003). International teaching assistant testing: Policies and methods. In D. Douglas (Ed.), English language testing in U.S. colleges and universities. Washington, DC: NASFA.

Ginther, A. (2013). Assessment of speaking. In C. Chapelle (Ed.), The encyclopedia of applied linguistics. London: Blackwell Publishing Ltd.

Harding, L. (2012). Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing, 29(2), 163–180.

Hay, J., & Drager, K. (2010). Stuffed toys and speech perception. Linguistics, 48(4), 865–892.

Higgs, T. (1982). What can I do to help? In T. Higgs (Ed.), Curriculum, competence and the foreign language teacher. Skokie, IL: National Textbook.

p.86

Hosoda, M., Stone-Romero, E. F., & Walter, J. N. (2007). Listeners’ cognitive and affective reactions to English speakers with standard American English and Asian accents. Perceptual and Motor Skills, 104(1), 307–26.

Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers: An agenda for research and suggestions for second-language assessment. Language Assessment Quarterly, 8(3), 229–249.

Isaacs, T. (2008). Towards defining a valid assessment criterion of pronunciation proficiency in non-native English-speaking graduate students. The Canadian Modern Language Review, 64(4), 555–580.

Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159.

Kalin, R., & Rayko, K. (1978). Discrimination in evaluative judgments against foreign-accented job candidates. Psychological Reports, 43, 1203–1209.

Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38, 301–315.

Kang, O. (2012). Impact of rater characteristics on ratings of international teaching assistants’ oral performance. Language Assessment Quarterly, 9, 249–269.

Kang, O., & Rubin, D. L. (2009). Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation. Journal of Language and Social Psychology, 28, 441–456.

Kang, O., Rubin, D. L., & Lindemann, S. (2015). Mitigating US undergraduates’ attitudes toward international teaching assistants. TESOL Quarterly, 49(4), 681–706.

Kauper, N. (2012). Development and implementation of an ESL classroom assessment of face-to-face conversational interaction. (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses. (Accession Order No. 3545283).

Kinzler, K. D., & DeJesus, J. M. (2013). Northern = smart and Southern = nice: The development of accent attitudes in the United States. The Quarterly Journal of Experimental Psychology, 66(6), 1146–1158.

Kinzler, K. D., Shutts, K., DeJesus, J., & Spelke, E. (2009). Accent trumps race in guiding children’s social preferences. Social Cognition, 27(4), 623–634.

Knoch, U. (2010). Investigating effectiveness of individualized feedback to rating behavior: A longitudinal study. Language Testing, 28(2), 179–200.

Kwiatkowski, J., & Shriberg, L. D. (1992). Intelligibility assessment in developmental phonological disorders: Accuracy of caregiver gloss. Journal of Speech and Hearing Research, 35(5), 1095–1104.

Labov, W. (2006). The social stratification of English in New York City, 2nd ed. New York: Cambridge University Press.

Lambert, W. E. (1967). Social psychology of bilingualism. Journal of Social Issues, 23, 91–109.

Lambert W. E., Hodgson, R., Gardner, R., & Fillenbaum, S. (1960). Evaluational reactions to spoken languages. The Journal of Abnormal and Social Psychology, 60, 44–51.

Lantolf, J., & Frawley, W. (1985). Oral-proficiency testing: A critical analysis. The Modern Language Journal, 69(4), 338–345.

Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560.

Lippi-Green, R. (1994). Accent, standard language ideology, and discriminatory pretext in the courts. Language in Society, 23, 163–198.

Lippi-Green, R. (1997). English with an accent: Language, ideology and discrimination in the United States. London: Routledge.

p.87

Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71.

Major, R., Fitzmaurice, S., Bunta, F., & Balasubramanian, C. (2002). The effects of nonnative accents on listening comprehension: Implications for ESL assessment. TESOL Quarterly, 36, 173–190.

Munro, M. J. (2008). Foreign accent and speech intelligibility. In J. G. Hansen Edwards & M. L. Zampini (Eds.), Phonology and second language acquisition (pp. 193–218). Amsterdam, the Netherlands: John Benjamins.

Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97.

Munro, M. J., & Derwing, T. M. (1998). The effects of speaking rate on listener evaluations of native and foreign-accented speech. Language Learning, 48(2), 159–182.

Munro, M. J., Derwing, T. M., & Sato, K. (2006). Salient accents, covert attitudes: Consciousness-raising for pre-Service second language teachers. Prospect, 21(1), 67–79.

Nelson, C. L. (2011). Intelligibility in World Englishes: Theory and application. New York: Routledge.

Nelson, L. R., Signorella, M. L., & Botti, K. G. (2016). Accent, gender, and perceived competence. Hispanic Journal of Behavioral Sciences, 3, 166–185.

Oppenheim, N. (1997). How international teaching assistant programs can prevent lawsuits. (ERIC Document Reproduction Service No. ED408886). Retrieved from http://files.eric.ed.gov/fulltext/ED408886.pdf.

Rubin, D. L. (1992). Nonlanguage factors affecting undergraduates’ judgments of nonnative English-speaking teaching assistants. Research in Higher Education, 33, 511–531.

Rubin, D. L., & Smith, K. A. (1990). Effects of accent, ethnicity, and lecture topic on undergraduates’ perceptions of nonnative English-speaking teaching assistants. International Journal of Intercultural Relations, 14, 337–353.

Sato, K. (1998). Evaluative reactions towards “foreign accented” English speech: The effects of listeners’ experience on their judgements. Unpublished Master’s thesis: University of Alberta, Edmonton, Canada.

Schinke-Llano, L. (1983). Foreigner talk in content classrooms. In H. W. Selinger & M. H. Long (Eds.), Classroom oriented research in SLA (pp. 146–165). Rowley, MA: Newbury House.

Schinke-Llano, L. (1986). Foreigner talk in joint cognitive activities. In R. R. Day (Ed.), Talking to learn (pp. 99–117). Rowley, MA: Newbury House.

Shriberg, L. D. (1993). Four new speech and prosody measures for genetics research and other studies in developmental phonological disorders. Journal of Speech and Hearing Research, 36, 105–140.

Spolsky, B. (1995). Measured words: The development of objective language testing. Oxford: Oxford University Press.

Tauroza, S., & Luk, J. (1997). Accent and second language listening comprehension. RELC Journal, 28(1), 54–71.

Thompson, I. (1991). Foreign accents revisited: The English pronunciation of Russian immigrants. Language Learning, 41(2), 177–204.

Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(04), 905–916.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.

Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–319.

p.88

Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231–252.

Xi, X. (2008). Investigating the criterion-related validity of the TOEFL speaking scores for ITA screening and setting standards for ITAs. RR-08-02, TOEFLiBT-03. Princeton, NJ: Educational Testing Service.

Xi, X., & Mollaun, P. (2009). How do raters from India perform in scoring the TOEFL iBT speaking section and what kind of training helps? TOEFL iBT Research Report, No. RR-09-31. Princeton, NJ: ETS.

Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test. Language Learning, 61(4), 1222–1255.

Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.