Assessment in Second Language Pronunciation

p.89

ASSESSING PRONUNCIATION FOR RESEARCH PURPOSES WITH LISTENER-BASED NUMERICAL SCALES

Daniel R. Isbell

Introduction

When pronunciation is assessed for research purposes, listener-based numerical scales are commonly used to assign scores to speaker performances. The scales feature simple descriptors at each end, and a number of score points are marked in between. The intervals between the score points are treated as equal, and scores from several raters (ranging from 3 raters in Lord, 2008, to 188 raters in Kang, Rubin, & Pickering, 2010) are typically averaged for subsequent analyses. In principle, any aspect of a speaker’s production can be assessed using such scales, but the most commonly measured attributes are comprehensibility, accentedness, and fluency (see Munro & Derwing, 2015, for detailed definitions). Intelligibility is another pronunciation attribute of considerable interest, but due to the variety and potential complexity of operationalizations, it is not commonly investigated with listener-based numerical scales.

In L2 pronunciation research, the number of score points has typically included five (e.g., Isaacs & Thomson, 2013), seven (e.g., Kang et al., 2010; Southwood & Flege, 1999), or nine points (e.g., Derwing, Munro, & Wiebe, 1998; Hsieh, 2011; Isaacs & Trofimovich, 2011; Lord, 2008; O’Brien, 2014, 2016); the latter is perhaps most commonly used. Sliding scales have also gained popularity in pronunciation research (e.g., Crowther, Trofimovich, Saito, & Isaacs, 2015), but they do not have intermediate points marked and instead an interactive slider is used to make judgments. Due to these differences, sliding scales are excluded from discussion here, which focuses on numerical scales. Research in measurement on numerical scales suggests that at least five points are desirable, and that somewhere between seven and ten points is optimal in terms of reliability and capacity to discriminate (Miller, 1956; Preston & Colman, 2000; Southwood & Flege, 1999). However, as Preston and Colman (2000) point out, the optimum number of scale points is likely to vary according to purpose and circumstances. Aspects of scale presentation have also been investigated. Especially relevant to computer-administered rating, Cook, Heath, Thompson, and Thompson (2001) found that radio buttons and partitioned sliding scales were equally reliable. Another finding germane to L2 pronunciation research is that scales with left-side positive scale anchors lead to comparatively higher ratings than scales that are right-side positive (Hartley & Betts, 2010). Pronunciation researchers have utilized both formats (e.g., left-side positive, Derwing et al., 1998; right-side positive, Pinget, Bosker, Quené, & De Jong, 2014).

p.90

Numerical scales have been used in many studies that have made key contributions to the field of L2 pronunciation. The simplicity and apparent transparency of the scales is thought to make them easy to use with listeners lacking linguistic expertise, thus they are used with only minimal training. Furthermore, when researchers are interested in representing “everyday” notions of accentedness, comprehensibility, and fluency, the selection of naïve (untrained) raters to evaluate speech is often made intentionally. The general absence of researcher-prescribed notions of what constitutes varying levels of pronunciation quality arguably provides access to unfiltered listener impressions. However, if listeners apply a scale in different ways, with some raters more severe or lenient than others (i.e., depending on different underlying representations of what it may mean to be accented, comprehensible, or fluent), then the inferences we draw about the underlying constructs can be confounded.

The scales often have high reliability when estimated with Cronbach’s alpha, commonly exceeding .90 (e.g., Hsieh, 2011; Isaacs & Thomson, 2013; Isaacs & Trofimovich, 2011; Kang et al., 2010). In such reliability analyses, each rater is treated as a fixed item, similar to multiple choice test questions or Likert scale survey items, and high reliability is interpreted as evidence that the “items” measure the same attribute. However, the strength of a reliability coefficient does not indicate that listeners are applying the scale in a comparable manner, neither does it guarantee a normal distribution of scores. As measures of reliability are sensitive to rankings, speakers may be ranked in a similar manner across listeners, but individual listeners may use different scale ranges and still achieve high reliability. Indeed, in the absence of rigorous training the scales have been reported as difficult for listeners to use (Isaacs & Thomson, 2013).

These problems may be exacerbated when researchers apply arithmetic score averaging across untrained rater judgments, as averaging may smooth over potentially interesting, even critical, sources of rater variation. Consider a comparable situation in which a pollster asks a respondent to indicate the extent to which they agree to the following statement: President X is doing a good job. When respondents are strongly partisan, they tend to produce bimodal distributions; that is, respondents/judges/raters tend to either strongly agree or disagree. If an average is used to represent an underlying bimodal distribution, it then may appear that respondents are neutral; therefore, the characteristics of the actual distributions must be carefully examined. If, as Munro and Derwing (2015) pointed out, listener-based judgments are crucial in evaluating pronunciation, then listener variation deserves to be considered and accounted for.

p.91

Finally, a key assumption of interval measurement is that equal intervals in the scale represent comparable differences in the attribute of interest (Stevens, 1946). Interval data can be normally distributed, which is important for many of the inferential statistics used to test hypotheses in research. Strictly speaking, the numerical scales used in pronunciation research yield ordinal data that are bound by the beginning and end of the scale, and thus cannot compose a genuine normal distribution (a normal distribution is asymptotic, extending to infinity in either direction). However, no real-world dataset is truly normally distributed, and it is up to the researcher to evaluate whether the obtained data approximate an underlying normal distribution of the attribute of interest. The degree to which numerical scales approximate interval measurement qualities must also be evaluated.

Evidence of the appropriateness of particular measurement techniques is crucial for the valid interpretation of L2 pronunciation scores. Thus this chapter highlights important considerations in the assessment of L2 pronunciation with numerical scales, including scale function, rater variation, and difficulties associated with the rating task. These considerations are then illustrated with data from a study on L2 Korean pronunciation instruction.

Current contributions and research

Several studies have been conducted to investigate the validity or measurement qualities of numerical scales in L2 pronunciation research. O’Brien (2016) focused on two important issues in rating comprehensibility, accentedness, and fluency: differences in scoring procedures, and differences between native and non-native listeners. For scoring procedures, O’Brien investigated whether there is a difference between scoring attributes separately (i.e., listening once and scoring comprehensibility, and then listening again and scoring accentedness) and simultaneously (i.e., listening once and scoring all attributes at the same time). Overall, she found no consistent differences across the rating conditions, and recommended that simultaneous scoring can be used for the sake of efficiency. O’Brien also compared the judgments of NS listeners and NNS listeners, finding that there were substantial differences between them, especially in the judgment of non-native speech. This finding aligns with a growing body of evidence supporting differences in pronunciation judgments across linguistic backgrounds (Crowther et al., 2015; Kang, Vo, & Moran, 2016; Saito & Shintani, 2015). Some studies, however, have found little to no difference between raters of different linguistic backgrounds (Munro, Derwing, & Morton, 2006; Schmid & Hopp, 2014). Schmid and Hopp (2014), however, did find individual variation in rater severity, and more interestingly found that the range of abilities in the speech samples had an effect on how raters used the scale.

Isaacs and Trofimovich (2010, 2011) investigated several individual differences relevant to comprehensibility, accentedness, and fluency judgments: phonological working memory, attention control, and musical ability. This work is particularly interesting because it investigates variation that exists within listeners of a particular language background. With regards to the phonological working memory and attention control, Isaacs and Trofimovich (2011) found no relationships between either of these cognitive variables and the L2 pronunciation attributes. However, when considering musical ability as determined by expertise (in this case, majoring in music at university), they observed a significant difference in accentedness ratings, but not comprehensibility or fluency ratings. Extending their investigation further, Isaacs and Trofimovich (2010) found that an aspect of musical aptitude was associated with accentedness judgments for extreme raters: those who were most severe had uniformly high melodic discrimination abilities. The authors recommended thorough training procedures to standardize rating that would otherwise be subject to this sort of individual difference among raters.

p.92

Turning to scale function and rater-scale interaction, Southwood and Flege (1999) compared a 7-point numerical scale to Direct Magnitude Estimation (DME) by having 20 listeners judge short speech samples from 90 L2 English speakers and 6 NS English speakers. DME is a type of judgment where a rater is provided with a benchmark and asked to compare subsequent samples to the benchmark. Southwood and Flege concluded that DME scores and 7-point scale scores had a linear relationship, but also found some problems with both rating techniques. This discussion of their work focuses on the 7-point scale. First, clearly non-normal score distributions were found. Listeners were observed to use score points at the ends of the scale more than would be expected. Second, due to differences in scores across raters, Southwood and Flege opted to not calculate mean scores for each speech sample. Third, while one group of raters had high inter-rater reliability (.85), the other group of raters had only moderate levels of inter-rater reliability (.58, based on intraclass correlation, discussed in detail later), and some raters had low intra-rater agreement. When presented with 24 speech samples a second time at the end of the rating session, one rater failed to award a score within ±1 point of their original score over 40% of the time. Southwood and Flege’s analyses revealed what appeared to be a ceiling effect associated with the 7-point scale, and concluded that a 9- or 11-point scale would be necessary to accommodate potential ceiling effects. However, the speech samples used in this study were elicited from speakers with considerable length of residence in an English-speaking country (i.e., many speakers would be perceived as having only slight accents due to exposure and frequency of L2 use); this is not always the case in L2 pronunciation research.

Isaacs and Thomson (2013) conducted the most thorough investigation of numerical pronunciation scale function to date. Using Rasch measurement to analyze 5- and 9-point scales, Isaacs and Thomson found that raters had difficulty distinguishing middle score points, and this problem was more noticeable in the 9-point scoring condition. In contrast to Southwood and Flege (1999), Isaacs and Thomson observed that raters were hesitant to use the endpoints of the scales. In stimulated recalls, raters reported difficulty in using the entire score range due to a gap between NS and NNS speakers, discomfort in judging someone’s accent in positive/negative terms, and inability to consistently discriminate performances and assign the most appropriate score (as one rater put it, “it’s a coin toss,” 2013, p. 148). Isaacs and Thomson concluded that raters would likely benefit from more robust description of the attributes and performances across the scale range.

p.93

That conclusion reveals a tension in L2 pronunciation research: untrained raters may provide a window into everyday listener perceptions, but such raters may struggle to assign scores uniformly with a numerical scale. Enhanced training or scale description may ameliorate uniformity problems, but conversely assumes that the attribute is robustly defined, thus questioning the need for naïve raters in the first place.

In sum, investigations into how listeners use interval scales to judge L2 speech qualities has shown that (a) numerical scales can be difficult to use, (b) the distance between score points may not be consistent representations of the attributes of interest (e.g., ceiling effects), and (c) individual raters may differ substantially in their judgments and how they use rating scales.

Illustrations and examples

To investigate the issues of scale function, interval measurement, and rater differences in scale use, many-facet Rasch measurement (MFRM; Linacre, 1989) is presented as an analytical tool for L2 pronunciation researchers. Widely used in language assessment (McNamara, 1996; McNamara & Knoch, 2012) and in research on L2 writing (Knoch, 2017), MFRM provides a measurement model that allows researchers to account for rater variation, rater characteristics, and task difficulty. At the same time, it provides an alternative way of examining scores from what are assumed to be genuinely interval scales. An in-depth description of MFRM is beyond the scope of this chapter, but in simple terms, Rasch measurement assumes that each subject has an underlying ability level that determines their likelihood of achieving a particular score on a particular item or task. These underlying subject abilities exist on a continuum divisible into equal-interval units called logits. These assumptions comprise the core of the Rasch model, which represents a more sophisticated approach compared to conventional observed-score measurement. Observed-score measurement assumes more simply that someone’s ability is directly represented by the score received on the item (plus or minus error). When scores for an item can have several ordered points (e.g., a 9-point numerical scale), Rasch measurement can estimate the underlying ability level needed to achieve each successive score point. MFRM is able to account for more than subjects and items: other facets, such as task, rater, or time, can be incorporated into the analysis, and the measures for all facets are estimated and expressed uniformly in terms of logits, which in effect converts dichotomous or ordinal responses into an interval-scaled measurement. For more details on the inner-workings of Rasch measurement, readers may consult Bond and Fox (2007).

p.94

This method is illustrated with rating data from an in-progress pronunciation instruction study. Additionally, reliability analyses and a qualitative analysis of post-rating debriefing questionnaires are presented to investigate rating difficulties and triangulate findings.

Speakers and tasks

To provide context for the comprehensibility and accentedness ratings that were analyzed for this example, the speakers will be briefly described. The 36 speakers were Korean learners in their first (24) or second (12) year of language study at a US university. At the time of the pretest, the first-year students had received about 60–70 hours of instruction, and would generally be considered beginners. The second-year students had received approximately 180 hours of instruction, and perhaps could be considered to have low-intermediate proficiency. The speakers came from two L1 backgrounds: English (24) and Chinese (12). One would expect noticeable variation in speaker pronunciation, and at the same time might expect that few speakers would be highly comprehensible, and fewer still would be considered to have unnoticeable accents.

The speakers completed two tasks in a controlled, group session. A picture description elicited spontaneous speech, though speakers did incorporate some linguistic input included in the prompt. A read-aloud consisting of a 158-syllable paragraph elicited more controlled speech. It was expected that the picture description would yield lower comprehensibility scores and somewhat lower accentedness scores. For both tasks, the speakers had two minutes to prepare (with no writing) and one minute to speak. Including the pretest and posttest tasks, a total of 142 speech samples were collected (pretest and posttest read-aloud recordings were missing from one participant).

Listeners and rating procedures

Ten Korean NSs were recruited from the same university. Their average age was 24.7 years old, and included six females and four males. No listeners were linguistics or education majors, and none had formal Korean teaching experience.

To the extent possible, speech rating procedures similar to other studies were adopted (e.g., Derwing, Munro, & Wiebe, 1998; Isaacs & Thomson, 2013; O’Brien, 2014). Pre-rating training was thus brief, including a brief introduction to both speaker tasks (to avoid issues with gradual learning/familiarization during live rating), a brief explanation of comprehensibility and accentedness, and instructions for using the 9-point scales (i.e., how to mark scores, using the whole range). Raters were presented with four samples (two NS and two NNS) to rate as practice; the scores given by raters were briefly discussed as a group but reaching exact consensus was not required. Raters agreed on maximum scores on both attributes for the NS; the NNS was scored lower on each attribute. Live rating commenced immediately afterward, with all 142 learner speech samples, including the picture description and read-aloud task elicited at pretest and posttest, presented in random. Six NS speech samples (three speakers each recorded both tasks) were seeded into the order. Mandatory breaks were taken after rating 37 samples to mitigate rater fatigue.

p.95

The scales and a set of debriefing questions were presented in Korean in a paper booklet. The debriefing questions elicited foci and influential factors in raters’ comprehensibility and accentedness judgments, and also asked for comments on difficulties in the rating task.

Analyses

For the purpose of illustration, only the pretest ratings for learners (71 speech samples scored by 10 raters) were included to avoid the analytical and conceptual complexities of instructional and maturational effects, allowing for a more straightforward focus on raters and scales. Scores for the NS speech samples were also removed. Reliability analyses were conducted in R (v3.3.1; R Core Team, 2016). MFRM analyses were carried out using the FACETS software (v3.71.4; Linacre, 2014), including facets for speakers, tasks, and raters. Raters were uncentered in FACETS, as is common practice in research on raters (McNamara, 1996); this means that estimates of rater severity are relative to speakers and tasks having an average measure of 0.00 logits. The debriefing question responses were analyzed in a qualitative content analysis; all responses were included.

Comprehensibility

Visual inspection is the first step to take in analyzing the comprehensibility scores. The histogram on the left in Figure 5.1 shows the distribution comprehensibility scores averaged across raters. The shape of this distribution appears roughly normal, but notably lacks any scores in the 1–1.99 range or scores that were exactly 9.0. On the right side of Figure 5.1 is a histogram comprising all individual scores awarded by raters. The shape of this distribution is roughly normal, and certainly appears different than that of the averaged scores. Importantly, one can see that a substantial number of 1s were awarded, as well as a small handful of 9s.

Descriptive statistics also show differences between the averaged and individual scores (Table 5.1). While the averaged and individual scores naturally have the same means (4.49), the individual scores show greater variation and span the complete range of the 9-point scale.

Turning to reliability (Table 5.2), a high Cronbach’s alpha (.96) was obtained for comprehensibility scores by treating each rater as a fixed “item,” and this high degree of internal consistency held across both speaking tasks. The intraclass correlation (ICC) is another index of reliability commonly used for scores awarded by multiple raters, and it has several variants for use in different situations (Shrout & Fleiss, 1979). ICC coefficients are commonly interpreted as poor below .40, fair between .40–.59, good between .60 and .74, and excellent at .75–1.00 (Cicchetti, 1994). Importantly, compared to Cronbach’s alpha, ICC models rater variation in addition to score variation, and thus depends on more than scores ranking subjects in a similar order to achieve high reliability: rater consistency is also necessary. The ICC(2k) is used if averaged scores from a random group of raters are to be used for interpretation, and here these values are excellent. ICC(2,1) is appropriate for considering the consistency of individual scores from random raters, and these values indicate only fair to good consistency.

p.96

FIGURE 5.1 Histograms of averaged and individual comprehensibility scores

Cronbach’s alpha is widely used in assessment, generally for the purpose of examining the internal consistency of a particular set of test items with a particular set of test-takers. In other words, alpha is the average intercorrelation of items and indicates the extent to which the set of items measures the same attribute (Crocker & Algina, 1986). In L2 pronunciation research with several raters, raters are essentially treated as fixed “items” (which may be easier or harder) and alpha tells us how well these “items” work together (and in fact, alpha in this case is equivalent to another variant of ICC, the ICC(3k), Shrout & Fleiss, 1979). But raters in pronunciation studies are not always fixed; often they are (pseudo-) randomly sampled from native speaker undergraduate populations, for example. Accounting for random variation in raters is important, especially when generalizing findings to a larger population of listeners, making ICC variants more appropriate. As Shrout and Fleiss (1979) put it, “Sometimes the choice of a unit of analysis causes a conflict between reliability considerations and substantive interpretations. A mean of k ratings might be needed for reliability, but the generalization of interest might be individuals” (p. 427). These conceptual and technical considerations aside, reliability generally looks good, although clearly individual rater scores show less consistency than the mean scores.

TABLE 5.1 Descriptive statistics for comprehensibility scores

p.97

TABLE 5.2 Reliability indices for comprehensibility ratings

MFRM also provides a means of examining rater consistency. In the Rasch model, a certain amount of variation in scores is expected. A statistic called infit is computed to analyze the degree to which a rater assigns predictably varying scores. An ideal rater would have a value of 1.0, but in practice values between 0.6 and 1.4 are considered acceptable for rating scale judgments (Bond & Fox, 2007). Larger values mean that a rater’s scores were inconsistent, varying erratically, while smaller values indicate that rater was too consistent, likely under-utilizing parts of the scale. In this analysis, two raters were identified as misfitting, with infit values of 1.55 (Rater 7) and 1.65 (Rater 6). Although the raters as a group were generally consistent in the way they assigned scores (i.e., they tended to rank the speaker performances similarly), they were not homogenous, and two raters in particular were less consistent than desired.

One of the main advantages of MFRM when used with rater data is that raters’ underlying severity can be estimated in relation to person abilities. Table 5.3 presents summary statistics from the MFRM analysis for speakers and raters. The mean abilities in Table 5.3 show that relative to speaker ability, raters were somewhat severe. More interesting are the ranges of abilities (minimum and maximum for speakers and raters). Based on the scores given by raters using the 9-point scales, MFRM reveals that speakers’ underlying, interval-scaled comprehensibility in the sample spanned a range of 4.77 logits, while raters’ severity in judging comprehensibility spanned 2.05 logits. In other words, the proportion of rater variability to speaker variability is roughly 43%, highlighting how considerably NS listener perceptions and/or scale use can vary.

TABLE 5.3 Summary statistics of speaker and listener measures for comprehensibility

p.98

FIGURE 5.2 Histograms of comprehensibility scores awarded by each rater

p.99

Rater differences in consistency and severity are evident, and become even more salient when examining each rater’s score distribution (Figure 5.2). Most obviously, scores from individual raters are generally not falling into normal distributions, though Rater 9 and Rater 10 make remarkable approximations. Rater 6 and Rater 7, who were flagged as being inconsistent due to their large infit values, have rather flat score distributions. Other raters appeared to have a few favorite score points, such as Rater 3, who favored 1, 3, and 7. Rater 8, who was the least severe (−0.67 logits), piled on ratings toward the high end of the scale, yet did not award any 9s. This raises an interesting question, especially if how L2 pronunciation is judged by listeners is of interest: Why do some college-aged Korean NS listeners judge the same speech so differently in terms of comprehensibility? Two explanations seem likely: (1) the listeners vary in their perceptions of speech, and/or (2) the raters have different understandings of the rating scale. For the first possibility, the spread of rater severity scores provides some evidence that non-expert listener impressions vary individually. The second possibility also warrants investigation.

One piece of evidence useful for examining the second possibility is a plot of category probability curves yielded by MFRM analysis (Figure 5.3). The x-axis represents the range of speaker comprehensibility, expressed in equal-interval logits, and the y-axis indicates the probability of being awarded a particular score. The curves, then, are interpreted as the conditional probability of receiving a particular comprehensibility score. For example, someone near the low end of the present sample (i.e., the minimum comprehensibility measure of −2.17 logits) would only have about a 5% chance of being awarded a 5 by the raters. Someone with comprehensibility near the middle of the sample (0.00 logits) would be more likely to receive a score of 5. It is worth mentioning that the curves for 1 and 9 are extrapolated; in the present sample almost no one had an underlying comprehensibility measure that would make either score most likely. Generally, the picture here for the comprehensibility scale does not look bad: score points peak successively, and each peak is distinct. However, there is some indication that assigning scores was difficult for raters. For example, the comprehensibility measure most likely to receive a score of 4 was around .90 logits. At that ability level, a speaker had about a 30% chance of receiving a 4, but also had around a 25% chance of receiving a 3 and roughly a 25% chance of receiving a 5. Deciding between adjacent score points was not a clear-cut process for the raters.

p.100

FIGURE 5.3 Category probability curves for comprehensibility scores

Given the individually varying scores from raters and the contentious delineations between some score points, a question arises: Are the differences among score points representative of equal intervals? If the difference between a 4 and a 5 for some reason feels narrower to a rater than the difference between an 8 and a 9, it may be the case that there is a misalignment at some point in the process of a speech sample being judged by a listener and then mapped to a number on the scale. MFRM allows us to examine this question by considering the underlying logit measures of comprehensibility and the thresholds between score points. An item characteristic curve (Figure 5.4) illustrates this analysis graphically. Once again, the x-axis represents speaker comprehensibility in logits, but this time the y-axis shows us the score points of the scale. It should be noted that the scores on the y-axis are equal distances apart, just as they are presented to the raters. However, following the curve from left to right, it is apparent that score points span differing ranges of underlying comprehensibility. For example, the score point of 8 covers a much wider range of ability than any other score point. However, score points 3 through 7 generally appear to be representing roughly equivalent spans of comprehensibility. The wide range of 8 is probably an artifact of the sample; there was likely a gap in speech samples that bridged the high end of the Korean learners and the Korean native speakers that were seeded into the rating. Likewise, the noticeably wider range of 2 could be reflective of the sample as well as the tasks, which were designed to be accessible to students in early stages of Korean learning.

p.101

FIGURE 5.4 Item characteristic curve for the comprehensibility scale

Accentedness

Turning now to accentedness, Figure 5.5 shows histograms of averaged ratings (left) and individual ratings (right). The distributions for both sets of scores appear roughly normal, though no speaker had an average of 9.0 and curiously no averaged scores fell into the 7.00–7.99 range. For the individual scores, only a single score of 9 was awarded. This positively skewed distribution would be expected based on the speakers being in relatively early stages of Korean learning. Additionally, previous research has found accentedness scores to generally be lower than comprehensibility scores for most speakers (Derwing, Munro, & Wiebe, 1998).

Descriptive statistics of the two sets of scores (Table 5.4) draw parallels with the comprehensibility scores in the previous section: averaging leads to less variation and fewer data points. Both distributions are positively skewed. More than with the comprehensibility scores, the kurtosis of the distributions differ. While both sets of scores have kurtosis values within rules-of-thumb for normality (i.e., within ±2), the distribution of individual scores has a lower density than the distribution of averaged scores.

Moving on to reliability (Table 5.5), there was generally a high degree of internal consistency (alpha = .92), similar to comprehensibility. However, compared to comprehensibility, the intraclass correlations were somewhat lower, especially when attempting to generalize the consistency of single raters. Indeed, for the picture description task, the ICC(2,1) coefficient was only .32, which is interpreted as poor. Infit values from the MFRM analysis also highlight individual rater consistency issues, with five raters identified as misfitting: Rater 5 (.53), Rater 10 (.60), Rater 7 (1.57), Rater 6 (1.76), and Rater 2 (1.83). Raters 5 and Rater 10 were too predictable, tending to rely on a limited set of score points. The other misfitting raters were erratic in their ratings, and Rater 2 in particular approached a threshold (2.0) that would negatively affect the Rasch model estimations.

p.102

FIGURE 5.5 Histograms of averaged and individual accentedness scores

As with comprehensibility scores, summary Rasch measures provide a useful way to consider the range of speaker accentedness and listener severity. Relative to speaker ability, the listeners were noticeably severe. The speakers in the sample had underlying accentedness that spanned 4.14 logits, while the raters had underlying severities in accentedness judgment spanning 2.26 logits. As logit measures are directly comparable in magnitude, the variability in rater severity was over half the variability in speaker accentedness.

TABLE 5.4 Descriptive statistics for accentedness scores

p.103

TABLE 5.5 Reliability indices for accentedness ratings

Like the comprehensibility ratings, differences in rater severity and consistency are apparent. Figure 5.6 illustrates these differences with histograms of scores given by each rater. Once again, scores from each rater do not necessarily approximate a normal distribution. Raters identified as too predictable had highly peaked distributions and limited score ranges. Rater 5, for example, favored scores of 3 and 4, but awarded few 2s and not a single 1. Rater 10 clearly favored score point 3. On the other hand, raters identified as erratic have flatter distributions, most clearly exemplified by Rater 6 and Rater 7 (who also had similar patterns when scoring comprehensibility). Curiously, some raters actually had gaps in the range of accentedness score points they used: Rater 3 and Rater 4, despite awarding a small number of 8s, did not award any 7s. Differences in severity are clear as well: compare the score distributions of the most severe rater, Rater 3 (measure = 2.07 logits), with the least severe rater, Rater 8 (measure = −0.19 logits). Rater variations in how severely accent strength is judged can account for some of the difference among score distributions, but in light of recurring patterns of inconsistency and new evidence of questionable scale use (i.e., raters having gaps in the middle of their utilized score range), it is important to investigate how well the raters were able to use the scale.

Examining the category probability curves for accentedness scores (Figure 5.7), difficulties in using the scale are visible. Most noticeably, the peak of the curve for 7 is completely subsumed by the curve for 8. This means that at no point in the continuum of accentedness was a speaker most likely to be assigned a score of 7 by a rater. In other words, the score point of 7 was largely redundant in the listeners’ understanding of the scale; scores of 6 or 8 were generally how listeners made sense of speakers with relatively weaker accents. Additionally, while curves for other score points have distinct peaks, they do overlap considerably, showing that for a given range of accentedness, there were fairly good chances that raters would choose adjacent scores rather than the most appropriate score according to the Rasch model.

TABLE 5.6 Summary statistics of speaker and listener measures for accentedness

p.104

FIGURE 5.6 Histograms of accentedness scores awarded by each rater

p.105

FIGURE 5.7 Category probability curves for accentedness scores

The degree to which rater accentedness scores approximate equal intervals was also analyzed. The item characteristic curve in Figure 5.8 presents evidence that the accentedness scores were not very representative of interval measurement. Like the comprehensibility scale, score points on the higher end appear to be much wider, and a plausible explanation for this is that the accentedness of the present group of L2 Korean speakers was relatively low, especially in comparison to NSs (Schmid & Hopp, 2014). Similarly, the width of score point 2 could plausibly be due to a floor in the accentedness of speakers in the sample. Nonetheless, one can see that score points in the middle of the scale represent a narrower range of underlying accentedness than other score points. The prime example is score point 5, which covers the narrowest range of comprehensibility (roughly half of a logit). In contrast, score point 3, which was the most frequently awarded score point (23% of all accentedness scores) and fully within the range of speaker accentedness in this sample, represented nearly twice the accentedness range of score point 5. This highlights potential problems for interpreting accentedness scores at face value, especially if assuming interval-level measurement.

p.106

FIGURE 5.8 Item characteristic curve for the accentedness scale

Difficulties reported by raters

Visual inspections of score distributions, reliability indices, and MFRM analyses have highlighted some challenges in the measurement of comprehensibility and accentedness with 9-point scales. To gather additional evidence related to these challenges as well as learn about issues that may otherwise be overlooked, debriefing questions can be useful (interviews are another useful and likely richer approach). Comments from raters touched on four issues in the rating task: using the scales (n = 5), differentiating constructs (n = 4), length of rating (n = 4), and insufficient training (n = 3).

Using the scales

The most common issue elicited in the debriefing questions was related to the number of score points. As Rater 4 put it, “The rating scale was too much. I couldn’t tell the difference between a 2 and a 3 on the scale.” Rater 3 thought that “there were too many points, so comparatively, I think I often gave people who had the same (similar) ability different scores.” In a comment that also relates to training, Rater 5 wanted to hear examples of a broad range of score points before rating, indicating some uncertainty regarding the differences between the score points in terms of speaker performances. All of these comments suggest that raters had difficulty due to the length of the scale, and offer some explanation for the inconsistency found in analyses of scores. These comments align with reports from raters using 9-point scales in Isaacs and Thomson (2013), which was described as “difficult to manage” (p. 148).

p.107

One rater, Rater 8, expressed concern about her own subjectivity and a degree of vagueness in the scale. She also stated that despite being able to understand the entirety of what speakers said, she “deliberately set a standard to divide the comprehensibility levels.” This comment raises an important question: How do individual raters internally “divide” the range of speaker comprehensibility into neat levels with corresponding numbers? Rater 8 was the most lenient rater for comprehensibility and accentedness, and it seems likely that Rater 8 and Rater 3 (the most severe rater for both attributes) had rather different ways of using the scales.

Understanding constructs

Despite the overt simplicity in the definitions of comprehensibility and accentedness, some raters grappled with conceptualizing these attributes in the rating task. As mentioned previously, Rater 8 noted that she could understand all that was said by speakers, which could be interpretable as perhaps giving too much weight to the basic intelligibility of the speech sample (understanding a speaker’s intended message, Munro & Derwing, 2015) rather than the ease of understanding that defines comprehensibility. Rater 3 thought that her comprehensibility judgments would have been more accurate had she not known the contents of the speech samples beforehand. Rater 4 felt that judging accentedness was difficult, and that she did not “learn Korean to a degree where I can sufficiently judge accentedness,” perhaps referring to a lack of linguistic expertise that could have been useful in differentiating speakers. This sentiment is not dissimilar to the non-expert raters in Isaacs and Thomson (2013), who felt that they had little authority to evaluate the speech of others. Finally, the real difference between comprehensibility and accentedness was called into question by Rater 6: “I am not sure if [the] two items are really separate measures.” As correlations between the two constructs are typically high (e.g., r = .89 in Saito, Trofimovich, & Isaacs, 2016; r = .93 in the larger pronunciation instruction study that the present data originated from), this comment is not surprising.

Length of rating

Despite efforts to break up the rating into manageable chunks, two raters commented that listening to the speech samples was fatiguing. Two other raters felt that the individual audio files were too long; these raters perhaps formed their judgments more quickly than other raters.

p.108

Insufficient training

A few raters stated that different training would have led to better judgments. As previously mentioned, Rater 5 would have liked to hear samples representative of different score points before rating. Similarly, Rater 1 commented that “more practice samples at the beginning would’ve led to better judgments.” These comments suggest that the training provided to raters was insufficient.

Discussion

In an illustrative example of L2 Korean pronunciation, reliability analyses, MFRM, and debriefing questions provided evidence pertaining to the quality of measurement when listeners use 9-point scales to judge comprehensibility and accentedness. When raters were treated as fixed items, internal consistency for both comprehensibility and accentedness scores were high, but reliability analyses considering individual raters as random were somewhat lower, particularly for accentedness scores. Both comprehensibility and accentedness scores exhibited monotonic progression across the range of speaker abilities. Comparatively, the comprehensibility scale appeared to function better than the accentedness scale: all score points were distinct and, at least in the middle of the scale where most scores were awarded, the scale provided a fair approximation of interval-level measurement. Accentedness, on the other hand, had one score point that was largely redundant and somewhat poorly approximated interval measurement across the scale. Rater comments offered confirmatory evidence of less than desirable scale function, and additionally provided support for more elaborated training in future studies. This evidence is relevant to the valid interpretations of comprehensibility and accentedness scores, especially when the speakers are the focus of the study (e.g., in a pronunciation instruction study).

The illustrative example also highlighted ways in which raters differed substantially in their judgments and use of the scales. Multiple pieces of evidence – ICC coefficients, Rasch fit statistics, Rasch severity estimates, and individual rater score distributions – demonstrated that raters differed in their consistency, range of score points used, preferred score points, and harshness of judgments. Rater comments indicated that some of this variation may be linked to individual difficulties in using the scales, but also suggested that raters may have unique orientations to the attribute and unique methods of internally partitioning the range of comprehensibility or accentedness – characteristic of an underlying ordinal scale. Compared with previous studies on rater differences, which have often focused on between-groups’ linguistic background (e.g., Kang et al., 2016; O’Brien, 2016), the results of the present analysis highlight the variation within speakers of a homogenous linguistic background, though it does not offer causal interpretations. These results reflect Southwood and Flege’s (1999) observation that “listeners are uncertain how to map responses onto the stimuli . . . they may attempt to use their own units of equal discriminability” (p. 344). This variability is important to account for when interpreting speaker attributes, and is equally important to studies that investigate influences on listener perceptions of L2 pronunciation.

p.109

New directions and recommendations

After collecting judgments from raters using numerical scales, collecting additional evidence related to measurement quality is important. While most L2 pronunciation studies report an overall reliability (usually Cronbach’s alpha, and sometimes ICCs) for each attribute measured, the illustrative example here as well as previous work suggest a need for closer examinations of how scales function and the quality of measurement (Isaacs & Thomson, 2013; Southwood & Flege, 1999). MFRM, as demonstrated here, provides a useful analytical tool for investigating scores elicited with listener judgments. Unlike an overall reliability index or a single histogram, MFRM allows the behavior of each rater to be investigated and quantified. Because MFRM is based on a prescriptive measurement model with interval units, it affords researchers the ability to evaluate the degree to which scores derived from numerical scales approximate interval measurement, which is important for subsequent use of scores in inferential statistics.

Beyond its analytical capabilities, MFRM also provides researchers with options for addressing shortcomings in listener ratings with numerical scales. In this chapter’s example, the score point of 7 for accentedness was shown to be redundant. With MFRM and the program FACETS, the possibility of combining 7 with an adjacent score point could be explored and evaluated for overall improvement of measurement (Bond & Fox, 2007; Fan & Bond, 2016). Alternatively, because the Rasch model is able to take into account speaker ability alongside rater behavior, task difficulty, and any number of other factors, a researcher could use the interval-scaled Rasch speaker ability measures for subsequent statistical analyses instead of a simple average across raters. For example, a speaker in this chapter’s example with a mean accentedness score of 5.00 points would be assigned a Rasch-based accentedness score of .91 logits, while a speaker with a mean accentedness score of 6.95 would be assigned a Rasch-based score of 2.31 logits. Pinget et al. (2014) took a similar approach through the use of a mixed-effects model, conceptually similar to Rasch measurement, to account for rater variation when transforming accentedness scores for subsequent regression analyses. In fact, using Rasch measures is quite common in L2 writing and speaking assessment research, such as when investigating rater or task effects (Knoch, 2017). Last, to address the issue of rater fatigue, researchers can use a sparse rating design in conjunction with MFRM to reduce rating volume for individual raters (Myford & Wolfe, 2000). In such a design, raters are carefully overlapped with one another in a fashion that does not require every rater to judge every speech sample.

To conclude, assessing L2 pronunciation for research purposes is a more complex enterprise than might be suggested by the simplicity of numerical rating scales and brevity of common training procedures. In reality, it is no simple task for a (linguistically naïve) listener to form judgments on attributes of L2 pronunciation and arrive at a consistent method of mapping those judgments to representative numbers. Each rater is likely to approach this task differently, which can result in as many scale interpretations as there are raters in a given study. Rater training procedures and analytical techniques commonly employed in language assessment provide some practical solutions. Accounting for and further investigating the causes of these differences presents an important area for future L2 pronunciation research.

p.110

References

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates.

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290.

Cook, C., Heath, F., Thompson, R. L., & Thompson, B. (2001). Score reliability in web-or internet-based surveys: Unnumbered graphic rating scales versus Likert-type scales. Educational and Psychological Measurement, 61(4), 697–706.

Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rineheart, & Winston.

Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2015). Second language comprehensibility revisited: Investigating the effects of learner background. TESOL Quarterly, 49, 814–837.

Derwing, T. M., Munro, M. J., & Wiebe, G. (1998). Evidence in favor of a broad framework for pronunciation instruction. Language Learning, 48(3), 393–410.

Fan, J., & Bond, T. (2016). Using MFRM and SEM in the validation of analytic rating scales of an English speaking assessment. In Q. Zhang (Ed.), Pacific Rim Objective Measurement Symposium (PROMS) 2015 conference proceedings (pp. 29–50). Singapore: Springer Science+Business Media.

Hartley, J., & Betts, L. R. (2010). Four layouts and a finding: The effects of changes in the order of the verbal labels and numerical values on Likert-type scales. International Journal of Social Research Methodology, 13(1), 17–27.

Hsieh, C.-N. (2011). Rater effects in ITA testing: ESL teachers’ versus American undergraduates’ judgments of accentedness, comprehensibility, and oral proficiency. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 9, 47–74.

Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2),135–159.

Isaacs, T., & Trofimovich, P. (2010). Falling on sensitive ears? The influence of musical ability on extreme raters’ judgments of L2 pronunciations. TESOL Quarterly, 44(2), 375–386.

Isaacs, T., & Trofimovich, P. (2011). Phonological memory, attention control, and musical ability: Effects of individual differences on rater judgments of second language speech. Applied Psycholinguistics, 32, 113–140.

Kang, O., Rubin, D. L., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.

p.111

Kang, O., Vo, S. C. T., & Moran, M. K. (2016). Perceptual judgments of accented speech by listeners from different first language backgrounds. TESL-EJ, 20(1), 1–24.

Knoch, U. (2017). What can pronunciation researchers learn from research into second language writing? In T. Isaacs & P. Trofimovich (Eds.), Second language pronunciation assessment: Interdisciplinary perspectives (pp. 54–71). Bristol, UK: Multilingual Matters.

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.

Linacre, J. M. (2014). Facets (v. 3.71.4). Chicago, IL: Winsteps.com.

Lord, G. (2008). Podcasting communities and second language pronunciation. Foreign Language Annals, 41(2), 374–389.

McNamara, T. (1996). Measuring second language proficiency. London: Longman.

McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 25(4), 495–519.

Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97.

Munro, M. J., Derwing, T. M., & Morton, S. L. (2006). The mutual intelligibility of L2 speech. Studies in Second Language Acquisition, 28(1), 111–131.

Munro, M. J., & Derwing, T. M. (2015). A prospectus for pronunciation research in the 21st century: A point of view. Journal of Second Language Pronunciation, 1(1), 11–42.

Myford, C. M., & Wolfe, E. W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs (TOEFL Tech. Rep. No. 15). Princeton, NJ: Educational Testing Service.

O’Brien, M. G. (2014). L2 learners’ assessments of accentedness, fluency, and comprehensibility of native and nonnative German speech. Language Learning, 64(4), 715–748.

O’Brien, M. G. (2016). Methodological choices in rating speech samples. Studies in Second Language Acquisition, 38(3), 587–605.

Pinget, A.-F., Bosker, H. R., Quené, H., & De Jong, N. H. (2014). Native speakers’ perceptions of fluency and accent in L2 speech. Language Testing, 31(3), 349–365.

Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica, 104, 1–15.

R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org.

Saito, K., & Shintani, N. (2015). Do native speakers of North American and Singapore English differentially perceive comprehensibility in second language speech? TESOL Quarterly, 50(2), 421–446.

Saito, K., Trofimovich, P., & Isaacs, T. (2016). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics, 37, 217–240.

Schmid, M. S., & Hopp, H. (2014). Comparing foreign accent in L1 attrition and L2 acquisition: Range and rater effects. Language Testing, 31(3), 367–388.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

Southwood, M. H., & Flege, J. E. (1999). Scaling foreign accent: Direct magnitude estimation versus interval scaling. Clinical Linguistics & Phonetics, 13(5), 335–349.

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.