p.115
PRONUNCIATION FEATURES IN RATING CRITERIA
Romy Ghanem and Okim Kang
Introduction
Various speaking features have been shown to predict second language (L2) speakers’ proficiency level and/or cue accentedness. Earlier L2 research has tended to focus on segmental features (i.e., consonant and vowel production) measuring the deviation from a native speaker norm (Jakobson, 1941; Flege & Port, 1981; Macken & Ferguson, 1983). More recent studies have highlighted the importance of suprasegmental features (i.e., features that go beyond consonants and vowels, such as prosody – intonation, stress, and rhythm) particularly in how much prosodic features may contribute to a listener’s perception of a speaker’s intelligibility or comprehensibility (Hahn, 2004; Kang, 2010). Still, identifying the linguistic components most conducive to NNSs’ production of intelligible speech remains a challenge in L2 pronunciation research.
Over the years, there have been attempts at allowing L2 research to inform and help revise descriptors for standardized speaking tests (Isaacs, 2013). Nevertheless, such attempts have often encountered limitations due to theoretical frameworks (“nativeness” vs. “intelligibility”) and the cognitive load that might be imposed on a rater. Constructs in speaking descriptors are usually based on speaking goals or benchmarks that are expected at each proficiency level (Iwashita, Brown, McNamara, & O’Hagan, 2008). Some changes have been made recently to improve these scales by increasing the number of bands and/or including more pronunciation-specific descriptors (e.g., fluency, hesitation markers, intonation patterns, and segmental errors). Such efforts may have to confront potential problems such as fuzziness of the band descriptions and absence of certain features at particular levels (Poonpon, 2011; Isaacs, Trofimovich, Yu, & Chereau, 2015). Researchers (e.g., Kang & Pickering, 2013) thus advocate for an alternative method by combining objective and systematic measurements (using computer software) of speaking features with rater evaluations for a more comprehensive description of pronunciation constructs.
p.116
Various computer programs offer tools that allow measurement of both segmentals and suprasegmentals of L2 speech. Some of these programs are available online for free and most of them are relatively user-friendly (e.g., PRAAT (Boersma & Weenink, 2016), Audacity (Audacity Team, 2014), Raven Pro (Charif, Waack, & Strickman, 2008), and Speech Analyzer (SIL International, 2012). Conversely, others are not open access (e.g., Kay’s Computerized Speech Lab). Computer programs that are comprised of algorithms and developed by speech scientists or computer programmers are somewhat interdisciplinary. These programs are usually referred to as Automated Speech Recognizers (ASRs) and have been used for commercial (e.g., Dragon, NaturallySpeaking) and educational (e.g., Versant (http://www.versanttest.com/), SpeechRater (Zechner, Higgins, Xi, & Williamson, 2009)) purposes. The models used in ASRs normally contain complex algorithms that are able to handle a large number of segmental and suprasegmental features simultaneously. This automated scoring has already been introduced as a possible complementary rating method for some standardized speaking tests (e.g., SpeechRater) in the near future.
The following chapter describes the most common pronunciation features that have been investigated in L2 research and examines their use in speaking scales. We provide suggestions that serve as a middle ground between L2 research and current practices in speaking descriptors. We first present descriptive accounts of the pronunciation features that have shown to be important variables in ESL/EFL pronunciation and oral assessment studies. We describe each feature and its use in different fields. We then offer detailed illustrations of the extraction and measurement of a select number of those features. Next, we report on the use of those variables in standardized test criteria and the way in which test scales reflect recent developments in pronunciation and assessment. We end with recommendations for future research. The suggestions we provide are based on the current trends in L2 speaking and assessment research, but they are tailored to fit the requirements of a standardized speaking scale and the capabilities of its rater.
Current conceptualizations of pronunciation features
Segmental features
Segmental features are phonetic characteristics that occur at the phone level, that is, at the consonant and vowel level. In the 1950s and 1960s, segmental features were primarily the subject of phonetic studies that used Contrastive Analysis as their theoretical backdrop (Flege, McCutcheon, & Smith, 1987; Flege, 1992). Researchers identified the sounds already found in a speaker’s first language (L1) and compared them to the sounds s/he would acquire in the target language. The main premise was that if a sound was shared by or similar in both languages, then it would be relatively easy to acquire. Conversely, difficulties were expected when the L2 sounds were not found or were produced differently in the L1. Segmental features have been shown to partially contribute to the variance in proficiency ratings (Magen, 1998; Kang & Moran, 2014). The importance of segmentals is further reflected in research on automated speech assessment which examines segmental features and deviations such as phone duration, vowel quality, syllable production, voice onset time (VOT), and stop closure duration (Kazemzadeh et al., 2006; Jin & Mak, 2012). In this section, we will discuss some of the most common segmental features investigated in L2 (English as a second/foreign language) pronunciation and their subsequent use (or lack of) in assessment research.
p.117
Consonant features. While other consonant features were examined in earlier research that investigated accentedness and language differences (e.g., aspiration in Repp, 1979), voice onset time (VOT) and stop closure duration have been most frequently addressed in more recent L2 pronunciation and assessment research.
Voice Onset Time (VOT). VOT is defined as the time between the release of the stop closure and the start of the vibration of vocal folds. Different languages have different times at which a stop is released (even if it is the same phone in both languages, such as /d/ in Japanese and English). There are three different types of VOT values. The first, usually occurring with voiceless aspirated stops, is a positive value. The second, most often measured when voiced stops are produced, is a VOT value equal or close to zero. The third value, a negative VOT, has been reported in some L1 speakers’ production of voiced consonants, albeit on rare occasions (Lisker & Abramson, 1964). See Figure 6.1 in the next section for an illustration of a short and long VOT.
Automated oral assessment research has examined the measurement of VOTs to identify proficiency (Kazemzadeh et al., 2006; Henry, Sonderegger, & Keshet, 2012) and to distinguish among accents (Hansen, Gray, & Kim, 2010; Alotaibi & AlDahri, 2012). L2 research has also identified VOT as a strong predictor for accentedness as L2 learners frequently produce it differently (Flege & Eefting, 1987; Das & Hansen, 2004). It was demonstrated that some L2 speakers of English consistently produce negative VOT values for different types of consonants (both voiced and voiceless) (Flege, 1992; Ettien & Abat, 2014). Even though certain deviations from the norm are produced by L1 speakers, producing an aspirated voiceless stop with a negative VOT would certainly render the production accented.
Stop closure duration. This feature is measured through the sudden burst in amplitude that is observed when the glottis is constricted to produce a voiced consonant. Researchers measure the distance from when this spike occurs until the release burst, which signals the ending of a consonant (Hiramatsu, 1990). This duration is typically much longer when producing a voiceless consonant than when producing a voiced one.
The sub-phonemic feature is almost never included in standardized speaking descriptors, of course, for good reason. Even if a human rater is somehow trained to audibly detect the closure duration, it would be quite taxing to keep track of every duration and deviation from the norm. However, with the advent of automated scoring, stop closure duration has been more frequently investigated in automated speech assessment. One example is the speech recognizer HTK, developed by Young et al. (2000), which has been recently used in language assessment and test validation (Bernstein, Van Moere, & Cheng, 2010; Cheng, D’antilio, Chen, & Bernstein, 2014). L2 speaking research has also examined this duration. Studies that observe stop closure duration compares L1 and L2 productions in order to detect differences in pronunciation. Results reveal that L2 speakers produce stop closures with longer duration with voiceless consonants rather than with voiced consonants. Yet the difference between voiced and voiceless consonants is not as substantial as that produced by English speakers (Flege, Munro, & Skelton, 1992).
p.118
Vowel features. Vowels are generally described by referring to two main acoustic characteristics: frequency (identified through vowel formants and the space between them) and length. L2 pronunciation and assessment have identified both frequency and length as predictor variables regarding accent detection and variance in proficiency levels.
Vowel formants and vowel space. Unlike consonants, vowels are produced at several frequencies. By measuring these frequencies, mainly the first formant (F1) and second formant (F2), one can determine where in the mouth the vowel is produced. A formant is used in phonetics to mean the vibrations of the vocal tract and it is measured in hertz, a unit of measurement for frequency. Since the mid 2000s, vowel formants and the space between them have been the focus of many studies that automatically assess L2 speech and oral proficiency (Chen, Evanini, & Sun, 2010; Peabody & Seneff, 2010; Sandoval, Berisha, Utianski, Liss, & Spanias, 2013). In fact, the space between the F1 and F2 formants has been proven to be a significant contributor to cue nonnativeness with certain vowels (Chen, Evanini, & Sun, 2010). Vowel space has also been at the center of phonetic research that investigates L2 vowel productions. The segmental feature has been used to acoustically map out the difference in vowel production between English and speakers’ L1 (Zhang, Nissen, & Francis, 2008) or to compare L1 and L2 productions of English vowels (Tsukada, 2001; Bianchi, 2007). See Figure 6.2 in the next section for an illustration of vowel formant extraction.
Vowel duration. Vowel duration, though not as common as vowel space and formants, is measured to identify and evaluate L2 speakers’ oral productions (Sun & Evanini, 2011; Graham, Caines, & Buttery, 2015). Researchers have measured vowels in English and proposed a set range for each monophthong and diphthong which made it easier for deviation to be detected (Ladefoged, 2006). Moreover, vowel duration is not only considered in itself to be a segmental feature to measure, but it is also one which signals other segmentals. To illustrate, a vowel’s duration changes depending on the consonant that follows it (Raphael, 1972; Flege, McCutcheon, & Smith, 1987; Ladefoged, 2006; Rojczyk, 2008). When the word-final consonant is a voiced one, the vowel length is significantly reduced in comparison to when the consonant is voiceless (e.g., heed versus heat). See Figure 6.3 in the next section for an illustration of vowel duration measurement.
p.119
Suprasegmental features
Suprasegmental features are those that occur beyond the production of the phone itself (e.g., pauses, intonation, and stress). These features have been examined in recent studies on L2 oral productions and proficiency. Earlier research focused on one or two suprasegmentals including fluency and prosodic features such as pauses (Riazantseva, 2001), speech rate (Munro & Derwing, 1998), stress (Field, 2005), prominence, and tone height (Wennerstrom, 1997; Pickering, Hu, & Baker, 2012). A larger number of suprasegmental features (fluency and prosody) have been later incorporated in oral assessment and L2 pronunciation to measure their relative significance on oral proficiency, accentedness, fluency, or intelligibility. The following section provides a detailed description of the suprasegmental features most commonly used in both L2 speaking and oral assessment studies.
Fluency features. The term fluency has been used to refer to different concepts; some studies consider this notion as synonymous to proficiency. That is, a speaker is evaluated based on his/her fluency level (Peters & Guitar, 1991). Others treat this construct as a hypernym that is comprised of several sub-features. Combined, the ways speakers use these sub-features have been shown to distinguish among proficiency levels (Trofimovich & Baker, 2006). Descriptors designed for standardized tests have used fluency in the more semantic sense (i.e., referring to features such as speech flow and hesitation markers) to guide raters. Recent work with ASRs has viewed fluency in a different way (consistent with L2 pronunciation), employing sub-features within the algorithm. The following section discusses the two most commonly explored fluency sub-features: pauses and speech rate.
Pauses. A pause is usually defined as the silent or filled time between two runs (a run is defined as uninterrupted speech between two silent pauses). Two types of pauses are of interest to L2 pronunciation: filled and silent pauses. Filled pauses have been described by some as having a specific function, e.g., as discourse markers used to prevent lull time or gain some time for thought (e.g., um and uh). Silent pauses are instances of complete silence between runs. Current research on automated evaluation most often employs pause as a feature in the model. This inclusion has several purposes: to assess human–computer coder reliability (Cucchiarini, Strik, & Boves, 2000), to test the ability of a computer model in assessing L2 learners’ utterances (Hönig, Batliner, & Noth, 2012), and to examine natural speech (Bhat, Hasegawa-Johnson, & Sproat, 2010). Similarly, L2 speaking research has focused on the correlation between the number of silent pauses and accentedness ratings (Kang, 2010). More importantly, studies have demonstrated that a pause as short as 0.1 seconds has the ability to cue deviation from the norm (Kang, 2010). Pauses are important because it has been shown that learners tend to produce longer and more frequent pauses in their L2 than in their L1s (Riggenbach, 1991; Cucchiarini, Strik, & Boves, 2000; Riazantseva, 2001). However, consensus is yet to be achieved as to the effect of proficiency or length of residence on pause frequency and its duration, as studies show conflicting results (Trofimovich & Baker, 2006). Interest has also emerged in assessing the location of these pauses (particularly silent ones). Research has revealed that the location of a pause (at a phrasal boundary or within a phrase) and its duration (Towell et al., 1996; Freed, 2000; Kang & Wang, 2014) does discriminate proficiency levels.
p.120
Speech rate. Speech rate is measured by dividing the total number of syllables by the total speaking time (syllables/second), which includes all pauses between runs. Some researchers also consider hesitations and false starts when counting the number of syllables. It has been suggested that the result should be multiplied by 60 to get the number of syllables per minute (Riggenbach, 1991; Kormos & Dénes, 2004).
Automated speech assessment has employed speech rate to validate computer models and to measure the reliability of the score given by human and machine coders (Cucchiarini et al., 2000; Van der Walt, De Wet, & Niesler, 2008; De Wet, Van der Walt, & Niesler, 2009; Evanini & Wang, 2013). In fact, Cucchiarini et al. (2000) demonstrated that speech rate has one of the highest correlations (>0.9) between human and machine raters. Speech rate has been likewise employed as a variable in L2 pronunciation for two main reasons: to differentiate between L1 and L2 speech and to evaluate oral proficiency. This fluency measure, in particular, has been said to strongly correlate with accentedness (Munro & Derwing, 1998).
Prosodic features. Prosodic features are suprasegmental properties that influence oral productions, especially when producing connected speech. They have only been recently added to some regression automated models in order to identify L1 speaker features and compare them to L2 productions for evaluative purposes (Hönig, Batliner, & Nöth, 2012; Evanini & Wang, 2013; Coutinho et al., 2016). The features that will be described in this section are those that have been found to be predicting factors in one of the following areas: the success of an ASR model, the prediction of oral proficiency, or the identification of L1 vs. L2 speech.
Stress. Stress is usually identified as the syllable in any given word that has the highest values for the following measures: pitch (measured in Hz), length (measured in milliseconds), and intensity (measured in dB). This phonological feature becomes especially problematic with multi-syllabic words when a speaker has to choose one syllable to carry primary stress. English is a lexical stress language, which means that the syllable carrying primary stress is arbitrary in any given word. This becomes an issue if the speaker’s L1 is syllable timed (e.g., Japanese or French) or has a fixed stress pattern (e.g., Finnish).
Research has demonstrated that L2 English learners not only place stress on the wrong syllable, but they also stress too many syllables in a multi-syllabic word (Field, 2005). The field has not reached consensus when it comes to the effect of misplaced stress on listener comprehension. However, most studies show that lower-proficiency speakers are more likely to misplace stress than higher-proficiency speakers (Wennerstrom, 2000; Field, 2005; Kang, 2008). Therefore, it is a discriminating factor among different proficiencies. See Figure 6.6 for an illustration of a stressed syllable in English.
Prominence and tone height. Prominence is an extension of primary stress because it is the syllable (or two syllables) that carries the main stress in a tone unit. Brazil (1997) identifies a tone unit as a stretch of speech that is most often delineated by silent pauses and which carries one or two prominent syllables. He argues that a speaker selects a syllable (or more) in a given tone unit and assigns prominence to it for various communicative purposes, including emphasis, the presentation of new information, and contrast. Research that investigates prominence following Brazil’s Discourse Intonation framework generally focuses on two features: the key and termination syllables. The former is identified as the syllable carrying prominence at the onset of the tone unit (key); the latter is defined as the last syllable carrying prominence in that unit (termination). Studies have shown that L2 speakers include more prominent syllables than needed in a tone unit which not only cues accentedness but may also affect comprehension (Wennerstrom, 1997; Kang, 2010).
p.121
When exploring prominence, tone height is very often an associated feature. Tone height can be defined as the pitch height of the vowel in a prominent syllable (measured in Hz). Brazil (1997) identifies three possible tone heights on any given prominent syllable: high, mid, or low. If a researcher is exploring key and termination syllables, s/he would choose from six possible options: (1) high key (key syllable carries a high pitch); (2) mid-key (key syllable carries an average pitch); (3) low key (key syllable carries a low pitch); (4) high termination (a high pitch is registered on the termination syllable); (5) mid-termination (an average pitch is registered on the termination syllable); and (6) low termination (a low pitch is registered on the termination syllable.
Studies in automated oral assessment have shown interest in developing a system that would automatically identify the height (Rosenberg, 2010; Johnson & Kang, 2016) or contrast pitch changes with tonal languages (Levow, 2005). In a similar way, research in L2 pronunciation has demonstrated that high key and termination syllables, for instance, are typically linked to change in topic, disagreement, contrasts, and the presentation of new information (Wennerstrom, 1997; Pickering, Hu, & Baker, 2012). Low pitch height for both key and termination syllables is typically related to given information or relatively short answers (Pickering, 1999; Cheng, Greaves, & Warren, 2008). Therefore, if a speaker chooses a low pitch for their key and termination syllable but is in fact divulging new information, these unexpected patterns could result in miscommunication as the speaker may perceive a mismatch between tone and communicative function. Various studies have determined that L2 speakers often over- or underuse certain height choices, which may make it difficult for the listener to identify new or given information (Pickering, 2004; Staples, 2015).
Tone choice. Tone choice is another prosodic characteristic that researchers measure on the termination syllable in a tone unit. Brazil (1997) identifies five possible tone choices: (1) fall (p); (2) rise (r+); (3) fall-rise (r); (4) rise-fall (p+); and (5) level (o). Similar to tone height, research in ASR-related methods of oral assessment have recently included tone choice as a variable especially after it has been associated with proficiency in pronunciation studies (Johnson, Kang, & Ghanem, 2016). In fact, Kang, Rubin, and Pickering (2010) found that mid-falling tones, high-rising pitch and mid-rising pitch were predictors of oral proficiency. L1 speakers tend to manipulate such contours according to their specific purposes and their interlocutors. L2 learners, on the other hand, might misuse some of those features which may result in miscomprehension or misrepresentation of the message. For instance, if an L2 speaker ends a question with a falling pitch, then the interlocutor may mistake this utterance to be an order. It has also been shown that using level tones on termination syllables is characteristic of some L2 speakers’ speech. The use of this tone could be problematic when an L1 speaker expects the idea to be complete (expressed through the use of a falling tone) or when they assume it is an invitation to contribute to the conversation (Cheng, Greaves, & Warren, 2008).
p.122
Pitch range variation. Pitch range is measured by subtracting the maximum F0 from the minimum F0 of prominent syllables. Computer software like PRAAT allows the user to accurately measure the highest or lowest frequency in a particular time frame. A speaker’s overall pitch variation is used to determine the extent to which a speaker fluctuates his/her pitch. Studies in L2 pronunciation have revealed that L2 speakers often have a more restricted pitch range than L1 speakers (Kang, 2013; Staples, 2015). As with tone choice, L2 learners sometimes struggle with falling intonation to indicate a statement or closure. That is most often the case because their pitch does not fall far enough to indicate a change (Binghadeer, 2008). Recent studies have included pitch range in their ASR algorithms in order to determine its effect on discriminating among proficiency levels or to improve speakers’ oral performance using ASRs (Eskenazi, 1999).
Illustrations of pronunciation analysis
The following section provides illustrations of selected pronunciation features that have been described in this chapter. The measurement of each feature will first be explained and then a screen shot from PRAAT will be provided to illustrate how this measurement is attained. The vertical dotted lines delineate the feature each figure is illustrating. The horizontal lines on the spectrogram represent the pitch value and movement, and the grey lines measure amplitude.
Voice onset timing
VOT is measured by accounting for the interval of time between the release of the stop and the onset of the voicing. When this value is short, the speaker most likely started the voicing immediately after the release of the consonant. Figures 6.1 and 6.2 are screenshots of two voicing conditions produced by an L1 speaker. The first is a long positive VOT when producing voiceless consonant /k/. The second is a short VOT when producing voiced consonant /b/.
p.123
FIGURE 6.1 Long voicing lead of consonant /k/ as produced by an L1 speaker
FIGURE 6.2 Short voicing lead of consonant /b/ as produced by an L1 speaker
Vowel formants
PRAAT also allows a user to get automated F1 and F2 frequencies. The dotted horizontal lines on the spectrogram represent the different frequencies. As can be seen in Figure 6.3, by clicking F1 on one’s keyboard, PRAAT automatically calculates first formant values for the selected phone (in this case the vowel /i/). However, the researcher still has to choose the exact location of that formant on the spectrogram.
p.124
FIGURE 6.3 First formant spectrograms of monophthong /i/ using PRAAT
Vowel duration measurement
Vowel duration is relatively easy to measure and can be done on any of the software mentioned in the introduction. One would begin where the preceding sound is completely inaudible and measure until the place where the following sound can be heard. Following (Figure 6.4) is a print screen from PRAAT with the highlighted section representing the diphthong /i/ in the stressed syllable of “THEsis.” As can be seen, the sound /ϴ/ has been isolated from the beginning of the syllable and /s/ is cut out from the end. When the area to be measured is highlighted, PRAAT and other software (e.g., Audacity) provide the user with the exact duration of the highlighted area, which makes the number quite exact and consistent. However, this might be slightly problematic as some researchers might decide to cut into the vowel more to avoid any coloring from the neighboring consonants, while others could allow the end burst of a consonant to be included so as not to lose vowel quality.
FIGURE 6.4 Vowel measurement of the tense vowel /i/ using PRAAT
p.125
Pauses
Measuring pause duration is fairly straightforward because one can easily see the exact location of the pause from the amplitude and pitch analysis (which neighboring sounds would have values for) or from the spectrogram. However, as with all other measurements, one should not simply rely on the visual representation of a pause because certain sounds might still be audible (however faint they might be) without registering any frequency or amplitude information. In Figure 6.5, the highlighted section is a silent pause between two runs of the speaker’s utterance. One can see that it does not register any pitch information and yet the amplitude is not one that would normally represent complete silence. That is because the speaker was in fact taking a breath midway through the pause in order to start the next run. Therefore, researchers have to use both visual and auditory information when assessing whether a particular section is indeed a silent pause or not. As one can also see, there are two spikes in amplitude to the right and left of the pause. The one to the left is a burst of the voiceless consonant /t/ that was still audible even though the speaker ended the word first right before that. The second spike to the right is the beginning of the following utterance and one would therefore have to listen to that section multiple times to be able to audibly discern where the pause ends completely and at which point exactly the following sound begins.
FIGURE 6.5 Representation of a silent pause using PRAAT
p.126
FIGURE 6.6 Stressed syllable in children as produced by an advanced chinese speaker of english
Stress
A stressed syllable in a given word is identified as the syllable that typically has the highest value in the following categories: length, pitch, and amplitude. Figure 6.6 illustrates the visual representation of the stressed syllable in a two-syllable word. As shown, the nucleus of the stressed syllable /I/ has the highest values for both pitch and amplitude.
Prominence and tone choice
Figure 6.7 clearly shows how the word think is set apart by the speaker from the remaining syllables in the chunk this is just what I THINK. Therefore, this syllable would be flagged as a prominent syllable. Since there is a significant rise at the beginning of that syllable and then a drop towards the end, this syllable would be marked as prominent with a rise-fall (p) tone.
FIGURE 6.7 The prominent syllable of a tone unit as produced by an advanced chinese speaker
p.127
Current practices and contributions
The following section describes the pronunciation features used in speaking scales of current standardized tests. We examine descriptors from CELA (Cambridge English Language Assessment), the TOEFL iBT (internet Based Test), and the IELTS (International English Language Testing System). Our purpose is to illustrate the use of the features we previously described as they appear in standardized tests and compare/contrast their operationalization in the descriptors with that of L2 research.
Segmental features
Segmental features have been proven to be significant predictors of oral proficiency in L2 pronunciation studies. These features, however, are not as thoroughly described in language tests or assessment descriptors. In addition, unlike their description in this chapter, segmental features are frequently combined under one major category. The descriptors typically do not separate vowel from consonant productions or any sub-features beyond that (e.g., VOT, vowel duration).
Some tests include segmental errors as a separate descriptor and often describe them as the speaker’s deviation from a norm. Designers at CELA do so on some of their tests. The explanation of segmental features in a descriptor is frequently general. In the Cambridge Advanced test, for instance, a speaker with a score of 3 is described as someone whose “individual sounds are articulated clearly” (Cambridge English Handbook for Teachers, 2015, p. 86). This might be somewhat difficult to quantify when a rater is evaluating a minute-long production. Furthermore, in L2 pronunciation studies, clear articulation does not equal accurate articulation, which might render this description slightly more vague than what L2 learners might be accustomed to.
Other tests place segmental features within a more general descriptor. The TOEFL iBT speaking scale includes all of its pronunciation features in the “delivery” construct. This umbrella term consists of the “flow” and clarity of the speech, segmental errors, intonation patterns, and overall intelligibility. A speaker who receives a high score is one whose production exhibits “minor difficulties” with segmental features. The same can be said for the IELTS, which places segmental errors under the term “pronunciation” and describes a proficient speaker as one who “uses a full range of pronunciation features with precision and subtlety” (IELTS Teachers’ Guide, 2015, p. 18). Although segmental features are referred to in the description for IELTS raters, they are mostly referred to in relation to comprehensibility, that is, how difficult the production is to be understood and the amount of effort required in order to understand it (Seedhouse & Egbert, 2006). This description is different from that of L2 speaking research which has demonstrated that comprehensibility is not solely affected by segmental features. Thus, teasing the segmental features from others for evaluative purposes might prove to be difficult. In addition, the measurement of segmental features (e.g., VOT, vowel space) requires extreme delicacy and precision because sometimes variation occurs at the millisecond level.
p.128
Suprasegmental features
Fluency features. Speaking descriptors define fluency in a more semantic sense, rather than a broad term with several sub-features. That is, fluency is whether the production flows easily. The IELTS, for example, uses the term fluency to mean the speaker’s use of connectors and his/her ability to produce coherent language. The sub-features under this descriptor are hesitations and self-corrections. This semantic definition, however, is no longer employed in L2 pronunciation (Peters & Guitar, 1991). In addition, hesitation has rarely been included in speaking research as a variable on its own (except in connection to filled pauses) because it is not the number of hesitations but the types that cue proficiency (Chambers, 1997; Watanabe & Rose, 2012).
Speech rate and pauses have been identified as significant factors that raters consider when evaluating speech. In her study on the iBT speaking scale, Poonpon (2011) revealed that speech rate was the only statistically significant predictor that could distinguish among different proficiency levels. However, some difficulties have been reported by raters (Poonpon, 2011). One example is the fuzziness of the band descriptors (particularly when one band is described in relation to another). Furthermore, qualitative data revealed that raters mentioned specific criteria influencing their ratings which were not included in the iBT descriptor. Some of these features involved L1 influence and pause location. While not explicitly stated or described in the scale, raters are sometimes aware of their influence on the way they rate speech files. These findings confirm that there are many more features that influence a rater’s perception than those present in most standardized descriptors.
Prosodic features. Prosodic features, particularly intonation, have been recently added to revised versions of the test descriptors. CELA tests describe intonation as the manner in which “the voice rises and falls, e.g., to convey the speaker’s mood, to support meaning or to indicate new information” (Cambridge English Handbook for Teachers, 2015, p. 89). The TOEFL iBT briefly mentions intonation under delivery and describes a high-proficiency production as one that uses correct intonation patterns. The IELTS scale, on the other hand, does not use intonation in the public version of the descriptor or in the more descriptive one provided to the raters.
Stress and prominence have also been mentioned in some of the speaking scales. The CELA tests incorporate both lexical stress and prominence in the description of pronunciation features. They are referred to as word stress and sentence stress and are included in descriptions of all bands. The TOEFL iBT scale similarly mentions lexical stress when describing the characteristics of a low-proficiency production. The IELTS does not comprise any prosodic features in the descriptors or the rater handbook. It is important to mention that the IELTS has a real-time interactive component (interview), and as L2 pronunciation studies have shown, prosodic features can play a major role in both cuing accentedness and affecting comprehension in such interactions (Wennerstrom, 1997; Kang, 2010).
p.129
New directions and recommendations
Based on our comparison between pronunciation features examined in L2 research and those incorporated into oral assessment scales, we can make recommendations on two different levels: (1) the selection of the different constructs; and (2) the description of those constructs. The following section details the new direction we recommend speaking scales move towards.
We first suggest that certain features be incorporated into the band descriptors, particularly those that have been proven significant in L2 speaking and assessment research. We do, however, proceed with caution while providing our suggestions. We acknowledge that it is virtually impossible for a human rater to account for all the variables we have described when evaluating a short oral production. Even if this were attempted, the amount of time (and consequently, money) invested would be extremely high. A rater could take up to 45 minutes to analyze a single variable for a production and this would certainly result in rater exhaustion (Kang & Pickering, 2013). In addition, some of these features require calculations and intricate acoustic measurements that the rater is unable to conduct in real time. Therefore, test designers may include features that do not impose a large cognitive load on raters. When it comes to segmental features, consonant and vowel deviations are relatively easy to detect and are not considered as tedious as other features. The same can also be said for intonation, pause length and frequency, tone choice, stress, and prominence. These features become more noticeable in dialogic interactions. As for the remaining phonetic features that require complex measurement (e.g., VOT), we recommend that they be considered for automated speech recognizers as they are quite difficult to measure by ear and yet are still predictive features worth retaining when assessing oral productions. Similarly, these features can also be used in pronunciation research when examining comprehensibility, intelligibility, or accentedness. While recent studies have incorporated a large number of pronunciation features (Kang, 2013), more research is needed to determine which features best predict listener ratings when it comes to nonnative productions. Such research would help L2 speaking and assessment descriptors better inform one another.
In order to limit the cognitive load and make an informed decision about constructs, our second recommendation is that the constructs assessed be task-specific. Different tasks may require the speaker to use a certain variety of sub-features (e.g., declarative falling tone for constrained speech versus more rising tones for interviews). The speech patterns exhibited by a speaker in read-aloud tasks would certainly be quite different from those used in interviews (Fulcher, 2003). It would therefore be beneficial to provide a more detailed description of selected task-related features, rather than supply a large selection with very general and vague definitions.
p.130
Our third recommendation is the employment of ASRs for the addition of a larger selection of features, particularly prosodic ones. Algorithms that analyze constrained speech have made great progress and have reached high correlations between human and computer raters (Bernstein, Van Moere, & Cheng, 2010). Oral assessment, however, does not simply evaluate a speaker’s repetition of an utterance verbatim. The earlier models included acoustic and segmental features alone, but as research has demonstrated, suprasegmental features play a major role in determining oral proficiency levels (Kang et al., 2010). Consequently, recent automated scorers, such as SpeechRater, have included fluency features in the training of computer programs (Zechner et al., 2009; Evanini & Wang, 2013; Coutinho et al., 2016). The public version of SpeechRater, for instance, includes 11 different pronunciation-related features that are analyzed before producing a score. Yet, the computer–human reliability is still not high enough to rely on computer scores alone in test conditions. Therefore, their use as complementary forms of rating could be beneficial.
Still, there are many significant features that have not yet been widely practiced in the model, such as intonation (Johnson et al., 2016). We recommend the inclusion of more prosodic features (beyond sentence level) in automated scoring, especially tone choice and intonation. This addition could help raise the reliability between human and computer, particularly for spontaneous speech since the automated system can be programmed to examine the same features a human rater would (see Loukina, Davis, & Xi, Chapter 8, this volume, for more discussion; or Van Moere & Suzuki, Chapter 7, this volume, for constrained speech). In fact, there has been a call from several researchers for more communication between pronunciation researchers and automated scoring program developers as raters can provide insight which informs the programmers’ development and adjustment of an algorithm (Chapelle & Chung, 2010).
Our fourth recommendation is to ensure consistency in the use and definition of pronunciation constructs among different speaking descriptors. The IELTS uses very general definitions under pronunciation and delivery. The TOEFL iBT and CELA tests, on the other hand, provide more detailed descriptions but place various pronunciation variables, both segmental and suprasegmental, in one category (either pronunciation or delivery). Additionally, the definitions of some terms (such as fluency: semantic versus phonological definition) are not consistent with L2 speaking research. Test-takers and teachers usually have access to the public version of the descriptor, which simply lists the descriptors and has very limited information concerning the specific manner in which the constructs are defined in this context. Teachers would most likely refer to textbooks that include pronunciation features as they are commonly used in the literature and so they might be misled regarding what to focus on in their teaching. Our recommendation is to provide clear and detailed definitions of the different constructs being assessed, especially after revealing that certain features can refer to different concepts (e.g., fluency).
p.131
In conclusion, a large number of pronunciation features have demonstrated to be significant indicators in discriminating among proficiency levels. While ASRs have proven their ability in including a significant number of features, their less-than-perfect accuracy has still not eliminated the need for human raters. Therefore, when choosing the relevant constructs for any speaking test, it is still important to choose features that fit well with rater ability and task variability. We therefore recommend that some of the features described in this chapter be taken into consideration in the descriptions of future descriptors used in oral assessment.
References
Alotaibi, Y. A., & AlDahri, S. S. (2012). Effect of Arabic emphaticness on Voice ONSET Time (VOT). Proceedings from the 2012 International Conference on Audio Language and Image Processing (ICALIP) (pp. 297–302). Shanghai, China: Institute of Electrical and Electronics Engineers. Retrieved from: http://ieeexplore.ieee.org/document/6376655/.
Audacity Team (2014). Audacity®: Free audio editor and recorder [Computer program]. Version 2.0.0 retrieved from http://audacity.sourceforge.net/ http://nationstopnews.org/2016/04/15/how-to-cite-audacity.html.
Bernstein J., Van Moere A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27(3), 355–377.
Bhat, S., Hasegawa-Johnson, M., & Sproat, R. (2010, September). Automatic fluency assessment by signal-level measurement of spontaneous speech. Proceedings from SLaTE (ISCA workshop on spoken language technology for education) (pp. 1–4). Tokyo, Japan. Retrieved from: http://www.isle.illinois.edu/sst/pubs/2010/bhat10slate.pdf.
Bianchi, M. (2007). Effects of clear speech and linguistic experience on acoustic characteristics of vowel production. Tampa, FL: University of South Florida.
Binghadeer, N. A. (2008). An acoustic analysis of pitch range in the production of native and nonnative speakers of English. Asian EFL Journal, 10(4), 96–113.
Boersma, P., & Weenink, D. (2016). Praat: Doing phonetics by computer [Computer program]. Version 6.0.21, retrieved from http://www.praat.org/.
Brazil, D. (1997). The communicative value of intonation in English. Cambridge: Cambridge University Press.
Cambridge English Handbook for Teachers. (2015). Cambridge English Language Assessment. http://www.cambridgeenglish.org/images/167804-cambridge-english-advanced-handbook.pdf.
Chambers, F. (1997). What do we mean by fluency? System, 25(4), 535–544.
Chapelle, C. A., & Chung, Y.-R. (2010). The promise of NLP and speech processing technologies in language assessment. Language Testing, 27(3), 301–315.
Charif, R. A., Waack, A. M., & Strickman, L. M. (2008). Raven Pro 1.3 user’s manual. Ithaca, NY: Cornell Laboratory of Ornithology.
Chen, L., Evanini, K., & Sun, X. (2010). Assessment of non-native speech using vowel space characteristics. Proceedings from the Spoken Language Technology Workshop (SLT) (pp. 139–144). Berkeley, CA: Institute of Electrical and Electronics Engineers. Retrieved from: http://ieeexplore.ieee.org/document/5700836/.
p.132
Cheng, J., D’antilio, Y. Z., Chen, X., & Bernstein, J. (2014). Automatic assessment of the speech of young English learners. Proceedings from The 9th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 12–21). Baltimore, MD: Association for Computational Linguistics. Retrieved from: http://www.aclweb.org/anthology/W14–1802.
Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intonation: The Hong Kong corpus of spoken English (prosodic) (Vol. 32). Amsterdam: John Benjamins Publishing.
Coutinho, E., Hönig, F., Zhang, Y., Hantke, S., Batliner, A., Nöth, E., & Schuller, B. (2016). Assessing the prosody of non-native speakers of English: Measures and feature sets. Proceedings from the Language Resources and Evaluation Conference (pp. 1328–1332). Portorož, Slovenia, N. Calzolari, et al. (Eds.). Retrieved from: http://www.lrec-conf.org/proceedings/lrec2016/pdf/1258_Paper.pdf.
Cucchiarini, C., Strik, H., & Boves, L. (2000). Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. Journal of the Acoustical Society of America, 107(2), 989–999.
Das, S., & Hansen, J. H. (2004, June). Detection of voice onset time (VOT) for unvoiced stops (/p/,/t/,/k/) using Teager energy operator (TEO) for automatic detection of accented English. Proceedings from The 6th Nordic Signal Processing Symposium-NORSIG (pp. 344–347). Espoo, Finland, J. M. A. Tanskanen (Ed.). Retrieved from: https://www.scopus.com/record/display.uri?eid=2-s2.0–11844288934&origin=inward&txGid=2A699980110580C6BDC76A1FAB6D99AF.wsnAw8kcdt7IPYLO0V48gA%3a2 (login required).
De Wet, F., Van der Walt, C., & Niesler, T. R. (2009). Automatic assessment of oral language proficiency and listening comprehension. Speech Communication, 51(10), 864–874.
Eskenazi, M. (1999). Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype. Language Learning & Technology, 2(2), 62–76.
Ettien, N., & Abat, M. (2014). Negative VOT in three Montenegrin-accented English idiolects. Linguistic Portfolios, 3(1), 91–97.
Evanini, K., & Wang, X. (2013). Automated speech scoring for non-native middle school students with multiple task types. Proceedings of The 14th Annual Conference of the International Speech Communication Association (pp. 2435–2439). Lyon, France: International Speech Communication Association. Retrieved from: http://www.evanini.com/papers/evaniniWang2013toefljr.pdf.
Field, J. (2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39(3), 399–423.
Flege, J. E. (1992). Speech learning in a second language. In C. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.), Phonological development: Models, research, and application (pp. 565–604). Timonium, MD: York Press.
Flege, J. E., & Eefting, W. (1987). Cross-language switching in stop consonant perception and production by Dutch speakers of English. Speech Communication, 6, 185–202.
Flege, J. E., McCutcheon, M. J., & Smith, S. C. (1987). The development of skill in producing word-final English stops. Journal of the Acoustical Society of America, 82(2), 433–447.
Flege, J. E., Munro, M. J., & Skelton, L. (1992). Production of the word-final English/t/–/d/contrast by native speakers of English, Mandarin, and Spanish. Journal of the Acoustical Society of America, 92(1), 128–143.
Flege, J. E., & Port, R. (1981). Cross-language phonetic interference: Arabic to English. Language and Speech, 24(2), 125–146.
p.133
Freed, B. F. (2000). Is fluency, like beauty, in the eyes of the beholder? In H. Riggenbach (Ed.), Perspectives on fluency (pp. 243–265). Ann Arbor, MI: The University of Michigan Press.
Fulcher, G. (2003). Testing second language speaking. London: Pearson.
Graham, C., Caines, A., & Buttery, P. (2015). Phonetic and prosodic features in automated spoken language assessment. Proceedings from the Workshop on Phonetic Learner Corpora (pp. 37–40). Glasgow, UK: International Congress of the Phonetic Sciences (ICPhS). Retrieved from: http://www.ifcasl.org/docs/Graham_final.pdf.
Hahn, L. D. (2004). Primary stress and intelligibility: Research to motivate the teaching of suprasegmentals. TESOL Quarterly, 38(2), 201–223.
Hansen, J. H., Gray, S. S., & Kim, W. (2010). Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification. Speech Communication, 52(10), 777–789.
Henry, K., Sonderegger, M., & Keshet, J. (2012). Automatic measurement of positive and negative voice onset time. Proceedings from the 13th Annual Conference of the International Speech Communication Association (pp. 871–874). Portland, OR: International Speech Communication Association. Retrieved from: http://people.linguistics.mcgill.ca/~morgan/interspeech2012.pdf.
Hiramatsu, K. (1990). Timing acquisition by Japanese speakers of English, speech research laboratory. Work in Progress, 6, 49–76. Reading, UK: Department of Linguistics, Reading University.
Hönig, F., Batliner, A., & Nöth, E. (2012, June). Automatic assessment of non-native prosody annotation, modelling and evaluation. Proceedings from the International Symposium on Automatic Detection of Errors in Pronunciation Training (IS ADEPT) (pp. 21–30). Stockholm, Sweden. Retrieved from: https://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Hoenig12-AAO.pdf.
IELTS Teacher Guide. (2015). Cambridge ESOL. Retrieved from: https://www.ielts.org/-/media/publications/guide-for-teachers/ielts-guide-for-teachers-2015-uk.ashx.
Isaacs, P., Trofimovich, P., Yu, G., & Chereau, M. (2015). Examining the linguistic aspects of speech that most efficiently discriminate between upper levels of the revised IELTS Pronunciation scale. IELTS Research Reports Online Series, 4. Retrieved from: https://www.ielts.org/~/media/research-reports/ielts_online_rr_2015–4.ashx.
Isaacs, T. (2013). Assessing pronunciation. The Companion to Language Assessment, 2(8), 140–155.
Iwashita, N., Brown, A., McNamara, T., & O’Hagan, S. (2008). Assessed levels of second language speaking proficiency: How distinct? Applied Linguistics, 29(1), 24–49.
Jakobson, R. (1941). Child language, aphasia, and phonological universals. The Hague, the Netherlands: Mouton.
Jin, T., & Mak, B. (2012). Distinguishing features in scoring L2 Chinese speaking performance: How do they work? Language Testing, 30(1), 23–47.
Johnson, D. O., & Kang, O. (2016). Automatic prosodic tone choice classification with Brazil’s intonation model. International Journal of Speech Technology, 19(1), 95–109.
Johnson, D. O., Kang, O., & Ghanem, R. (2016). Language proficiency rating: Human versus machine. Proceedings from Pronunciation in Second Language Learning and Teaching (PSLLT) 2015 (pp. 119–129). Dallas, TX, J. Levis (Ed.). Retrieved from: https://apling.engl.iastate.edu/alt-content/uploads/2016/08/PSLLT7_July29_2016_B.pdf.
Kang, O. (2008). The effect of rater background characteristics on the rating of international teaching assistants speaking proficiency. Spaan Fellow Working Papers, 6, 181–205.
Kang, O. (2010). Salient prosodic features on judgments of second language accent. Proceedings of Speech Prosody. Chicago, IL: Karger, Medical and Scientific Publishers. Retrieved from http://speechprosody2010.illinois.edu/papers/100016.pdf.
p.134
Kang, O. (2013). Linguistic analysis of speaking features distinguishing general English exams at CEFR levels B1 to C2 and examinee L1 backgrounds. Research Notes, 52, 40–48.
Kang, O., & Moran, M. (2014). Pronunciation features in non-native speakers’ oral performances. TESOL Quarterly, 48, 173–184.
Kang, O., & Pickering, L. (2013). Using acoustic and temporal analysis for assessing speaking. In A. Kunnan (Ed.), Companion to language assessment (pp. 1047–1062). London: Wiley-Blackwell.
Kang, O., Rubin, D. L., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.
Kang, O., & Wang, L. (2014). Impact of different task types on candidates’ speaking performances. Research Notes, 57, 40–49. Cambridge: Cambridge English Language Assessment, University of Cambridge. Retrieved from http://www.cambridgeenglish.org/images/177881-research-notes-57-document.pdf.
Kazemzadeh, A., Tepperman, J., Silva, J. F., You, H., Lee, S., Alwan, A., & Narayanan, S. (2006). Automatic detection of voice onset time contrasts for use in pronunciation assessment. Proceedings from the 9th Annual Conference of the International Speech Communication Association. Pittsburgh, PA: International Speech Communication Association. Retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.652.2133&rep=rep1&type=pdf.
Kormos, J., & Dénes, M. (2004). Exploring measures and perceptions of fluency in the speech of second language learners. System, 32(2), 145–164.
Ladefoged, P. (2006). A course in phonetics, 5th ed. Boston, MA: Thomson Wadsworth.
Levow, G. A. (2005). Context in multi-lingual tone and pitch accent recognition. Proceedings from the 8th Annual Conference of the International Speech Communication Association (pp. 1809–1812). Lisbon, Portugal: International Speech Communication Association. Retrieved from: https://faculty.washington.edu/levow/papers/IS05_context.pdf.
Lisker, L., & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20, 384–422.
Macken, M. A., & Ferguson, C. A. (1983). Cognitive aspects of phonological development: Model, evidence, and issues. In K. E. Nelson (Ed.), Children’s language, vol. 4. Hillsdale, NY: Erlbaum.
Magen, I. (1998). The perception of foreign-accented speech. Journal of Phonetics, 26, 381–400.
Munro, M. J., & Derwing, T. M. (1998). The effects of speaking rate on listener evaluations of native and foreign-accented speech. Language Learning, 48(2), 159–182.
Peabody, M., & Seneff, S. (2010). A simple feature normalization scheme for non-native vowel assessment. Proceedings from SigSLaTE. Tokyo, Japan. Retrieved from: https://groups.csail.mit.edu/sls/publications/2010/Peabody_SLaTE_2010.pdf.
Peters, T. J., & Guitar, B. (1991). Stuttering: An integrated approach to its nature and treatment. Baltimore, MD: William & Wilkins.
Pickering, L. (1999). An analysis of prosodic systems in the classroom discourse of native speaker and nonnative speaker teaching assistants. Unpublished doctoral dissertation. University of Florida, USA.
Pickering, L. (2004). The structure and function of intonational paragraphs in native and nonnative speaker instructional discourse. English for Specific Purposes, 23(1), 19–43.
Pickering, L., Hu, G. G., & Baker, A. (2012). The pragmatic function of intonation: Cueing agreement and disagreement in spoken English discourse and implications for ELT. In J. Romero-Trillo (Ed.), Pragmatics and prosody in English language teaching (pp. 199–218). Dordrecht, the Netherlands: Springer.
p.135
Poonpon, K. (2011). Synergy of mixed method approach to development of ESL speaking rating scale. Proceedings from Doing Research in Applied Linguistics Conference (pp. 37–44). Bangkok, Thailand. Retrieved from: http://arts.kmutt.ac.th/dral/PDF%20proceedings%20on%20Web/37–44_Synergy_of_Mixed_Method_Approach_to_Development_of_ESL.pdf.
Raphael, L. J. (1972). Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in American English. Journal of the Acoustical Society of America, 51(4B), 1296–1303.
Repp, B. H. (1979). Relative amplitude of aspiration noise as a voicing cue for syllable-initial stop consonants. Language and Speech, 22(2), 173–189.
Riazantseva, A. (2001). Second language proficiency and pausing a study of Russian speakers of English. Studies in Second Language Acquisition, 23(4), 497–526.
Riggenbach, H. (1991). Toward an understanding of fluency: A microanalysis of nonnative speaker conversations. Discourse Processes, 14(4), 423–441.
Rojczyk, A. (2008). Cross-linguistic priming on vowel duration and delayed plosion in Polish-English bilinguals. In E. Waniek-Klimczak (Ed.), Issues in accents of English (pp. 44–63). Newcastle, UK: Cambridge Scholars Publishing.
Rosenberg, A. (2010). Classification of prosodic events using quantized contour modeling. Proceedings from Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 721–724). Los Angeles, CA: Association for Computational Linguistics. Retrieved from: http://www.aclweb.org/anthology/N10–1109.
Sandoval, S., Berisha, V., Utianski, R. L., Liss, J. M., & Spanias, A. (2013). Automatic assessment of vowel space area. Journal of the Acoustical Society of America, 134(5), EL477–EL483.
Seedhouse, P., & Egbert, M. (2006). The interactional organisation of the IELTS speaking test [online]. In International English Language Testing System (IELTS) Research Reports 2006, Vol. 6 (pp. 1–45). Canberra, Australia: IELTS Australia and British Council. Retrieved from: http://search.informit.com.au/documentSummary;dn=078797279676525;res=IELHSS.
Staples, S. (2015). The discourse of nurse–patient interactions: Contrasting the communicative styles of U.S. and international nurses. Philadelphia, PA: John Benjamins.
Summer Institute of Linguistics (2012). Speech analyzer. http://www-01.sil.org/computing/sa/index.htm?_gat=1&_ga=GA1.2.1971943201.1478547681.
Sun, X., & Evanini, K. (2011, May). Gaussian mixture modeling of vowel durations for automated assessment of non-native speech. Proceedings from the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5716–5719). Prague, Czech Republic. Retrieved from: http://ieeexplore.ieee.org/document/5947658/.
Towell, R., Hawkins, R., & Bazergui, N. (1996). The development of fluency in advanced learners of French. Applied Linguistics, 17(1), 84–119.
Trofimovich, P., & Baker, W. (2006). Learning second language suprasegmentals: Effect of L2 experience on prosody and fluency characteristics of L2 speech. Studies in Second Language Acquisition, 28(1), 1–30.
Tsukada, K. (2001). Native vs non-native production of English vowels in spontaneous speech: An acoustic phonetic study. Proceedings from the 2nd Annual Conference of the International Speech Communication Association (pp. 305–308). Aalborg, Denmark: International Speech Communication Association.
p.136
Van der Walt, C., De Wet, F., & Niesler, T. (2008). Oral proficiency assessment: The use of automatic speech recognition systems. Southern African Linguistics and Applied Language Studies, 26(1), 135–146.
Watanabe, M., & Rose, R. L. (2012). Pausology and hesitation phenomena in second language acquisition. The Routledge encyclopedia of second language acquisition (pp. 480–483). London: Routledge.
Wennerstrom, A. (1997). Discourse intonation and second language acquisition: Three genre based studies. Unpublished doctoral dissertation, University of Washington, Seattle, USA.
Wennerstrom, A. (2000). The role of intonation in second language fluency. In H. Riggenbach (Ed.), Perspectives on fluency (pp. 102–127). Ann Arbor, MI: University of Michigan Press.
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2000). The HTK book version 3.0. Cambridge: Cambridge University Press.
Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883–895.
Zhang, Y., Nissen, S. L., & Francis, A. L. (2008). Acoustic characteristics of English lexical stress produced by native Mandarin speakers. Journal of the Acoustical Society of America, 123(6), 4498–4513.