21

Second language speech
perception

A cross-disciplinary perspective on
challenges and accomplishments

Debra M. Hardison

Historical discussion

It has been well documented that infants are able to discriminate among the phonetic units of a range of languages; however, a significant decline in this ability occurs between 6–8 and 10–12 months of age (e.g., Werker and Tees, 1984). During this period of first language (L1) perceptual attunement, perception of native-language consonants improves, beginning the process of a neural commitment to those auditory patterns (Rivera-Gaxiola et al., 2005). These simultaneous processes of decline in non-native sound discrimination and facilitation of native sound perception have been attributed to the learning of the acoustic and statistical regularities of speech, which serve as the foundation for acquiring more complex patterns as the lexicon develops (Kuhl et al., 2006). This process for the child, the epitome of the “early” language learner, raises the question of the impact of age on second-language (L2) speech perception, and underscores the important role of continued neural plasticity in L2 perceptual category development.

Many studies explored the observation of better L2 performance by individuals whose age at the time of learning was “early” compared to those who began in late adolescence or adulthood. Yamada (1995) found early exposure in childhood versus adulthood to American English (AE) /r/ and /l/, the most commonly studied L2 sounds, resulted in better perception of these sounds by native speakers (NSs) of Japanese. Some researchers linked the effect of age to maturational constraints related to the notion of a critical period for speech acquisition involving a purported loss of neural plasticity (Scovel, 1969); others linked it to differences in the state of development of L1 phonetic categories and perceived L1–L2 phonetic distance (e.g., Flege, 1999). Native Language Magnet theory proposed that the effect of L1 perceptual attunement on L2 perception could be characterized as an L1-conditioned “warping” of speech input due to the effect of L1 “magnets,” reducing the perceptual distance between L2 sounds that are attracted by the same L1 prototype (e.g., Kuhl and Iverson, 1995; Kuhl et al., 2008).

In addition, the L1 phonological system was a major consideration in studies exploring the perception of sounds such as /r/ and /l/ by native Japanese and Korean learners of English as a second language (ESL) (e.g., Ingram and Park, 1998; Mochizuki, 1981; Sheldon and Strange, 1982). Studies involving these sounds also revealed better performance with natural versus synthesized speech stimuli (e.g., Mochizuki, 1981), greater experience with the L2 (e.g., Goto, 1971; MacKain et al., 1981), and position of the sound in a word (e.g., initial vs. final, singleton vs. cluster) (e.g., Logan et al., 1991).

More recent research, outlined in the following section, has attempted to investigate a wider range of potential factors impacting L2 speech perception. This research involves the contributions of many disciplines, including phonetics, phonology, cognitive psychology, neuroscience, and second language acquisition.

Core issues

There are several core issues in current L2 speech perception research that will be discussed in the following section:

Length of residence and L2 input

Many studies have been conducted to explore the range of factors accounting for the lack of uniform success among L2 learners in speech perception. Length of residence (LOR) in the L2 environment, most often explored as a factor contributing to accented production,1 was investigated in a study comparing groups of L1 Chinese adults living in the USA on several tasks including perception of word-final English consonants, presented with and without the final release burst, and with and without masking noise, which is used to reduce ceiling effects (Flege and Liu, 2001). Participants were divided into two LOR categories: 0.5–3.8 years (“short LOR”) and 3.9–15.5 years (“long LOR”), each of which was subdivided into two occupational categories (i.e., university students and non-students). There was a significant effect of LOR, but only for the students; those with longer LORs had higher scores. The authors concluded “segmental phonetic perception may be influenced most importantly by native-speaker input” (p. 546), which was more readily available to the students than to the research assistants or scientists. Results support the importance of an input- and interaction-rich L2 environment in the development of L2 perception.

L1 and L2 use

In addition to LOR and type of input, studies demonstrated that learners who seldom use their L1 were better able to identify English consonants (MacKay et al., 2001), recognize more English words in noise (Meador et al., 2000), and perceive L2 vowel contrasts (Flege and MacKay, 2004). Flege and MacKay found higher scores on discrimination tasks for two types of participants: early learners (L1 Italian) who immigrated to Canada between two and 13 years of age versus late learners who immigrated between 15 and 26 years, and learners who reported low L1 use (average of 8 percent of the time) versus those with high L1 use (average of 48 percent). Early learners with high L1 use differed from NSs of English, but not from learners of comparable age of arrival (AOA) and low L1 use. Findings suggested that early AOA is not a guarantee of native-like L2 vowel perception, and, importantly, L2 learners with an established L1 phonetic system can still show comparable (though not identical) performance to NSs. Højen and Flege (2006) also point out that identical perception by even early L2 learners and NSs should not be sought as evidence for the continued plasticity of the perceptual system given the variable nature of the statistical properties of individuals’ language input.

The above findings are compatible with the Speech Learning Model hypothesis (SLM; e.g., Flege, 1995) that neural plasticity in terms of the ability to establish new perceptual categories exists throughout life. The SLM also predicts that: (a) the development of L2 perceptual categories is dependent on the perceived distance between an L2 speech sound and the closest L1 sound, and (b) as L1 phonetic categories develop throughout life, they are more likely to assimilate perceptually close L2 sounds (see also Best and Tyler, 2007, for a discussion of the Perceptual Assimilation Model for functional monolinguals). Perceived similarity is an important concept as the basis for categorization behavior (e.g., Nosofsky, 1986).

Modifying the adult perceptual system through auditory training

Of both theoretical and pedagogical significance was the series of studies by Pisoni and colleagues (e.g., Lively et al., 1993; Logan et al., 1991) from which emerged a better understanding of the hallmarks of successful adult L2 perception training. Perception of L2 sounds involving spectral differences, such as /r/ and /l/, is generally more challenging than perception of contrasts involving temporal differences (e.g., Bohn, 1995). For example, an early study that trained NSs of English to identify a new voice onset time (VOT) category was accomplished in one session (Pisoni et al., 1982). Initial attempts to train L1 Japanese speakers to discriminate AE /r/ and /l/ had mixed results (Strange and Dittmann, 1984). Pre- and post-test stimuli involved natural speech minimal pairs contrasting the sounds in four word positions (initial singleton, initial cluster, medial, final singleton), and two synthesized speech continua (rock-lock and rake-lake). Training (14–18 sessions) involved synthesized tokens (rock-lock continuum) and a discrimination task. Following training, participants showed more categorical-like perception of the rock-lock series, with some generalization to the rake-lake series; however, there was no significant improvement in the identification of natural speech tokens.

Using the testing stimuli from the Strange and Dittmann (1984) study, Logan et al. (1991) administered three weeks of training to adult L1 Japanese learners of English who completed forced-choice identification tasks. Training stimuli included 68 minimal pairs contrasting /r/ and /l/ in all word positions (i.e., those noted above plus final clusters) produced by five different NSs of English. Significant improvement was noted between pre- and post-tests, especially for the initial cluster and medial positions, which had the lowest pre-test accuracy. There was also a significant effect of talker. Following the post-test, additional identification tasks revealed that participants were able to generalize performance accuracy to novel words (produced by a familiar talker from training) and to a new voice.

Role of stimulus variability. A subsequent study drawing from the same population tested the hypothesis that multiple- versus single-talker training involving /r/ and /l/ promotes phonetic category development robust across context and talker variability (Lively et al., 1993). In the first experiment, testing stimuli were from Logan et al. (1991); however, training stimuli, spoken by five talkers, were limited to the more difficult word positions for L1 Japanese learners (i.e., initial singleton, initial cluster, medial). Findings revealed moderate but significant improvement in overall identification accuracy with a corresponding decrease in reaction times2 across three weeks of training. Participants were able to generalize performance to a new voice. In Experiment 2, only one talker was used in training (i.e., the one producing the best scores from Experiment 1), and final singleton and final cluster positions were included. Results indicated no significant improvement with training. Identification accuracy did not generalize well to word-initial /r/ and /l/ spoken by an unfamiliar talker. The authors concluded that participants receiving multiple-talker training showed better generalization to a new voice.

The following emerged as characteristics of successful L2 perception training: (a) multiple exemplars that are representative of the variability the sounds show in the natural language environment, (b) natural (vs. synthesized) speech to preserve all acoustic cues, (c) multiple talkers, (d) relatively implicit training, (e) identification (vs. discrimination) tasks to promote observation of within-category similarities and between-category differences, (f) testing compatible with the training, and (g) feedback during training. Feedback is particularly important for stimuli characterized by considerable variability (e.g., /r/ and /l/) (Homa and Cultice, 1984). Successful perception training should show generalization to novel stimuli and new voices, transfer ability to other tasks/skills such as production, and retention.

In addition to its use in segmental training, a high variability stimulus set was used to train L1 English learners of Mandarin to identify four Mandarin tones over a two-week period. Accuracy showed significant improvement following training, with generalization to new stimuli and talkers, and transfer to production (Wang et al., 2003).

Although the above perception training studies incorporated stimulus variability, it was controlled by using (a) stimuli spoken in citation form, recorded and presented under ideal conditions, (b) intelligible talkers, (c) a consistent rate and style of speech, and (d) stimuli produced by one talker per training session. As Leather (1990) commented, “too much or too little variability at too early a stage may prevent the learner from discovering with sufficient accuracy the prototypical forms that exemplars expound” (p. 96).

Other training techniques. Using a perceptual fading technique, Morosan and Jamieson (1989) trained Canadian Francophones to perceive the English voiced (e.g., this) and voiceless (e.g., thought) interdental fricatives. Training stimuli were synthesized consonant-vowel (CV) syllables on a continuum from voiceless to voiced, involving exaggerated frication, initially set at 140 ms gradually decreasing to 35 ms over a period of 90 minutes (two sessions). Participants completed an identification task with feedback. Results showed that the identification accuracy of synthesized and natural tokens significantly improved, but without generalization to other word positions or contrasts; for example, participants were able to identify the voiced interdental fricative when the alternative was the voiceless one but not the voiced stop /d/.

A similar type of training was used with children exhibiting a temporal processing deficit, which results in difficulty segmenting the speech stream if input is presented too fast. Tallal and colleagues hypothesized that if critical cues (e.g., formant transitions distinguishing /ba/ from /da/) were temporally amplified or exaggerated, stimuli might be more easily identified (e.g., Tallal et al., 1996). Results indicated that some children showed perceptual improvement, but there was variability in performance.

Hyperarticulated/exaggerated speech may also offer some advantages for L2 learners. Exaggerated cues such as an expanded vowel space were found in speech directed toward infants (Kuhl et al., 1997; Uther et al., 2007), and adult non-native speakers of English (Uther et al., 2007). Hyperarticulated vowel sounds may make them better exemplars of their respective categories, making each more distinct.

Exaggerated stimuli were used to train Japanese learners of English to perceive /r/ and /l/ (McCandliss et al., 2002). Using synthesized rock-lock and road-load continua, the /r/-/l/ distinctions were exaggerated (adaptive training) and compared with fixed training (using good examples of /r/ and /l/ stimuli from each continuum). Participants took the training at home on one continuum (3 sessions, each 20 minutes, 480 trials per session), and were tested on both. Results showed the fixed training with feedback was the best, followed by adaptive training with or without feedback. Fixed training without feedback offered little improvement. Exaggerated stimuli may draw learner attention to critical cues by making them more perceptually salient, resulting in faster gains because more stimuli and variability create competition and challenge in dividing up psychological space. However, such variability is closer to the perceptual challenge of the natural language environment. Transfer to other skills and retention ability are unknown with such approaches.

Modifying the adult perceptual system through auditory-visual input

All of the above studies have involved unimodal (i.e., auditory) input to the adult perceptual system. Yet, in many instances, speech communication involves both visual and auditory sources of input. In fact, some researchers have argued that multimodal speech (e.g., auditory, visual, haptic) is the primary mode of speech perception (e.g., Rosenblum, 2004). Auditory-visual integration in perception has a longer history in the fields of infant speech development (e.g., Dodd, 1987; Meltzoff and Kuhl, 1994) and adult monolingual processing (e.g., Massaro, 1998; McGurk and MacDonald, 1976) than in L2 speech processing (see Hardison, 2007). In an early study on the difficulty adult Japanese speakers face in perceiving AE /r/ and /l/, Goto (1971) stated that for L2 learners, there was the “disadvantage of not being able to read the lips of the speaker” (p. 321). Some years later, the influence of visual cues from a talker's face was explored with L2 learners of English (Hardison, 1996). This study investigated whether learners whose L1s were Japanese, Korean, Spanish, and Malay would experience the McGurk effect (e.g., McGurk and MacDonald, 1976). This refers to a perceptual illusion which arises from mismatched auditory and visual cues; for example, when NSs of English see a talker's face on a monitor whose articulatory gesture represents /ga/ with the dubbed auditory production /ba/, some report they hear /da/. The intermediate-level ESL learners in Hardison's (1996) study were presented with both matched and mismatched CV syllables involving AE /p, f, w, r, t, k/ and /a/. The Japanese speakers showed a significant increase in identification accuracy of /r/ when visual /r/ was present; Korean speakers showed more accurate perception of /r/ and /f/ with corresponding visual cues. For mismatched stimuli, findings revealed that the cue which learners could identify the best contributed more to the percept on each trial. These results suggested visual cues should be explored in L2 perception training.

In a subsequent study using a pre-test-post-test design, auditory-visual (AV) training was compared to auditory-only (A-only) with a focus on /r/ and /l/ for intermediate-level Japanese and Korean ESL learners (Hardison, 2003). Independent variables were word position (initial singleton and cluster, medial, final singleton and cluster), adjacent vowel (varying the dimensions of height and rounding), and training modality (AV vs. A-only). Training involved 15 sessions (each about 30 minutes) over three weeks. Learners began at comparable levels of perceptual accuracy. Comparison of A-only pre- and post-test scores (the shared modality for both training types) revealed AV and A-only training resulted in significant improvement for both L1 groups; however, AV training was significantly more effective. Visual input contributed the most in contexts where L1 phonology would suggest greater difficulty (i.e., initial word position for the Japanese and final position for the Korean learners3). Both groups’ mean scores on a video-only (V-only) condition showed they also improved their lipreading ability.4 There were significant effects of talker (in the training data), word position, and adjacent vowel; vowel effects had not been explored in previous studies. Successful generalization obtained for novel stimuli and a new talker, with transfer to improved production of /r/ and /l/.

There is variability in the degree to which visual and auditory cues are helpful for L2 learners. In one study using British English, AV training was more effective than A-only in improving the perception of the labial/labiodental contrast (/v/-/b/-/p/) by L1 Spanish speakers, but not the perception of /r/ and /l/ by Japanese speakers (Hazan et al., 2005). Hazan et al. concluded that AV training is more effective when the critical visual cues are sufficiently salient. It is worth noting that comparability is difficult to establish across studies such as Hardison (2003) and Hazan et al. (2005) due to differences in (a) the discernibility of lip movements in different varieties of English (e.g., American vs. British), (b) articulatory gestures between talkers, (c) learners’ linguistic experience, exposure to L2 articulatory gestures, motivation and attention to critical stimulus features, and (d) methodological elements such as period of training, feedback, size of video screen, stimuli, etc.

Computer-animated talking heads such as Baldi (e.g., Massaro, 1998) have been used in the training of L2 learners, and children with language disorders. Because Baldi's exterior surface can be made transparent or eliminated, the internal articulators (e.g., tongue) can be displayed. Massaro and Light (2003) compared training of Japanese learners of English to perceive /r/ and /l/ using the normal view of Baldi and the view showing movement of the articulators. Both approaches resulted in improvement in identification and production; however, seeing the articulators did not provide an additional benefit.

Visual input in speech processing is not limited to human faces or talking heads. Electronic visual displays of pitch contours are helpful for L2 perception and production, and are user-friendly (see Chun et al., 2008 for a review). Waveform displays for visualization of segmental duration with the accompanying speech provided a significant advantage for AE learners of Japanese in the improvement of geminate5 perception compared to those who received A-only training (no display) although both groups improved with transfer to production (Motohashi Saigo and Hardison, 2009).

Relationship between perception and production

Several studies shed light on other facets of successful training such as retention and transfer to production. Japanese speakers living in Japan significantly improved their identification of /r/ and /l/ with retention of perceptual abilities (i.e., loss of only 2 percent accuracy) when tested three months later (Lively et al., 1994). One might predict that learners in the L2 environment with the advantage of continued input would also show retention although this is often difficult to investigate because of participant availability.

Studies using auditory-only input (Bradlow et al., 1997; Wang et al., 2003) and those using auditory-visual input, both from a talker's face (Hardison, 2003) and an electronic display (Motohashi Saigo and Hardison, 2009) demonstrated that perception training transferred to significant improvement in production in the absence of production training. Hardison (2003) found that the variability in production accuracy of /r/ and /l/ by Japanese and Korean learners of English as a function of word position decreased with perception training. However, these studies also pointed out that performance variability exists between these domains across individual learners. For example, in Wang et al. (2003), Mandarin tone 3 became relatively easy for L1 AE learners to perceive after training, but remained difficult to produce. Bradlow et al. (1997) and Motohashi Saigo and Hardison (2009) measured relative improvement (i.e., improvement as a proportion of the room for improvement)6 to compare perception and production improvement rates for participants in their studies. In both cases, no significant relationship was found between degrees of learning in perception and production due to individual variation.

Beyond the segmental level. The segmentation of the speech stream generally poses a challenge in the early stages of L2 acquisition when adult learners rely on cues relevant to the L1 (Strange and Shafer, 2008). Rhythmic categories such as stress units in English, syllables in French, and morae7 in Japanese play a role in this process (e.g., Cutler and Otake, 2002). A recent auditory study found that with increasing proficiency in L2 Japanese, L1 English learners adopted a segmentation strategy focusing on moraic units (vs. stressed syllables), which increased identification accuracy of Japanese geminates (e.g., moraic obstruents), which contrast in duration with their singleton counterparts (Hardison and Motohashi Saigo, 2010). Perception was facilitated by contexts with greater consonant-vowel sonority8 difference enhancing the perception of mora boundaries for segmentation.

The question arises as to whether segmental-level perception training can improve learners’ word identification processes. Using the gating paradigm, which involves successive presentations of increasing amounts of a target word (e.g., Grosjean and Frauenfelder, 1996), adapted for auditory-visual input, Hardison (2005a) found that L2 learners of English identified words significantly earlier following perception training with minimal pairs, and when visual cues (talker's face) were present. This AV advantage was noted especially for words beginning with /r/ and /l/, the primary focus of the training. Findings suggested that because articulatory movements precede the associated acoustic signal, visual cues play a priming role in the word identification process for a listener-observer in face-to-face situations (see also Hardison, 2005b; Munhall and Tohkura, 1998). Results of a recent study using a gating task revealed that visual cues from a talker's face and sentence context played significant yet independent roles in word identification by L1 Japanese and Korean learners of English (Hardison, 2008). Much as the temporal precedence of an articulatory gesture can give visual cues a priming role in AV spoken language processing, so can comprehensible sentence context.

The trend toward examining the larger context of a speech event leads us to the potential role of a talker's hand-arm gestures in the processing of L2 speech input. Results of a multiple-choice listening comprehension task administered to low-intermediate and advanced ESL learners revealed significantly higher scores when both proficiency levels were able to see the speaker (Sueyoshi and Hardison, 2005). For the low-intermediate level, seeing the face and gestures produced the best results; for the advanced level, seeing only the face produced the highest scores. Hand gestures were primarily related to the semantic component of the stimulus, whereas lip movements were linked to the phonological component, favoring the more proficient learners who had greater experience with L2 communication. Questionnaire responses revealed positive attitudes toward both types of visual cues (see also Gullberg, 2006). Co-speech gestures also facilitated recall of Japanese words learned by L1 English adults with no prior knowledge of the language (Kelly et al., 2009). Based on results of memory tests and event-related potentials (ERPs9), Kelly et al. suggested that co-speech gesture “deepens the imagistic memory trace for a new word's meaning in the brain” (p. 330). We will return to the issue of memory traces below (see Empirical verification).

Data and common elicitation measures

As noted above, in L2 speech research, synthesized and natural speech have been used as stimuli. Synthesized speech allows researchers to manipulate critical acoustic cues of interest along a continuum in which the end points represent good examples of each of two categories that NSs clearly perceive as different (e.g., /r/ and /l/); the intermediate points are equal physical changes on a set of critical parameters (e.g., Strange and Dittman, 1984). In contrast to NS categorical perception of consonant categories, less experienced L2 learners typically exhibit a more continuous pattern of perception. Synthesized speech is also used to produce techniques such as perceptual fading (e.g., Morosan and Jamieson, 1989), and in its sophisticated form, to contribute to the creation of talking heads such as Baldi. As noted earlier, the use of synthesized speech as training stimuli tended to result in poorer generalization ability compared to natural speech.

Synthesized speech is often used with discrimination tasks. For example, in an AXB discrimination task, three stimuli are presented per trial where “X” is either “A” or “B,” and participants determine whether “X” was similar to “A” or “B” (see Højen and Flege, 2006). In an AX (same-different) task, participants determine the similarity of two stimuli. In an identification task, listeners categorize or label the stimulus presented in each trial. Listeners can discriminate between stimuli to the extent that they can identify them as belonging to different categories. Performance on discrimination tasks relies on low-level acoustic-phonetic input, whereas identification tasks direct attention to a phonemic level, matching input to stored representations in memory, and more closely mirror daily language tasks. Researchers may measure both accuracy and reaction time.

Natural speech may be presented in citation form as is often the case with CV syllables (e.g., Hardison, 1996) or minimal pairs used in perception training (e.g., Lively et al., 1993), which may be real words or nonwords. Lively et al. found no significant difference in L2 learners’ perception of segments (e.g., /r/ and /l/) in real words versus those in nonwords. In a forced-choice identification task, the word serves as a carrier of phonetic context. In these tasks, participants are shown a minimal pair (on paper or computer screen) and asked to select the one being presented. (See Other training techniques for studies using hyperarticulated/exaggerated versus natural speech.)

In tasks of spoken word identification that require access to the mental lexicon (i.e., access to meaning-based representations), the participants’ familiarity with the stimulus is an important consideration. Familiarity may be determined objectively based on corpus-based assessments of word frequency, which imply likelihood of familiarity, or subjectively using a familiarity scale. The scale may be administered prior to a study to a peer group of participants (i.e., those similar to the study participants in level of proficiency, etc.) and/or following data collection to the study participants. In addition to familiarity, the density of the lexical neighborhood is a consideration in selecting materials. Words in dense lexical neighborhoods have many neighbors (i.e., words that differ from the target by a one-phoneme addition, substitution, or deletion in any position). These require listeners to attend carefully to the phonetic input to distinguish the target from its neighbors who are competitors in the process (Pisoni et al., 1985). For L2 learners, more exposure to the spoken language appears related to the ability to recognize words from denser neighborhoods (Bradlow and Pisoni, 1999). Fewer studies have examined the relationship between segmental perception and spoken word recognition for L2 learners (see Hardison, 2005a, b). As with segmental perception, both accuracy and reaction-time data can be collected in the assessment of spoken word recognition (see Grosjean and Frauenfelder, 1996 for a review of paradigms). Finally, technological developments including functional magnetic resonance imaging10 (fMRI), and ERPs have begun to compare the neural organization of L1 and L2 processing (see Perani and Abutalebi, 2005 for a review).

Empirical verification

The studies described earlier that demonstrated variable performance across phonetic contexts and talkers support the view that sources of variability in the speech signal are a part of subsequent neural representations (Goldinger, 1996, 1997). Lively et al. (1993) suggested L2 learners may rely on context-dependent exemplars (vs. abstract prototypes) for input identification. As a result of perception training, learner attention is shifted to relevant stimulus features. These shifts are the “stretching” and “shrinking” of perceptual distances in psychophysical space to make items from different categories (e.g., /r/ and /l/) appear less similar, and within-category variants (i.e., allophones) more similar (e.g., Nosofsky, 1986).

In an adaptation of multiple-trace memory (MTM) theory (e.g., Hintzman, 1986), also referred to as an episodic or exemplar theory, for auditory-visual L2 speech learning, Hardison (2003) suggested the exemplars of episodic models and the prototypes of an abstractionist approach could co-exist. Multiple-trace memory theory states that the memory encoding of a perceptual event involves storage of the attended details as episodes or traces, preserving aspects of variability. In processing, a retrieval cue or probe generated in primary memory contacts in parallel all stored traces in long-term memory, simultaneously activating each based on similarity to features of the probe. In AV speech perception, the features that comprise the preliminary representation that probes memory depend upon the attention given to the auditory and visual attributes of a stimulus relevant for a given task. Jusczyk (e.g., 1993) viewed the extraction of auditory features in infant speech input as part of the human auditory system's innate guided learning mechanism whose attention-weighting role gives prominence to those features as part of L1 perceptual attunement. One of the challenges for L2 learners is to adopt a weighting scheme based on the critical features of the L2. Perception training can serve to focus learner attention to these stimulus attributes. The probe is then said to return an echo or response to primary memory. The goal for learners is to have the echo from clearly defined L2 traces overshadow that from any similar L1 traces. As a consequence of successful perception training, new L2 traces should be less ambiguous in content and at greater psychological distance to L1 traces.

At the time of information retrieval, abstract knowledge can be derived from a composite of episodic traces. A category is, thus, regarded as an aggregate of individual exemplars activated together at the time of information retrieval. The category prototype concept, as a representation of the shared features of multiple traces, captures the advantage of redundancy. Whereas exemplars preserve detailed relevant information of an event, they may be forgotten over time; however, the prototype concept is retained longer. This may account for the retention capability demonstrated by successful L2 perception training studies using multiple exemplars (e.g., Lively et al., 1994). The more exemplars of a category that are stored, the more likely one or more will generalize to the probe and influence its classification. Goldinger (2007) also provided computational evidence of a reciprocal neural network that includes inter-dependent episodic and abstract representations involved in word perception. This complementary learning systems (CLS) approach combines the advantage of a fast-learning network representing, in neurophysiological terms, the hippocampus (to rapidly memorize specific events), and a more stable network representing the cortex (to slowly learn statistical regularities of the input), replicating behavioral evidence from recognition memory data (e.g., Goldinger, 1996, 1997).

L2 learners may store details of both auditory and visual elements of speech events in memory to facilitate subsequent processing (see Hardison, 2003). In bimodal speech processing, the auditory and visual modalities initially provide separate sources of input to the perceptual system. Recent behavioral and neurophysiological evidence supports the early integration of bimodal information. One view suggests unimodal signals are integrated in multisensory areas of the cortex with feedback pathways to primary sensory areas, accounting for enhanced response of neurons to the presentation of concordant auditory and visual events, and the establishment of a multisensory perceptual representation (e.g., Calvert, 2001; Driver and Spence, 2000). Another view suggests that the contribution of multisensory sites throughout the cortex lies in the predictive role one modality's information can have on the other to reduce stimulus uncertainty and facilitate processing (van Wassenhove et al., 2005). Specifically, visible articulatory movements allow the listener-observer's brain to reduce the set of potential targets the speaker will produce (Skipper et al., 2007). This preliminary representation elicits a cohort of possible targets that are compared against the auditory input and fed back for correction. In the case of informative visual cues or fewer possible targets (e.g., salient articulations of bilabials /p,b,m/), more precise predictions can be made resulting in greater facilitation of processing. Visual speech can speed up the cortical processing of auditory signals by NSs within 100 ms of signal onset (van Wassenhove et al., 2005); however, such findings assume listener-observer knowledge of correspondences between sound and lip movements that begin to develop in infancy (e.g., Meltzoff and Kuhl, 1994). For adult L2 learners, these findings emphasize the value of AV training with feedback that exposes them to the concordance of bimodal cues using multiple exemplars from a variety of talkers, offering greater potential for the creation of an accurate L2 cohort to facilitate spoken language processing (Hardison, 2003).

Results of a recent study of bimodal L2 speech processing involving advanced ESL learners (L1 Korean) provided additional support for an episodic lexicon (Hardison, 2006). If listener-observers store information about a talker's face, then subsequent presentation of only a partial visual stimulus of a familiar talker should serve as an effective memory probe to produce greater word identification accuracy compared to a comparable stimulus from an unfamiliar talker, or to an auditory-only stimulus from either type of talker. Findings revealed that seeing only the eye and upper cheek areas of a talker's face was significantly more informative in L2 word identification compared to just hearing the voice, but only when these cues belonged to a familiar talker. Similarly, seeing just the mouth and lower jaw was equivalent to seeing the entire face, but only when the talker was familiar. These results suggest that the visual information learners attend to on a talker's face is not limited to lip movements. Initial processing may be more global allowing observers to preserve as many details as possible in memory with subsequent strategic shifts of attention. As proposed in MTM theory, a cue composed of a familiar but only partial stimulus (e.g., the talker's mouth or eyes) can be enhanced by echoes from stored representations of the entire stimulus (e.g., talker's entire face) in memory to fill in missing details in order to facilitate a task such as word identification. This function based on partial input was also modeled computationally in a CLS framework (Norman and O'Reilly, 2003).

Applications

Instructors may very well question whether one can actually teach someone to perceive a sound. For example, telling learners to focus attention on a low third formant frequency (an acoustic cue that distinguishes AE /r/ from /l/) would not be helpful. In contrast, production lends itself to teaching how sounds are produced, and how they can be affected by surrounding sounds, rate and style of speech, etc. Production is also subject to proprioceptive11 feedback. Nevertheless, instructors can facilitate the perception process by manipulating the input.

Training studies suggest that stimulus variability contributes to development of perceptual categories robust to the variable input of the natural language environment. Taking AE /r/ and /l/ as examples, these sounds show considerable variability depending on vocalic context, word position, and talker. Presenting learners with multiple, auditory-visual exemplars of these sounds in a range of phonetic contexts (e.g., road-load, hear-heal, arrive-alive, crew-clue, etc.), produced by a variety of talkers on a recurring basis would exploit the advantages demonstrated by the aforementioned successful training studies. Not all sounds, of course, show the same degree of variability. Although extensive focused training, if needed, may require sessions that are separate from regular classes, zooming in on troublesome sounds in frequent words as they are encountered in communicative activities is beneficial and compatible with an incidental focus-on-form approach. However, incorporation of variability in input raises the question of how much is helpful, and at what point it should be introduced. Decisions must be made in accordance with a learner's interlanguage stage, needs, and tolerance.

Feedback is also important, especially for those perceptual categories that are more difficult to establish because of the variability of the L2 sounds and/or their similarity to L1 sounds. Here it is helpful for instructors to have some knowledge of the phonology of the learners’ L1(s).

Given the contribution of visual cues to perception, and their precedence in articulation, instructors should consider pointing out the visible aspects of articulations as part of the knowledge learners can take with them when they leave the classroom to aid both perception and production practice. Visual cues include articulatory gestures related to segmental production and hand gestures that are part of the larger speech event, and may correspond to suprasegmental features of stress, rhythm, and intonation.

Early reference to the visual component of speech can be found in the classic textbook, Manual of American English Pronunciation (Prator and Robinett, 1985). It included diagrams of “mouth shapes” for practicing the perception and production of vowels. However, books are not the best medium to help learners take advantage of dynamic visual input, and neither books nor audiotapes provide sufficient variability (e.g., contextual, talker, stylistic). Web-based programs that are currently available are similarly limited and lack the ability to provide feedback on perceptual accuracy, but they do offer some opportunity for auditory-visual segmental input outside the classroom (e.g., http://www.uiowa.edu/~acadtech/phonetics/ for English, German, and Spanish; http://international.ouc.bc.ca/pronunciation/ for English). For instructors who wish to produce and edit their own audiorecordings for learners, Audacity is a free cross-platform sound editor (available at http://audacity.sourceforge.net/). There is also Anvil (Kipp, 2001; see http://www.anvil-software.de/ for details), a program which provides a screen display integrating the recorded audio and video components of a speech event with the associated pitch track and/or waveform extracted using Praat (http://www.fon.hum.uva.nl/praat/).

Future directions

It is evident from studies demonstrating a significant improvement in production as a result of perception training that there is a relationship between these two domains although its exact nature remains elusive. In L2 acquisition, there is a need for studies examining the transfer of production training to perception. Although assessing the retention of abilities following training presents practical challenges for researchers, this also remains an important issue in L2 learning.

In addition, there is a need for L2 studies combining behavioral and neurophysiological evidence. Based on fMRI results, Hickok et al. (2003) suggest there is a predominantly left hemisphere network that enables acoustic-phonetic input to guide the acquisition of language-specific articulatory gestures, and underlies phonological working memory in adults, which has become a focus of some studies in L2 production (e.g., O'Brien et al., 2007). In addition, phonetic training studies have demonstrated that learning can be reflected in ERP measures before behavioral measures (e.g., Tremblay and Kraus, 2002), suggesting a direction for L2 research. Finally, eye-tracking studies may indicate which components of a speech event a perceiver attends to and how this changes over the course of the event, and over the course of learning. The convergence of behavioral, neurophysiological, and psychophysical evidence is a promising direction for future research in L2 spoken language processing.

Notes

1   See Chapter 20 (Pickering) for a discussion of L2 speech production.

2   Reaction time is the elapsed time between presentation of a stimulus and the detection of the associated behavioral response as an indication of processing duration.

3   Japanese has a flap in the dental or alveolar region occurring in utterance-initial and intervocalic positions, with acoustic, articulatory, and perceptual similarities to the AE flap. Korean has a nonvelarized or clear /l/ in syllable-final position and a flap intervocalically (or between vowel and glide; see Hardison, 2003).

4   In AV speech research, it is common to assess the contribution from each modality separately (A-only, V-only) as well as both combined (AV).

5   The duration of vowels and consonants is a contrastive feature in Japanese unlike English. In Japanese, singleton consonants contrast with their longer geminate counterparts (e.g., kite “coming,” kitte “postage stamp”).

6   Following Bradlow et al. (1997), relative improvement is measured as post-test accuracy minus pre-test accuracy divided by 100 minus the pre-test accuracy.

7   A mora is a unit of timing, which plays a role in the temporal organization of speech in production and segmentation of speech in lexical recognition.

8   Sonority is considered an inherent property of a segment, traditionally based on stricture. Generally speaking, the wider the openness of the vocal tract (the smaller the degree of stricture), the greater the energy in the acoustic signal, and the higher the sonority value.

9   ERPs measure the timing of electrical brain responses to a stimulus. The brainwaves correspond to different types of neurocognitive processes.

10   fMRI is a method of capturing and creating an image of the activity in the brain through measurement of the magnetic fields created by the functioning nerve cells in the brain associated with a given task.

11   Proprioception refers to the ability to sense the location and movement of parts of one's own body (e.g., tongue position in the production of sounds).

References

Best, C. T., and Tyler, M. D. (2007). Non-native and second-language speech perception: Commonalities and complementarities. In O. -S. Bohn and M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: Benjamins.

Bohn, O. S. (1995). Cross-language perception in adults: First language transfer doesn't tell it all. In W. Strange (Ed.), Speech perception and linguistic experience: Theoretical and methodological issues (pp. 370–410). Timonium, MD: York Press.

Bradlow, A. R. and Pisoni, D. B. (1999). Recognition of spoken words by native and non-native listeners: Talker-, listener-, and item-related factors. Journal of the Acoustical Society of America, 106(4), 2074–2085.

Bradlow, A. R., Pisoni, D. B., Akahane-Yamada, R. A., and Tohkura, Y. (1997). Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production. Journal of the Acoustical Society of America, 101, 2299–2310.

Calvert, G. (2001). Crossmodal processing in the human brain: Insights from functional neuroimaging studies. Cerebral Cortex, 11(12), 1110–1123.

Chun, D. M., Hardison, D. M., and Pennington, M. C. (2008). Technologies for prosody in context: Past and future of L2 research and practice. In J. G. Hansen Edwards and M. L. Zampini (Eds.), Phonology and second language acquisition (pp. 323–346). Amsterdam: Benjamins.

Cutler, A. and Otake, T. (2002). Rhythmic categories in spoken-word recognition. Journal of Memory and Language, 46(2), 296–322.

Dodd, B. (1987). The acquisition of lip-reading skills by normally hearing children. In B. Dodd and R. Campbell (Eds.), Hearing by eye: The psychology of lip-reading (pp. 163–175). London: Erlbaum.

Driver, J. and Spence, C. (2000). Multisensory perception: Beyond modularity and convergence. Current Biology, 10(20), R731–R735.

Flege, J. E. (1995). Second-language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 229–273). Timonium, MD: York Press.

Flege, J. E. (1999). Age of learning and second-language speech. In D. Birdsong (Ed.), Second language acquisition and the critical period hypothesis (pp. 101–131). Mahwah, NJ: Erlbaum.

Flege, J. E. and Liu, S. (2001). The effect of experience on adults’ acquisition of a second language. Studies in Second Language Acquisition, 23(4), 527–552.

Flege, J. E. and MacKay, I. R. A. (2004). Perceiving vowels in a second language. Studies in Second Language Acquisition, 26(1), 1–34.

Goldinger, S. D. (1996). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105(2), 251–279.

Goldinger, S. D. (1997). Words and voices: Perception and production in an episodic lexicon. In K. Johnson and J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 33–66). San Diego, CA: Academic Press.

Goldinger, S. D. (2007, August). A complementary-systems approach to abstract and episodic speech perception. Paper presented at the International Congress of Phonetic Sciences (ICPhS XVI), Saarbrücken, Germany. Retrieved August 13, 2010 http://www.icphs2007.de/conference/Papers/1781/1781.pdf

Goto, H. (1971). Auditory perception by normal Japanese adults of the sounds “l” and “r”. Neuropsychologia, 9(3), 317–323.

Grosjean, F. and Frauenfelder, U. H. (1996). A guide to spoken word recognition paradigms: Introduction. Language and Cognitive Processes, 11(6), 553–558.

Gullberg, M. (2006). Some reasons for studying gesture and second language acquisition. International Review of Applied Linguistics in Language Teaching, 44(2), 103–124.

Hardison, D. M. (1996). Bimodal speech perception by native and nonnative speakers of English: Factors influencing the McGurk Effect. Language Learning, 46(1), 3–73.

Hardison, D. M. (2003). Acquisition of second-language speech: Effects of visual cues, context, and talker variability. Applied Psycholinguistics, 24(4), 495–522.

Hardison, D. M. (2005a). Second-language spoken word identification: Effects of perceptual training, visual cues, and phonetic environment. Applied Psycholinguistics, 26(4), 579–596.

Hardison, D. M. (2005b). Variability in bimodal spoken language processing by native and nonnative speakers of English: A closer look at effects of speech style. Speech Communication, 46(1), 73–93.

Hardison, D. M. (2006). Effects of familiarity with faces and voices on L2 spoken language processing: Components of memory traces. In Proceedings of the Ninth International Conference on Spoken Language Processing (pp. 2462–2465). Bonn, Germany: International Speech Communication Association. Retrieved August 13, 2010 from https://www.msu.edu/~hardiso2/Hardison.ICSLP2006.pdf

Hardison, D. M. (2007). The visual element in phonological perception and learning. In M. C. Pennington (Ed.), Phonology in context (pp. 135–158). Basingstoke, England: Palgrave Macmillan.

Hardison, D. M. (2008, March). The priming role of context and visual cues in spoken language processing by native and non-native speakers. Paper presented at the American Association for Applied Linguistics Conference, Washington, D.C.

Hardison, D. M. and Motohashi Saigo, M. (2010). Development of perception of L2 Japanese geminates: Role of duration, sonority, and segmentation strategy. Applied Psycholinguistics, 31(1), 81–99.

Hazan, V., Sennema, A., Iba, M., and Faulkner, A. (2005). Effect of audiovisual perceptual training on the perception and production of consonants by Japanese learners of English. Speech Communication, 47(3), 360–378.

Hickok, G., Buchsbaum, B., Humphries, C., and Muftuler, T. (2003). Auditory-motor interaction revealed by fMRI: Speech, music, and working memory in area Spt. Journal of Cognitive Neuroscience, 15(5), 673–682.

Hintzman, D. L. (1986). “Schema abstraction” in a multiple-trace memory model. Psychological Review, 93(4), 411–428.

Højen, A. and Flege, J. E. (2006). Early learners’ discrimination of second-language vowels. Journal of the Acoustical Society of America, 119(5), 3072–3084.

Homa, D. and Cultice, J. C. (1984). Role of feedback, category size, and stimulus distortion on the acquisition and utilization of ill-defined categories. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10(1), 83–94.

Ingram, J. C. L. and Park, S. -G. (1998). Language, context, and speaker effects in the identification and discrimination of English /r/ and /l/ by Japanese and Korean listeners. Journal of the Acoustical Society of America, 103(2), 1161–1174.

Jusczyk, P. W. (1993). From general to language-specific capacities: The WRAPSA model of how speech perception develops. Journal of Phonetics, 21(1–2), 3–28.

Kelly, S. D., McDevitt, T., and Esch, M. (2009). Brief training with co-speech gesture lends a hand to word learning in a foreign language. Language and Cognitive Processes, 24(2), 313–334.

Kipp, M. (2001). Anvil - A generic annotation tool for multimodal dialogue. In Proceedings of the 7th European Conference on Speech Communication and Technology (pp. 1367–1370). Aalborg, Denmark: Eurospeech.

Kuhl, P. K., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., and Ryskina, V. L. (1997). Cross-language analysis of phonetic units in language addressed to infants. Science, 277(5326), 684–686.

Kuhl, P. K., Conboy, B. T., Coffey-Corina, S., Padden, D., Rivera-Gaxiola, M., and Nelson, T. (2008). Phonetic learning as a pathway to language: New data and native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B, 363(1493), 979–1000.

Kuhl, P. K. and Iverson, P. (1995). Linguistic experience and the “Perceptual Magnet Effect”. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 121–154). Timonium, MD: York Press.

Kuhl, P. K. Stevens, E., Hayashi, A., Degucki, T., Kiritani, S., and Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2), F13–F21.

Leather, J. (1990). Perceptual and productive learning of Chinese lexical tone by Dutch and English speakers. In J. Leather and A. James (Eds.), New Sounds 90 (pp. 72–97). Amsterdam: University of Amsterdam.

Lively, S. E., Logan, J. S., and Pisoni, D. B. (1993). Training Japanese listeners to identify English /r/ and /l/. II: The role of phonetic environment and talker variability in learning new perceptual categories. Journal of the Acoustical Society of America, 94(3), 1242–1255.

Lively, S. E., Pisoni, D. B., Yamada, R. A., Tohkura, Y., and Yamada, T. (1994). Training Japanese listeners to identify English /r/ and /l/. III: Long-term retention of new phonetic categories. Journal of the Acoustical Society of America, 96(4), 2076–2087.

Logan, J. S., Lively, S. E., and Pisoni, D. B. (1991). Training Japanese listeners to identify English /r/ and /l/: A first report. Journal of the Acoustical Society of America, 89(2), 874–886.

MacKain, K. S., Best, C. T., and Strange, W. (1981). Categorical perception of English /r/ and /l/ by Japanese bilinguals. Applied Psycholinguistics, 2(4), 369–390.

MacKay, I., Meador, D., and Flege, J. (2001). The identification of English consonants by native speakers of Italian. Phonetica, 58(1–2), 103–125.

Massaro, D. W. (1998). Perceiving talking faces: From speech perception to a behavioral principle. Cambridge, MA:MIT.

Massaro, D. W. and Light, J. (2003). Read my tongue movements: Bimodal learning to perceive and produce non-native speech /r/ and /l/. In Proceedings of Eurospeech (Interspeech), 8th European Conference on Speech Communication and Technology (CD-ROM). Geneva, Switzerland.

McCandliss, B. D., Fiez, J. A., Protopapas, A., and Conway, M. (2002). Success and failure in teaching the [r]-[l] contrast to Japanese adults: Tests of a Hebbian model of plasticity and stabilization in spoken language perception. Cognitive, Affective, and Behavioral Neuroscience, 2(2), 89–108.

McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.

Meador, D., Flege, J., and MacKay, I. (2000). Factors affecting the recognition of words in a second language. Bilingualism: Language and Cognition, 3(1), 55–67.

Meltzoff, A. N. and Kuhl, P. K. (1994). Faces and speech: Intermodal processing of biologically relevant signals in infants and adults. In D. J. Lewkowitz and R. Lickliter (Eds.), The development of intersensory perception: Comparative perspectives (pp. 335–369). Hillsdale, NJ: Erlbaum.

Mochizuki, M. (1981). The identification of /r/ and /l/ in natural and synthesized speech. Journal of Phonetics, 9(3), 282–303.

Morosan, D. E. and Jamieson, D. G. (1989). Evaluation of a technique for training new speech contrasts: Generalization across voices, but not word-position or task. Journal of Speech and Hearing Research, 32(3), 501–511.

Motohashi Saigo, M. and Hardison, D. M. (2009). Acquisition of L2 Japanese geminates: Training with waveform displays. Language Learning and Technology, 13(2), 29–47. Retrieved August 13, 2010 from http://llt.msu.edu/vol13num2/motohashisaigohardison.pdf

Munhall, K. G. and Tohkura, Y. (1998). Audiovisual gating and the time course of speech perception. Journal of the Acoustical Society of America, 104(1), 530–539.

Norman, K. and O'Reilly, R. (2003). Modeling hippocampal and neocortical contributions to recognition memory: A complementary learning-systems approach. Psychological Review, 110(4), 611–646.

Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General, 115(1), 39–57.

O'Brien, I., Segalowitz, N., Freed, B., and Collentine, J. (2007). Phonological memory predicts second language oral fluency gains in adults. Studies in Second Language Acquisition, 29(4), 557–581.

Perani, D. and Abutalebi, J. (2005). The neural basis of first and second language processing. Current Opinion in Neurobiology, 15(2), 202–206.

Pisoni, D. B., Aslin, R. N., Perey, A. J., and Hennessy, B. L. (1982). Some effects of laboratory training on identification and discrimination of voicing contrasts in stop consonants. Journal of Experimental Psychology: Human Perception and Performance, 8(2), 297–314.

Pisoni, D. B., Nusbaum, H. C., Luce, P. A., and Slowiaczek, L. M. (1985). Speech perception, word recognition and the structure of the lexicon. Speech Communication, 4(1–3), 75–95.

Prator, C. H. and Robinett, B. W. (1985). Manual of American English pronunciation (Fourth Edition). Orlando, FL: Harcourt Brace.

Rivera-Gaxiola, M., Silva-Pereyra, J., and Kuhl, P. K. (2005). Brain potentials to native and non-native speech contrasts in 7- and 11-month-old American infants. Developmental Science, 8(2), 162–172.

Rosenblum, L. D. (2004). Primacy of multimodal speech perception. In D. B. Pisoni and R. E. Remez (Eds.), The handbook of speech perception (pp. 51–78). Malden, MA: Blackwell.

Scovel, T. (1969). Foreign accents, language acquisition, and cerebral dominance. Language Learning, 19(3–4), 245–253.

Sheldon, A. and Strange, W. (1982). The acquisition of /r/ and /l/ by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics, 3(3), 243–261.

Skipper, J. I., van Wassenhove, V., Nusbaum, H. W., and Small, S. L. (2007). Hearing lips and seeing voices: How cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex, 17(10), 2387–2399.

Strange, W. and Dittmann, S. (1984). Effects of discrimination training on the perception of /r-l/ by Japanese adults learning English. Perception and Psychophysics, 36(2), 131–145.

Strange, W. and Shafer, V. L. (2008). Speech perception in second language learners: The re-education of selective perception. In J. G. Hansen Edwards and M. L. Zampini (Eds.), Phonology and second language acquisition (pp. 153–191). Amsterdam: Benjamins.

Sueyoshi, A. and Hardison, D. M. (2005). The role of gestures and facial cues in second language listening comprehension. Language Learning, 55(4), 661–699.

Tallal, P., Miller, S. L., Bedi, G., Byma, G., Wang, X., and Nagarajan, S. S. (1996). Language comprehension in language-learning impaired children improved with acoustically modified speech. Science, 271(5245), 81–84.

Tremblay, K. L. and Kraus, N. (2002). Auditory training induces asymmetrical changes in cortical neural activity. Journal of Speech, Language, and Hearing Research, 45(3), 564–572.

Uther, M., Knoll, M. A., and Burnham, D. (2007). Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech. Speech Communication, 49(1), 2–7.

van Wassenhove, V., Grant, K. W., and Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences, 102(4), 1181–1186.

Wang, Y., Jongman, A., and Sereno, J. A. (2003). Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. Journal of the Acoustical Society of America, 113(2), 1033–1043.

Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63.

Yamada, R. A. (1995). Age and acquisition of second language speech sounds: Perception of American English /r/ and /l/ by native speakers of Japanese. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 305–320). Timonium, MD: York Press.