7
Phonetics and the auditory system

Matthew B. Winn and Christian E. Stilp

Introduction

Why combine these topics?

Rosen and Fourcin (1986) wrote, “trying to understand the auditory processes involved in the perception of speech is likely to lead not only to a better understanding of hearing, but also, of speech itself” (p. 373). In this chapter, we review some basic principles of the auditory system so that the perception and processing of speech can be understood within a biological framework. There are some ways that traditional acoustic descriptions of speech fail to capture how sounds are handled by the auditory system; this chapter contains some advice on how to approach experimental phonetics with the auditory system in mind. Rather than a comprehensive review of the auditory system or acoustic phonetics, this chapter focuses on a few basic principles that would be (a) essential bits of knowledge for anyone doing auditory-perceptual work, (b) useful for design and analysis of phonetics experiments, and (c) helpful for an auditory scientist interested in how basic auditory functions like spectral and temporal coding could play a role in the encoding of speech information. The scope will be limited to the factors that affect segmental/phonetic aspects of perception rather than prosody, voice quality, or tone.

Sound frequency, intensity, and timing are all processed nonlinearly, meaning that equivalent changes in the acoustic input do not map onto equivalent changes in the perceptual interpretation. Furthermore, perception of each of these acoustic properties is dependent on each of the others, and the encoding of a signal can unfold dramatically differently depending on what came before or after it. The spectral and temporal processing of the auditory system has meaningful implications for the processing of speech in particular, since it is replete with meaningful changes that span wide frequency ranges, occur on multiple timescales, and with great fluctuation in amplitude. In the next sections, we will review some of the most fundamental aspects of nonlinear auditory processing.

Nonlinear processing of frequency

Sound frequency is an organizing principle throughout the entire auditory system. The cochlea – the signature structure of the inner ear – acts as a mechanical spectrum analyzer, separating and ordering the frequency components in a way that is similar to a spectrum or a piano. However, unlike a conventional spectrum on a computer screen, the frequencies in the cochlea are arranged mostly logarithmically, meaning that equal spacing along the cochlea corresponds to equivalent proportional changes in frequency, rather than equivalent raw changes in frequency (Figure 7.1). For example, the space between cochlear locations for 500 Hz and 1000 Hz is much larger (100% change) than the space between areas for 4000 Hz and 4500 Hz (12.5% change), despite the equal linear change of 500 Hz. For this reason, many auditory researchers compare frequencies based on cochlear space using the function specified by Greenwood (1990), perceptual space using equivalent rectangular bandwidths (Glasberg and Moore, 1990), the mel scale (Stevens et al., 1937), or the Bark frequency scale (Zwicker, 1961; Traunmüller, 1990). These scales account for the expanded representation of low frequencies in the cochlea that gradually tapers off for higher frequencies.

Phoneticians can consider auditory frequency spacing in the design and analysis of speech stimuli in perceptual experiments. For example, a continuum of first formant (F1) frequencies might be easily generated as 300, 400, 500, 600, 700 and 800 Hz, but these steps are not equally spaced in the human auditory system. A continuum with values of 300, 372, 456, 554, 668 and 800 Hz would occupy equidistant intervals along a typical adult human cochlea. These numbers were generated using the Greenwood (1990) formula using standard parameters for an adult 35 mm-long human cochlea. One could translate frequencies and cochlear positions using these functions (written for the R programming language), with sensible default values for the formula parameters declared in the opening lines:

Figure 7.1

Figure 7.1 The place of cochlear excitation for various sound frequencies arranged on a linear scale (left panel) and a logarithmic scale (right panel), demonstrating that cochlear treatment of frequency is proportional rather than absolute.

Figure 7.2

Figure 7.2 Neurogram (top) and spectrogram (bottom) of the word “choice.”

The nonlinear arrangement of frequency and energy in the auditory system implies that the spectrogram – probably the most common way of viewing and analyzing speech signals – offers a distorted view of the signal in the ear. Scaling the spectrogram’s frequency axis to correspond with cochlear spacing and expressing energy as neural activation rather than analytical signal energy produces a neurogram, which illustrates the signal from the perspective of the auditory system. Figure 7.2 shows a neurogram along with its corresponding spectrogram for the word “choice.” The vowel dynamics, leading and trailing frication noise, and glottal pulses are visible in each image, but are spatially warped in the neurogram. Note also the “background” of resting state neural activity that allows the energy to both increase and decrease in order to signal change in the input.

Clearer understanding of frequency information is helpful not just for generating acoustic perceptual stimuli, but also for filtering or analyzing bands of frequency information (e.g., for contributions to intelligibility or other outcome measures). Suppose that the F1 frequency varies between 300 and 1000 Hz, and the F2 frequency varies between 1000 and 2500 Hz. The F1 range seems smaller, but spans a distance of 6.84 mm in the standard adult cochlea, whereas the apparently larger 1000–2500 Hz spans just 6.06 mm. More cochlear space is dedicated to F1 after all!

Pitch perception

Pitch is a complicated concept because it involves both spectral and temporal cues, and because related terms are sometimes used haphazardly. Individual components (e.g., sinusoids) of a complex sound have frequency, which is the measurable repetition rate. A complex sound with a large number of components at integer multiples of the fundamental frequency (like voiced speech) has a more complicated waveform pattern, but still a general rate of repetition, which is its fundamental frequency (f0). f0 can also be derived from the spacing between the harmonics, and usually from the frequency of the lowest harmonic component. Pitch has a more slippery definition; it is the subjective perceptual quality of a sound that scales closely with fundamental frequency, and which generally gives rise to melody.

Deriving pitch from the sound spectrum can be somewhat misleading. Even though the spectrum might cleanly distinguish all the harmonic components of a voice, the auditory system cannot encode every spectral component separately. Only the lowest seven or eight harmonics are proportionally far enough apart that they each activate a separate auditory filter (i.e., a separate place on the basilar membrane in the cochlea). The difference between the first and second harmonic is 100% (e.g., 100–200 Hz); the difference between the second and third is 50% (200–300 Hz), and the proportional change continues to decrease (33%, 25%, 20%, and so on). Around the eighth harmonic, the proportional difference is smaller than the proportional width of the resolving power of the cochlea (roughly 12–14%), so the higher harmonics are not represented distinctly in the auditory system, and are said to be unresolved (Moore et al., 1986). Interestingly, this upper limit holds true even for high-pitched voices, whose harmonics extend into higher-frequency regions of the cochlea, but are also spaced farther apart, therefore preserving resolution of only the first approximately seven harmonics.

f 0 is coded primarily by temporal cues. It is no coincidence that the upper limit of fundamental frequencies perceived to have tonal quality is around 5000 Hz, which is also the upper limit of the auditory system’s ability to lock to the repetition rate of the sound waveform. For higher-order harmonics, the exact placement of excitation is not resolved, but the interaction of multiple components within one auditory filter creates amplitude fluctuations that repeat at the same rate as the fundamental (try adding sounds of 2000 and 2200 Hz; you will get a pattern that repeats 200 times per second). Therefore, one can derive the pitch by attending to temporal fluctuations in any part of the audible spectrum. However, pitch perception is markedly better when listeners have access to the lower-frequency resolved harmonics; the two cues appear to be complementary. Because cues for pitch are distributed across the whole spectrum, pitch information cannot be neutralized by simply removing or masking low-frequency components. Even without the first harmonic present, the fundamental (and therefore the pitch) remains the same, because the basic repetition rate of the complex wave remains the same. Consider how you can still identify different talkers and distinguish questions from statements when listening to an adult over the telephone, which filters out components below 300 Hz. Therefore, for an experimenter to control or neutralize the influence of pitch, it is not enough to filter out or mask out the fundamental and lowest harmonics (as done in numerous studies in the literature).

Listeners with typical (“normal”) hearing can distinguish complex signals with just a 0.2% change in fundamental (Moore et al., 1984). Linguistically relevant acoustic contrasts generally do not require such fine-grained discrimination. Although listeners with typical hearing can perceive such miniscule changes in steady-state vowels (Klatt, 1973), this ability grows poorer – requiring about 6% change – when (synthetic) speech contains dynamic pitch movements (‘t Hart, 1981). Rosen and Fourcin (1986) point out that there is also roughly 6% variance when talkers repeat the same speech segment, according to analysis of vowels from the classic study by Peterson and Barney (1952). Rosen and Fourcin’s work demonstrates the power of perceptual analysis driven by analysis of the acoustic output itself. These studies collectively suggest that only frequency differences of 6% or more are likely to play a meaningful role in the transmission of speech information.

Pitch is an intensely studied aspect of audition partially because it serves as a mechanism to bind together multiple components of a complex sound so that they are perceived as a unified auditory stream (Culling and Darwin, 1993). When listening to two talkers simultaneously, intelligibility improves considerably when the two voices have greater separation of pitch (Brokx and Nooteboom, 1982), which partially explains why it is easier to hear one talker when a background talker is of a different gender (Brungart, 2001). Additionally, acoustic-phonetic cues that are carried by pitch are more robust to noise, even if they are ignored when the signal is in quiet (Wardrip-Fruin, 1985; Winn et al., 2013).

Perception of the spectrum becomes unclear when intensity is increased

At low sound intensities, active mechanisms of outer hair cells in the inner ear sharpen the peak of excitation in the cochlea, creating precise coding of frequency. Higher- intensity sounds activate a wider span of cochlear space, similar to the wide disturbance of a pond when a large stone is dropped in it. This means that frequency coding could be jeopardized when the sound is presented loudly. Figure 7.3 illustrates how auditory excitation in the cochlea grows to be not only wider (known as spread of excitation), but increasingly asymmetrical with greater intensity. Low-frequency sounds are relatively more disruptive than high-frequency sounds because the traveling wave of sound energy begins at the base of the cochlea (where high frequencies are encoded), and passes through lower-frequency regions on their way to their characteristic place of stimulation. This phenomenon is called “upward spread of masking.” Illustration of excitation patterns in Figure 7.3 as well as other auditory responses elsewhere in this chapter were generated using the auditory periphery model of Zilany et al. (2014). The chaotic pattern at very high intensities is due to multiple modes of vibration in the basilar membrane response, in addition to the nonlinearities in inner hair cell discharge (modeled by Zilany and Bruce, 2007). These patterns are likely to contribute to the degradation in perception at very high intensities, known in audiology as the “roll-over” effect in speech when it is presented above normal conversational levels.

Upward spread of masking directly impacts perception of formants in speech. Like most other natural sounds, speech has greater energy in lower-frequency regions. The first formant can therefore spread excitation upward to mask the second and third formants. The amount of masking is extraordinary – about 40 dB – and was calculated originally by Rand (1974), who presented F1 and the upper formants to opposite ears; in doing so, the 40 dB masking release was observed. A series of experiments using this principle eventually became part of the series of “duplex perception” studies, which played a role in the theoretical debate over the nature of phonetic representation and speech perception in general. But they began as a simple exploration of basic auditory principles.

Understanding the spread of cochlear excitation helps us to think of formants not as specific frequencies, but as regions that are less precise and more overlapping than a conventional spectrum would suggest. Zahorian and Jagharghi (1993) suggested a model of vowel recognition based on log-amplitude-scaled spectral shapes rather than formants. The Bark frequency scale further aids the intuition for the frequency separation needed to perceive the vowel spectrum. Syrdal and Gopal (1986) suggested a framework whereby spectral peaks (formants) would be interpreted not as absolute values but in terms of relative distance to other peaks in auditory-perceptual space (or, arguably, physical space in the cochlea). A “critical distance” of 3–3.5 Bark was found to categorically distinguish vowel height (Bark f0F1 distance) and advancement (Bark F2F3 distance). Some papers provide very specific focus connecting the ideas of frequency integration and vowel perception (for reference, see Darwin and Gardner, 1986; Assmann and Nearey, 1987; Kiefte et al., 2010).

Figure 7.3

Figure 7.3 Auditory excitation patterns (inner hair cell discharge) for a simple 500 Hz pure tone, 500 Hz harmonic complex, and the vowel /i/ (with 115 Hz fundamental frequency), each labeled on the right band of each panel. The lowest panels show the analytical power spectrum of the /i/ vowel. Different lines represent different input intensity levels at 10-dB intervals. Larger intensities elicit smoother and less precise excitation patterns, including patterns with peaks unrelated to the original components.

The biological framework of formants as broad areas of cochlear excitation can help to reconcile different proposed perceptual cues for phonemes. Consider the debate over acoustic cues for place of articulation in stops. Stevens and Blumstein described static cues according to gross spectral shape, which has good explanatory power in the acoustic (Stevens and Blumstein, 1978) and perceptual (Blumstein and Stevens, 1980) domains. Kewley-Port (1983) provided an account of this phonetic contrast by instead focusing on the dynamic transitions in formant frequencies. If upper formants are considered as areas of cochlear excitation, it is much easier to see the similarities between formants and the spectral shapes as described by Stevens and Blumstein: Regardless of the acoustic measurement, spectral components as high as F2 or F3 will necessarily be transformed by the cochlea into rather gross areas of excitation, or arguably “spectral shapes.”

A more recent theory of vowel coding in the auditory system (Carney et al., 2015) shifts the focus away from formant peaks altogether. The high intensity of formant peaks would saturate inner hair cell discharge rates, and therefore not be coded robustly at higher levels in the auditory system (e.g., the inferior colliculus) that are selectively tuned to detect modulation rather than absolute energy. However, downward-sloping amplitude energy on the upper and lower edges of formants contain fluctuations in amplitude consistent with pitch periodicity, and would therefore project a fluctuating signal to which the auditory midbrain neurons would respond. The model and empirical data presented by Carney et al. (2015) suggest that the space between formants is the property most strongly represented at upper levels of the auditory system. This framework would be consistent with basic edge-detection principles that have prevailed in the visual sciences.

Not all frequencies are equally audible or equally loud

Different frequencies require different amounts of sound pressure level in order to be audible, or to be equal in loudness. Based on the current international standard (ISO 226) inspired by seminal work by Fletcher and Munson (1933) and Robinson and Dadson (1956), Figure 7.4 illustrates the dependence of loudness judgments on sound frequency. These equal-loudness contours have practical implications for normalizing sound levels and for delivering sound through specialized earphones. In situations where stimuli need to be shaped to match specific loudness perceptions, analysis of analytical sound pressure level is not sufficient; one must also consult the equal-loudness contours to ensure that the attenuation is reflective of the frequency of the signal being played. Equal loudness contours factor into frequency-specific energy weightings when estimating loudness of a broadband sound (using the dBA scale for low or moderate-intensity sounds and the dBC scale for loud sounds). The influence of frequency on loudness is less pronounced at higher levels; a high-intensity sound is loud, no matter what the frequency.

Figure 7.4

Figure 7.4 Equal loudness contours. Each line indicates the sound pressure intensities required at different frequencies to produce a fixed level of perceived loudness. The thickened lines correspond to uncomfortably loud sounds (top), medium-level sounds (middle), and the absolute thresholds for detection of different frequencies (bottom).

The interaction between spectrum and time

Perception of the sound spectrum becomes more accurate with increasing duration. Moore (2013) suggests that the gradual sharpening of spectral estimation unfolds over the course of a 30 ms time window, which happens to correspond to various meaningful time windows for phonetic perception. For example, it is the minimal stimulus duration required for accurate perception of sibilant fricatives (Jongman, 1989), formant trajectories in stop consonants (Blumstein and Stevens, 1979), vowel onglides and offglides (Assmann, 1996), and voicing distinctions in intervocalic or post-vocalic fricatives (Stevens et al., 1992). It was used by Dubno and Levitt (1981) in their detailed acoustic analysis of consonants in perceptual experiments. It is also close to the short “phonetic” auditory temporal window proposed by Poeppel and Hickok (2004) and the 20-ms boundary proposed to be a natural discontinuity in auditory temporal processing by Holt and colleagues (2004). Jongman (1989) points out that, at least among fricatives, place of articulation – traditionally thought to be cued spectrally – is much more affected by brevity of stimulus duration than other features such as voicing or manner. This observation is consistent with the cueing of velar consonants /ɡ/ and /k/, which have a distinct relatively compact spectral peak in the mid-frequency range, which demands longer exposure to establish the precision of the spectrum. Fittingly, the duration of the burst and aspiration of these consonants is longer than their counterparts at other places of articulation (Blumstein and Stevens, 1980).

Temporal information

At its very core, speech is a temporal signal with multiple layers of amplitude modulations. Temporal cues are essential components of familiar phonetic contrasts such as voice onset time (VOT; duration of aspiration, temporal amplitude modulation between burst and vowel onset), word-medial contrasts in voiced and voiceless stops (duration of closure gap), fricative/affricate distinctions (presence or absence of silent closure gap, speed of frication rise time), word-final consonant voicing perception (duration of preceding vowel), and the speed of formant transitions (glide/stop contrasts), to name a few. Even cues that are seemingly spectral in nature – such as formants – can be described as temporally modulated energy in adjacent auditory filters.

Temporal cues can be classified into broad categories based on the speed of modulation. Rosen (1992) provided a thorough explanation of these categories and their translation to linguistic properties. Very slow (2–50 Hz) modulations comprise the signal envelope, which can convey information about syllable rate as well as manner of articulation and segment duration. Faster modulations, on the order of 50–500 Hz convey periodicity, and can transmit information about voice pitch. For this reason, a voiceless fricative /s/ that is artificially modulated at 100 Hz can sound like a voiced fricative /z/ even without the full complement of acoustic cues for glottal voicing. Waveform changes that happen faster than 500 Hz are generally described as temporal fine structure, and correspond to cues that are commonly thought to be “spectral” in nature, such as place of articulation. Different categories of waveform modulation are illustrated in Figure 7.5.

Although a storied history of speech perception research (namely, that which began at Haskins laboratories) focused on the spectral characteristics of speech, there are successful models of speech perception grounded in auditory principles that give special emphasis to temporal modulations (Dau et al., 1997). Furthermore, Giraud and Poeppel (2012) and Peelle and Davis (2012) describe how neural oscillations on different timescales might facilitate speech perception (see also Chapter 8, this volume). They suggest that oscillations are entrained to fluctuations in the amplitude envelope of the speech signal at the levels of linguistic units like phonemes, syllables, etc. (but see Obleser et al., 2012 for a supportive but critical interpretation). These models suggest that when comparing sensitivity to different kinds of linguistic segments (e.g., syllables, fricative sounds, stop sounds), one must consider the general rate of change in the time domain. Apart from merely a descriptive tool, amplitude modulation spectra over very coarse frequency ranges have been used to efficiently model perception of consonants and vowels (Gallun and Souza, 2008).

Figure 7.5

Figure 7.5 The sentence “A zebra is an animal that lives in Africa.” The envelope is illustrated as a thick line above the full waveform, and the temporal fine structure for a 20-ms segment indicated by the narrow slice is illustrated in the zoomed-in inset box.

The encoding of sound energy over time does not perfectly match the intensity of the sound waveform. First, the auditory system gives special emphasis to sound onsets and changes in amplitude, as will be further discussed later in the chapter. Also, the mechanical nature of the inner ear results in some sustained activation after a sound, called “ringing,” that can interfere with perception of subsequent sounds presented in quick succession. This type of interference – called “forward masking” – can last upwards of 100 ms (Jesteadt et al., 1982), and can result in different detection thresholds for speech in isolation compared to speech in the context of other signals. Consonants – being relatively lower in intensity and shorter in duration – are more susceptible to forward masking than vowels (Fogerty et al., 2017). This might explain why consonants are relatively more difficult to perceive at syllable offset position compared to syllable onset position (Dubno et al., 1982).

Decomposing the temporal envelope

The complete waveform can be misleading. In the auditory system, it is decomposed by numerous auditory filters, and the output of each filter might look dramatically different from the full waveform you see on your computer screen. Figure 7.6 illustrates the word “sing,” shown as a full waveform and after being sent through a small number of band-pass filters, each responsible for a narrow region of frequencies. With this in mind, it is useful to be skeptical of descriptions of complete waveforms, as they have no true corresponding biological representation beyond the middle ear.

Decomposing the signal into a small number of frequency bands with sparsely coded envelopes has been useful for the development of sound processing in cochlear implants (Loizou et al., 1998), whose processing strategies can be traced back directly to the development of the Vocoder (Dudley, 1939). It is a way of simplifying the signal since the fine structure is discarded in favor of a low-fidelity envelope that transmits only crude information about how the frequency bands change in amplitude over time. As long as there is no interfering noise, these simplified envelopes can support excellent speech recognition with as few as four to eight frequency bands (Shannon et al., 1995; Friesen et al., 2001).

Summary

With all of the transformations that occur in the auditory system, our knowledge of relevant acoustic cues for phonemes is seriously challenged by a need to translate acoustic measurements into properties that are actually encoded by the sensory system. The next section walks us through how this can be accomplished, with the aid of decades of work in other sensory systems engaged in the same kind of computational problem.

Figure 7.6

Figure 7.6 A complex sound decomposed into constituent frequency bands by the auditory system. Each band has its own envelope of energy over time, implying that the overall envelope on the composite waveform is typically not represented in the cochlea. The filter bands shown here are merely a six-channel view for simplicity, using frequency ranges of equal cochlear space. The typical auditory system would have many more frequency filters whose widths are narrower, but similarly proportional to their central frequencies.

Sensory systems as change detectors

Once the signal from the inner ear is transduced into action potentials at the auditory nerve, the auditory system gets quite complex. Information is relayed to the cochlear nucleus, superior olivary complex, inferior colliculus, and the medial geniculate body of the thalamus before reaching the brain. As further described in Chapter 8 of this volume, at every stage along the way, the neural code is repackaged so that more sophisticated aspects of the input can be represented. This is by far the most extensive subcortical processing in any sensory modality. Additionally, a dense network of feedforward, feedback, and lateral connections exists between auditory nuclei, so much so that nearly every figure depicting this network is noted as being highly simplified.

Given this complexity, how is one to understand how speech and other complex sounds are processed? This question has benefited from an unlikely source of inspiration: Claude Shannon’s (1948) mathematical approach to information transmission in a communication system (i.e., “information theory”). In this approach, the amount of information transmitted is directly related to its predictability. If the transmitter’s message is entirely predictable (e.g., “Thursday comes after Wednesday”), there is no new information for the receiver because it isn’t anything she/he did not already know. However, if the transmitter’s message is relatively unpredictable (e.g., “Audrey imagined a typhoon”), there is considerable potential information for the receiver.

These principal ideas in Shannon’s information theory were originally conceived to describe communications systems, but they also do a very good job of describing sensation (Barlow, 1961) and perception (Attneave, 1954). When some stimulus in the environment is completely predictable or unchanging, it provides no new information for the perceiver. Conversely, when a stimulus is unpredictable or changing, it possesses potential information for the perceiver. Neurons exhibit efficient response strategies that optimize the amount of information they can transmit. Enhanced responses can indicate unexpected or changing inputs (high information content), whereas diminished responses can indicate expected or unchanging inputs (low information content). These notions form the foundation of the efficient coding hypothesis (Barlow, 1961, 2001; Simoncelli, 2003): Recoding the sensory input reduces its predictability and increases the informativeness of stimulus representations at later stages of sensory processing. By focusing on relative change, limited neural dynamic ranges adjust to maximize the amount of change detected in the environment (Kluender et al., 2003). The auditory system displays a variety of mechanisms for change detection, and these mechanisms can be highly informative for understanding how speech is processed and perceived.

Adaptation in the auditory nerve

In the presence of continued stimulation, neurons generally decrease their firing rate. This has been termed “firing rate adaptation,” or “spike-frequency adaptation” (Figure 7.7). Adaptation is not mere neural fatigue, but instead an efficient way to code information about a stimulus that is not changing (Wark et al., 2007). Neurons are even observed to respond strongly to change even when adapted to previous stimuli (Smith, 1979; Smith and Zwislocki, 1975). There is a biological basis for viewing adaptation as an efficient response strategy. Producing action potentials incurs a metabolic cost, and there are limitations on energy consumed by neural populations in the brain (Laughlin, 1981; Lennie, 2003). Thus, adaptation acts as a conservation of resources until they can be better deployed, such as when there is a change in the environment. Adaptation occurs throughout the auditory system to a wide range of stimulus properties, and it is reflected in other neural systems as well. Figure 7.7 illustrates the canonical neural activity over time in response to a simple pure tone, and Figure 7.8 expands this concept to phonemes with different manners of articulation, whose patterns of airflow are distinguished by their sound envelopes. Although the different manners are distinguished by their envelopes on the waveform (see Rosen, 1992), the neural representations of those envelopes are far less distinct, primarily because of the overrepresentation of onsets.

Figure 7.7

Figure 7.7 Firing activity of a population of auditory neurons in response to a 50-ms, constant-amplitude pure tone. The neural activity begins with a resting firing rate before any stimulus, represents the tone onset with disproportionately high firing that adapts within about 15 ms, and then remains steady through the end of the stimulus. At stimulus offset, activity decreases sharply before finally returning to its resting state.

Figure 7.8

Figure 7.8 Overall neural firing response (akin to Figure 7.7) in response to phonemes with different manners of articulation.

Neurons adapt if they are coding spectral energy that does not change substantially over time (such as formant frequencies in English vowels /i/ and /u/, and numerous other steady-state vowels in other languages). These adapted neurons remain sensitive to new information so that they can respond when these frequencies change, like during coarticulation into the next speech sound. In addition, adaptation is a strong candidate for producing spectral contrast effects (see section, “Speech perception as change detection”).

Change detection in auditory cortex

Detection of sensory changes is also measured at the cortical level using mismatch negativity (MMN; Näätänen et al., 1978). MMN is an event-related potential elicited by neural populations in the brain. Studies of the MMN contrast the neural activity measured in response to a pattern of repetitive or predictable inputs (the “standard”) against an unexpected stimulus that violates this pattern (the “deviant”). Mismatch responses occur automatically, even in the absence of attention, making them a popular tool for investigating central auditory function. They arise in response to changes in simple acoustic properties (e.g., frequency, duration, intensity, and spatial location; see Näätänen et al., 2007 for a thorough review), and they scale up to the level of phonetic contrasts as well (Aaltonen et al., 1987; Näätänen et al., 1997). MMNs have been reported for violations of cue covariance (Gomes et al., 1997; Takegata et al., 2005), which is highly relevant for phonetic categorization, given the abundance of covarying cues in phonemes (Repp, 1982). These cortical responses to stimulus changes are reliable predictors of speech intelligibility at segmental and sentence levels, and of language development (Koerner et al., 2016; Molfese and Molfese, 1997). Altogether, investigations of MMN paint a rich picture of how the auditory system is sensitive to changes, from simple (i.e., acoustic) to more complex (i.e., phonetic). This is consistent with the idea that deviance detection in audition is organized hierarchically (see Grimm et al., 2011; Escera and Malmierca, 2014 for reviews).

Speech perception as change detection

The acoustic complexity of speech is considerable, with important signal properties spanning narrow and broad frequency ranges as well as brief and long durations. It is both true and fortunate that the auditory system is constantly calibrating in order to be maximally sensitive to changes across variable frequency and temporal extents (Kluender et al., 2003). Considering the general principle of change detection that cuts across all sensory systems, it is unsurprising that it plays a significant role in speech perception. Some speech sounds can be defined by how they change over time. For example, the second formant (F2) transition is a major cue for distinguishing voiced stop consonants (Delattre et al., 1955; Kewley-Port, 1983; Lahiri et al., 1984). Apart from this well-known case, there are other cases that deserve to be framed as change detection as well. Encoding spectral changes over time has proven to be fruitful in understanding speech perception by people with impaired auditory systems (Alexander and Kluender, 2009; Dorman and Loizou, 1996; Winn and Litovsky, 2015), and in highlighting the difference between automatic classification (Blumstein and Stevens, 1979) and human perception (Dorman and Loizou, 1996, Kewley-Port, 1983).

Vowel-Inherent Spectral Change

Although vowel recognition is typically described as depending primarily on the frequencies of the first two formants (and further formalized in the literature by mapping so-called steady-state formants onto a static two-dimensional space), there are extensive dynamics in play. Researchers often use tables of steady-state formant values from publications by Peterson and Barney (1952) and Hillenbrand et al. (1995) when preparing or analyzing their own speech stimuli. However, there is tremendous value in the authors’ original commentaries surrounding these tables. For instance, Hillenbrand and colleagues suggested very strongly that English vowel categories should be defined not merely by a snapshot of steady-state values, but by the changes in those values over time. This point was earlier proposed and empirically verified in Canadian English by Nearey and Assmann (1986), with supporting evidence in American English contributed by Hillenbrand and Nearey (1999). In this latter report, automatic classification of F1 and F2 steady-state values was greatly enhanced by the addition of formant information from a second time point – much more so than the addition of duration, f0, or F3 information. Figure 7.9 depicts the movement of vowel formants in the production of English vowels by adult women recorded in the study by Hillenbrand et al. (1995). This illustrates how much information is typically lost when only considering formants at vowel midpoints.

Figure 7.9

Figure 7.9 Formant dynamics in English vowels spoken by women recorded by Hillenbrand et al. (1995). Formant frequencies are measured at the 20% point (beginning of each line), 50% midpoint (gray boxes), and 80% point (circle with phonetic symbol) of vowel duration.

The concept of perceptually significant changes in formants across time has come to be known as vowel-inherent spectral change (VISC; Nearey and Assmann, 1986; Hillenbrand and Nearey, 1999; Morrison and Assmann, 2013). VISC plays an important role in vowel recognition, and not just for vowels traditionally labeled as being diphthongal. Accuracy of vowel perception drops considerably when formant trajectories are artificially flattened, for vowels spoken by adults (Hillenbrand and Nearey, 1999) and by children (ages 3, 5, and 7) (Assmann and Katz, 2000). It bears mention that while VISC is a perceptually salient cue that helps disambiguate vowels in American and Canadian English, this might be due to the crowded nature of these vowel spaces, contributing to the Adaptive Dispersion theory of vowel typology and the Dynamic Specification model of vowel perception (Jenkins and Strange, 2013). VISC has not been studied extensively in other languages, but languages with less-crowded vowel spaces might have less need for vowels to be disambiguated through VISC.

Recognition of other phonemes can be interpreted through the lens of change detection, but it is beyond the scope of this chapter to list them all. The earlier examples illustrate the importance of detecting changes in acoustic details within the phoneme (i.e., intrinsic cues to phoneme identity). However, surrounding sounds form an acoustic context whose properties can influence recognition of a given phoneme (i.e., extrinsic cues to phoneme identity). This distinction between intrinsic and extrinsic cues was originally raised for vowel perception (Ainsworth, 1975; Nearey, 1989), but it is appropriate for perception of any stimulus, speech or non-speech. Next, we briefly review how spectral and temporal context influences phoneme perception.

Short-term spectral contrast

Spectral properties of neighboring sounds can alter recognition of a phoneme, such that any differences in the spectra of two neighboring sounds will be perceptually magnified. If frequencies in the surrounding context are relatively lower, frequencies in a target sound may seem higher by comparison, and vice versa. The perceptual patterns that reveal the bias away from surrounding context are known as “spectral contrast effects.” Such demonstrations date back at least to Lindblom and Studdert-Kennedy (1967), who studied identification of a target vowel as /ʊ/ or /ɪ/ in different consonantal contexts (/wVw/ or /jVj/). When F2 transitions in the consonantal context started and ended at lower frequencies (/w/), the vowel was identified as the higher-F2 /ɪ/ more often, and the opposite pattern emerged for the /jVj/ context. This pattern of results was later extended to stop consonant contexts (/bVb/, /dVd/) in perception of /o/–/ʌ/ and /ʌ/–/ɛ/ vowel continua (Nearey, 1989; Holt et al., 2000). Additionally, these results were replicated when consonantal contexts were replaced by pure tones and tone glides (Holt et al., 2000), suggesting that the contextual adjustment might be driven by general auditory contrast rather than phonemic or phonetic contrast.

Spectral context influences consonant recognition as well. Listeners are more likely to perceive /d/ (higher F3 onset frequency) following /r/ (lower F3 offset frequency), and more likely to perceive that same consonant as /ɡ/ (lower F3 onset frequency) following /l/ (higher F3 offset frequency) (Mann, 1980, 1986; Fowler et al., 1990). Similar to the results of Holt and colleagues (2000), Lotto and Kluender (1998) replicated these effects by replacing the liquid consonant with pure tones and tone glides at the appropriate frequencies. The debate remains strong over whether these context effects can be described as general auditory phenomena or something more complicated (Lotto and Holt, 2006; Fowler, 2006; Viswanathan et al., 2009; Kingston et al., 2014).

Long-term spectral contrast

The influence of spectral context is not limited to the sound immediately adjacent to the target phoneme (Figure 7.10). Similar shifts in phoneme categorization occur when spectral properties are stable across a longer listening context like a sentence. Listeners are more likely to perceive a vowel with lower F1 frequency (e.g., /ɪ/) when the preceding acoustic context features higher F1 frequencies, and vice versa (Ladefoged and Broadbent, 1957). Such demonstrations are often referred to as “extrinsic normalization” or “vowel normalization” (e.g., Ainsworth, 1974; Nearey, 1989; Johnson, 1990), but similar effects have been reported in consonant recognition as well. Studies of speech and non-speech perception produced similar results using non-speech contexts (signal-correlated noise, Watkins, 1991; sine tones, Holt, 2005; string quintet, Stilp et al., 2010) and/or targets (brass instruments, Stilp et al., 2010), indicating that talker-specific information is not necessary in order to produce spectral contrast effects. This has led to suggestions that it is not talker information per se but the long-term average spectrum of preceding sounds that biased subsequent perception in these studies (Huang and Holt, 2012). However, it bears mention that similar effects have been reported without any influence from the long-term average spectrum of preceding sounds. Visual information about the talker’s gender produced similar shifts in categorization boundaries (Glidden and Assmann, 2004), and expectations about talker gender can shift vowel categorization (Johnson et al., 1999).

Figure 7.10

Figure 7.10 Different timescales of context (gray) influence phoneme categorization (black). The context can be the immediately preceding sound (top) or several seconds of sounds (bottom), whether they are speech or non-speech. This depicts “forwards” context effects (context precedes target phoneme), but “backwards” effects (phoneme precedes context) are also known to occur.

Spectral contrast effects have been reported for recognition of phonemes cued by fundamental frequency, F1, F2, F3, and overall spectral shape (see Stilp et al., 2015 for review). A series of experiments by Stilp and colleagues (2015) demonstrated the generality of long-term spectral contrast effects on categorization of /ɪ/ and /ɛ/. Perceptions were repeatedly influenced by spectral properties of the preceding sentence, whether those properties took the form of a narrow spectral peak (100 Hz bandwidth), a broader peak (300 Hz), or spanned the entire spectral envelope (after Watkins, 1991). Importantly, the size of the shift in phoneme categorization was correlated with the size of the spectral difference between the context sentence and subsequent target vowels (see also Stilp and Alexander, 2016). This relationship also exists in consonant categorization (Stilp and Assgari, 2017) and musical instrument categorization (Frazier et al., in press). Spectral contrast effects are a general principle of auditory perception, and the wide variety of examples in the literature suggest that they might play a widespread role in everyday speech perception as well.

Temporal contrast

Temporal properties of speech can also have contrastive effects on phoneme identification. Similar to the spectral contrast effects reported earlier, these effects occur on both shorter and longer timescales. Whether formant transitions are heard as fast (indicating a stop consonant) or slow (indicating a glide/semivowel) can be affected by temporal information from the surrounding speech context, such as speaking rate and the relative durations of consonant and adjacent vowel. Longer vowel duration makes the initial consonant sound shorter (i.e., more /b/ responses), and shorter vowel duration made the consonant sound longer (more /w/ responses; Miller and Liberman, 1979). Similar patterns emerge for other temporal contrasts, such as VOT for /b/ and /p/ (Green and Miller, 1985). Even a brief interval of silence can form a temporal context; an abrupt onset of the amplitude envelope is perceived to be shorter, leading to an increase in perception of affricates instead of fricatives (Dorman et al., 1979).

Temporal contrast effects can also occur across longer-duration listening contexts. Ainsworth (1973) reported that when preceding sounds had a slower tempo, a fixed-duration target consonant sounded faster; formant transitions normally slow enough to be perceived as /w/ were instead more likely to be heard as /b/. An analogous temporal contrast effect emerges for the perception of voicing in /ɡ/ and /k/ (Summerfield, 1981) and in the /ʃ/–/tʃ/ contrast (Haubert and Pichora-Fuller, 1999; Repp et al., 1978). The temporal category boundary between consonants essentially moves around to accommodate the momentary speaking rate and the exact syllable timing; listeners show sensitivity to changes in multiple timescales simultaneously. However, some controversy surrounds the consistency with which distal speech rate information affects speech perception (Heffner et al., 2017).

Together, these studies depict speech perception as being heavily influenced by detection of change relative to context. These changes can span narrow (a formant peak) to broad spectral bandwidths (spectral envelope) and narrow (tens of milliseconds) to broad temporal extents (several seconds). This acoustic diversity highlights the extent to which the auditory system is constantly adjusting its sensitivity to changes across frequency and across time in order to maximize the amount of information that can be transmitted to the perceiver. For a phonetician, this provides important perspective for evaluating phoneme perception. Not only do the (intrinsic) acoustic properties of a given phoneme affect its perception, but the (extrinsic) surrounding context plays a significant role as well.

Learning about the auditory system using speech sounds

Being conversant and knowledgeable about phonetics can be useful for auditory scientists. We highlight at least three benefits here: first, one can explore the performance of the auditory system using stimuli where signal properties translate into known meaningful differences in speech (as opposed to acoustic properties that are merely convenient and programmable). Second, we can explore cases where auditory cues permit categorization, rather than just discrimination. Finally, one can explore the effects of specific auditory deficits such as hearing loss on particular phonetic contrasts. In this section, we walk through some ways that acoustic phonetics can contribute to auditory science, and the challenges that remain.

Experimental control and external validity

There are some good reasons why psychoacousticians do not normally use speech stimuli. Speech acoustics are highly variable from one utterance to the next, so two experimenters might use different sounds even if they are nominally the same phoneme. Acoustic measurements of speech do not give an absolute template for speech perception in general. Even seminal papers on acoustic measurements such as Hillenbrand et al. (1995) warn readers against the tendency to use measurements as a framework for describing anything other than a snapshot in history in a particular region of a country (for further discussion, see Chapter 16 in this volume). There is no consensus on what constitutes an “ideal” speech stimulus, or on how to properly control acoustic factors of interest (via synthesis, partial resynthesis, waveform editing, etc.). In the psychoacoustics world, everyone agrees on what constitutes a 1000 Hz tone of 300-ms duration and 70 dB SPL intensity. However, the pure tone might not represent any meaningful experience relating to communication. What we gain in stimulus acoustic control we also sacrifice in ecological validity, and vice versa.

Figure 7.11

Figure 7.11 Waveforms and spectrograms for “dime” (top panels) and “time” (lower panels) showing that voicing contrast appears as differences in amplitude envelope/duration of consonant burst and aspiration, as well as formant frequencies at the onset of voicing.

Suppose one is interested in measuring auditory temporal processing because there are relevant phonetic cues that are temporal in nature. It can be tempting to use the classic “out-of-the box” psychoacoustic and phonetic tests such as gap detection and VOT categorization, respectively. However, there are at least two challenges worth noting. First (and most simply), the acoustic dimensions sometimes do not match up across different testing styles. For example, VOT is not a gap in the speech signal; it can arguably be described as an amplitude envelope modulation, or asynchrony in onset time of two crudely defined “high” and “low” frequency regions. Second, the prolonged aspiration of a voiceless English stop is not “added” onto the vowel but rather a progressive devoicing of the vowel; Figure 7.11 illustrates this principle by temporally aligning the contrastive pair “dime” and “time.”

As stated earlier, it is reasonable to suspect that psychoacoustic tests probe abilities on a different scale than those needed for speech recognition. For example, while syllable-final stop sounds contain a gap whose duration is useful for voicing categorization (i.e., longer gaps typically correspond to voiceless stops), the duration of these gaps (30–150 ms) is an order of magnitude greater than just-noticeable gaps in psychoacoustic tests (2–5 ms).

Multidimensionality of phonetic contrasts

Phonetic contrasts cannot be completely described in only one acoustic domain at a time. The literature is replete with investigations of multiple acoustic parameters that covary with single linguistic features. Among them are Lisker’s (1986) catalog of 16 cues for the /b/–/p/ contrast in word-medial position, and McMurray and Jongman’s (2011) measures of 24 different cues for English fricative sounds. Although some of the cues are certainly more dominant than others in governing perception, we note these examples to highlight how synthetizing or manipulating speech must account for a high amount of complexity in order to retain natural acoustic quality.

Even when a temporal property is a salient cue for a phonetic contrast, its role can rarely be examined in isolation in the same way as in a classical psychoacoustic experiment. Consider the impact of vowel environment on VOT. Lower vowels (like /a/) are cued by higher F1 values, and therefore contain a very salient up-sweeping F1 transition. As stop consonant aspiration essentially devoices the transition into the following vowel, longer VOT results in a noticeably higher F1 frequency at voicing onset. In experimental terms, this means the F1 onset frequency (a spectral cue) can be strongly confounded with VOT (a nominally “temporal” cue), particularly in the context of low vowels like /a/. Incidentally, there are numerous examples of experiments that examined temporal perception using VOT stimuli in the /a/ context, with predictably mixed results. For example, Elangovan and Stuart (2005) found a relationship between gap detection and VOT boundary, but Mori et al. (2015) found no such pattern. It is likely that perception of temporal processing was diluted by a strong accompanying spectral cue in the /ba/–/pa/ series used in each study. Such mixed results are peppered throughout the literature on this topic, possibly because of an incomplete appreciation of the multiplicity of acoustic cues for the voicing feature. Barring a deep dive into the acoustics literature, a good practice for an auditory scientist interested in temporal processing would be to use a steady-state vowel context with low F1, such as /i/, which affords ostensibly no extra spectral cues because there is no sweeping frequency transition.

Controlling acoustic-phonetic cues

Despite the inherent covariation among acoustic cues in speech, it is usually possible to manipulate speech sounds so that phonetic categorization mainly taps into a particular auditory domain. Detecting a stop following /s/, or detecting an affricate instead of a fricative at the end of a word is ostensibly the same as a psychoacoustic test of gap detection. Haubert and Pichora-Fuller (1999) examined the perception of word pairs that are differentiated by detecting the presence (the stop in “spoon” or the affricates in “catch” or “ditch”) or absence of a short temporal gap (resulting in perception of “soon,” “cash,” or “dish”). For those interested in perception of frequency-specific peaks in the spectrum, there could be tests that rely on perception of low-frequency formants (as for vowel height distinction between “hoop” and “hope,” or for F1 transitions in “loss” and “laws”), mid-frequency formants (for vowel advancement distinctions like “shock” and “shack”), or high-frequency energy that distinguishes words like “key” and “tea.” There are tests of amplitude modulation detection that have potential phonetic implications as well. Consider slow (2–10 Hz) rates of syllabicity and speech rate, or manner of articulation at slightly higher modulation rates (15–60 Hz). Detection of faster (≈100–200 Hz) modulations in broadband noise is akin to detecting the presence (e.g., /z/) or absence (e.g., /s/) of voicing in fricatives. The list certainly does not end here; there are numerous other examples where simple or complex patterns in speech can be mapped to corresponding simple acoustic parameters in non-speech stimuli.

Discrimination and categorization

Even after identifying appropriate complementary psychoacoustic and phonetic tests, it is worthwhile to recall that the ability typically tested in non-linguistic psychoacoustic tests is discrimination while most phonetic perception tasks probe categorization (Holt and Lotto, 2010). A helpful guide to these abilities was written by Erber (1982) and to this day is still regarded as a useful framework for understanding development of audition in children. The paradigms used to test speech recognition demand processing that extends beyond the simple detection or discrimination of sounds. Speech sounds must be appropriately categorized, requiring listeners to derive meaning from superfluous variation, and to incorporate relevant contextual and linguistic knowledge. For people who use cochlear implants, word recognition is more strongly predicted by auditory categorization than by psychoacoustic discrimination (Winn et al., 2016).

Categorization depends on a host of methodological details that could have considerable implications for processing. For example, extremely fine discrimination of speech sounds (i.e., the “auditory” mode) emerges with short inter-stimulus comparison intervals, but more coarse category-level labeling, seemingly absent of fine discrimination at all (i.e., the “phonetic” mode) occurs at longer intervals (Pisoni, 1973). Early models of categorical perception (a well-known nonlinear mapping of acoustic cues into phonological categories) broadly equated discrimination and categorization: Sounds from two different categories were discriminable, and sounds from the same category were indiscriminable. However, subsequent work demonstrated that shared category membership does not render sounds indiscriminable (Pisoni and Lazarus, 1974; Massaro and Cohen, 1983), further highlighting important differences between these abilities.

Using speech stimuli to probe for auditory deficits

The utility of mapping simple acoustics parameters onto phonetic segments can be seen very clearly in experiments designed to test auditory deficits using speech sounds. Hearing loss in its most common form (high-frequency sensorineural hearing loss, resulting from damage to the cochlea) is not merely a reduction of loudness, but instead a distortion of the spectral representation of sound, as well as a drastic reduction of the wide dynamic range that is characteristic of typical hearing. These problems are exacerbated for those who wear cochlear implants, which are prosthetic devices that directly electrically stimulate the auditory nerve but fail to faithfully transmit all of the fine temporal or spectral details of real acoustic signals. Because of these less-intuitive aspects of hearing loss, the specific spectral and temporal structure of speech contrasts can lead to testable hypotheses about what kinds of perceptions (and misperceptions) will unfold from various deficits in the auditory system.

The temporal domain

Gordon-Salant et al. (2006) showed that older listeners generally have less efficient processing of temporal cues, even in the absence of any clinical hearing loss. Examples of this include poorer perception of affricate gaps, VOT, spectrotemporal transition speed (e.g., the /b/–/w/ contrast), and vowel duration (for the voicing contrast in syllable-final consonants). This study provides a very clean and thorough example of how speech contrasts can be used to target isolated auditory abilities normally constrained to psychoacoustic discrimination tests.

The spectral domain

Individuals who use cochlear implants frequently have their hearing tested for spectral resolution because of the sizable challenges that stand in the way of good spectral coding by their devices. Such testing can take many forms in both the psychoacoustic and phonetic realms. At a coarse level, vowels can be used to test perception of segments that are primarily defined in the spectral domain without many helpful cues in the temporal domain. Typically, such vowel testing is done with “hVd” words such as “heed,” “hid,” “head,” etc., because the vowel is embedded in a consistent environment with virtually no confounding acoustic cues, each environment yields a valid word in English, and the acoustics of such vowels have been explored in numerous publications. Other studies examining dynamic cues in vowel perception generally found a lack of sensitivity to these crucial cues in cochlear implant users (see Iverson et al., 2006; Winn et al., 2012; Donaldson et al., 2013).

Vowels can be used to make specific predictions about how perception changes with local damage in the cochlea. DiNino et al. (2016) imposed frequency-specific degradation in a simulation of cochlear implants, where narrow regions of frequency information were either lost or redistributed, causing predictable warping of phonetic perceptual space. Vowel perceptions tended to gravitate away from the affected region. In cases where a mid-frequency region between 850 and 1850 Hz would carry F2 information for back vowels spoken by a woman, perception of those vowels would be more “front” in nature (e.g., “hud” would be perceived as “head”). In this study and those mentioned in previous paragraphs, it is worth mentioning that the test stimuli were not chosen because words of hVd form are especially important in the lexicon, but because they provide a stable phonetic environment where some parameter of interest can be tested directly. Using this testing approach, one might be able to diagnose deficiencies or even localized sites of cochlear lesion by capitalizing on the acoustic structure of vowels and consonants (Winn and Litovsky, 2015).

Use of a cue vs. sensitivity to a cue

The perceptual weighting of an acoustic-phonetic cue (e.g., to distinguish between phonemes in a minimal pair) is not the same as psychoacoustic sensitivity. Work by Klatt (1982) and Rosen and Fourcin (1986) articulate the distinctions between these concepts in illuminating ways. Consider voice pitch, to which human listeners are exquisitely sensitive, but which is largely discarded when perceiving stop consonant voicing, despite being a natural cue (Abramson and Lisker, 1985). Exceptions occur only in circumstances of unnatural ambiguity (Haggard, 1978) and/or with harsh masking noise or band-limited conditions (Winn et al., 2013). The phonetic perception results do not necessarily imply that frequency discrimination is poor. Instead, they imply that the cue is not regarded as reliable enough to inform categorization of this contrast because some other cue takes priority. Following the framework established in vision science (Ernst and Banks, 2002; Jacobs, 2002), Toscano and McMurray (2010) modeled auditory cue weighting as a function of the reliability of the distribution of that cue as it appears in natural speech. For example, if VOT for two segments /b/ and /p/ is realized as two tightly constrained (low-variability) separate distributions, and some other cue (such as vowel duration) is realized as two wide (high-variability) and overlapping distributions, the secondary cue should be given much less perceptual weight on account of its lack of usefulness for the task at hand (regardless of whether it is perceptually discernible).

Conclusion

Basic properties of the auditory system play a vital role in the processing of speech sounds. Exploration of nonlinearities in frequency, intensity, and temporal processing offer insight into how speech is transformed from an acoustic signal to a neural signal. Consistent with other sensory systems, auditory processing shows special sensitivity to change over time, and handles information proportionally and with respect to context. Although speech sounds are multidimensional, phonetic contrasts can be found to rely relatively heavily on spectral or temporal cues, and can thus be used to probe the function of the auditory system in a fashion similar to classic psychoacoustics. Impairments of the auditory system lead to systematic deficits in phonetic perception, which can be revealed by experiments that treat the signal in an auditory framework. An examination of the auditory system and an equally rigorous examination of the acoustic structure of speech can contribute to refined methodology in phonetics, basic hearing sciences, and the study of speech perception by people with normal and impaired hearing.

References

Aaltonen, O., Niemi, P., Nyrke, T., & Tuhkanen, M. (1987). Event-related brain potentials and the perception of a phonetic continuum. Biological Psychology, 24(3), 197–207.

Abramson, A., & Lisker, L. (1985). Relative power of cues: F0 shift versus voice timing. In V. Fromkin (Ed.), Linguistic phonetics (pp. 25–33). New York: Academic Press.

Ainsworth, W. (1973). Durational cues in the perception of certain consonants. Proceedings of the British Acoustical Society, 2, 1–4.

Ainsworth, W. (1974). The influence of precursive sequences on the perception of synthesized vowels. Language and Speech, 17(2), 103–109.

Ainsworth, W. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant & M. Tatham (Eds.), Auditory analysis and perception of speech (pp. 10–113). London: Academic Press.

Alexander, J. M., & Kluender, K. R. (2009). Spectral tilt change in stop consonant perception by listeners with hearing impairment. Journal of Speech, Language, and Hearing Research, 52(3), 653–670.

Assmann, P. (1996). Modeling the perception of concurrent vowels: Role of formant transitions. The Journal of the Acoustical Society of America, 100(2 Pt 1), 1141–1152.

Assmann, P., & Katz, W. (2000). Time-varying spectral change in the vowels of children and adults. The Journal of the Acoustical Society of America, 108(4), 1856–1866.

Assmann, P., & Nearey, T. (1987). Perception of front vowels: The role of harmonics in the first formant region. The Journal of the Acoustical Society of America, 81(2), 520–534.

Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193.

Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. Sensory Communication, 217–234.

Barlow, H. (2001). The exploitation of regularities in the environment by the brain. Behavioral and Brain Sciences, 24(4), 602–671.

Blumstein, S., & Stevens, K. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. Journal of the Acoustical Society of America, 66, 1001–1017.

Blumstein, S., & Stevens, K. (1980). Perceptual invariance and onset spectra for stop consonants in different vowel environments. The Journal of the Acoustical Society of America, 67(2), 648–662.

Brokx, J., & Nooteboom, S. (1982). Intonation and the perceptual separation of simultaneous voices. Journal of Phonetics, 10(1), 23–36.

Brungart, D. (2001). Informational and energetic masking effects in the perception of two simultaneous talkers. The Journal of the Acoustical Society of America, 109(3), 1101–1109.

Carney, L., Li, T., & McDonough, J. (2015). Speech coding in the brain: Representation of vowel formants by midbrain neurons tuned to sound fluctuations. eNeuro, 2(4), e0004–0015.2015 1–12.

Culling, J., & Darwin, C. (1993). Perceptual separation of simultaneous vowels: Within and across-formant grouping by F0. The Journal of the Acoustical Society of America, 93(6), 3454–3467.

Darwin, C., & Gardner, R. (1986). Mistuning a harmonic of a vowel: Grouping and phase effects on vowel quality. The Journal of the Acoustical Society of America, 79(3), 838–845.

Dau, T., Kollmeier, B., & Kohlrausch, A. (1997). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. The Journal of the Acoustical Society of America, 102(5), 2906–2919.

Delattre, P., Liberman, A., & Cooper, F. (1955). Acoustic loci and transitional cues for consonants. The Journal of the Acoustical Society of America, 27(4), 769–773.

DiNino, M., Wright, R., Winn, M., & Bierer, J. (2016). Vowel and consonant confusions from spectrally-manipulated stimuli designed to simulate poor cochlear implant electrode-neuron interfaces. The Journal of the Acoustical Society of America, 140(6), 4404–4418.

Donaldson, G., Rogers, C., Cardenas, E., Russell, B., & Hanna, N. (2013). Vowel identification by cochlear implant users: Contributions of static and dynamic spectral cues. The Journal of the Acoustical Society of America, 134, 3021–3028.

Dorman, M. F., & Loizou, P. C. (1996). Relative spectral change and formant transitions as cues to labial and alveolar place of articulation. The Journal of the Acoustical Society of America, 100(6), 3825–3830.

Dorman, M., Raphael, L., & Liberman, A. (1979). Some experiments on the sound of silence in phonetic perception. The Journal of the Acoustical Society of America, 65(6), 1518–1532.

Dubno, J., Dirks, D., & Langhofer, L. (1982). Evaluation of hearing-impaired listeners test using a nonsense-syllable test II. Syllable recognition and consonant confusion patterns. Journal of Speech and Hearing Research, 25(March), 141–148.

Dubno, J., & Levitt, H. (1981) Predicting consonant confusions from acoustic analysis. The Journal of the Acoustical Society of America, 69, 249–261.

Dudley, H. (1939). Remaking speech. The Journal of the Acoustical Society of America, 11(2), 169–177.

Elangovan, S., & Stuart, A. (2005). Interactive effects of high-pass filtering and masking noise on word recognition. Annals of Otology, Rhinology and Laryngology, 114(11), 867–878.

Erber, N. (1982). Auditory training. Washington, DC: AG Bell Association for the Deaf.

Ernst, M., & Banks, M. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870), 429–433.

Escera, C., & Malmierca, M. (2014). The auditory novelty system: An attempt to integrate human and animal research. Psychophysiology, 111–123.

Fletcher, H., & Munson, W. (1933). Loudness, its definition, measurement and calculation. Bell Labs Technical Journal, 12(4), 377–430.

Fogerty, D., Bologna, W., Ahlstrom, J., & Dubno, J. (2017). Simultaneous and forward masking of vowels and stop consonants: Effects of age, hearing loss, and spectral shaping. The Journal of the Acoustical Society of America, 141(2), 1133–1143.

Fowler, C. A. (2006). Compensation for coarticulation reflects gesture perception, not spectral contrast. Perception & Psychophysics, 68(2), 161–177.

Fowler, C. A., Best, C., & McRoberts, G. (1990). Young infants’ perception of liquid coarticulatory influences on following stop consonants. Perception & Psychophysics, 48(6), 559–570.

Frazier, J. M., Assgari, A. A., & Stilp, C. E. (in press). Musical instrument categorization is highly sensitive to spectral properties of earlier sounds. Attention, Perception, & Psychophysics.

Friesen, L., Shannon, R., Baskent, D., & Wang, X. (2001). Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants. The Journal of the Acoustical Society of America, 110(2), 1150–1163.

Gallun, F., & Souza, P. (2008). Exploring the role of the modulation spectrum in phoneme recognition. Ear and Hearing, 29(5), 800–813.

Giraud, A., & Poeppel, D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15(4), 511–517.

Glasberg, B., & Moore, B. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47, 103–138.

Glidden, C. M., & Assmann, P. F. (2004). Effects of visual gender and frequency shifts on vowel category judgments. Acoustics Research Letters Online 5(4), 132–138.

Gomes, H., Bernstein, R., Ritter, W., Vaughan, H., & Miller, J. (1997). Storage of feature conjunctions in transient auditory memory. Psychophysiology, 34(6), 712–716.

Gordon-Salant, S., Yeni-Komshian, G., Fitzgibbons, P., & Barrett, J. (2006). Age related differences in identification and discrimination of temporal cues in speech segments. The Journal of the Acoustical Society of America, 119(4), 2455–2466.

Green, K., & Miller, J. (1985). On the role of visual rate information in phonetic perception. Perception & Psychophysics, 38(3), 269–276.

Greenwood, D. (1990). A cochlear frequency-position function for several species 29 years later. The Journal of the Acoustical Society of America, 87, 2592–2605.

Grimm, S., Escera, C., Slabu, L., & Costa-Faidella, J. (2011). Electrophysiological evidence for the hierarchical organization of auditory change detection in the human brain. Psychophysiology, 48(3), 377–384.

Haggard, M. (1978). The devoicing of voiced fricatives. Journal of Phonetics, 6(2), 95–102.

Haubert, N., & Pichora-Fuller, M. K. (1999). The perception of spoken language by elderly listeners: Contributions of auditory temporal processes. Canadian Acoustics, 27(3), 96–97.

Heffner, C., Newman, R., & Idsardi, W. (2017). Support for context effects on segmentation and segments depends on the context. Attention, Perception, & Psychophysics, 79(3), 964–988.

Hillenbrand, J., Getty, L., Clark, M., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099–3111.

Hillenbrand, J., & Nearey, T. M. (1999). Identification of resynthesized vertical bar hVd vertical bar utterances: Effects of formant contour. The Journal of the Acoustical Society of America, 105(6), 3509–3523.

Holt, L. (2005). Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychological Science, 16(4), 305–312.

Holt, L., & Lotto, A. (2010). Speech perception as categorization. Attention, Perception & Psychophysics, 72(5), 1218–1227.

Holt, L., Lotto, A., & Diehl, R. (2004). Auditory discontinuities interact with categorization: Implications for speech perception. The Journal of the Acoustical Society of America, 116(3), 1763–1773.

Holt, L., Lotto, A., & Kluender, K. (2000). Neighboring spectral content influences vowel identification. The Journal of the Acoustical Society of America, 108(2), 710–722.

Huang, J., & Holt, L. (2012). Listening for the norm: Adaptive coding in speech categorization. Frontiers in Psychology, 3, 10.

Iverson, P., Smith, C., & Evans, B. (2006). Vowel recognition via cochlear implants and noise vocoders: effects of formant movement and duration. The Journal of the Acoustical Society of America, 120, 3998–4006.

Jacobs, R. (2002). What determines visual cue reliability. Trends in Cognitive Science, 6, 345–350.

Jenkins, W., & Strange, J. (2013) Dynamic specification of coarticulated vowels. In G. Morrison & P. Assmann (Eds.), Vowel inherent spectral change (pp. 87–116). Berlin: Springer.

Jesteadt, W., Bacon, S., & Lehman, J. (1982). Forward masking as a function of frequency, masker level, and signal delay. The Journal of the Acoustical Society of America, 71(4), 950–962.

Johnson, K. (1990). The role of perceived speaker identity in F0 normalization of vowels. The Journal of the Acoustical Society of America, 88(2), 642–654.

Johnson, K., Strand, E. A., & D’Imperio, M. (1999). Auditory – visual integration of talker gender in vowel perception. Journal of Phonetics, 27(4), 359–384.

Jongman, A. (1989). Duration of frication noise required for identification of English fricatives. The Journal of the Acoustical Society of America, 85(4), 1718–1725.

Kewley-Port, D. (1983). Time-varying features as correlates of place of articulation in stop consonants. The Journal of the Acoustical Society of America, 73(1), 322–335.

Kiefte, M., Enright, T., & Marshall, L. (2010). The role of formant amplitude in the perception of /i/ and /u/. The Journal of the Acoustical Society of America, 127(4), 2611–2621.

Kingston, J., Kawahara, S., Chambless, D., Key, M., Mash, D., & Watsky, S. (2014). Context effects as auditory contrast. Attention, Perception, & Psychophysics, 76, 1437–1464.

Klatt, D. H. (1973). Discrimination of fundamental frequency contours in synthetic speech: Implications for models of pitch perception. The Journal of the Acoustical Society of America, 53, 8–16.

Klatt, D. H. (1982). Prediction of perceived phonetic distance from critical-band spectra: A first step. Proceedings – ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 7(June), 1278–1281. doi:10.1109/ICASSP.1982.1171512

Kluender, K., Coady, J., & Kiefte, M. (2003). Sensitivity to change in perception of speech. Speech Communication, 41(1), 59–69.

Koerner, T., Zhang, Y., Nelson, P., Wang, B., & Zou, H. (2016). Neural indices of phonemic discrimination and sentence-level speech intelligibility in quiet and noise: A mismatch negativity study. Hearing Research, 339, 40–49.

Ladefoged, P., & Broadbent, D. (1957). Information conveyed by vowels. The Journal of the Acoustical Society of America, 29(1), 98–104.

Lahiri, A., Gewirth, L., & Blumstein, S. (1984). A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: Evidence from a cross-language study. The Journal of the Acoustical Society of America, 76(2), 391–404.

Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Zeitschrift Fur Naturforschung C, 36(9–10), 910–912.

Lennie, P. (2003). The cost of cortical computation. Current Biology, 13(6), 493–497.

Lindblom, B., & Studdert-Kennedy, M. (1967). On the role of formant transitions in vowel recognition. The Journal of the Acoustical Society of America, 42(4), 830–843.

Lisker, L. (1986). ‘Voicing’ in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and Speech, 29, 3–11.

Loizou, P., Dorman, M., & Powell, V. (1998). The recognition of vowels produced by men, women, boys, and girls by cochlear implant patients using a six-channel processor. The Journal of the Acoustical Society of America, 103(2), 1141–1149.

Lotto, A., & Holt, L. (2006). Putting phonetic context effects into context: A commentary on Fowler. Perception & Psychophysics, 68(2), 178–183.

Lotto, A., & Kluender, K. (1998). General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60(4), 602–619.

Mann, V. (1980). Influence of preceding liquid on stop-consonant perception. Perception & Psychophysics, 28(5), 407–412.

Mann, V. (1986). Distinguishing universal and language-dependent levels of speech perception: Evidence from Japanese listeners’ perception of English “l” and “r.” Cognition, 24(3), 169–196.

Massaro, D. W., & Cohen, M. M. (1983). Categorical or continuous speech perception: A new test. Speech Communication, 2(1), 15–35.

McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review, 118, 219–246.

Miller, J., & Liberman, A. (1979). Some effects of later-occurring information on the perception of stop consonant and semivowel. Perception & Psychophysics, 25(6), 457–465.

Molfese, D., & Molfese, V. (1997). Discrimination of language skills at five years of age using event-related potentials recorded at birth. Developmental Neuropsychology, 13(2), 135–156.

Moore, B. (2013). An introduction to the psychology of hearing (6th ed.). Leiden: Brill.

Moore, B., Glasberg, B., & Peters, R. (1986). Thresholds for hearing mistuned partials as separate tones in harmonic complexes. The Journal of the Acoustical Society of America, 80(2), 479–483.

Moore, B., Glasberg, B., & Shailer, M. (1984). Frequency and intensity difference limens for harmonics within complex tones. The Journal of the Acoustical Society of America, 75(2), 550–561.

Mori, S., Oyama, K., Kikuchi, Y., Mitsudo, T., & Hirose, N. (2015). Between-frequency and between-ear gap detections and their relation to perception of stop consonants. Ear and Hearing, 36, 464–470.

Morrison, G., & Assmann, P. (Eds.). (2013). Vowel inherent spectral change. Berlin, Heidelberg: Springer-Verlag.

Näätanen, R., Gaillard, A., & Mäntysalo, S. (1978). Early selective-attention effect on evoked potential reinterpreted. Acta Psychologica, 42(4), 313–329.

Näätanen, R., Lehtokoski, A., Lennes, M., Cheour, M., Huotilainen, M., Iivonen, A., … Alho, K. (1997). Language-specific phoneme representations revealed by electric and magnetic brain responses. Nature, 385(6615), 432–434.

Näätanen, R., Paavilainen, P., Rinne, T., & Alho, K. (2007). The Mismatch Negativity (MMN) in basic research of central auditory processing: A review. Clinical Neurophysiology, 118(12), 2544–2590.

Nearey, T. (1989). Static, dynamic, and relational properties in vowel perception. The Journal of the Acoustical Society of America, 85(5), 2088–2113.

Nearey, T., & Assmann, P. (1986). Modeling the role of inherent spectral change in vowel identification. The Journal of the Acoustical Society of America, 80(5), 1297–1308.

Obleser, J., Herrmann, B., & Henry, M. (2012). Neural oscillations in speech: Don’t be enslaved by the envelope. Frontiers in Human Neuroscience, 6, 250.

Peelle, J., & Davis, M. (2012). Neural oscillations carry speech rhythm through to comprehension. Frontiers in Psychology, 3, 320.

Peterson, G., & Barney, H. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184.

Pisoni, D. (1973). Auditory and phonetic memory codes in the discrimination of consonants and vowels. Perception & Psychophysics, 13, 253–260.

Pisoni, D. B., & Lazarus, J. H. (1974). Categorical and noncategorical modes of speech perception along the voicing continuum. The Journal of the Acoustical Society of America, 55(2), 328–333.

Poeppel, D., & Hickok, G. (2004). Towards a new functional anatomy of language. Cognition, 92(1–2), 1–12.

Rand, T. (1974). Dichotic release from masking for speech. The Journal of the Acoustical Society of America, 55, 678–680.

Repp, B. (1982). Phonetic trading relations and context effects – new experimental evidence for a speech mode of perception. Psychological Bulletin, 92(1), 81–110.

Repp, B., Liberman, A., Eccardt, T., & Pesetsky, D. (1978). Perceptual integration of acoustic cues for stop, fricative, and affricate manner. Journal of Experimental Psychology Human Perception and Performance, 4(4), 621–637.

Robinson, D., & Dadson, R. (1956). A re-determination of the equal-loudness relations for pure tones. British Journal of Applied Physics, 7, 166–181.

Rosen, S. (1992). Temporal information in speech: Acoustic, auditory and linguistic aspects. Philosophical Transactions: Biological Science, 336, 367–373.

Rosen, S., & Fourcin, A. (1986). Frequency selectivity and the perception of speech. In B. Moore (Ed.), Frequency selectivity in hearing (pp. 373–488). London: Academic Press.

Shannon, C. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423.

Shannon, R., Zeng, F., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270(5234), 303–304.

Simoncelli, E. (2003). Vision and the statistics of the visual environment. Current Opinion in Neurobiology, 13(2), 144–149.

Smith, R. (1979). Adaptation, saturation, and physiological masking in single auditory-nerve fibers. The Journal of the Acoustical Society of America, 65(1), 166–178.

Smith, R., & Zwislocki, J. (1975). Short-term adaptation and incremental responses of single auditory-nerve fibers. Biological Cybernetics, 17(3), 169–182.

Stevens, K., & Blumstein, S. (1978). Invariant cues for place of articulation in stop consonants. The Journal of the Acoustical Society of America, 64(5), 1358.

Stevens, K., Blumstein, S., Glicksman, L., Burton, M., & Kurowski, K. (1992). Acoustic and perceptual characteristics of voicing in fricatives and fricative clusters. The Journal of the Acoustical Society of America, 91(5), 2979–3000.

Stevens, S., Volkmann, J., & Newman, E. (1937). The mel scale equates the magnitude of perceived differences in pitch at different frequencies. The Journal of the Acoustical Society of America, 8(3), 185–190.

Stilp, C., & Alexander, J. (2016). Spectral contrast effects in vowel categorization by listeners with sensorineural hearing loss. Proceedings of Meetings on Acoustics, 26, 060003.

Stilp, C., Alexander, J., Kiefte, M., & Kluender, K. (2010). Auditory color constancy: Calibration to reliable spectral properties across speech and nonspeech contexts and targets. Attention, Perception & Psychophysics, 72(2), 470–480.

Stilp, C., Anderson, P., & Winn, M. (2015). Predicting contrast effects following reliable spectral properties in speech perception. The Journal of the Acoustical Society of America, 137(6), 3466–3476.

Stilp, C., & Assgari, A. (2017). Consonant categorization exhibits graded influence of surrounding spectral context. The Journal of the Acoustical Society of America, 141(2), EL153–EL158.

Summerfield, Q. (1981). Articulatory rate and perceptual constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 7(5), 1074–1095.

Syrdal, A., & Gopal, H. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. The Journal of the Acoustical Society of America, 79(4), 1086–1100.

‘t Hart, J. (1981). Differential sensitivity to pitch distance, particularly in speech. The Journal of the Acoustical Society of America, 69(3), 811–821.

Takegata, R., Brattico, E., Tervaniemi, M., Varyagina, O., Naatanen, R., & Winkler, I. (2005). Preattentive representation of feature conjunctions for concurrent spatially distributed auditory objects. Cognitive Brain Research, 25(1), 169–179.

Toscano, J., & McMurray, B. (2010). Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics. Cognitive Science, 34(3), 434–464.

Traunmüller, H. (1990). Analytical expressions for the tonotopic sensory scale. The Journal of the Acoustical Society of America, 88, 97.

Viswanathan, N., Fowler, C., & Magnuson, J. (2009). A critical examination of the spectral contrast account of compensation for coarticulation. Psychonomic Bulletin & Review, 16(1), 74–79.

Wardrip-Fruin, C. (1985). The effect of signal degradation on the status of cues to voicing in utterance-final stop consonants. The Journal of the Acoustical Society of America, 77(5), 1907–1912.

Wark, B., Lundstrom, B., & Fairhall, A. (2007). Sensory adaptation. Current Opinion in Neurobiology, 17(4), 423–429.

Watkins, A. (1991). Central, auditory mechanisms of perceptual compensation for spectral-envelope distortion. The Journal of the Acoustical Society of America, 90(6), 2942–2955.

Winn, M., Chatterjee, M., & Idsardi, W. (2012). The use of acoustic cues for phonetic identification: effects of spectral degradation and electric hearing. The Journal of the Acoustical Society of America, 131(2), 1465–1479.

Winn, M., Chatterjee, M., & Idsardi, W. (2013). Roles of voice onset time and F0 in stop consonant voicing perception: Effects of masking noise and low-pass filtering. Journal of Speech, Language, and Hearing Research, 56(4), 1097–1107.

Winn, M., & Litovsky, R. (2015). Using speech sounds to test functional spectral resolution in listeners with cochlear implants. The Journal of the Acoustical Society of America, 137, 1430–1442.

Winn, M., Won, J-H., & Moon, I-J. (2016). Assessment of spectral and temporal resolution in cochlear implant users using psychoacoustic discrimination and speech cue categorization. Ear and Hearing, 37, e377–e390.

Zahorian, S., & Jagharghi, A. (1993). Spectral-shape features versus formants as acoustic correlates for vowels. The Journal of the Acoustical Society of America, 94, 1966–1982.

Zilany, M., & Bruce, I. (2007). Predictions of speech intelligibility with a model of the normal and impaired auditory periphery. Proceedings of the Third International IEEE/EMBS Conference on Neural Engineering, 481–485.

Zilany, M., Bruce, I., & Carney, L. (2014). Updated parameters and expanded simulation options for a model of the auditory periphery. The Journal of the Acoustical Society of America, 135(1), 283–286.

Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands. The Journal of the Acoustical Society of America, 33, 248.