Chapter 13

Reading a Sound Spectrogram

In This Chapter

arrow Appreciating the importance of the spectrogram

arrow Decoding clues in spectrogram readouts

arrow Using your knowledge with clinical cases

arrow Reading spectrograms that are less than ideal

arrow Knowing more about noise

The spectrogram is the gold standard of acoustic phonetics. These images were originally created by a machine called the sound spectrograph, built in the 1940s as part of the World War II military effort. These clunky instruments literally burned images onto specially treated paper. However, software that computes digital spectrograms has replaced this older technology. As a result, you can now make spectrograms on almost any computer or tablet. Although the technology has gotten snappier, you still need to know how to read a spectrogram, and that’s where this chapter comes in.

Reading a sound spectrogram is not easy. Even highly trained experts can’t be shown a spectrogram and immediately tell you what was said, as if they were reading the IPA or the letters of a language. However, with some training, a person can usually interpret spectrograms well for many work purposes. This chapter focuses on making spectrogram reading a bit more comfortable for you.

Grasping How a Spectrogram Is Made

A spectrogram takes a short snippet of speech and makes it visual by plotting out formants and other patterns over time. Time is plotted on the horizontal axis, frequency is plotted on the vertical axis, and amplitude is shown in terms of darkness (see Figure 13-1).

Developments in technology have made the production of spectrograms perhaps less exciting than the good ol’ days, but far more reliable and useful. Current systems are capable of displaying multiple plots, adjusting the time alignment and frequency ranges, and recording detailed numeric measurements of the displayed sounds. These advances in technology give phoneticians a detailed picture of the speech being analyzed.

9781118505083-fg1301.eps

Figure 13-1: A sample spectrogram of the word “spectrogram.”

tip.eps You can easily obtain software for computing spectrograms (and for other useful analyses such as tracking fundamental frequency and amplitude over time) free from the Internet. Two widely used programs are WaveSurfer and Praat (Dutch for “Speech”). To use these programs, first be sure your computer has a working microphone and speakers. Simply download the software to your computer. You can then access many online tutorials to get started with speech recording, editing, and analysis.

Take a look at Figure 13-2. You can consider the information, shown in a line spectrum, to be a snapshot of speech for a single moment in time. Now, turn this line spectrum sideways and move it over time. Voila! You have a spectrogram. The difference between a line spectrum and a spectrogram is like the difference between a photograph and a movie.

9781118505083-fg1302.eps

Figure 13-2: Relating the line spectrum to the spectrogram.



remember.eps Don’t sell silence short while working with spectrograms. Silence plays an important role in speech. It can be a phrase marker that tells the listener important things about when a section of speech is done. Silence can fall between word boundaries (such as in “dog biscuit”). It can be a tiny pause indicating pressure build-up, such as the closure that occurs just before a plosive. Sometimes silence is a pause for breathing, for emotion, or for dramatic effect.

Reading a Basic Spectrogram

Welcome to the world of spectrogram reading. I can see you are new to this, so it’s time to establish a few ground rules. You want to read a spectrogram? You had better inspect the axes. Take a look at Figure 13-3.

9781118505083-fg1303.eps

Figure 13-3: Spectrogram of the phrase “Buy Spot food!”

This is the phrase “Buy Spot food!” produced by a male speaker of American English (me). You can assume that Spot is hungry. Actually, I’ve selected this phrase because it has a nice selection of vowels and consonants to learn. Figure 13-3 is a black-and-white spectrogram, which is fairly common because it can be copied easily. However, most spectrogram programs also offer colored displays in which sections with greater energy light up in hot colors, such as red and yellow.

tip.eps When reading a spectrogram, you should first distinguish silence versus sound. Where there is sound, a spectrogram marks energy; where there is no sound, it is blank. Look at Figure 13-3 and see if you can find the silence. Look between the words — the large regions of silence are shown in white. In this figure, there is a gap between “Buy” and “Spot” and between “Spot” and “food.” I made this spectrogram very easy by recording the words for this speech sample quite distinctly. In ordinary spectrograms of connected speech, distinguishing one word from another isn’t so simple.

There are also other shorter gaps of silence, for instance, in the word “Spot” between the /s/ and the /p/. This gap is a silent gap that helps distinguish the stop within the cluster. Two other silent regions are found before the final stops at the end of “Spot” (before the /t/) and in food (before the /d/). These are regions of closure before final stop consonant release.

The horizontal axis has a total of about 3,000 milliseconds (or about 3 seconds). If you time yourself saying this same sentence, you’ll notice I use a fairly slow, careful rate of speech (citation form; as opposed to more usual, informal connected speech). In citation form, people tend to be on their best behavior in pronunciation, making all sounds carefully so they can be well understood. I used citation form to make a very clear spectrogram.



Now, take a look at the vertical axis in Figure 13-3. The frequency ranges from 0 to 7,000 Hz, which is an intermediate range typically used to show both vowels and consonants in spectrograms. To highlight vowels, phoneticians will usually view a lower range (such as to 5,000 Hz), and when sounds with higher frequencies are being inspected (such as fricatives), a higher y-axis maximum (for example, up to 10,000 Hz or 20,000 Hz) is sometimes used.

tip.eps Your next job is to determine whether the sound-containing regions are voiced (periodic) or not. A good way to start is to look for energy picked up at the bottom of the frequency scale, which is a band of energy in the very low frequencies, corresponding to the first and second harmonics. For men, it’s about 100–150 Hz, for women it’s often around 150–250 Hz (with lots of variation between people). If sound is periodic (that is, it’s due to a regularly vibrating source, such as your vocal folds), a voice bar (the dark band running parallel to the very bottom of the spectrogram) will usually be present (although it may be faint or poorly represented, depending on the spectrogram’s quality and the talker’s fundamental frequency values).

In Figure 13-3, you can see the voice bar at the bottom of “Buy,” in the /ɑ/ vowel of “Spot,” and in the /ud/ portion of “food.” It isn’t present for the voiceless sounds, including the /s/ and /t/ sounds of “Spot” and the /f/ of “food.”

Visualizing Vowels and Diphthongs

Vowels on a spectrogram can be detected by tracking their steady-state formants over time. A formant appears as a broad, dark band running roughly horizontal with the bottom of the spectrogram page. Some of my more imaginative students have remarked they look like caterpillars (if this helps you, so be it). In that case, you’re searching for caterpillars cruising along at different heights, parallel to the spectrogram’s horizontal axis.

But how do you know which vowel is which? If you know the talker’s gender and accent, then you can compare the center of the formant frequency band with established values for the vowels and diphthongs of English. (If you don’t know the gender or accent, your task will be even harder!) Tables 13-1 and 13-2 show formant frequencies for the first (F1), second (F2), and third (F3) vowel formants for common varieties of General American English and British English. Notice that the GAE values are listed separately for men and women, which is relevant because physiological differences in the oral cavity and pharyngeal cavity ratios (and body size differences) between the sexes create different typical values for men and for women. Values for British women weren’t available at the time of this writing.

9781118505083-tb1301a.png

9781118505083-tb1301b.png

9781118505083-tb1302.png

In Figure 13-3, knowing that an American adult male produced “Buy Spot food,” you should be able to find the formant frequencies of vowel in the second word shown in the spectrograph.

Figure 13-4 shows the same spectrogram but with additional details about the formant estimates. In this figure, the spectrograph program shows formant frequency values. This figure plots a line in the estimated center frequency of each of the F1, F2, F3, and F4 formants. In old-fashioned spectrograms, a user would have to do this manually, using the eye and a pencil.



The first monophthongal vowel in this phrase is the /ɑ/ in the word “Spot.” In Figure 13-4, you can see those values are 724, 1065, and 2571 Hz. These map quite closely to the formant values for the male American /ɑ/ shown (768, 1333, and 2522 Hz).

9781118505083-fg1304.eps

Figure 13-4: An annotated spectrogram.

Next, examine the /u/ of “food.” In Figure 13-4, the F1, F2, and F3 values are estimated in the same fashion. These are 312, 1288, and 2318 Hz. You can see that these measurements match closely to the /u/ values for the GAE male talkers in Table 13-1 (378, 997, and 2343 Hz). My F2 is a bit higher, perhaps because I’m from California and it seems to be a dialectal issue in California, where “u” vowels begin rather /i/-like. Overall, the system works.

remember.eps Vowels that behave this way are traditionally called steady state because they maintain rather constant formant frequency values over time. Another way of putting it is that they have relatively little vowel inherent spectral change (VISC).

In contrast, the General American English diphthongs (//, //, and /ɔɪ/) perceptually shift from one sound quality to another. Acoustically, these diphthongs show relatively large patterns of formant frequency shift over time, as in “buy” shown in Figure 13-5. Spectrograms of //, //, and /ɔɪ/ are shown in Figure 13-5, for comparison.

9781118505083-fg1305.eps

Figure 13-5: Spectrograms of //, //, and /ɔɪ/.

remember.eps Diphthongs provide an excellent opportunity to review the rules mapping formant frequencies to physiology (refer to Chapter 12). For instance, in // you see that according to the F1 rule when the tongue is low for the /a/ of //, F1 is high. However, F1 drops when the tongue raises for the high vowel /ɪ/ at the end of the diphthong. Conversely, /a/ is a central vowel, while /ɪ/ is a front vowel. According to the F2 rule, F2 should increase as one moves across the diphthong (and indeed this is the case).

Checking Clues for Consonants

Consonants are different beasts than vowels. Vowels are voiced and relatively long events. You make vowels by positioning the tongue freely in the mouth. That is, the tongue doesn’t need to touch or rub anywhere. Consonants can be long in duration (as in fricatives) or short and fast (like stops). Consonants involve precise positioning of the tongue, including movement against other articulators.

Identifying consonants on spectrograms involves a fair bit of detective work because you must go after several clues. Your first clue is the manner of articulation. Recall that there are stops, fricatives, affricates, approximants, and nasals. In these sections, I show you some of each. Later in the chapter, after you know what each of these manner types look like on the spectrogram, I explain the place of articulation (labial, alveolar, velar, and so on) for stop consonants, a slightly more challenging task in spectrogram reading.

Stops (plosives)

Stop consonants can be identified on spectrograms because of their brevity: they’re rapid events marked by a burst and transition. Say “pa ta ka” and “ba da ga.” Feel the burst of each initial consonantal event. Now look at the spectrograms in Figure 13-6. Notice that each has a thin and tall pencil-like spike where the burst of noise has shot up and down the frequency range. As you might expect, the voiced stops have a voice bar underneath, and for the voiceless cases, there aren’t voice bars.

Stop consonants look rather different at the end of a syllable. First, of course, the transitions are pointing in the opposite direction than when the consonant is at the beginning of the syllable. Also, as you saw in Figure 13-3 with the final consonants in “Spot” and “food,” there is a silent closure before the final release. Figure 13-7 shows two more examples, “pat” and “pad,” with important sections labeled.

9781118505083-fg1306.eps

Figure 13-6: The spectograms of //, //, // (top) of //, //, and /ɡɑ/ (bottom).

9781118505083-fg1307.eps

Figure 13-7: Spectrogram of “pat” and “pad.”

Fricative findings

Noise (friction) shows up in spectrograms as darkness (intensity marking) across a wide frequency section. Figure 13-8 shows the voiced and voiceless fricatives of English in vowel, consonant, vowel (VCV) contexts.

9781118505083-fg1308.eps

Figure 13-8: The spectrograms of GAE fricatives in VCV contexts.

remember.eps Here is a list of important fricative points to remember:

check.png Fricatives are fairly long. Their durations are clearly longer than stop consonants.

check.png The voice bar can be a good cue for telling the voiced from the voiceless.

check.png The energy distribution (spread) of the different fricatives isn’t the same. Some are darker in higher frequency regions, some in lower regions.

check.png /s/ and /ʃ/ are produced with strong airflow (sibilants).

check.png /f/, /v/, /ð/, and /θ/ are produced with weak airflow (non-sibilants).

Energy spread is an especially good clue to fricative identity. If you listen to /s/ and /ʃ/, you hear that these are strong and hissy because they’re made by sharply blowing air against the teeth, in addition to the oral constriction. Compare /s/ and /ʃ/ (the strong fricatives, or sibilants) with /f/, /v/, /ð/, and /θ/. This second group should sound weaker because they don’t involve such an obstacle.

Tuning in to the sibilants, you can also hear that /s/ sounds higher than /ʃ/. This shows up on the spectrogram with /s/ having more darkness at a higher frequency than does /ʃ/. In general, /s/ and /z/ have maximum noise energy, centering about 4000 Hz. For /ʃ/ and /ʒ/, the energy usually begins around 2500 Hz.

Okay, the strong fricatives are out of the way, so you can now work over the weaklings (non-sibilants). A characteristic of this whole group is they may not last as long as /s/ and /ʃ/. Because of this (and because of their weak friction) they may sometimes look like stops. Don’t let them get away with it: Check out the lineup in Figure 13-8.

The fricatives /f/ and /v/ are the strongest of the weaklings. They can show up on the spectrogram as a triangular region of frication. In most cases there is strong energy at or around 1200 Hz. The fricative /θ/ can take two forms:

check.png A burst-like form more common at syllable-initial position

check.png A more fricative-like pattern at the end of a syllable (shown in Figure 13-8)

It can sometimes be accompanied by low-frequency energy. However, its frication is usually concentrated above 3000 Hz.

The phoneme /ð/ is the wimpiest of all the fricatives; it can almost vanish in rapid speech, although unfortunately this sound occurs in many common function words in English (the, that, then, there, and so on). When observable, /ð/ may contain voiced energy at 1500 and 2500 Hz, as well as some higher-frequency energy.

Affricates

English has two affricates, // and //. These have an abrupt (alveolar) beginning, marked with a burst and transition, followed by energy in an alveolar locus (approximately 1800 Hz). This quickly transitions into a palato-alveolar fricative. Old spectrogram hands suggest a trick for pulling out affricate suspects from the lineup: Sometimes there’s a bulge in the lower frequency portions of the fricative part. The plosive component is detectable as a single vertical spike just to the left of the frication portion of the phoneme! Check out Figure 13-9 for such evidence.

Approximants

Approximants have more gradual transitions than those of stops, as seen in Figure 13-10. This spectrogram shows the approximants found in GAE, including /w/ and /j/, two approximants also called glides. They have this name because these consonants smoothly blend into the vowel next to them. They also have less energy than that of a vowel. A time-honored phonetician’s trick for spotting /j/ is to look for “X marks the spot” where F2 and F3 almost collide before going their merry ways. Because the constriction for /j/ is so narrow, this phoneme is often marked by frication as well as voicing.

The sounds /ɹ/ and /l/ are fun because of the unique tongue shapes involved. Taken together, these two approximants are called liquids because of the way these sounds affected the timing of the classical Greek language. The “r” sounds (rhotics) are a particularly scandalous bunch. Literally. They may involve a bunched tongue, as in some forms of American English, a retroflex gesture (bringing the sides of the blade curled up to the alveolar ridge and the back tongue sides into contact with the molars), uvular fricatives (such as in French or Hebrew), taps, or trills. Looking at the American English /ɹ/ in Figure 13-10, the main acoustic characteristic becomes clear: A sharp drop in F3.

9781118505083-fg1309.eps

Figure 13-9: The spectrograms of // and //.

9781118505083-fg1310.eps

Figure 13-10: The spectrograms of approximants /wa/, /ja/, /ɹa/, and /la/.

The lateral approximant /l/ creates a side-swiped situation in the oral cavity. In a typical /l/ production, the tongue tip is placed on the alveolar ridge and the sides are in the usual position (or slightly raised), with air escaping around the sides. This causes something called anti-resonance at 1500 Hz, which you can see as a fading out of energy in that spectrogram zone. Anti-resonance is an intensity minimum or zero.

Spectrograms that contain /l/ consonants can show much variability. For example, before a vowel F3 may drop or stay even, while F2 rises, giving the phoneme a forked appearance. Following a vowel, /l/ may be signaled by the merging of F2 with F1 near or below 1000 Hz, with F3 moving up toward 3000 Hz, leaving a hole in the normal F2 (side-swiped by /l/, acoustically).

Nasals

Imagine you entered a futuristic world where a nasty government went around spying on everyone by using voice detectors to snatch all kinds of personal information from people. How could you escape detection? The first thing I would do is change my name to something like “Norman M. Nominglan.” That is, something laden with nasals. This is because nasals are some of the most difficult sounds for phoneticians to model and interpret. They’re tough to read on a spectrogram and tend to make speech recognizers crash all over the place. Go nasal and fly under the radar.

English has three nasal stop consonants, bilabial /m/, alveolar /n/, and velar /ŋ/. They’re produced by three different sites of oral constriction, and by opening of the velar port to allow air to escape through the nasal passageway. Opening the nasal port adds further complexity to an already complicated acoustic situation in the oral cavity. As in the case of /l/, nasal sounds have anti-resonances (or zeros), which can show up in spectrograms. To help you track down anyone named “Norman M. Nominglan,” here are some important clues:

check.png Nasal consonants are voiced events, but they have lower amplitudes than vowels or approximants. Nasals therefore appear fainter than surrounding non-nasal sounds.

check.png There may be a characteristic nasal murmur (sound that occurs just after oral closure) at 250 Hz, near F1.

check.png If nasals are at the start or end of a syllable, F1 may be the only visible formant.

check.png Nasal stops (like other plosives) have an optional release.

check.png F2 is the best clue for place of articulation. F2 moves toward the following target values:

• /m/ for bilabials 900 to 1400 Hz.

• /n/ for alveolars 1650 to 1800 Hz.

• /ŋ/ for velars 1900 to 2000 Hz.

Check out the suspects in Figure 13-11.

9781118505083-fg1311.eps

Figure 13-11: The spectrograms of /n/, /m/, and /ŋ/.

Formant frequency transitions

An important basis for tracking consonant place of articulation in spectrograms is the formant frequency transition, a region of rapid formant movement or change. Formant frequency transitions are fascinating regions of speech with many implications for speech science and psychology. A typical formant frequency transition is shown in Figure 13-12.

If a regular formant looks like a fuzzy caterpillar, then I suppose a formant frequency transition looks more like a tapered caterpillar (or one wearing styling gel). This is because the transition begins with low intensity and a narrow bandwidth, gradually expanding into the steady state portion of the sound.

9781118505083-fg1312.eps

Figure 13-12: A typical formant frequency transition.

Here’s how it works.

check.png F1: Think about what your tongue does when you say the syllable “da.” Your tongue moves quickly down (and back) from the alveolar ridge. Following the inverse rule for F1, it means that F1 rises. Because you’re moving into the vowel, the amplitude also gets larger.

check.png F2: These transitions are a bit trickier. For stop consonants, transitions, F2 frequency transitions are important cues for place of articulation. Figure 13-13 shows typical F1 and F2 patterns for the nonce (nonsense) syllables //, //, and /ɡɑ/. Notice that these transition regions start from different frequency regions and seem to have different slopes. For the labial, the transition starts at approximately 720 Hz and has a rising slope. The alveolar stop, /d/, starts around 1700–1800 Hz and is relatively flat. The velar stop, /ɡ/, begins relatively high, with a falling slope. A common pattern also seen for velars is a pinching together of F2 and F3, where F2 points relatively high up and F3 seems to point to about the frequency region.

Phoneticians use these stop-consonant regions, called the locus, to help identify place of articulation in stop consonants. The physics behind these locus frequencies is complex (and a bit beyond the scope of this book). However, in general they result from interactions of the front and back cavity resonances.

9781118505083-fg1313.eps

Figure 13-13: Stylized F1 and F2 patterns for //, //, and /ɡɑ/.



These rapidly changing sections of the speech signal are integrated by people’s perceptual systems in a smooth, seamless fashion. For instance, imagine you create a synthetic syllable on a computer (“da”) and then artificially chop out just the formant frequency transitions (for example, just for the “d”). If you play this section, it won’t sound like a “d”; it will instead just sound like a click or a stick hitting a table. That is, there is not much speech value in formant frequency transitions alone. They must be fused with the neighboring steady state portion in order to sound speech-like.

Spotting the Harder Sounds

A few sounds on the spectrogram may have escaped your detection. These sounds typically include /h/, glottal stop, and tap. Here are some clues for finding them.

Aspirates, glottal stops, and taps

The phoneme /h/ has been living a life of deceit. Oh, the treachery! Technically, /h/ is considered a glottal fricative, produced by creating friction at the glottis. It is unvoiced. This is all very well and fine, except for the fact that when phoneticians actually investigated the amount of turbulence at the glottis during the production of most /h/ consonants, they discovered, there is almost no friction at the glottis for this sound.

remember.eps In other words, /h/ is scandalously misclassified. Some phoneticians view it as a signal of partial devoicing for the onset of a syllable. Others call it an aspirate, as in the diacritic for aspiration [ ʰ]. You can observe a spectrogram of /h/ in Figure 13-14. One nice thing about /h/ is that it’s very good at fleshing out any formant frequencies of nearby (flanking) vowels. They’ll run right through it.

9781118505083-fg1314.eps

Figure 13-14: The spectrograms of /h/: //, /hi/, and /hu/.



You may now turn, with relief, to another sound made at the glottis that is much simpler, the glottal stop. This is marked by silence. Clean silence. And relatively long silence. For instance, look at “uh oh” in Figure 13-15. The silent interval of glottal stop is relatively long.

9781118505083-fg1315.eps

Figure 13-15: The spectrograms of /ʔ/ and /ɾ/ compared with /d/ and /t/.

The glottal stop may be contrasted with the alveolar tap, /ɾ/, a very short, voiced event. In American English, this is not a phoneme that stands by itself. Rather it is an allophone of the phonemes /t/ and /d/. Contrast “a doe,” “a toe,” and “Otto” (GAE accent) in Figure 13-15. Here are some hints for spotting taps:

check.png A tap is among the shortest phonemes in English — as short as two or three pitch periods.

check.png The English tap usually has an alveolar locus (around 1800 and 2800 Hz).

check.png There is often a mini-plosion just before the resumption of the full vowel after the tap. The mini-plosion occurs when the tongue leaves the alveolar ridge.

Cluing In on the Clinical: Displaying Key Patterns in Spectrograms

Spectrograms can be an important part of a clinician’s tool chest for understanding the speech of adult neurogenic patients, as well as children with speech disorders. Chapter 19 gives you added practice and examples useful for transcribing the speech of these individuals.

remember.eps A spectrogram can handily reveal the speech errors of communication-impaired talkers. This section gives you some examples of speech produced by individuals with error-prone speech, compared with healthy adult talkers, for reference. The first two communication-impaired talkers are monolingual speakers of GAE, the last is a speaker of British English.

check.png Female with Broca’s aphasia and AOS (Apraxia of speech)

check.png Female with ALS (Amyotrophic lateral sclerosis)

check.png Male with cerebral palsy (spastic dysarthia)

In Figure 13-16, the subject describes a story about a woman being happy because she found her wallet. The intended utterance is “And she was relieved.” There is syllable segregation — the whole phrase takes pretty long (try it yourself; it probably won’t take you 3 seconds). There are pauses after each syllable (as seen in the white in the spectrogram). I am sure you don’t do this either. There is no voicing in the /z/ of /wǝz/ (note the missing voice bar) and the final consonant is also missing in the ending of “relieved,” which comes out as a type of /f/, heard as “relief.”

Dysarthria occurs in more than 80 percent of ALS patients and may cause major disability. Loss of communication can prevent these patients from participating in many activities and can reduce the quality of life. Dysarthria is often a first symptom in ALS and can be important in diagnosis.

9781118505083-fg1316.eps

Figure 13-16: The spectrogram of an individual with BA and AOS showing syllable prolongation.

There are many ways ALS speech can be noted in a spectrogram. Figure 13-17 gives one common example. Look at the syllables /bib/, /beb/, and /baeb/ produced by an individual with ALS having moderate-to-severe dysarthria (66 percent intelligibility), compared with those of an age-matched control talker. You will notice a couple of things:

check.png The productions by the individual with ALS are slightly longer and more variable.

check.png Whereas the healthy talker has nice sharp bursts (viewable as pencil-like spikes going up and down the page), the productions of the ALS talker have none. This is graphic evidence of why she sounds like she does: instead of sounding like a clear /b/, the oral stops sound muted.

check.png The broadened formant bandwidths and reduced formant amplitudes suggest abnormally high nasalization.

9781118505083-fg1317.eps

Figure 13-17: The spectrograms of ALS speech (a) and healthy speech (b).



People with cerebral palsy (CP) commonly have dysarthria. The speech problems associated with CP are poor respiratory control, laryngeal, and velopharyngeal dysfunction, as well as oral articulation disorders that are due to restricted movement in the oral-facial muscles. You can find more information on CP and dysarthria in Chapter 19.

The next spectrograms highlight spastic dysarthria in a talker with CP. Speech problems include weakness, limited range of motion, and slowness of movement. In this spectrogram (Figure 13-18), you can see evidence of issues stemming from poor respiratory control and timing. In the first attempt of the word “actually,” the pattern shows a breathy, formant-marked vocoid (sound made with an open oral cavity) with an /æ/-like value, then the consonant /ʧ/, followed by a /d/-like burst, slightly later. There is then an intake of air and a rapid utterance of “I actually just” in 760 ms. This time, the final /t/ isn’t realized.

9781118505083-fg1318.eps

Figure 13-18: The spectrograms of spastic dysarthria in cerebral palsy (a), compared with healthy speech (b).

If you compare this with the same thing said (rapidly) by a control speaker, notice that formant patterns are nevertheless relatively distinct in the spectrogram of the healthy talker, particularly formant frequency transitions and bursts. There is formant movement in and out of the /l/. There is a /k/ burst for the word “actually” and the final /t/ of “just”.

Working With the Tough Cases

Certain speaker- and environment-dependent conditions can make the task even more difficult for reading spectrograms. These sections take a closer look at these tough cases and give you some suggestions about how to handle them.

Women and children

Tutorials on spectrogram reading generally try to make things easy by presenting clear examples from male speakers and by using citation forms of speech. There’s nothing wrong with that! Until, of course, you must analyze your first case of a child or female with a high fundamental frequency. At this point, you may see your first case of spectrogram failure, where formants simply won’t appear, as expected. Take a look at Figure 13-19. This figure shows a man, woman, and 5-year old child each saying the word “heed” (/hid/ in IPA) and having the fundamental frequencies 130 Hz, 280 Hz, and 340 Hz, respectively. Notice that the formants in the spectrograms of speech produced by the man and the woman are relatively easy to spot, while those of the young child are fuzzy (F1 and F2) or missing entirely (F3).

9781118505083-fg1319.eps

Figure 13-19: The spectrograms of /hid/ by a man, woman, and child with F0s indicated.

The reason for the decreasing clarity is a problem called spectral sketching, a problem of widely spaced harmonics in cases of high fundamental frequencies. Recall that the spectrograph’s job is to find formants. It does this either by using bandwidth filters, which is old school, or by newer methods, such as fast Fourier transform (FFT) and linear predictive coding (LPC) algorithms. If, however, a talker has a high voice, this results in relatively few harmonics over a given frequency band. As a result, there isn’t much energy for the machine or program to work with. The spectrum that results is sketchy; the system tends to resolve harmonics, instead of formants as it should.

Figure 13-20 shows a male vocal tract with a deep voice and its harmonics compared with a child vocal tract and its harmonics. Figure 13-20a and 13-20b show a snapshot of the energy taken at an instant in time. There is more acoustic information present in the male’s voice that can be used to estimate the broad (formant) peaks. However, in the child’s voice, the system can’t be sure whether the peaks represent true formants or individual harmonics. There is just not enough energy there.

9781118505083-fg1320.eps

Figure 13-20: A male’s (a) and a child’s vocal tracts (b) with line spectra input (below) and the results of vocal tract filtering (above).

Speech in a noisy environment

Another challenge with many applications, from working with the deaf, to forensics, to military uses, is detecting a meaningful speech signal from a noisy environment.

Noise can be defined as unwanted sound. It can be regular, such as a hum (electric lights) or buzz (refrigerator, air conditioner), or random-appearing and irregular sound (traffic sounds, cafeteria noise).



9781118505083-sb1302.eps

Lombard effect

People naturally increase the loudness of their voices when they enter a noisy room to make their voices clearer. This is called the Lombard effect (named after the French otolaryngologist, Etienne Lombard). What is surprising is that people do more than simply increase their volume. They also typically raise their F0, make their vowels longer, change the tilt of their output spectrum, alter their formant frequencies, and stretch out content words (such as nouns and verbs) longer than function words (such as “the,” “or,” and “a”).

Incidentally, humans aren’t alone. Animals that have been found to alter their voices in the Lombard way are budgies, cats, chickens, marmosets, cotton top tamarins, nightingales, quail, rhesus macaques, squirrel monkeys, and zebra finches.

Cocktail party effect

The cocktail party effect is quite different than the Lombard effect (see the preceding section). It’s a measure of selective attention, how people can focus on a single conversation in a noisy room while “tuning out” all others. People are extremely good at this — much better than machines. To test this for yourself, try recording a friend during conversation in a noisy room and later play the recording back to see if you can understand anything. You may be surprised at how difficult it is to hear on the recording what was so easy to detect “live” and in person in the room.

Such focused attention requires processing of the phase of speech waveform, resolved by the use of binaural hearing (involving both ears). Chapter 2 includes information on the phase of speech waveforms. In a practical sense, some people will resort to the better ear effect, in which one ear is cocked toward the conversation and farther from the party noise, as a strategy.

How people attend cognitively to the incoming signal is less well understood. Early models suggested that the brain could sharply filter out certain types of information while allowing other kinds of signals through. A modification of this model was to suggest a more gradual processing, where even the filtered information could be accessed if it was important enough. For instance, even if you aren’t paying attention in a noisy room and somebody in the room mentions your name, you may hear it because this information is semantically salient to you.

Many other issues are involved in the cocktail party effect, including a principle called auditory scene analysis (in which acoustic events that are similar in frequency, intensity, and sound quality follow the same temporal trajectory in terms of frequency intensity, position, and so on). This principle may also be applied to speech. For instance, in a noisy room if you hear the words on a particular topic being uttered, say the weather, other words on this same topic may be more easily detected than random words relating to something entirely different. This is because when people talk about a certain topic, the listener often knows what will come next. For instance, if I tell you . . . “the American flag is red, white, and __,” your chances of hitting the last word, blue, are really high here.

Much remains to be done to understand the cocktail party effect. This research is important for many applications, including the development of hearing aids and multi-party teleconferencing systems.