Whilst vowel quality can be described in terms of only two or three parameters – tongue height, front-back position, and rounding in articulatory terms, and F1, F2, and F3 in acoustic terms – such is not the case for consonants. The International Phonetic Alphabet (IPA) suggests that consonants can be uniquely described using the properties of voicing, place, and manner, but as we shall see, these properties do not form a continuum in the same way that vowel height or vowel front–back specifications can be considered a continuum. Although it is true that consonants such as stops and nasals can have a particular place of articulation that can be accurately described in phonetic terms, how to categorize place for sounds such as fricatives or rhotics is much more problematic. Moreover, the role of overall tongue shape in determining the spectral output for sounds such as these is quite poorly understood, to the extent that it is not at all encoded in the IPA. In addition, the similarities in terms of dorsal constriction between rhotics and laterals are likewise not captured within the IPA.
In order to cover the wide variety of sounds that are considered consonants in the world’s languages, this chapter will use the IPA as the organizing framework, working through the various (pulmonic) places and manners of articulation. It should be pointed out that, like consonant place, the category of consonant manner does not represent a continuum of sounds. Whilst it might be possible to say that a stop is lenited to a fricative-type sound at the same location, which in turn might be lenited to an approximant-type sound at this location, it is not necessarily the case, for example, that a particular fricative such as /s/ or /ʃ/ has a clear counterpart at the stop manner of articulation (as noted by Gordon 2016, the oral stops never lenite to /s/ or /ʃ/). Moreover, the articulatory and aerodynamic considerations involved in producing the different manners of articulation – nasal consonants, lateral consonants, stops, fricatives, approximants – result in acoustic outputs that require quite different measures and considerations for each of the different categories. The fact that these various manners of articulation can form coherent phonological categories is a challenge to phonetic research.
The IPA chart lists seven possible places of articulation for oral stop consonants (also known as “plosives”). A prototypical oral stop involves complete closure at some point along the oral cavity. In principle, such a closure involves a period of complete silence in a spectrogram display, although in practice a “voice bar” may be visible at the bottom of the spectrogram for at least part of the stop closure (voicing in stops will be discussed a little further later in this chapter). The exact duration of this closure depends on several factors, including prosodic factors (e.g., stops in stressed syllables have longer closures than stops in unstressed syllables), and on place of articulation and the active articulator involved. For instance, an alveolar stop (/t/, /d/) which is often produced with the rapidly moving tongue tip, often has a much shorter closure duration than a bilabial (p/, /b/) or a velar (/k/, /ɡ/). In fact, it seems to be relatively difficult to achieve a complete closure in the velar region using the slower-moving tongue back/body, to the extent that phonemically velar stops are often realized with frication (see the section on Fricatives).
The three stop places of articulation just mentioned – bilabials (p/, /b/), alveolars (/t/, /d/), and velars (/k/, /ɡ/) – are the most common stops across the world’s languages, and are precisely the stops found in English and many other European languages. There is one respect, however, in which the bilabial stops differ from all other oral stop places of articulation, and this relates to the acoustic output. At the moment of release – i.e., at the moment that the closure is released – the portion of the oral cavity in front of the closure becomes acoustically “excited” by the acoustic energy built up behind the closure. This is because during the oral closure, as air continues to flow from the lungs into the oral cavity, pressure builds up behind the closure; and at the moment of release, air rushes into what was the front cavity, with the resulting burst of sound capturing the acoustic characteristics of this front cavity. For all oral stops except the bilabial, the cavity in front of the closure is within the oral cavity itself. However, for the bilabial stops, the “cavity” in front of the closure is the “outside world,” and as such it is difficult for the relatively small amount of air flowing through the oral cavity to produce much acoustic energy in the large space beyond the lips. As a result, bilabials show much less acoustic energy at the moment of stop burst than do the other oral consonants, and this is particularly noticeable in the higher-frequency regions. For instance, in Figure 10.1, the stop burst for /p/, which is represented by the solid line, has a noticeable drop-off in energy across the frequency range, so that above 3 kHz it has very little energy overall. (The stop burst is the burst of energy that occurs in the first few milliseconds following the release of the stop closure.) The low-frequency energy that is seen for the bilabial comes from the cavity behind the constriction, and since in this case the entire oral and pharyngeal cavities are involved, the acoustic energy is at a very low frequency (the size of the cavity determines its resonance frequencies, with a larger cavity producing a lower-frequency resonance).
By contrast, the other stop places of articulation are very much shaped by the cavity in front of the constriction at the moment of release. Broadly speaking, this front cavity becomes smaller the more forward in the oral cavity the stop is articulated, and as a result, more high-frequency energy is typically present in the stop burst for stops that are articulated further forward in the oral cavity. Again looking at Figure 10.1, which shows the stop burst spectra for stops in four languages, it can be seen that the energy for /k/ is largely concentrated at lower frequencies than the energy for /t/. The energy for /k/ (shown as a dot-dash line) is noticeable at about 1500–2000 Hz, although for English the energy is spread across about 1000–3000 Hz. By contrast, the energy for /t/ (shown as a dotted line) is higher in the frequency range above 3 kHz, and this is particularly noticeable for Arrernte and Pitjantjatjara, where the alveolar /t/ and the “retroflex” /ʈ/ have a particularly broad spectral peak in the 2000–4000 Hz range.
Figure 10.1 also brings to light some other issues relating to “place” of articulation in consonants. As mentioned, this figure shows the alveolar and retroflex stops of the Australian languages Arrernte and Pitjantjatjara (the retroflex stops are shown as light blue lines). In Australian languages, both the alveolar and the retroflex are articulated with the tongue tip. In prototypical terms, the alveolar is produced as a “plateau” articulation – the tongue tip moves upwards to contact the alveolar ridge, and it releases at more or less the same point as initial contact. By contrast, the prototypical retroflex articulation involves a closure in the post-alveolar or pre-palatal region of the oral cavity, followed by a ballistic forward movement of the tongue that results in the closure being released towards the alveolar region. This highlights the fact that “retroflex” is not actually a place of articulation, but in a sense a style of articulation that relates to the direction of travel of the front part of the tongue. It is interesting that the IPA chart lists /t d/ as spanning the dental, alveolar and post-alveolar places of articulation, with various diacritics used to separate out these three places (a bridge for the dental /t̪/, and a minus sign below for the post-alveolar /t̠/, meaning retracted). There is in fact a tremendous amount of overlap between the alveolar and retroflex phonemic categories in Australian languages (Tabain, 2009), and discussion of this minimal phonemic contrast is beyond the scope of this chapter. However, it can be noted that some of the articulations might best be classed as the retracted /t̠/, since the point of contact is post-alveolar, but the ballistic forward movement of the tongue is not so apparent.
It may have been inferred from the preceding discussion that the prototypical alveolar and prototypical retroflex stops are very similar at the moment of stop burst release, since both are released at the alveolar region. However, there are some subtle differences between the two stop bursts, which suggest that the retroflex is not released quite as far forward as the alveolar. In general, the concentration of energy for the retroflex is at ever so slightly lower frequencies than for the alveolar, and this is most consistently reflected in differences at the right edge of the broad spectral peak – in Figure 10.1, it can be seen that for Pitjantjatjara, the /t/ has more energy than the /ʈ/ above about 3200 Hz, whereas for Arrernte this cross-over occurs at around 4000 Hz (for these particular speakers). The slightly greater higher-frequency energy for /t/ suggests that the cavity in front of the constriction release is slightly smaller for the alveolar than for the retroflex. It should, however, be noted that these apical consonants (i.e., made with the tongue tip) in the Australia languages have a lot more high-frequency energy than the English and Makasar alveolar stops. The exact reasons for these differences would require some careful articulatory-to-acoustic study, but the differences themselves are a reminder that each language must be treated in its own right when a particular symbol from the IPA is used to represent the sound in that language.
Figure 10.1 also shows the stop burst spectrum for the dental stop in Arrernte. Arrernte contrasts four coronal stops (coronal means sounds made with the tongue tip or blade), which can be divided into apicals (made with the tongue tip) and laminals (made with the tongue blade, which is just behind the tongue tip). The apicals are the alveolar and retroflex, just discussed, and the laminals are the palatal (discussed later in the chapter) and the dental. The Arrernte dental has very little contact between the tongue and the palate – the contact is primarily between the tongue and the upper front teeth. The tongue body is typically quite low, since contact at the teeth involves a relatively low point within the oral cavity. It can be seen that the stop burst spectrum for dental /t̪/ in Arrernte is really quite flat compared to the alveolar and retroflex spectra. Such a flat spectrum with a very broad spread of energy is typical of this sound. Part of the explanation for this broad spread of energy lies in the fact that there may not be a cavity as such in front of the constriction at the moment of release for the dental (perhaps comparable to the case of the bilabial). However, a more important consideration is the jaw movement related to these sounds. For the alveolar and for the retroflex, the jaw remains quite high at the moment of stop burst release, whereas for the dental this is not the case, with the jaw lowering before stop burst release in anticipation of the following vowel (Tabain, 2012). What is relevant here is that the lower teeth move together with the jaw, and the lower teeth can represent a significant obstacle to airflow at the moment of stop burst release. Such an obstacle serves to enhance the amplitude of any spectral peaks, and if this secondary obstacle is not present, the sound may not have as much energy. In fact, the presence of the teeth is what gives sibilant sounds such as /s/ and /ʃ/ (discussed later in the chapter) their characteristic noise. In the case of the stop bursts, if the lower teeth are in the path of the airstream at stop burst, the amplitude of spectral peaks is increased; this is the case for the alveolar and for the retroflex. However if the lower teeth are not in the path of the airstream, there is no overall increase in amplitude – this is the case with the dental, where the tongue tip is in contact with the upper teeth (it is presumably somewhat difficult to channel the air from this point at the upper teeth towards the lower teeth, at least for most speakers).
The other stop burst that can be seen in Figure 10.1 is for the palatal stop /c/ – or to be more precise, the alveo-palatal. The symbols that are used in the IPA – /c/ and /ɟ/ – are in fact ambiguous, and may represent a “proper” palatal sound that may be heard by native English speakers as /kj/ (that is, articulated just in front of the soft-palate/hard-palate juncture), and an alveo-palatal sound that can be heard by English speakers as /tj/. A typical characteristic of the alveo-palatal stop is that it is strongly affricated, and that is certainly the case for the languages shown in Figure 10.1. This is one reason why many researchers prefer to use the symbol /ʨ/ to denote the alveo-palatal stop. Indeed, English speakers may often hear this affricated /c/ as /tʃ/ (although there are important spectral differences as will be seen for /ʃ/). There is some variety in the /c/ spectra shown in Figure 10.1, but in all cases a spectral prominence can be seen at around 4000 Hz (about 4000–6000 Hz for Makasar, and about 3000–5000 Hz for Arrernte and Pitjantjatjara). Importantly, the balance of spectral energy is shifted upwards for the alveo-palatal – it will be seen that the stop burst for this sound has less energy in the spectral range below 3000 Hz than any of the other stops. It can be said that the palatal stop has a higher spectral “center of gravity” than the other stops, while the bilabial has a relatively low spectral center of gravity. It should be noted that the alveo-palatal stop, like the alveolar and retroflex, has a high jaw position at stop burst. This means that the high-frequency energy is enhanced by the teeth serving as an obstacle to the airflow. It is important to note that the concentration of energy is at higher frequencies for the alveo-palatal than it is for the alveolar, even though in principle one might expect that the smaller cavity in front of the alveolar closure would result in a higher-frequency spectral burst for the alveolar than for the alveo-palatal. A further discussion of this issue is beyond the scope of this chapter, but it suffices to say that the shape of the tongue at the moment of release, as well as shape characteristics of the cavity in front of the closure, and properties of the cavity behind the constriction, all contribute to the final spectral output. It would be fair to say that a better understanding of all of these contributing factors is much needed for the variety of sounds in the world’s languages.
We now have only two stop “places” of articulation to consider on the IPA chart. The first is the uvular stops /q ɢ/, which are produced further back in the oral cavity than the velars. In languages that contrast velars and uvulars, it is typically the case that the uvulars induce a more back vowel quality on adjacent vowels; thus, a sequence /qa/ is more likely to sound like [qɑ]. The other stop – the glottal /ʔ/ – is not really oral, in the sense that it does not involve a closure in the oral cavity. This laryngeal consonant typically involves glottalization of adjacent vowels, and only prototypical glottal stop articulations involve a period of complete silence on a spectrogram.
Thus far, we have simply considered the stop burst as a cue to stop place of articulation, and in a sense, it is the presence of the burst that identifies the stop as being a stop. However, the actual place identity of the stop is cued by several aspects of the articulation, including the duration of the closure, the duration of the burst, and by formant transitions into and out of the stop itself.
The main formant to cue consonant place of articulation is F2, although F3 and perhaps even F4 may play a role in certain instances (Ladefoged and Maddieson, 1996). F1 is not usually considered a cue to stop place of articulation, although it is certainly a cue to stop manner (though see next paragraph regarding cues to post-velar places of articulation). F1 is typically falling into a stop, and typically rising out of a stop (this includes nasal stops, as shown in Figure 10.2). This is because, theoretically, F1 is at zero for a stop consonant, since the stop closure is an extreme case of a constriction in the oral cavity. In this case, the stop closure forms a constriction for a Helmholtz resonance – that is, the constriction can be considered the “neck” of a bottle-like resonator, and the cavity behind the constriction is the “bottle” itself. Since the cavity behind the constriction is quite large, it becomes the lowest resonance of the system, and hence is associated with F1 in an adjacent vowel. In a Helmholtz resonance, the smaller the constriction, the lower the resonance frequency – at the point where there is no constriction, the resonance frequency is zero. This is why F1 is typically falling into and rising out of a stop consonant.
However, other consonant effects on F1 relating to coarticulation must not be discounted. For example, certain consonants result in modifications to F1 at the midpoint of an adjacent vowel. For example, alveo-palatal stops require a relatively high jaw position due to the involvement of a large part of the tongue body in their articulation (the tongue body and the jaw are considered coupled articulators, cf. Keating et al., 1994). As a result, in a sequence such as /ca/, the tongue and jaw may not have time to reach the following low vowel target, and as a result, the sequence may (assuming no phonemic vowel constraints in the language) sound more like [ce] – that is, with a lower F1 than might be expected for an /a/ vowel. This is certainly the case following the alveo-palatal consonants in Arrernte. It was also mentioned that uvular consonants often result in a more back vowel articulation than would be expected for a given vowel phoneme – in this case, the effect of the consonant is on F2 of the vowel (i.e., a lower F2 adjacent to uvular consonants). However, post-velar consonants such as uvulars and pharyngeals also have a higher F1 compared to more anterior consonants – if F1 is treated as a Helmholtz resonance, then the shorter cavity behind the consonant constriction results in higher F1 values, and thus a lower vowel. Notably, F1 tends to be higher for pharyngeals than for uvulars (Alwan, 1989; Sylak-Glassman, 2014).
Nevertheless, as already mentioned, by far the most important cues to consonant place of articulation lie in the formant transitions into and out of the consonant itself. These transitions capture the changing shapes of the cavities in front of and behind the constriction as the articulators move from the consonant into the vowel (or the vowel into the consonant). It must be stressed that while the stop burst is defined primarily by the cavity in front of the constriction, the formant transitions are much more appropriately modeled as being vowel-like, and as such, both the cavity behind and the cavity in front of the constriction are relevant. As already noted that F1 is associated with the constriction itself coupled with the back cavity. F2 and F3 are associated with the cavities behind and in front of the constriction. The cavity behind the constriction is modeled as a half-wavelength resonance – that is, the resonance is a function of twice the length of the cavity. A half-wavelength resonance is a resonance that is closed at both ends, and in this case, the resonance is deemed to be “closed” at both the glottis, and at the constriction. By contrast, the cavity in front of the constriction is modeled as a quarter-wavelength resonance – that is, the resonance is a function of four times the length of the cavity. A quarter-wavelength resonance is a resonance that is closed at one end, and open at the other. In this case, the resonance is deemed to be “closed” at the constriction, but open at the lips. F2 is therefore assigned to whichever cavity has the lower resonance, and typically this is the back cavity; F3 is then assigned to the cavity that has the higher resonance, which is typically the front cavity.
In the case of velar consonants, F2 and F3 are theoretically very close together. This is because the velar constriction often occurs at around two-thirds the length of the vocal tract length, as measured from the glottis to the lips. Since the cavity behind the velar constriction is in this case twice as long as the cavity in front of the constriction, the half-wavelength resonance behind the constriction is at theoretically the same frequency as the quarter-wavelength resonance in front of the constriction. In practice, however, the location of the velar constriction depends very much on the adjacent vowel context, with the constriction being further forward for front vowels and further back for back vowels. Nevertheless, a “velar pinch” is sometimes seen, where F2 and F3 come together leading into or out of a velar consonant (Johnson 2012). It should also be noted that F2 and F3 can be very close together for palatal consonants, as well.
When the consonant is further forward than this velar/palatal “two-thirds point,” F2 is affiliated with the back cavity, and F3 with the front cavity (i.e.,. for the coronal consonants). When the consonant is further back than this velar “two-thirds point,” F2 is affiliated with the front cavity, and F3 with the back cavity (i.e., for uvular consonants). For an alveolar, for example, F2 is a back cavity resonance, even though the alveolar burst is associated with the front cavity. The alveolar “locus” for F2 is typically around 2 kHz for adult female speakers (a bit lower for male speakers, and higher for children). However, it is important to note that the relative size of the front and back cavities for a given speaker depends not just on the location of the consonant constriction, but also on constriction length. For instance, alveolar and dental consonants have relatively similar constriction locations, but if one is an apical articulation (i.e., it uses only the smallest front portion of the tongue) and the other is a laminal articulation (i.e., it uses a larger portion of the front-most part of the tongue), the relative sizes of the front and back cavities may be affected. In Australian languages, alveolar and dental consonants tend to have similar F2 values. However, dentals tend to have a higher F3 value. Since dentals are laminal consonants, it is possible that the greater constriction length for the dental results in F2 (the back cavity resonance) being similar to the alveolar consonants, but F3 (the front cavity resonance) being higher than for the alveolar. F1 for laminal articulations also tends to be lower than for apical articulations, and this is likely also related to the greater constriction length for laminals, since the F1 Helmholtz resonance is lower when the constriction is longer. (I should stress at this point that for the sake of simplicity, I am using the tubes and resonators approach to modeling speech production, whereas in some instances an approach that uses perturbation theory [Chiba and Kajiyama, 1941] may be more appropriate. The reader is referred to Johnson (2012) for further discussion of perturbation theory.)
The question of laminal versus apical articulations raises another issue that concerns formant transitions, and that is transition duration. All else being equal, apical articulations reach their vowel target more quickly than do laminal articulations, so that transitions into and out of apical articulations tend to be quite short, while transitions into and out of laminal articulations (and also into and out of velars) tend to be a bit longer (cf. Dart, 1991). This is because the tongue tip on its own is quite a rapidly moving articulator; in addition, a laminal articulation typically recruits a large part of the tongue body, and the release of the longer (in length) constriction may take longer (in duration). This observation is particularly relevant for palatal consonants (which are never apical articulations!), where the transitions into and out of the vowel tend to be quite long. Palatals have a very high F2 and also F3 (which as already mentioned may be quite close together). This is because the cavity behind the constriction (F2) is relatively short compared to the cavity behind the constriction for an alveolar. At the same time however, due to the large amount of tongue-palate contact for palatals, the front cavity (F3) is also relatively small. Due to the involvement of large portions of the tongue, and also to the relatively high point of contact within the oral cavity, palatal transitions tend to be very long and very noticeable on a spectrogram, as they move from the high F2 locus for the consonant (often above 2.5 kHz even for adult male speakers) to the F2 for the vowel target (see the palatal nasal consonant in Figure 10.2 for an example).
As a point of contrast, we might consider the retroflex consonants. These are consonants that are articulated with the tongue tip, and so involve a relatively thin band of contact between the tongue and the palate. As discussed earlier, initial contact for the retroflex is further back than for the alveolar consonants. Since the cavity in front of the retroflex constriction is longer for the retroflex than for an alveolar articulated with the same active articulator, F3 is lower for the retroflex than for the alveolar at the onset of contact. Indeed, a low F3 is considered a defining characteristic of a retroflex articulation. The important point to note, however, is that F3 is noticeably lower for the retroflex than for the alveolar only for the transition into the consonant. It is not always the case that F3 is lower for the retroflex than for the alveolar for the transition out of the consonant, since as already mentioned, the ballistic forward movement of the tongue that is seen during the closure for a prototypical retroflex results in the retroflex consonant being released very close to the alveolar zone. Differences in F2 between alveolars and retroflexes are often non-significant, and this may be due to a slightly different tongue configuration at the point of contact, resulting in the back cavity having relatively similar proportions for both of these consonant sounds.
Finally, we consider the bilabial consonant transitions (the glottal stop does not have a transition as such, since it does not involve a constriction at any point along the oral cavity). Bilabials are particular in that a lip constriction means that the tongue is free to articulate any adjacent vowels without being biomechanically constrained by the articulation for the consonant. This means that bilabial formant transitions do not tend to have very long durations. The theoretical F2 and F3 locus for bilabial consonants is quite low. This can be explained if one considers that at the moment of bilabial closure or release, the oral cavity forms one long tube, with the formants arising at multiples of the fundamental tube resonance. This tube can be modeled as a tube open at one end, and so as a quarter-wavelength resonance, although it is also possible that at the point of closure and release, the tube is more appropriately modeled as a tube closed at both ends, and so a half-wavelength resonance. The F2 locus for a bilabial consonant is typically around 1 kHz for an adult male speaker.
We turn now to the nasal consonants, or more properly, the nasal stop consonants. Nasal consonants are often conceived of as being the exact place-of-articulation counterparts of the oral stop consonants, and there are many phonetic and phonological reasons for this conception – for instance, in many languages, the only consonant clusters that are allowed are nasal + stop clusters with the same place of articulation (i.e., homorganic clusters), and indeed the presence of a nasal in the phoneme inventory implies a stop at the same place of articulation (Gordon, 2016). Nasal stops and oral stops both involve a complete closure within the oral cavity, but nasal stops differ from oral stops in that the primary airflow is in principle through the nasal cavity: The velum is lowered in order to allow for this airflow to happen. For this reason, there are no nasal consonants that are articulated further back than the uvular place of articulation. Indeed, the pharyngeal and glottal places of articulation are blocked out on the IPA chart, in order to indicate that if air is flowing through the nasal cavity, there cannot be a complete constriction at the pharyngeal or glottal places of articulation.
The manner class of nasals is primarily characterized by a noticeable reduction in spectral energy (Dang, Honda, and Suzuki, 1994; Dang and Honda, 1996). This loss in acoustic energy arises from two main sources: From losses within the nasal cavity itself, and from losses arising from the interaction between the oral and nasal cavities. We first consider the losses within the nasal cavity. Since in theory the primary airflow for nasal stops is through the nasal cavity, it is generally thought that energy is lost via the four pairs of para-nasal sinuses (maxillary, sphenoidal, frontal and ethomoidal). The para-nasal sinuses are very small side-branches from the main nasal cavity, with little openings (ostia) that result in the sinuses acting as Helmholtz (i.e., bottle-like) resonators that “trap” acoustic energy from the main nasal airflow. This means that the resonant frequencies at which the various para-nasal sinuses vibrate are removed from the spectral output – that is, the sinuses introduce anti-resonances (in signal processing terms these anti-resonances are often called “zeros”) into the spectral output. The larger sinuses have Helmholtz resonances of around 500 Hz, while the smaller sinuses have Helmholtz resonances of 1500 Hz or more than 3000 Hz. The exact values of the sinus resonances vary dramatically from speaker to speaker, and are thought to contribute to individual speaker characteristics. Empirical studies also find zeros in the nasal spectrum well above 5 kHz, resulting in a heavily dampened spectrum above this frequency range. The important point to note from the phonetic point of view is that energy is lost across a wide range of frequency values, purely due to the presence of these sinuses. In addition, the presence of nasal hairs and mucus can also contribute to losses in the spectral output of nasal consonants (damping of acoustic energy), as can the presence of any asymmetries between the two nasal pathways as the airflow approaches the nares.
The second source of energy loss in nasals arises from acoustic energy being trapped within the oral cavity while the main airflow is (theoretically) through the nasal cavity. In this case, the exact frequency of the anti-resonance is dependent upon place of articulation and on the length of the constriction. Since the main path of the airflow is through the pharyngeal cavity and the nasal cavity, the pocket of air behind the constriction in the oral cavity forms a side-branch to the main airflow, just as the sinuses do within the nasal cavity. For a bilabial nasal, the side-branch is comparatively long, and estimates of the anti-resonance arising from the nasal cavity tend around 1000 Hz for an adult male (based on treating the side-branch as a quarter-wavelength resonator of about 8–9 cm, closed at the bilabial constriction). This anti-resonance becomes progressively higher in frequency as the constriction becomes further back (estimates of the anti-resonance frequency for velar /ŋ/ tend around 3500 Hz). At the point of a uvular constriction for the nasal /ɴ/, the side-branch theoretically disappears, as the air flows along the pharyngeal and nasal cavities only. However, it must be stressed that the size of the oral cavity side-branch depends not just on the passive place-of-articulation, but on the active articulator as well – for instance, in Australian languages, where the dental is a laminal articulation and the alveolar is an apical articulation, the anti-resonance for the dental may be higher than the anti-resonance for the alveolar, since the more extensive tongue–palate contact for the dental may result in a comparatively smaller back cavity. In addition, for some speakers, a retroflex nasal may have a higher anti-resonance than a palatal nasal, if for that speaker the tongue tip contacts the roof of the mouth at a point further back than the contact for the palatal.
The combined effect of all of these losses in acoustic energy is evident in Figure 10.2, which shows an example of a palatal nasal /ɲ/ followed by the low vowel /a:/ in the Pitjantjatjara word nyaa (“what?”). Since the nasal is word-initial, there seems to be much less energy for the first half of the consonant than for the second half, especially in the higher frequencies. Overall, it can be seen that there is much less energy for the nasal than there is for the following vowel, at all frequency ranges. Some energy is visible from 0 to 1 kHz, and from 2 to 3 kHz. However, there are large gaps in the spectrum between about 1 kHz and 2 kHz, and above 4 kHz, which represent the combined effects of the oral and sinus anti-resonances (note that the anti-resonance from the oral side-branch also has harmonics at odd multiples, so that for instance, an anti-resonance of 1000 Hz has another anti-resonance at 3000 Hz).
Two other points can be noted based on Figure 10.2. The first is that F1 is clearly rising in the transition from nasal to vowel. This is because at the moment of nasal stop release, the tongue is still in the configuration for an oral stop consonant, and the normal F1 transition that is seen for oral stops is also usually seen for nasal stops (assuming that the velopharyngeal port has been sufficiently closed to prevent nasal airflow, as appears to be the case in this spectrogram). At the same time in this spectrogram, F2 is clearly falling, from about 2500 Hz at the nasal-vowel boundary, to about 1500 Hz for the vowel. F3 falls from about 3500 Hz to about 3000 Hz. These are typical transitions for palatal consonants of all different manners of articulation.
As a side note, Australian languages such as Pitjantjatjara seem to show minimal nasalization of segments surrounding nasal consonants. This is thought to be so in order to maintain strong formant cues to consonant place of articulation, given that nasalization of vowels significantly alters vowel formant frequencies and bandwidths (nasalization of vowels is particularly common cross-linguistically when the nasal is in syllable coda position, making it very difficult to find boundaries between the vowel and the following nasal in the spectrogram). Altered formant cues may compromise access to place of articulation information in Australian languages, which have up to six places of articulation in oral stops, nasals and laterals. It is therefore important to note that despite the presence of the nasal “murmur,” the main cues to place of articulation for nasals still lie in the formant transitions into and out of adjacent segments. To a certain extent, it could be argued that the presence of the nasal murmur simply serves to cue nasal manner itself, since it is very hard to distinguish nasal place of articulation based on the nasal murmur alone (Harrington, 1994; see also Malecot, 1956; Nakata, 1959; Fujimura, 1962; Recasens, 1983; Huffman and Krakow, 1993). Indeed, due to the low spectral energy of the nasal murmur, these sounds are all the more difficult to distinguish in noise, since any cues to place of articulation from the oral anti-resonance are compromised by the presence of background noise (because the anti-resonance cue to place of articulation is actually a spectral minimum, not a peak, and so is masked by background noise).
Thus far, we have focused on the anti-resonances that define the nasal consonants. This is because there is perhaps greater agreement among researchers as to exactly how these anti-resonances arise (see Fant, 1970, and Stevens, 1998). However, nasal consonants are also defined by their resonances, and exactly how these arise is not quite so clear – a lack of cross-language research on nasal consonants certainly doesn’t help matters. In terms of spectral energy, nasal consonants often have a very low first formant, around 200–400 Hz for an adult male speaker and a little higher for an adult female speaker (such a low first nasal formant – usually labeled N1 – can be seen in Figure 10.2 for the female speaker). Theoretically, a second nasal formant may be seen at about 1000 Hz, with a third nasal formant at about 2000–2500 Hz. Nasal formants are characterized by very wide bandwidths. For the low first nasal formant, the bandwidth is about 100 Hz. For the higher formants, the bandwidth values tend around 300 Hz.
Some researchers consider that the first formant is a Helmholtz resonance formed by the pharyngeal cavity and by the constriction at the velopharyngeal port; the wide bandwidth is a result of the losses that arise within the nasal cavity. As a result of the low frequency of N1, it is often difficult to distinguish this first nasal formant from the fundamental frequency, with the two combining to form a significant spectral prominence in the low-frequency region. The second nasal formant (and its odd multiple, which may become the fourth formant) may arise as the natural resonance of the nasal cavity itself, while the third formant may arise as the natural resonance of the pharyngeal cavity (see Stevens, 1998; Fant, 1970; Tabain, Butcher, Breen, and Beare, 2016b). In contrast, other researchers consider that the entire nasal and pharyngeal tract forms one long resonating cavity, with no significant constriction at the velopharyngeal port. In this case, the nasal resonances are simply odd multiples of the quarter-wavelength resonance that extends from the glottis to the nares, albeit with the oral and sinus side-braches producing anti-resonances.
All of these modeling approaches to nasal consonants, however, fail to account for results from several empirical studies (on languages as diverse as the European languages Catalan, Polish, Czech, German, Hungarian, Russian and Swedish, and the Australian languages Yanyuwa, Yindjibarndi, Arrernte, Pitjantjatjara and Warlpiri) that find significant differences in the nasal formants based on place of articulation. The first nasal formant, N1, is found to be significantly lower for bilabials and significantly higher for velars, with the coronal consonants in between. Moreover, the bilabial nasal has the lowest N2, N3, and N4, and these higher nasal formants have similarly low values for the velar nasal. Given that acoustic models of nasals do not consider the possibility that place of articulation within the oral cavity has any influence on the spectral output other than via the introduction of anti-formants, these consistent empirical results are surprising, and suggest that oral resonances (and not just anti-resonances) play an important role in the output of nasal consonants. Interestingly, at least for the Australian languages, the palatal nasal /ɲ/ has a low N1 (like the bilabial /m/), but very high N2 and N3 (note that this low N1 does not seem to be replicated in European language data, where the palatal N1 instead patterns between the alveolar and the velar). As we saw earlier, a low F1, and a high F2 and F3, are important cues to palatal place of articulation for the vowel formant transition, so the very high N2 and N3 in the nasal murmur suggest a significant contribution from the oral resonances to the nasal consonant output. There are also significant differences in bandwidth values between the various nasal places of articulation – for instance, the velar has by far the greatest bandwidth for N1, whereas the palatal has the greatest bandwidth for N2, N3, and N4.
So what might be some of the sources for these various differences in the places of articulation for nasals? It is certainly true that the presence of anti-resonances in the spectrum affects any nearby resonances, shifting them slightly in frequency, and broadening their bandwidths. However, this cannot account for all of the differences observed. One possibility for palatal nasals is that the proximity of the tongue body and the lowered uvula sets up a special coupling relationship between the oral and pharyngeal cavities. By contrast, for anterior coronal sounds such as the dental /n̪/, the alveolar /n/ and the retroflex /ɳ/, the back of the tongue may form a Helmholtz-like constriction with the lowered uvula, likewise leading to a greater contribution from the oral formants. The bilabial /m/, by contrast, should not involve any proximity between the tongue and the uvula (at least for most vowel contexts), and it is possible that the nasal formants for the bilabial nasal are simply harmonic multiples of the nasal cavity. Overall, the picture for where resonances arise for nasal consonants is a very murky one, and despite some good work on determining the resonance properties of the nasal and para-nasal cavities, the question of how/why the resonances in nasal consonants arise is one that requires much further articulatory-to-acoustic modeling.
Lateral consonants are similar to nasal consonants in that they both involve a reduction of spectral energy due to the presence of anti-resonances in the spectrum. However, in the case of laterals, the importance of these anti-resonances is much diminished, since they only arise from within the oral cavity, and not from within the nasal and para-nasal cavities.
Laterals involve continuous airflow through the oral cavity, along the side(s) of the tongue. A constriction is formed at some point along the roof of the mouth, and the side(s) of the tongue are lowered in order to allow air to escape laterally. It is not clear whether the sides of the tongue are actively lowered, or whether the tongue is lengthened in order to shorten the width of the tongue (Ladefoged and Maddieson, 1996). Regardless of the exact mechanism, due to the volume-preserving nature of the tongue, when air is allowed to flow along the sides of the tongue, another part of the tongue must be displaced in order for this to happen. We will return to this issue later in the chapter, in relation to the class of liquids (i.e., laterals and rhotics).
According to the IPA chart, the central constriction for a lateral consonant may occur at any point from the dental place of articulation to the uvular place of articulation. However, the only official lateral (non-fricative) symbols are /l/, the retroflex /ɭ/, the palatal /ʎ/, and the velar /ʟ/. The dental or alveolar /l/ is by far the most common lateral in the world’s languages, while the retroflex and palatal are not uncommon. The velar /ʟ/ is very rare; it occurs in some languages of Papua New Guinea, and it may occur allophonically in some dialects of English. It is not a sound that is well understood as yet, and to many English speakers it may simply sound like [gl]. It is possible that it is comparatively difficult to produce a full closure centrally in the velar or uvular region, while at the same time allowing air to flow along the sides of the tongue.
One important consequence of the central constriction for a lateral is that there is a pocket of air trapped behind the constriction. This pocket of air is considered to be the main source of any anti-resonances that occur in the lateral spectrum. The main airflow is along the side(s) of the tongue, and the pocket of air behind the constriction serves as a cavity to trap any acoustic energy at the resonant frequency of the cavity. The cavity is treated as a quarter-wavelength resonance (closed at the constriction), and as may be inferred from previous discussions, the exact length of this cavity will vary greatly depending on the place of articulation and on the active articulator (as an aside, it is not clear whether this cavity is of any appreciable size behind a velar constriction, as opposed to a more anterior constriction). The frequency of this anti-resonance may be less than 1000 Hz, or it may be as high as 4000 Hz – it is really not clear for the different lateral places of articulation. Additional anti-resonances may also arise when the left and right pathways for the lateral airflow are not symmetrical, and this may vary greatly from speaker to speaker.
Figure 10.3a shows a spectrogram of the Pitjantjatjara word palyani /paʎaɳi/ (“to make/fix”). It can be seen that like the nasal, the lateral has somewhat less energy relative to the adjacent vowel, although it has more energy than the nasal in this same word. Some other similarities can be seen between the palatal lateral in this figure, and the palatal nasal in Figure 10.2. In both figures, there is a large gap in spectral energy in the mid-frequencies; in the case of the lateral, this gap is between about 500 Hz and about 2000 Hz. However, the lateral has more energy than the nasal in the higher frequencies, above 4 kHz, and this is most likely due to the fact that there are no losses from the para-nasal sinuses. The lowest resonance for the lateral is at or below 500 Hz, and this is typical for all lateral sounds. It is thought that this lowest lateral resonance is due to a Helmholtz resonance formed between the lateral constriction along the side(s) of the tongue, and the large cavity behind the constriction. The lateral F1 is lower for laminal articulations than for apical articulations. This is most likely due to a longer constriction or a smaller cross-sectional area for the constriction in the case of a laminal articulation (laminal articulations involve a higher tongue body position and a higher jaw position than do apical articulations).
As regards the higher frequency lateral resonances in Figure 10.3a, formants can be seen at about 2250 Hz, about 3250 Hz, and about 4200 Hz. These values are very typical for palatal laterals, the high second formant being a defining characteristic of all palatal consonants, as already seen. A high F4 (above rather than below 4 kHz) seems to be a characteristic of laminal laterals. Since as already mentioned, laminals are characterized by a longer constriction than apicals, it is likely that F4 arises from a back cavity resonance, which is presumably smaller for a laminal articulation than for an apical articulation.
Figure 10.3b shows spectrograms of an Arrernte speaker producing the minimal pair alheme /al̪əmə/ (“goes”) followed by aleme /aləmə/ (“liver”), which shows the (laminal) dental versus (apical) alveolar contrast in this language. It can be seen that the lateral formants for these sounds are a little more evenly spaced than for the palatal. Some researchers have suggested that the relatively even spacing of the upper formants, with a large gap between F1 and F2, is what gives laterals their “ringing” quality (Fant, 1970). In Figure 10.3b, the first lateral formant is once again below about 500 Hz, and it might be seen that this formant is a little higher for the laminal “lh” /l̪/ than for the apical /l/. For both of these sounds, the second formant sits at around 1700 Hz, and this is also typical for the retroflex /ɭ/, which is not shown here (recall that for the palatal, the second formant was above 2 kHz). A third formant is visible at about 3000 Hz in both cases, although there seems to be movement in the lateral F2 and F3 for the alveolar. This may relate to the marginal phonemic contrast between alveolars and retroflexes in this language, whereby the phonemic alveolar may show articulatory characteristics of a post-alveolar or retroflex consonant, as previously mentioned. F3 in a retroflex lateral “ring” is a little lower than in the alveolar. It can also be seen in Figure 10.3b that F4 in the lateral ring is higher in the laminal dental than in the alveolar, as mentioned earlier (although whether or not this is perceptible is a question for further research).
One final point worth commenting on in regards to the lateral figures is the sharp change in amplitude between the lateral and the vowel (Stevens, 1998). However, in many languages there is no such abrupt onset to the lateral when it is in coda position (Tabain, Butcher, Breen and Beare, 2016a). This is because lateral consonants are prone to vocalization (e.g., Hardcastle and Barry, 1989; Recasens, 1996). When the central constriction is released, the articulatory configuration very much resembles a vowel articulation. Indeed, many researchers have commented on the similarity between a lateral configuration and an /i/ vowel configuration. In both cases there is a Helmholtz resonance producing the first formant, with the second formant arising from the cavity behind the constriction (with the multiple of this resonance becoming the fourth formant), and the third formant arising from the cavity in front of the constriction. The main difference between the /i/ articulation and the lateral articulation is the presence of the anti-resonances, which may slightly shift the frequency of the formants and increase their bandwidths. It is perfectly possible to synthesize a lateral sound without any anti-resonances at all, simply by manipulating the resonance frequencies and bandwidths. We may note in passing that some researchers have suggested that the anti-resonance may cancel any resonances in front of the lateral constriction, but this result does not seem likely based on empirical data which shows a lower F3 for retroflexes, for example.
Thus far, we have been considering a relatively “clear” lateral sound – namely with a relatively high F2 in the lateral ring. However, in “dark” laterals, F2 is considerably lower, and this is due to the presence of a second constriction within the oral cavity. In the case of Russian, which has a phonemic contrast between dark and light (i.e., velarized versus palatalized) consonants, it is possible to say that due to the small cross-sectional area of the velar constriction, a second Helmholtz resonance is formed between the back part of the tongue and the soft palate, thereby producing a much lower second formant.1 In this case, it could be said that the “dark” lateral articulation somewhat resembles the articulation for an /u/ vowel, with a dorsal constriction, and an anterior constriction in the alveolar region rather than at the lips. Thus, if in coda position a dorsal constriction is formed for the lateral, but the anterior lateral constriction is not formed, or is formed a little later, the exact boundary between the vowel nucleus and the coda lateral can be a little difficult to determine visually on a spectrogram (this is often the case in many varieties of English, where the vocalized lateral usually sounds like the unrounded back vowel [ɯ]).
It should be noted that labial laterals are not considered possible in the IPA chart. It is certainly possible to allow air to flow only out of one corner of the mouth, so the reason for this theoretical restriction may be that it is not clear how a pocket of air could be trapped behind a central constriction involving the lips. Pharyngeal laterals are also considered impossible, since it is not clear that the root of the tongue could modify the airflow in the correct manner for a lateral. Finally, glottal laterals are not possible, since laterals necessarily involve a supralaryngeal modification of the airflow. In conclusion, it seems that a particular tongue configuration is needed for the class of lateral sounds.
Before we consider the class of fricative sounds, a few words are needed regarding the other pulmonic consonant sounds on the IPA chart. The IPA lists trill, tap/flap2 lateral-fricative, and approximant manners of articulation (in addition to the oral stop, nasal, lateral, and fricative manners of articulation). Perhaps the most significant grouping of this rather miscellaneous collection of manners of articulation is the class of rhotics. Rhotics are /r/-like sounds, and it has been a point of some phonetic and phonological mystery as to why such a diverse range of sounds pattern together across the world’s languages. Most researchers would agree that rhotic sounds are characterized by a comparatively low third formant, and that this low third formant is brought about by a constriction towards the uvular or pharyngeal region of the oropharyngeal space.
Figure 10.4 shows a spectrogram example for the two phonemic rhotic sounds of Arrernte in a minimal pair: arreme /arəmə/ (“louse”) followed by areme /aɻəmə/ (“see”). The first rhotic is a trill, and phonologically it is considered to be alveolar. In this spectrogram, four periods of contact are visible (the last period of contact may not involve full closure). The second rhotic is an approximant or glide (the terms are often used interchangeably), and like the trill, it involves a sharp drop in energy from the adjacent vowels (note however that although this is typically the case in Arrernte, a rhotic glide does not always involve a clear drop in energy in all languages). The rhotic glide in this particular example has a very strong spectral prominence centered at about 1750 Hz. This is most likely the combined spectral prominence of F2 and F3 together. The other formants are comparatively much weaker for the glide in this particular spectrogram. Notably, however, both the trill and the glide induce a low F3 in the adjacent vowels (perhaps a little less obvious in the case of the preceding vowel for the trill, but noticeable in all three other adjacent vowels).
In this particular example I have used the symbol for a retroflex glide /ɻ/, but some authors use the symbol for an alveolar glide /ɹ/ to denote the rhotic consonant in languages such as English. The exact articulatory differences between these two sounds are perhaps not terribly clear, but in general the retroflex symbol tends to be used for slightly “darker” sounds than the alveolar rhotic symbol (“dark” being a subjective measure dependent on the hearer’s language background, but in general perhaps referring to just how low the F2 + F3 spectral prominence might be). In principle, the point of constriction for a “retroflex” rhotic glide should be a little further back than for an alveolar glide. In the case of the English rhotic, much research has shown that the low F3 arises thanks to a combination of articulatory strategies: The dorsal constriction in the uvular or pharyngeal region, a coronal constriction in the alveolar or post-alveolar region, and a labial constriction as well. All of these constrictions target an anti-node in the standing wave of the third formant resonance of a schwa-like tube, each one of them serving to slightly lower the third formant, and together combining to significantly lower the third formant. (This approach to the English rhotic is based on perturbation theory, but other researchers have used tube models to account for the formant structure of this sound; see for example Zhou, 2009, and Zhou et al., 2008, for discussion of the retroflex versus bunched rhotic of English).
Perhaps somewhat surprisingly, the trill /r/ also involves a significant dorsal constriction in the pharyngeal region. It is thought that this constriction serves to brace the rear-most portion of the tongue, in order to facilitate trilling in the anterior portion of the tongue. The anterior portion for a trill is low in comparison to stop, nasal and lateral sounds at the same “place” of articulation. The other rhotic, the uvular /ʀ/ (sometimes realized in a fricated version as /ʁ/ or /χ/), by definition has a dorsal constriction in the uvular region.
As an aside, the dorsal constriction for the rhotics may be what brings these sounds together with the laterals in the natural class of liquids. It was mentioned earlier that the volume-preserving nature of the tongue leads to the rear-most portion of the tongue being more back for laterals than for stops or nasals with the same place of articulation. Whilst it is certainly not the case that all laterals are “dark” – that is, involving a significant dorsal constriction in the uvular or pharyngeal region – it may be the case that a sufficient number of laterals do involve some dorsal retraction, and that this is sufficient to systematically bring the class of laterals and the class of rhotics together into a natural class of liquid sounds. This is clearly an area for further cross-linguistic research.
Finally, it should be mentioned that the IPA chart recognizes other trill and approximant/glide sounds, such as the bilabial trill /ʙ/ and the glides /j/, /w/, /ɥ/ and /ɰ/. For a trill to be initiated, certain aerodynamic conditions must be met, which are not dissimilar to the conditions required for the initiation of voicing. The relevant trill articulator is set into vibration, and the volume of airflow that is required to maintain trilling leads to compromises between the duration of the trill and the continuation of voicing (see further below for discussion of voicing in fricatives and stops).
Glide articulations are the class of consonant articulations that are most similar to vowels. Air flows relatively unimpeded in the oral cavity, and indeed most glides are simply a relatively short version of a high vowel. The glide counterpart of the unrounded front (palatal) vowel /i/ is /j/; /w/ is the glide counterpart of the rounded back (velar) vowel /u/; /ɥ/ is the glide counterpart of the rounded front (palatal) vowel /y/; and /ɰ/ is the glide counterpart of the unrounded back (velar) vowel /ɯ/. In many languages, the unrounded glides serve as variants of stop or other manners of articulation at the same place of articulation. For instance, where there is not a complete closure for /g/, the sound may be lenited to [ɣ] or even to [ɰ]. In many languages, a palatal /ɲ/ or /ʎ/ may be realized over time as [j] – that is, the nasal or lateral components of the articulation are not realized.
Fricatives are characterized by a critical constriction in the oral cavity. By “critical” it is meant that the constriction creates turbulent airflow, and the resultant spectrum is characterized by noise (i.e., spectral energy is distributed randomly across the frequency range). The exact spectral properties of the fricative are dependent on both the location of the constriction and on the length of the constriction. The primary determiner of spectral characteristics of a fricative is the size of the cavity in front of the constriction. However, although constriction location and constriction length are relevant factors for the other consonant manners of articulation, as discussed earlier, they seem to be all the more critical for fricatives in ways that are not yet fully understood. Although by far the largest number of symbols in the IPA chart is for the class of fricatives, the precise articulatory and spectral differences between the various fricatives, in particular the typologically less common fricatives, is in need of further study. In this section I will consider some of the salient characteristics of fricatives, including those in some less well-studied languages.
Figure 10.5a shows an example of the glottal fricative /h/ in the word hala (“call to move buffalo”) from the Indonesian language Makasar. It can be seen that the spectrum for the glottal is much noisier than the spectrum for the vowel, and the glottal has comparatively much less energy. The glottal fricative is not, strictly speaking, a fricative, since there is no constriction in the oral cavity that shapes the spectrum. Instead, the noise simply arises as a result of airflow through the glottis that has not been modified by voicing. (A breathy glottal “fricative” /ɦ/ is also possible, in which case vocal fold vibration is present, but closure at the vocal folds is not complete, resulting in a larger volume of air “leaking” through the glottis). It can be seen that the glottal fricative has a moderate amount of energy up to about 6 kHz in this spectrogram, and that the energy drops off greatly above this frequency range. Moreover, the peaks in the fricative noise correspond almost exactly to the formant peaks in the following vowel /a/. Indeed, this is the defining characteristic of a glottal fricative: It is effectively the voiceless version of an adjacent vowel.
Although not shown here, a labial fricative such as /f/ has a very similar spectrum to /h/ in the frequency range to about 5–6 kHz, but instead of dropping off above this range, the energy remains constant right up to the very highest frequencies (at least up to 20 kHz). This very constant, even spread of energy is a defining characteristic of the labial fricatives, and also of the dental fricative /θ/. Indeed, it is very difficult to distinguish the labio-dental fricative /f/ from the bilabial fricative /ϕ/, and /f/ from /θ/ in languages that have this contrast; studies have consistently shown that hearers rely on formant transition cues to distinguish place of articulation for these sounds, and also on visual cues. In the case of the labio-dental fricative, it is possible that a cavity exists between the upper teeth and the upper lip, particularly if the upper lip is slightly raised in production of this fricative. However, given that there are very few spectral differences between the anterior-most fricatives, it seems to be the case that any cavity in front of the constriction is negligible, and noise is simply generated at the constriction.
Figure 10.5b shows an example of the alveolar fricative /s/ in the word sala (“wrong/miss”) by the same speaker of Makasar. It can immediately be seen that in comparison to /h/, there is very little energy below 4 kHz for /s/. By contrast, there is much energy above this frequency range, with a broad peak extending from about 5 kHz to 10 kHz. Although not shown here, the energy drops off above 10–11 kHz, where the energy levels for /s/ are comparable to the energy levels for the anterior fricatives /f/ and /θ/. (This particular instance of /s/ seems to have a little less energy around 7500 Hz, though this would not be typical of most instances of this sound). The strong spectral peak for the alveolar fricative /s/ is due to the presence of an obstacle downstream from the constriction. The main constriction for this sound is at the alveolar ridge, and involves a very narrow channel for airflow. Importantly, the airflow through the constriction is channeled towards the lower teeth (for this reason, sounds such as /s/, and /ʃ/ which is discussed next, have a very high jaw position), and these teeth act as an obstacle that increases the turbulence of the airflow, and thereby the amplitude of any spectral peaks. In the case of the alveolar, the cavity in front of the constriction is quite small, and as such the spectral peaks are at a very high frequency, as already seen. By far the most common fricative in the world’s languages is /s/, and the salience of its noise is one reason why it is believed to be so common.
High-energy noise is the defining characteristic of fricative sounds that are termed “sibilants.” The other fricative that can be termed sibilant is the sound that is spelled “sh” in English, namely /ʃ/. On the IPA chart this sound is listed as post-alveolar, although some researchers use different terms for this sound, such as “palato-alveolar.” As the name suggests, the place of constriction is slightly further back for the palato-alveolar/post-alveolar than it is for the alveolar, and the constriction is a little wider. However, as for the /s/, the lower teeth serve as an obstacle downstream of the constriction for /ʃ/, likewise amplifying any spectral peaks. Since the constriction for /ʃ/ is located further back in the oral cavity than the constriction for /s/, the spectral peaks for /ʃ/ are located at a correspondingly lower frequency. This can be seen in Figure 10.5c, which shows an utterance of the English word “shopping” /ʃɔpɪŋ/. It can be seen that there is very little energy below 2 kHz, and that there is a broad spectral peak between about 2.5 kHz and about 5.5 kHz, which is typical for this sound. Although there is less energy above 6 kHz, the amount of energy between 6 and 10 kHz is greater for /ʃ/ than for the non-sibilant sounds /f/ and /θ/. Above 10 kHz (not shown here), the energy in /ʃ/ continues to drop off, with the net result that /ʃ/ has less energy above 10 kHz than do the non-sibilant fricatives /f/ and /θ/. As an aside, the F2 transition for /ʃ/ has a very high locus compared to the alveolar (or the dental), somewhat resembling a palatal (hence the term palato-alveolar used by some researchers).
Figure 10.5d shows an example of the retroflex fricative in the word /ʂʅ33/ (“to die”) from the Tibeto-Burman language Lisu. It can be seen that for this speaker, the retroflex fricative /ʂ/ has a strong spectral peak between about 2.5 and 3.5 kHz, with an extra, less intense peak between about 6 and 8 kHz (the energy gradually decreases above this range). The lower-frequency peak likely arises from the cavity in front of the constriction, and is lower than for the alveolar /s/, given that /ʂ/ likely has a longer front resonating cavity. The apparent extra, less-intense peak may arise from an additional cavity that is formed in a retroflex articulation, perhaps underneath the tongue. This is truly an area for extra articulatory-to-acoustic modeling.
Figure 10.5e shows an example of the alveo-palatal fricative in the word /ɕø21/ (“walk”), also from Lisu. For this speaker it can be seen that there is a broad energy peak between about 6 and 10 kHz, and a narrower bandwidth peak at about 3 kHz. For this speaker the alveo-palatal /ɕ/ has less energy overall than the retroflex /ʂ/, although to what extent this is typical for Lisu or even cross-linguistically is not clear. Although the Lisu retroflex seems perceptually and acoustically similar to the same sound in Polish, the alveo-palatal seems to be different in the two languages. It is also not clear whether either the retroflex or the alveo-palatal sounds can be treated as a sibilant fricative. The amount of spectral energy for /ɕ/ suggests that it may be a sibilant, but further research is needed to determine exactly how the noise is channeled.
The back fricatives, such as the velar /x/, have much less energy overall than the coronal fricatives just described, although thanks to a large resonating cavity in front of the constriction, they do have a relatively larger amount of low frequency energy. As noted earlier, post-velar consonants tend to have a higher F1 in adjacent vowels – with the pharyngeal consonants having an even higher F1 than the uvular consonants – and often a lower F2. It should be noted that the voiced pharyngeal fricative /ʢ/ is often realized without frication, as either an approximant or a stop; whereas the voiced uvular fricative /ʁ/ can be realized as either a fricative or as an approximant (Alwan, 1989).
Finally, it should be mentioned that lateral fricatives (voiceless /ɬ/ and voiced /ɮ/) are also possible if the air channel along the side(s) of the tongue forms a critical constriction to generate turbulence. In addition, stops and fricatives may be combined to form affricates such as /ʦ/, /ʧ/ or /ʨ/, which are cross-linguistically relatively common. Articulatorily and spectrally affricates are very similar to stops + fricatives, but their timing characteristics are such that the overall duration is much shorter than for a stop and fricative produced separately.
Before concluding this section on fricatives, it is worth emphasizing that the above discussion has focused on the voiceless stops and fricatives. This is because it is in the first instance simpler to consider the acoustic properties of these obstruent sounds without considering the consequences of voicing. In general, a certain amount of airflow is required to generate and maintain voicing, and a certain amount of airflow is required to generate noise within the oral cavity – this noise being necessary for accurate production of stop and fricative sounds, as seen earlier. At a certain point, these two requirements become incompatible. As a consequence, voicing rarely continues throughout the duration of a phonemically voiced stop or fricative consonant.
For voicing to continue, the air pressure below the glottis must be greater than the pressure above the glottis. When air flows freely through the oral or nasal cavity, as for a vowel, voicing may continue for quite some time, depending on the speaker’s ability. However, when there is an obstruction in the oral cavity, the pressure above the glottis quickly becomes equal to the pressure below the glottis, and in this case voicing ceases. As a consequence, voicing ceases relatively quickly for a velar stop, since the cavity behind the velar constriction is relatively small. Voicing can continue for relatively longer in a bilabial stop, since the cavity behind the bilabial constriction is the entire oral cavity. In fact, speakers may recruit other articulatory strategies in order to enlarge the supralaryngeal cavities in order to maintain voicing, including expanding the cheeks, lowering the velum (to allow some nasal leakage), and lowering the larynx (to expand the pharyngeal cavity). Similarly for fricatives, voicing does not continue throughout the duration of phonemically voiced fricatives. In fact, since airflow is impeded by voicing, the pressure build-up required for the generation of turbulence at the fricative constriction, or for a high-energy stop burst, is not as great. As a result, the stop burst for voiced stops may be less intense than for voiceless stops, and spectral energy in voiced fricatives compared to voiceless fricatives may be reduced at higher frequencies. The presence of voicing in the spectrum therefore has important consequences for the spectral output of voiced stops and fricatives compared to voiceless stops and fricatives.
We have seen that there are many place of articulation contrasts that are encoded within the consonant portion itself – be it the stop burst, the nasal murmur, the lateral “ring,” or the fricative noise. We have gained some insight into the various acoustic models that are used to understand these sounds, and yet the mystery remains that despite the very different acoustic results produced by the articulatory gestures, the cues to consonant place of articulation remain surprisingly constant across manner. There is no doubt that formant cues into and out of a consonant play a crucial role in identifying consonant place of articulation. However, even within a nasal murmur or a lateral “ring” (i.e., the lateral sound itself, akin to the nasal murmur), the palatal place of articulation is identified by a high second formant, and the retroflex “place” of articulation by a lowered third formant. Laminal articulations are routinely longer in duration and have a lower first formant value than apical articulations. Although we have not discussed more gross spectral measures such as spectral center of gravity or standard deviation of the spectrum in the present chapter, there are likewise remarkable similarities across manner classes for a given place of articulation. For instance, palatals have a higher spectral center of gravity, retroflexes have a lower spectral center of gravity compared to alveolars, and dentals have a much flatter spectrum than other places of articulation. Moreover, the study of dynamic cues to consonant production is only in its early stages (e.g., Reidy 2016, Holliday, Reidy, Beckman, and Edwards 2015, and Iskarous, Shadle, and Proctor 2011 on English fricatives for some examples), and this is another area for further research. Much acoustic modeling work has been based on major world languages such as English, and as empirical data keeps coming in from languages that have not yet enjoyed quite as much academic attention as the European languages, we may come to a better understanding of exactly why place of articulation is so robustly encoded across different manners of articulation.
Alwan, A. (1989). Perceptual cues for place of articulation for the voiced pharyngeal and uvular consonants. The Journal of the Acoustical Society of America 86 549–556.
Chiba, T. & Kajiyama, M. (1941). The Vowel: Its Nature and Structure. (Kaiseikan, Tokyo).
Dang, J., Honda, K. & Suzuki, H. (1994). Morphological and acoustical analysis of the nasal and the paranasal cavities. The Journal of the Acoustical Society of America 96 2088–2100.
Dang, J. & Honda, K. (1996). Acoustic characteristics of the human paranasal sinuses derived from transmission characteristic measurement and morphological observation. The Journal of the Acoustical Society of America 100 3374–3383.
Dart, S. (1991). Articulatory and acoustic properties of apical and laminal articulations. UCLA Working Papers in Phonetics, 79.
Fant, G. (1970). Acoustic Theory of Speech Production, 2nd ed. (Mouton, The Hague).
Fujimura, O. (1962). “Analysis of nasal consonants.” The Journal of the Acoustical Society of America, 34, 1865–1875.
Gordon, M. (2016). Phonological Typology. (Oxford University Press, Oxford).
Harrington, J. (1994). “The contribution of the murmur and vowel to the place of articulation distinction in nasal consonants.” The Journal of the Acoustical Society of America, 96, 19–32.
Johnson, K. (2012). Acoustic and Auditory Phonetics, 3rd ed. (Wiley Blackwell, Maldon, Oxford, West Sussex).
Hardcastle, W. & Barry, W. (1989). “Articulatory and perceptual factors in /l/ vocalisations in English.” Journal of the International Phonetic Association, 15, 3–17.
Holliday, J. J., Reidy, P. F., Beckman, M. E., & Edwards, J. (2015). “Quantifying the robustness of the English sibilant fricative contrast in children.” Journal of Speech, Hearing and Language Research, 58, 622–637.
Huffman, M. & Krakow, R., eds. (1993). Phonetics and Phonology: Nasals, Nasalization, and the Velum (Academic Press, San Diego).
Iskarous, K., Shadle, C. H., & Proctor, M. I. (2011). “Articulatory acoustic kinematics: The production of American English /s/.” The Journal of the Acoustical Society of America 129, 944–954.
Keating, P., Lindblom, B., Lubker, J. & Kreiman, J. (1994). “Variability in jaw height for segments in English and Swedish VCVs.” Journal of Phonetics, 22, 407–422.
Ladefoged, P. & Maddieson, I. (1996). The Sounds of the World’s Languages. (Wiley Blackwell, Oxford, UK, Cambridge, MA).
Malecot, A. (1956). “Acoustic cues for nasal consonants: An experimental study involving a tape-splicing technique.” Language, 32, 274–284.
Nakata, K. (1959). “Synthesis and perception of nasal consonants.” The Journal of the Acoustical Society of America, 31, 661–666.
Recasens, D. (1983). “Place cues for nasal consonants with special reference to Catalan.” The Journal of the Acoustical Society of America, 73, 1346–1353.
Recasens, D. (1996). “An articulatory-perceptual account of vocalization and elision of dark /l/ in the Romance languages.” Language and Speech, 39, 63–89
Reidy, P.F. (2016). Spectral dynamics of sibilant fricatives are contrastive and language specific. The Journal of the Acoustical Society of America, 140, 2518–2529.
Stevens, K. (1998). Acoustic Phonetics (Massachusetts Institute of Technology, Cambridge, MA).
Sylak-Glassman, J. (2014). Deriving Natural Classes: The Phonology and Typology of Post-velar Consonants. Ph.D., UC Berkeley.
Tabain, M. (2009). “An EPG study of the alveolar vs. retroflex apical contrast in Central Arrernte.” Journal of Phonetics, 37, 486–501.
Tabain, M. (2012). “Jaw movement and coronal stop spectra in Central Arrernte.” Journal of Phonetics, 40, 551–567
Tabain, M., Butcher, A., Breen, G. & Beare, R. (2016a). “An acoustic study of multiple lateral consonants in three Central Australian languages.” The Journal of the Acoustical Society of America, 139, 361–372.
Tabain, M., Butcher, A., Breen, G. & Beare, R. (2016b). An acoustic study of nasal consonants in three Central Australian languages. The Journal of the Acoustical Society of America 139, 890–903.
Zhou, Xinhui, Carol Espy-Wilson, Suzanne Boyce, Mark Tiede, Christy Holland & Ann Choe (2008). A magnetic resonance imaging-based articulatory and acoustic study of “retroflex” and “bunched” American English /r/. The Journal of the Acoustical Society of America 123 4466–4481.
Zhou, Xinhui. (2009). An MRI-based articulatory and acoustic study of American English liquid sounds /r/ and /l/. PhD thesis: University of Maryland.