The late Peter Ladefoged was fond of telling stories about his days on the set of the 1964 film version of My Fair Lady. He loved to regale the UCLA students of the day (tempered with appropriate British reserve) with the story of the day that Audrey Hepburn brought cookies to the set. The lunch room at UCLA had a picture of him on set, coaching Rex Harrison on how to understand the visible speech symbols that were prominently featured in many scenes in that film. That anecdote comes to mind in writing this chapter because of a line that Mr. Harrison’s character Henry Higgins utters in that film: “Why can’t a woman be more like a man?” More to the point of this chapter, why can’t a man sound more like a woman? Why do men and women speak differently? What consequences do male-female differences in speech have for our understanding of the nature of human language? How are these differences rooted in our biology, our evolutionary path since diverging from our last common species, and our participation in different social and cultural activities?
Variation in the spoken form of language is pervasive. Consider Figure 18.1. This shows four plots of phonetic characteristics of a set of single-word productions by 22 men and 22 women. These are the men and women described in Munson, McDonald, DeBoe, and White (2006). Briefly, Munson et al. (2006) was an exploratory study of the acoustic characteristics of heterosexual and gay speech from young adult men and women from a uniform dialect region (see Munson 2007a, 2007b for additional analyses of this corpus of talkers). The recordings from that study are used to illustrate various phenomena in this chapter. For a description of the general methods used to make the recordings and the measures, please see Munson et al. (2006). The parameters plotted in Figure 18.1 are temporal and spectral characteristics of consonants and vowels. These parameters were chosen at random from Munson et al.’s exploratory study. Values for men are plotted with crosses, and values for women are plotted with an open circle. As these visualizations show, there is considerable variation in all eight parameters. There are clear clusters of talkers in Figure 18.1a in both dimensions, and clear clusters along one of the two dimensions in Figures 18.1b and 18.1c. These clusters correspond to whether the talkers self-reported as men or women. There is no clear cluster in either dimension of Figure 18.1d. If one didn’t know anything about the speakers, and were to subject their acoustic data to an unsupervised clustering algorithm, one of the main divisions in the stimuli would likely correspond to whether the participant identified as male or female when asked to report either sex or gender. Hierarchical clustering applied to these four plots resulted in a perfect classification for Figure 18.1a, near-perfect classification for Figures 1b and 1c (in these cases, one man was classified into a group that included the 22 women), and poor classification for Figure 18.1d (where there were two clusters, one comprising six women and one man, and the other comprising the remaining individuals). Together, these figures show that men and women don’t differ in all possible acoustic parameters. However, they do differ in enough parameters that they pattern differently in many cases, even when the acoustic parameters are chosen at random.
Given how strongly men and women’s speech cluster together, it is not surprising that a handbook on phonetics would dedicate a chapter to this topic. What is surprising, however, is how complicated the explanation is for these differences. Understanding why men and women differ so strongly involves understanding anatomy (including its development over the lifespan), acoustics, and social dynamics. It requires understanding the full range of variation that occurs in different languages, in different cultural contexts, and at different points in development. We argue that it is a question of deep importance. Given their status as two of the most pervasive sources of variation, understanding the mechanisms through which sex and gender affect speech has the potential to help us better understand how to model the production of variation more generally, an argument we make in greater detail elsewhere (Babel & Munson, 2014).
The phrase “sex and gender” in the title of this paper immediately invites a definition of each of these terms. Most of the quick definitions of these concepts can be summarized simply as “sex is biological, while gender is either socially or culturally determined.” This definition has a strong core of truth to it, but its brevity belies the complexity associated with defining and understanding both sex and gender. Regardless, this definitional split is greatly preferable to simply using the terms interchangeably, or to using “gender” simply because it sounds more polite than “sex,” which also has the meaning “sexual intercourse.” Neither sex nor gender is completely binary. Biological sex in complex multicellular organisms like birds and mammals encompasses numerous traits, such as chromosomal composition, genitals, hormone levels, and secondary sex characteristics. While there are statistical patterns of co-occurrence among these variables, these correlations are not inevitable. The presence of a Y chromosome is correlated at far greater-than-chance levels with the existence of a penis and testicles, a particular level of testosterone, and a large larynx that descends at puberty. However, counterexamples can be found: individuals can be born with ambiguous genitals, or can have a Y chromosome but female genitals. Individuals can opt to take hormone therapy, or to undergo genital reassignment surgery. Both of these interventions would lead an individual to have some canonically male and some female biological attributes at a moment in time. Consequently, individuals can demonstrate characteristics intermediate between the canonical male and female presentations. That is, while there are strong clusters of characteristics that imply the existence of two predominant categories of biological sex, male and female, by no means do these do not describe all individuals.
The existence of sexual dimorphism (i.e., anatomical differences between sexes) is pervasive in the animal kingdom, and the human animal is no exception to this. Many of men and women’s speech production mechanisms – from the capacity of the lungs, to the size and position of the larynx, to the size and shape of the vocal tract – create phonetic properties that form statistical distributions whose means differ greatly (but across large population samples, show some degree of overlap). Not surprisingly, then, a recurring theme in this chapter is that many of the sex differences in the speech of men and women are grounded in sexual dimorphism. These speech differences are either direct consequences of the sexual dimorphism of the speech-production mechanism, or are intentional exaggerations or attenuations of the effects of sexual dimorphism.
Still other differences between men and women appear, however, to be completely arbitrary. The apparent arbitrary nature of those differences implies that they are learned behaviors. Another theme in this chapter is that these learned behaviors might reflect an individual’s expression of their gender. What, then, is gender? Just as biological sex is complex, so is gender. For the purposes of this chapter, we consider very broadly the notion that gender is a set of practices that individuals engage in that reinforce a social division between men and women. A summary of this as it relates to language variation is given by Eckert and Wenger (2005). In this sense, gender is highly dimensional. As an example, consider the act of expressing emotional vulnerability. According to a survey study conducted in Amsterdam, some experimental evidence suggests that individuals identify this as a “female” trait (Timmers, Fischer, & Manstead, 2003). While the origin and existence of sex differences in emotional expression are a matter of debate (Eagly & Wood, 1999; Chaplin, 2015), the data collected thus far on the topic of gender suggest that when a woman practices vulnerability, and when a man fails to practice vulnerability, they are reinforcing this gender association. Failing to conform to these stereotypes may have consequences: When a man expresses vulnerability, he may face the potential consequence of being thought of as feminine, at least in the Western cultures that have been the subject of most of the research on this topic. In these cultures, these gender differences evolve over generations, are culturally specific, and are potentially fluid over one’s lifetime. As reviewed by Eagly and Wood, some of these gender differences are grounded in sex differences in anatomy or physiology, or in the roles that men and women are expected to take in different cultures, like caregiving or money-earning.
As the reader of this chapter will see, our understanding of sex and gender differences in speech is richest when we consider those differences in speech acoustics that can be traced directly to anatomical and physiological differences between the sexes. These differences have parallels in the vocalizations of nonhuman animals, and hence we can speculate on how they arose over the course of human evolution. Our understanding of gender and speech is considerably more tenuous. This is because it is only recently that studies have begun to use equally sophisticated methods for measuring speech and for coding social categories and social meanings. Indeed, only very recently have experimental phoneticians begun to exploit the scholarship on coding gender and sex as something other than as a self-report as either “male” or “female.” See, for example, Eckert (2014) for recommendations on improving the coding of these constructs in research. However, there is a small but robust literature in sociocultural linguistics that we can draw on to begin to understand the ways that the many components of gender are coded phonetically.
This chapter is structured as follows: We begin by talking about sex differences between men and women’s speech production mechanisms, and describing the speech differences that arise because of these. We then describe those differences in speech that do not have a plausible basis in anatomical or physiological differences between the sexes. We then discuss how sex- and gender-based differences in speech are perceived, and how they develop over the lifespan. We end the chapter with suggestions for ongoing research in this area.
Any discussion of speech acoustics should follow the most influential theory of speech production, the acoustic source-filter theory (Chiba & Kajiyama, 1941; Fant, 1970). This theory posits that the acoustic characteristics of speech sounds reflect the combined influences of a sound source and a filter through which it is passed. The first set of speech features that we consider are those related to the sound source for voiced sounds, namely, the vibration of the vocal folds. Women tend to have smaller and lighter vocal folds than men. By virtue of this difference in morphology, women’s vocal folds vibrate at a faster rate than men’s (Titze, 1989). Acoustically, this produces a higher fundamental frequency (henceforth f0). As listeners, we perceive this increased rate of vibration and higher f0 as an increase in pitch. The source spectrum generated at the glottis is a complex signal, composed of the fundamental frequency and harmonics, which are positive, whole-number multiples of the fundamental frequency. With harmonics as multiples of the fundamental, this means that men have harmonically more dense voices than women.
Consider the following example to make this important distinction between male and female voices more concrete. Figure 18.2 presents power spectra from the midpoint of the vowel /æ/ for a woman (dashed line) and a man (solid line). These are two speakers from the corpus of 44 talkers described by Munson et al. (2006). They represent the extremes of f0 in that sample. The female who produced this token has a mean f0 of 223 Hz, and, therefore, has harmonics occurring at 223 Hz, 446 Hz, 669 Hz, and all other positive whole-number multiples of the f0. The male who produced this token has a mean f0 of 96 Hz, with harmonics occurring regularly every 96 Hz: 96 Hz, 192 Hz, 288 Hz, etc. So, from 0 to 1000 Hertz, this male has ten harmonics, while this female only has four. While a direct laryngoscopic examination of these individuals was not performed, we can presume that these differences are due, at least in part, to physiological and anatomical differences between these two people’s vocal folds.
Given the sex differences illustrated by Figure 18.2, one sensible assumption is that the habitual rate of vibration of the vocal folds can be calculated, and used as an estimate of the mass and thickness of an individual’s vocal folds. Perhaps this is true, given that the samples of speech used to estimate f0 are long enough to average over all of the many other sources of f0 variation. Over shorter samples, there are likely to be substantial differences. This is illustrated by Sandage, Plexico, and Schiwitz’s (2015) assessment of f0 estimates taken from a variety of speech samples that are often used in clinical assessments of voice. Sandage et al. showed that the estimates for women’s f0 can vary considerably across different speech genres (i.e., counting, sentence reading, and narrative readings). A simple illustration of the effect that one variable can have on f0 measures is given in the two pitch tracks shown in Figure 18.3. These are estimates of f0 over the course of two readings of the sentence the red bus that you get in the city center takes forever.
The first of these, shown in the top f0 trace, was produced with a single intonational phrase, in which the words red, bus, city, center, takes, and forever were given intonational accents, followed by a fall in pitch at the end of the utterance. The typed words represent the approximate location of each of the words, as determined from a spectrogram. In the second reading of the sentence, there were three intonational phrases, with an intonational phrase boundary after the words bus and center. This is shown in the bottom pitch trace. The medial intonational phrase was produced as a parenthetical comment, and hence had a lower overall pitch than the first and third intonational phrases (Dehé, 2014). The net result of these differences is that the average f0 for the first reading is 130 Hz, and 120 Hz for the second reading. This difference reflects the meaningful linguistic difference between the two utterances. The information status of the phrase which you get in the city center is different in the two utterances, and the f0 range is one of the salient cues to this difference. This difference does not, however, have anything to do with the mass or thickness of the vocal folds. Hence, caution must be exercised when interpreting f0 estimates of individuals’ voices, as they reflect the combined influence of many factors.
Sex differences in the mass of the vocal folds also contribute to a typical voice quality difference between male and female speakers. Voice quality is a complex phenomenon. Readers are referred to Chapter 4 in this volume for a more detailed treatment of this topic. One relevant parameter for understanding voice quality variation is the open quotient, i.e., the amount the glottis is unobstructed during the opening-and-closing gesture of the vocal folds within each cycle of vibration during voicing. Modal voicing occurs when the open quotient is 0.5, that is, when the open and closed phases of the vibratory cycle are approximately equally long. Higher open quotient values indicate that the glottis is more open than it is closed, resulting in a voice quality that is termed breathy. Lower open quotient values indicate the glottis is more closed than open, generating a creaky voice quality. Many studies have documented breathy voice qualities for women’s voices (Klatt & Klatt, 1990), and computational modeling demonstrates that women’s thinner vocal folds are less likely to achieve complete closure when voicing (Titze, 1989), thus producing a breathier voice quality.
A commonly used measure of voice quality is H1-H2, which is calculated subtracting the amplitude of the second harmonic from that of the first. The two individuals whose productions are plotted in Figure 18.2 show a difference in H1-H2. The woman’s production has a much larger difference between H1 and H2 than does the man’s. This may be due to her producing the word with a breathier voice quality than the man. Indeed, that is the perceptual impression that these two speakers give (at least to the first author, who was trained in the assessment of voice disorders): the man sounds considerably less breathy than the woman. In Figure 18.2, the H1 and H2 frequencies are far away from the nearest formant frequency; hence, their amplitudes can be assumed to largely reflect voice quality, and to reflect vocal-tract resonance properties only minimally. This would not be the case, for example, if the vowel were /i/ or /u/, which have F1 frequencies that are much closer to the frequencies of H1 and H2. In those cases, the amplitudes of H1 and H2 reflect a combination of voice quality and vocal-tract resonance. In these cases and other instances where the estimation of F1 can be a challenge, H1-H2 is a less-useful measure of voice quality. A complex correction needs to be made for the potential effects of vocal-tract resonances, such as that described by Iseli, Shue, and Alwan (2007).
Even when corrections are made for vocal-tract characteristics, this measure is still potentially problematic. Simpson (2012) strongly advises against the use of this measure for voice quality, particularly to compare across male and female speakers. The specific problem relates to the fact that the difference between H1 and H2 amplitude can also be affected by the degree of nasality in a person’s speech. The reasoning is important to our understanding of sex differences in speech because issues arise when an acoustic parameter that has no global sex difference (e.g., nasal formants) is superimposed on known sex differences (e.g., oral formants), thus obscuring the mechanism behind the measurement. The acoustics of nasals are difficult to predict because of individual differences in nasal anatomy. To create a nasal stop or a nasalized oral sound (e.g., nasalized vowels), a speaker lowers his/her velum. This lowers the amplitude and increases the F1 bandwidth of the resulting sound (Chen, 1997). This also creates a nasal formant around 200–300 Hertz (Chen, 1997), for both male and female speakers. The individual differences in nasal anatomy outweigh any global sex-based differences in the location of the nasal formant. However, this amplification in the lower-frequency spectrum generally serves to increase the amplitude of H1 for female speakers and H2 or H3 for male speakers. So, while the acoustics of this nasal formant that does not have a sex-specific component itself, it amplifies different harmonics in the female and male spectra, based on sex- or gender-specific f0 patterns. This confounds the use of H1-H2 as a measurement for voice quality, a phonetic feature we know to have sex-specific patterns based in part on physiology. In sum, laryngeal differences can result in male-female differences in speech. However, great care must be applied to the interpretation of these measures.
In the source-filter model of speech production, the source spectrum generated at the glottis for voiced sounds is filtered by the vocal tract. This filtering process amplifies the harmonic frequencies that fall at or near the resonant peaks of the vocal tract. Those resonant peaks are determined by the size and shape of the vocal tract, that is, the articulatory configuration of the speech organs. Hence, the next set of male-female differences that we discuss is plausibly related to differences in anatomy. This discussion begins with one basic fact: On average, men have longer vocal tracts than women (Fitch & Giedd, 1999). The resonant frequencies of sounds produced with a relatively open vocal tract are inversely proportional to the length of the vocal tract. Hence, it is not surprising that the formant frequencies of women’s vowels are, on average, higher than those of men’s. This was first illustrated in Peterson and Barney (1952), a large-scale study on individual differences in vowel production in English, and by Fant (1970), which was a smaller-scale study of speakers of Swedish. They have been confirmed in every large-scale study on vowel production that has been conducted subsequently, including Hillenbrand, Getty, Clark, and Wheeler’s (1995) follow-up to Peterson and Barney.
One somewhat surprising finding from these studies is that the differences between men and women’s formant frequencies are not equivalent for all of the vowels of English or Swedish. This appears to be due to the fact that the size of the ratio of the size of the oral and pharyngeal cavities is sex-specific. It is approximately equal in women. However, men have disproportionately lower larynxes than women (Fant, 1966; Fitch & Giedd, 1999). This means that they have larger pharyngeal cavities than oral cavities. This leads to especially low F2 frequencies for men’s productions of low-back vowels, when compared to women’s productions. Nonetheless, these vowel-specific sex differences can still be explained in strictly anatomical terms, following the source-filter theory of speech production (Chiba & Kajiyama, 1941; Fant, 1970).
The fact that men and women’s formant frequencies are scaled differently shows that listeners’ perception of vowels cannot involve a simple mapping between ranges of formant frequencies and particular vowel categories. This problem of acoustic invariance is one of the longest-standing questions in speech perception: How do listeners map a seemingly infinitely variable acoustic signal onto invariant perceptual representation? The finding that men and women’s formant frequencies differ in ways that appear to be related to the overall length of their vocal tracts implies that listeners can account for this source of variability by making inferences about speakers’ overall vocal-tract size, and then factoring out the effect of vocal-tract length on formant frequencies. This possibility has been explored at great length in experimental and computational work by Patterson and colleagues (e.g., Irino & Patterson, 2002; Patterson, Smith, van Dinther, & Walters, 2008; Smith & Patterson, 2005; Smith, Patterson, Turner, Kawahara, & Irino, 2005). That work suggests that speakers’ overall stature can be estimated from formant frequencies, and that listeners make these estimates at a relatively early stage of speech processing. Indeed, overall height and vocal-tract length are correlated (Cherng, Wong, Hsu, & Ho, 2002). Once an estimate of size is made, subsequent formants can be interpreted relative to that scaling factor. This explanation is appealing not only because of its mathematical elegance, but also because it is observed across numerous species. Reby et al. (2005) argue that male red deer (cervus elaphus) make inferences about an antagonist’s body size during antagonistic interactions from formant frequencies. Rendall, Owren, Weerts, and Hienz (2004) showed that baboons (Papio hamadryas) can also discriminate caller size from call acoustics.
There are, however, data that seem to contradict the strong claim that human male-female formant-frequency differences can be reduced to differences in overall size. Much of this research focuses on within-sex relationships between talker size and resonant frequencies. The first of these pieces of evidence is that the correlation between talker height and formants is much weaker than anticipated if the scaling of the entire ensemble of vowels in the F1/F2 space were due only to overall size. Moreover, the correlation is stratified by sex: it is stronger in females than males (Gonzalez, 2004). The lower correlation in males might indicate that men make more maneuvers than women to mask their true body size (i.e., to sound bigger or smaller). The composition of the stimuli used to assess this relationship are likely of great importance. Barreda (2017) found that listener judgments of speaker heights were more accurate when based on productions of certain vowels. Barreda’s finding suggests that a comparison of relationships between height and acoustics must consider stimulus composition carefully.
As an illustration of the correlation between height and speech acoustics, consider Figure 18.4, which plots of the relationship between height and both the average F1 and the average F2 for the 22 adults in Munson et al. (2006). These are the same adults whose data are presented in Figures 18.1 and 18.2, using the same symbols for men and women. These data show the correlation between height and average F1 (top) and average F2 (bottom) for a variety of monophthongal vowels. These averages are simply the linear mean of the F1 and F2 of many dozens of productions of different vowels, taken at the vowel midpoint. They were the same vowels and the same words for all speakers. When both sexes are included, these correlations are very robust (r = −0.68 for F1, r = −.76, p < 0.001 for both F1 and F2 correlations). Indeed, they are much larger than those noted by Gonzalez. However, they do follow the sex differences shown by Gonzalez: the correlations are stronger for women than for men (for F1: r = −0.51, p = 0.02 for women, r = −0.01, p = 0.97 for men; for F2: r = −0.39, p = 0.07 for women, r = −0.37, p = 0.09 for men). These patterns hold when the logs of formant frequencies are used, as is discussed by Neary (1989). Simply put, while average formant frequencies are affected by speaker size, the weak correlations with size suggest that other factors are at play in determining these values. The general dearth of research in this area, and the small, age-limited sample sizes that have been used, leave this a very open area for future research.
The second, somewhat striking counterexample to the relationship between height and formant frequencies comes from Johnson’s (2006) survey of available formant frequency data for men and women’s productions of the five vowels /i e a o u/ by speakers of 17 languages. (Johnson’s data came from published sources, a full list of which can be found in the Appendix accompanying his article.) While the languages were overwhelmingly Indo-European (and within Indo-European, overwhelmingly Germanic), the set does have some typological diversity (i.e., it includes Hebrew and Korean). Within these data, the degree of gender difference varied widely across languages. One possibility is that these differences are simply related to population differences in vocal-tract length, which we can estimate from differences in overall height (as in Cherng et al., 2002). To explore this possibility, Johnson compared Bark-scaled formant-frequency data to height data within each language population, correlating the gender difference in formant values for a given language to the gender difference in height within that same population. The relationship that Johnson found does not support the hypothesis that the language-specificity of sex differences is due to the population-specific differences in men and women’s height: there was a weak relationship between population differences in height and differences in formant frequencies. Johnson concluded that these differences reinforce the idea that there are socially, culturally, and linguistically conventional ways of expressing gender through speech, and that these can mitigate or enhance sex differences in speech communities.
There are many ways to interpret and test Johnson’s hypothesis, and we return to it later in this chapter. Here, we discuss one related hypothesis that Johnson’s data inspires. This is the idea that the mitigation or enhancement of sex differences that Johnson noted is the result of people modifying the length of their vocal tract. The notion that people’s vocal-tract length can be manipulated is supported by a great deal of research (Sachs, Lieberman, & Erikson, 1972). Vocal-tract length can be increased by lowering the larynx or protruding the lips, and, conversely, can be shortened by raising the larynx or spreading the lips. Articulatory studies show that individuals can perform both maneuvers. For example, French- and Mandarin-speaking individuals lower their larynx when producing the round vowels /u/ and /y/ with an articulatory perturbation that prevents them from rounding their lips (Riordan, 1977). Further evidence for the malleability of vocal-tract length comes from experimental studies of the acoustic characteristics of the speech produced by individuals who were instructed to change the impression of their overall body size. K. Pisanski et al. (2016) examined the speech of men and women who were instructed to make themselves sound larger or smaller. They found that, across a variety of languages and cultures, individuals made modifications to the formant frequencies that are consistent with an overall stretching or compression of the vocal tract. Similar findings are reported in studies of imitated masculinity and femininity in speech. Individuals modify all of the lowest four-formant frequencies when instructed to speak in a more masculine style, and increase them when asked to speak in a more feminine style (Cartei, Cowels, & Reby, 2012). While Cartei et al. and K. Pisanski et al. did not conduct articulatory measures to ascertain how individuals changed their formant frequencies, the consistency with which they did so (across vowels and different speaking tasks) makes it very plausible that they did so by lengthening and shortening their vocal tracts, respectively.
The relationship between body size and formant frequencies can be seen in other animal species that communicate vocally and are characterized by variation in body size. For example, Budka and Osiejuk (2013) show that formant frequencies are related to body size in male corncrakes (Crex crex). As described by Fitch and Hauser (2002) and Rendall and Owren (2010), there are numerous cases of deceptive vocalizations in nonhuman animals that give the illusion of being larger or smaller than one actually is. Indeed, one of the motivations of the studies by Cartei and colleagues and by K. Pisanski and colleagues is to understand the extent to which vocal signals communicate information that might be consistent across a diverse group of animal species, rather than specific to human language.
In sum, there is ample evidence that men and women’s different-sized vocal tracts lead to acoustic differences in their production of vowels. There is also evidence that male-female differences in vocal-tract size cannot account for all of the differences in vowels between men and women.
This next section considers sex differences in the production of consonants. As the reader will see, there are relatively fewer of these studies than there are studies of sex differences in vowels. We interpret this asymmetry in the research as stemming from two factors. The first of these is that the articulatory-acoustic relationships for consonants are more poorly understood than are those for vowels. Second, the acoustic measures that are generally used to characterize stops and fricatives are less transparently related to their articulatory characteristics than are those for vowels.
Fricatives are characterized by turbulent airflow, which can be produced by forcing air through a narrow constriction, one that is narrow enough that given a particular volume velocity, the flow is turbulent in nature or by forcing air against an obstacle, such as the vocal tract walls or the front teeth. The turbulent noise is then shaped or filtered by the vocal tract cavity in front of the noise source. The result of this filtering is that different places of articulation for fricatives have different frequency-amplitude distributions. The size of the channel constriction, the articulatory configuration (especially involving the tongue), and the distance between channel and obstacle turbulence sources are some of the many factors at play in determining the acoustic qualities of a fricative (Catford, 1977; Shadle, 1991; Stevens, 2000). For our purposes, we focus on articulatory positions that are potentially related to sex differences. For example, given the sexual-dimorphic differences between males and females, we can predict that at a particular oral constriction, the front cavity for a male speaker will be larger than the equivalent for a female speaker’s vocal tract, resulting in lower resonant frequencies. Just as with vowels, this cavity size can also be easily manipulated. Speakers can, for example, round their lips, effectively extending the length of the front cavity and, thus, lowering the frequencies that will be amplified. Speakers can also subtly retract the constriction location to create a larger front cavity.
Jongman, Wayland, and Wong (2000) provide a detailed acoustic analysis of the voiced and voiceless fricatives in American English /f v θ ð s z ʃ ʒ/ for 20 speakers. They document higher spectral peaks for female speakers, but not uniformly across fricative place of articulation. While female speakers have higher spectral peak frequencies in the sibilant fricatives, the pattern is less robust in non-sibilant fricatives, where perceptually listeners seem to rely rather heavily on adjacent formant transitions (Babel & McGuire, 2013). Acoustically, English fricatives produced by males have lower spectral means, lower variance, lower kurtosis, and higher skewness than the same fricatives produced by females. It is important to note, however, that these effect sizes are tiny, and vary across the different acoustic measures and the different fricative place of articulations (Jongman et al., 2000).
Stuart-Smith (2007) investigated gendered productions of /s/ in Glaswegian English in males and females from social groups that varied in terms of their prescribed gender norms. While the number of participants in this study was relatively small (n = 32), it is an example of research that is rich in the sophistication of its acoustic measurements, analyses, and social theory. Hence, it is a useful illustration of research on the phonetics of sex and gender. According to Stuart-Smith, middle-class Glaswegian concepts of gender involve more differentiation between men and women than do working-class gender norms. For example, female behavior in the middle class involves more explicit discussion of the domestic realm, compared to the gendered norms for working-class females. In Stuart-Smith’s study, younger (aged 13–14) and older (aged 40–60) males and females from middle class and working class socioeconomic backgrounds were recorded reading a wordlist that included /s/ words and one /ʃ/ word. These /s/ and /ʃ/ words were predicted not to show social variation between the groups of speakers. Indeed, they were originally conceived of as control words in a study that focused on sounds whose variation had been studied previously. Stuart-Smith reasoned that, if fricative acoustics were not influenced by socially constructed class-prescribed norms, we would expect that age and sex would both affect fricative acoustics, as speakers developmentally changed in size throughout the lifespan. Effects and interactions of age and gender with social class and a lack of age and gender effects, however, would provide evidence that /s/ production is shaped by the social conventions of the working- and middle-class cultures.
Across a range of acoustic measurements, Stuart-Smith found that young-working class girls have /s/ whose acoustics pattern with the male speakers, despite the lack of evidence of anatomical similarities between these speaker groups. Boys and men from both working and middle class backgrounds clustered together in the analyses, providing evidence of a socially structured “male” style that is exploited by male speakers regardless of age and class. Hence, Stuart-Smith provided some of the first evidence that variation in /s/ was not due to anatomical influences alone in women, as /s/ differed across groups of women for whom there were no obvious differences in anatomy.
Further data on the social mediation of sex differences in /s/ is provided by Podesva and Van Hofwegen (2014), who studied variation in a rural Northern California community. They found greater /s/ differentiation between men and women in that community than had been found previously in people living in urban centers. The authors speculated that this related to the people in this community adhering to more traditional male and female variants, while those in urban centers were freer to use a variety of /s/ variants. Like Stuart-Smith, Podesva and Von Hofwegen provide strong evidence that an apparent sex difference can be mediated by a number of other social factors, like age, social class, and whether one is rural or urban.
There are many opportunities for talker sex to conceivably affect the acoustic-phonetic realization of oral stops. One commonly investigated feature is voice onset time. With their larger vocal tract volumes, males may have an easier time setting up the specific aerodynamic configurations to generate voicing of the vocal folds, leading to lower voice onset time values in voiceless unaspirated stops (e.g., Whiteside & Irving, 1998). Across languages, however, we see considerable variation in terms of whether males or female exhibit longer or shorter voice onset time values, strongly suggesting that temporal patterns of stop articulation are subject to language-specific learned patterns of gender performance (Oh, 2011).
Foulkes and Docherty (2005) examined variants of /t/ in English speakers in and around Newcastle, England. There are at least two variants of word-medial /t/ in this dialect: a plain variant, and a variant that is produced with a very constricted glottis. The glottal variant is produced at high rates by men, regardless of age or social class. For women, the rate of production of the glottal variant differs systematically across age (older women produce fewer glottal tokens than younger women) and social class (middle-class women produce fewer glottal tokens than working-class women).
The fact that sex differences in /s/ and /t/ production in the studies reviewed in the preceding are so strongly mediated by other social variables calls into question whether an explanation for them based on strictly biological factors is even plausible. There is no reason to believe that vocal tract size and shape should differ systematically between, for example, rural and urban individuals. However, research directly examining the relative influence of vocal-tract variables and potentially social influences on consonant variation is sparse. One example of such a study is presented by Fuchs and Toda (2010), who examined whether a variety of anatomical measures of palate shape and size could predict the magnitude of male-female differences in /s/ production. They found no evidence that anatomical factors can explain the sex differences in /s/ production. Given this lack of a relationship, they reasoned that sex differences in /s/ must reflect learned factors.
The cases of /s/ in Glaswegian and /t/ in Newcastle are cases of what are arguably socially driven cases of differences in speech. Given that these are stratified by self-reported biological sex, it is not inaccurate to label them as sex differences. However, the fact that they are mediated by social factors like age and socioeconomic status suggests that they are cases of socially driven variation. In the remainder of this chapter, will refer to these as cases of potentially socially driven variation. By this, we mean that the variation does not have any basis in anatomical differences between the sexes. The word potentially is key, because we regard it as unwise to attribute a social origin or social meaning to a pattern of variation simply because one can think of no other source of the variation. We believe that something cannot be deemed to be socially meaningful variation unless there are rigorous, methodologically appropriate studies both of its origin and of the consequences of this variation on social functioning and social evaluation.
With that proviso in mind, consider again the finding described earlier that men’s and women’s formant frequencies differ across languages. What would a social explanation for those patterns involve? One possibility, discussed earlier, is that it might involve language-specific maneuvers to lengthen or shorten the vocal tract. One reasonable hypothesis is that such behaviors developed because a culture placed high value on sounding a particular way. Such an explanation would need to be verified by rigorous examination of this hypothesis. One model study of a method for a hypothesis like this is presented by Van Bezooijen (1995). This study examined f0 differences across languages, rather than formant differences, but the methodology is one that we feel is ideally suited to introduce the reader to a method for studying socially motivated variation of any phonetic variable.
Van Bezooijen investigated why there appears to be more sex differentiation in f0 in Japanese speakers than in Dutch speakers, a finding previously documented by Yamazawa and Hollien (1992). She hypothesized that these language-specific tendencies relate to culture-specific value ascribed to f0 variation. She reasoned that Japanese culture would place a higher value on having an f0 that matches one’s biological sex than would Dutch culture. A series of rating experiments found evidence for this: Japanese-speaking listeners evaluated women as being less attractive when their speech had been altered to have a low f0. They rated men to be less attractive when the speech was altered to have a high f0. The effect of f0 on Dutch listeners’ evaluation was smaller. Moreover, Japanese-speaking listeners indicated that the attributes of what they believed to be an ideal woman were those associated with higher pitch across both Dutch and Japanese subjects, and the attributes for an ideal man were those associated with lower pitch. Dutch listeners’ attributes of an ideal man and woman were more similar, and not as clearly linked to having a high or low f0. Van Bezooijen’s study is an example of an investigation that seeks to ground differences in observed male-female speech in differences in values and expectations across cultures.
Identifying and understanding phonetic differences between men and women is challenging in any context, but researchers’ ability to delineate physiological or anatomically induced variation from potentially social variation is further diminished when the community is undergoing sound changes. Let us take as an example GOOSE-fronting, a sound change affecting many varieties of English including varieties spoken in the United Kingdom (Harrington, Kleber, & Reubold, 2008) and North America (Hagiwara, 1997; Hall-Lew, 2011). GOOSE-fronting affects the second formant frequency of a high-, non-front vowel. The historically older vowel is a true high-back vowel, with a characteristically low F2. The more contemporary production involves a generally more front tongue body position. The smaller front cavity of the new production results in a higher F2. Sound changes that affect resonant properties, like GOOSE-fronting, can masquerade as sex-based differences if viewed from a narrow lens. That is, if women show a higher second formant frequency in GOOSE than men, one may incorrectly conclude that this is due to women making an articulatory maneuver to shorten their vocal tract. However, an equally plausible explanation is simply that women engage in the sound change more frequently than men, or to a greater extent than men.
A recent and impressive example of potentially social variation is described in Labov, Rosenfelder, and Fruehwald (2013). Labov et al. describe sound changes in the speech of 379 White Philadelphians born between 1888 and 1991, who were recorded across a nearly 40-year period. Labov et al. describe several sound changes in progress in the Philadelphian English vowel system. For many of these sound changes, women were more likely to engage in the sound change at its inception than were men. For example, women and men born around the beginning of the 20th century did not differ in the position of the GOAT vowel, symbolized with /oʊ/ in many American dialects, in their vowel space. As the century progressed, women begin fronting their GOAT vowel, as evidenced by its F2 frequencies. That is, there is a correlation between when women were born, and the F2 of their GOAT vowel. The F2 in GOAT reaches its peak for those who were born around 1940, and women born after this date have progressively lower F2 frequencies in GOAT. This is a strong example of potentially social variation. There is no evidence that women born in Philadelphia around the middle of the 20th century are markedly different anatomically from women born in the earlier and later parts of the century. There is no epidemiological evidence to suggest, for example, that there was a spike in a particular craniofacial malformation, or in a distinctive pattern of dentition that might affect the front-back movement of the tongue. A reasonable hypothesis, then, is that women in different time periods were simply engaging in different social trends that affect the pronunciation of the GOAT vowels.
Further evidence of social variation comes from studies using ethnographic methods. Ethnography involves careful observation of behaviors occurring in natural environments. One example of a detailed ethnographic study that examined (among other things) phonetic variation is given by Mendoza-Denton (2008). Mendoza-Denton examined a variety of markers of gang affiliation in girls in a high school in East Palo Alto, CA. Her study was noteworthy in many ways, two of which are relevant to this chapter. First, she sampled data very densely over a long period of time. Hence, the observations are very rich, and show how individuals’ behaviors changed as a function of their affiliation with different groups in this high school. The ethnographic method allowed Mendoza-Denton to assess the various meanings that the people she observed ascribed to different behaviors, including the use of different phonetic variants. Second, she used acoustic analysis to examine speech. One salient social category that emerged from her study is an identity of “having an attitude” (Mendoza-Denton, 2008: 251). Acoustic analysis showed that individuals who projected this particular identity were more likely to produce the KIT vowel with a more peripheral articulation, resulting in a vowel that is raised in the vowel space with a lower F1 and a higher F2. This more FLEECE-like pronunciation is not related to anatomical differences in these individuals, but is an example of potentially social variation, where a speaker’s production reflects the identity they wish to project.
One example of a study using a mix of ethnographic and experimental methods to examine gendered speech is given by Zimman’s research on the speech of transgender men (some of whom self-identify with the terms trans man or transmasculine man). The speech of trans men is a particularly interesting area of research, because it provides a potential to study how speech changes as an individual’s gender identity (e.g., one’s experience with their gender, which may or may not correspond to one’s sex assigned at birth) and gender presentation (i.e., visible markers of gender, like hair length and clothes) change. In a series of observational, acoustic, and perceptual studies (e.g., Zimman, 2013, 2017), Zimman has shown that there is considerable variation in the acoustic characteristics of the speech of transmasculine men – individuals who are born female, but identify with masculine practices more than female practices. These appear to correlate with a variety of characteristics, such as their gender presentation, and participation in gendered roles and communities. For example, Zimman (2017) shows how these different factors predict variation in /s/ among a group of 15 transmasculine men.
The papers reviewed thus far in this section show convincingly that there are cases where anatomical information can potentially explain phonetic variation, and cases where it cannot. The latter cases include ones in which there is clear evidence, either from demographic patterns or from ethnographic observation, that social factors are a driving force in the variation. The last discussion in this section highlights a particular problem for this research, and that is the cases in which there are multiple, potential explanations for a given sex difference. The specific case is that of the size of the vowel space in men and women. If we are to plot the peripheral vowels of English in the first formant (F1) by second formant (F2) space, we would likely find that the size of the F1/F2 space is larger for women than it is for men (e.g., Hillenbrand et al., 1995). Why is this? Is this a case of anatomical variation, of potentially social variation, or something else altogether that we haven’t yet considered? The answer is “yes,” which is to say that there are many potential explanations for this finding. These explanations are not meant to be exhaustive, nor are they necessarily mutually exclusive. But, they are presented in this chapter for two reasons. The first is simply pedagogical. As this is a review chapter, we feel that it is important to give the reader a sense of the myriad explanations that might exist for a particular pattern of variation. The second, however, is to emphasize that there are often multiple plausible explanations for patterns of variation that come from very different academic fields. The norms for research methods and sources of evidence can differ widely across fields, and these differences can then limit researchers’ willingness to read research that uses methods that are different from their own. Ultimately, all of these accounts have great merit and potential explanatory value. Let us now consider these explanations in turn. The discussion in this section follows similar explanations given by Simpson (2009).
The first explanation for these differences is anatomical. Women have smaller-sized vocal tracts than men. Hence, equivalent articulatory movements will have different consequences for men and women. A tongue movement of 7 mm by a woman might result in a more extreme articulation than would the same 7 mm movement by a man, given her smaller-sized vocal tract. This more-extreme movement would presumably translate to more-extreme acoustic values. Hence, we might posit that this sex difference is due in large part to an anatomical difference between the sexes. This argument is elaborated upon in Simpson (2002).
The second set of explanations can be called potentially social. These explanations appeal to the sociolinguistic literature on the use of standard and nonstandard forms of language, such as isn’t vs. ain’t. As reviewed by Coates (2016), there is a long history of research showing that women are less likely than men to use nonstandard forms like ain’t. There are numerous purely social explanations for this. For example, because women are involved in child-rearing roles, they perhaps have an onus to preserve the standard forms of the language in their speech to children, which is frequently described as hyperarticulated (Kuhl et al., 1997). Alternatively, women’s historical status as subordinate to men might prompt them to use more standard language as a way of sounding polite or educated, as this would potentially regain some of the ground lost due to social inequities. These purely social explanations assume that larger-sized vowel spaces are more standard than smaller-sized ones. Indeed, this is likely to be so, as larger-sized vowel spaces are associated with more intelligible speech (Bradlow, Torretta, & Pisoni, 1996).
The third explanation concerns the perceptibility of male and female voices. Consider again Figure 18.2. The higher fundamental frequency for the woman in Figure 18.2 leads her to have more widely spaced harmonics than the man. This is sometimes referred to as sparser sampling of harmonics. The F1 and F2 frequencies that determine the size of the vowel space are based on the resonant frequencies of the vocal tract. These formant frequencies are used to identify vowels, and to discriminate vowels from one another. The sparser sampling of women’s harmonics means that their harmonics are less likely to align with a formant frequency than are men’s harmonics. The larger-sized vowel spaces produced by women might reflect their tacit knowledge that their high f0 makes their speech vulnerable to misperception, and that producing an exaggerated acoustic difference between pairs of vowels might remedy this (Diehl, Lindblom, Hoemeke, & Fahey, 1996). The hypothesis that sex differences in vowel-space size are affected by ease of perception is further supported by Munson (2007b). Munson showed that the differences between men and women’s vowel-space sizes are biggest for words that are predicted to be easy to perceive based on their frequency of use and their inherent confusability with other real words, suggesting that significant social phonetic variation manifests in words where that variation may be more likely to be noticed.
As noted at the outset of this section, these explanations for why the F1/F2 vowel spaces are larger for females than males are neither exhaustive nor are they mutually exclusive. Here, they serve to illustrate the wide variety of methods used to assess and understand sex and gender differences in speech.
The discussion of speech perception at the end of the previous section segues into talking about how sex and gender differences in speech are perceived. This section summarizes the literature on the consequence of sex- and gender-based phonetic variation for speech perception. First, we consider individuals’ perception of speaker sex through patterns of phonetic variation. Given how pervasively sex affects phonetic characteristics, it is not surprising that statistical discrimination functions can readily identify sex from even brief stretches of content-neutral speech (Bachorowski & Owren, 1999). Listener identification is similarly robust. Only four of the 44 talkers in Munson et al. (2006) were identified as their biological sex less than 95% of the time by a group of naïve listeners who were presented with single words and asked to judge sex. Moreover, all 44 talkers were identified accurately at greater-than-chance levels. This is noteworthy because the group varied considerably in other measures of sex typicality, including ratings of the masculinity and femininity of their speech (Munson, 2007a). Cases of the opposite can be found: Matar, Portes, Lancia, Legou, and Baider (2016) examined the acoustic and perceptual characteristics of the speech of women with Reinke’s edema (RE). RE is a condition that affects the health of the vocal folds, and which leads to a dramatic decrease in f0. Matar et al. report that Lebanese Arabic-speaking women with RE are often mistaken for men, and that they find this very troubling.
What do people listen for when they identify a talker as male or female? Readers of the chapter thus far should not be surprised to know that the answer is: many things. First, consider the relative contribution of characteristics of the voicing source and the vocal-tract filter to the identification of sex. This was the subject of a recent series of studies (Skuk & Schweinberger, 2014; Skuk, Dammann, & Schweinberger, 2015). Skuk and Schweinberger (2014) created sets of male to female voice morphs that independently morphed f0 trajectories or spectral properties related to vocal tract resonance, and asked listeners to categorize the voices as male or female. The identification of certain voice morphs as androgynous-sounding then allowed them to combine, for example, and androgynous f0 trajectory with female-sounding vocal tract resonance properties. Their results demonstrate that while f0 may be the most important single parameter, such a conclusion stems from the outsized role of f0 on the perception of voices for those with the lowest of f0 values. For all other voice types – which is most voice types – f0 and vocal tract resonances both substantially contribute to the perception of voices as male or female. Using similar morphing techniques in the context of a selective adaptation paradigm, Skuk et al. (2015) find that timbre (operationalized by the authors as the acoustic characteristics that distinguish voices with the same perceived pitch and loudness) induces larger aftereffects than f0 manipulations, suggesting timbre affects the identification of voice gender more strongly than fundamental frequency.
The interaction between source and filter characteristics in perceiving sex was also explored by Bishop and Keating (2012). Speakers were asked to produce their maximal f0 ranges, eliciting steady-state vowels in nine roughly equally spaced steps across their f0 range for the vowel /a/. Listeners then evaluated where they felt that production landed within the speakers’ ranges. In a second experiment, they asked listeners to categorize the gender of the speakers. They found that f0 position-within-range was interpreted with respect to the speaker’s gender, indicating that listeners have clear expectations about f0 distributions across genders and interpret variable f0 accordingly. Listeners seem to compare a given f0 to a gender-based population sample, which indicates that gender categorization is necessary in interpreting absolute f0 values.
Listeners have pervasive stereotypes of how men and women speak. Indeed, these stereotypes are so strong that they are the subject of frequent debate in the media. Rigorous instrumental investigation of these stereotypes shows that they are often deceptive. One common stereotype that is apparent through internet searches and pop science books (e.g., Brizendine, 2006) is that women speak faster than men. Acoustic-phonetic analyses suggest otherwise: Byrd (1994) provided some early evidence that the contrary is true: women speak more slowly than men. Following from the fact that more complex phonetic events are perceived as being longer (Lehiste, 1976; Cumming, 2011), Weirich and Simpson (2014) reasoned that the perception that women speak more quickly than men relates to the increased vowel space typically produced by women (cite): “a speaker traversing a larger acoustic space in the same time as a speaker traversing a small acoustic space might be perceived as speaking faster” (Weirich & Simpson, 2014: 2). Weirich and Simpson synthetically controlled the f0 contours and durations of 40 male and female speakers, who naturally varied in the size of their vowel space. Indeed, listeners judged speakers with a larger acoustic vowel space as speaking faster, resulting in women being judged as faster talkers, despite the lack of actual duration differences in the speech samples.
The descriptive norms for men’s and women’s voices sometimes appear to translate into listeners having prescriptive expectations for how men and women should speak. That is, listeners’ preferences for male and female voices often reflect a preference for voices that fit typical gendered speech patterns. Listeners’ expectations and stereotypes about voices appear to be stronger for female speakers, at least in American English, with correlations between judgments about what’s vocally attractive and what is vocally typical for female voices being stronger than male voices (Babel & McGuire, 2015).
Vocal preferences have been most frequently explored in the literature from the perspective of vocal attractiveness, given the evolutionary interest in voices as a sign of Darwinian fitness and a guiding force in mate selection. This line of research typically finds that listeners have a preference for vocal patterns that provide acoustic evidence for sexual dimorphism, conveying evidence for gender-stereotypical vowel acoustics. That is, listeners tend to find average or slightly higher-than-average pitch preferable for female voices and average or slightly lower-than-average pitch more vocally attractive for male voices (Zuckerman & Miyake, 1993; Riding, Lonsdale, & Brown, 2006; Saxton, Caryl, & Roberts, 2006; Feinberg, Jones, Little, Burt, & Perrett, 2005; Feinberg, DeBruine, Jones, & Perrett, 2008). Previous research on vocal preferences has approached the speech signal as though humans are robots programmed to vocalize based on the constraints imposed by their hardware (e.g., unpliable vocal tracts and articulators). While providing important basic insights on the apparent roles of perceived pitch on voice preferences, this work often uses linguistically impoverished stimuli, such as vowels produced in isolation, which tend to then be produced in a sing-song quality with artificially high f0. The cultural and social variations on spoken language, however, highlight that speakers have some agency in shaping the outputs of their vocal tracts. In this spirit, Babel, McGuire, and King (2014) attempted a step forward in our understanding of vocal preferences, assessing listeners’ judgments of vocal attractiveness from a perspective of speaker agency and malleability, making acoustic measurements that go beyond f0 and apparent vocal tract size. While still using non-ideal speech data – single words produced in isolation – they showed that (specifically Californian) listeners’ vocal preferences were based on suite of acoustic-phonetic features that related to sexual dimorphism, as previously well documented, in addition to apparent health and youth and a speaker’s engagement with community-level sound changes. These results suggest that vocal preferences cannot be reduced to signals listeners use to infer a speaker’s physical stature.
To better understand social meanings, experimental phoneticians must be mindful not to fall into the trap of over-interpreting the results of perception experiments. Consider, for example, a hypothetical experiment examining how a sentence’s average f0 affects judgments of how masculine a man’s speech is. Imagine that the two sentences shown in Figure 18.3 were used as stimuli, and that they elicited different judgments of vocal masculinity. It might be tempting to interpret this finding as evidence that pitch influences masculinity. However, given that the pitch differences are due to the presence of a parenthetical comment in the sentence, this interpretation is likely to be an oversimplification. The association between pitch and masculinity in this hypothetical experiment is just as likely to reflect knowledge of the extent to which men and women use parenthetical comments in discourse.
Indeed, the literature on the perception of speaker attributes is replete with examples of cases where listeners identify a speaker as having an attribute that is different from what they might be trying to express. In Matar et al.’s research on women with RE, described earlier, the source of the mismatch is biomechanical. Other, more complex cases might be due to listeners’ behavior when asked to judge a social category about which they know little. For example, Smith, Hall, and Munson (2010) showed that different variants of /æ/ elicited judgments of different states of health (i.e., weight, and whether or not a person smoked). Questions about speaker health were included in Smith et al.’s study as filler items. The main focus of that study was on sexual orientation and /æ/ variation. There is no evidence that a person’s health affects the type of /æ/ they produce. A more likely explanation is that listeners developed a complex logic over the course of the experiment that led to these judgments. For example, listeners might have associated /æ/ variants with different geographic regions (rightly so, as /æ/ does vary according to region), and then applied stereotypes about people who live in those regions. Another example of this is given by Zimman (2013), who showed that many transmasculine men are rated by listeners as sounding gay, despite the fact that they do not identify as gay men. This may be due to listeners having only weak stereotypes about how gay men talk, or due to the lack of response categories that reflected what the listeners actually perceived. One potential solution to this problem comes from D’Onofrio (2015), who showed that listeners perceive speech more quickly when primed with an authentic social category than when provided with one that is overly general.
Another fruitful source of information about the nature of gendered speech comes from its development. The evidence presented thus far shows the sex differences in the speech production mechanism explain many acoustic differences between men and women, and that potentially social explanations explain other differences. The potentially social variation would have to be learned. This invites an investigation of when sex differences in the vocal-tract mechanism develop, and when and how the learned differences between men and women are acquired.
Evidence for the early emergence of sex-specific articulatory behavior is given by Perry, Ohde, and Ashmead (2001). Perry et al. examined listeners’ perception of hVd words produced by 4-, 8-, 12-, and 16-year-old boys and girls. Listeners rated their production on a six-point scale, ranging from “definitely a male” to “definitely a female.” There were significant differences in the ratings for all four groups. The ratings for the youngest three groups clustered around the two middle values, which represented “unsure, may have been a male,” and “unsure, may have been a female.” The ratings were predicted by measures of vowels’ formant frequencies. Not surprisingly, the largest differences were found for the oldest group. This includes speakers who have undergone puberty. At puberty, the male larynx descends dramatically. This change in the size of the vocal tract is the primary factor driving sex differentiation in formant frequencies.
Other studies have examined the acoustic characteristics of boys’ and girls’ speech. The formant frequencies of children aged 7–8 show gender differences that vary in size across the vowel space, with part of this difference attributable to sex differences physical size (Bennett, 1981). Bennett also reasons that some of the differences are due to “sex-specific articulatory behaviors” (p. 238). Similar results are reported by Whiteside (2001) where the differences in resonant frequencies for males and females across the lifespan cannot be wholly attributed to vocal tract morphology, but must have some footing in learned, gendered ways of pronunciation.
Part of the seemingly contradictory evidence in the studies reviewed in the previous paragraph is because they did not have access to anatomical data from their participants. Recent research by Vorperian and colleagues has considered the question of the development of sex differentiation in vocal tracts. Vorperian et al. (2011) made a large suite of measurements to MRI and CT images of 605 individuals aged birth to 19 (broken down into four cohorts, as follows: Cohort 1, birth to 4;11; Cohort 2, 5 to 9;11; Cohort 3, 10 to 14;11; Cohort 4, 15 to 19;11) to identify and quantify developmental changes in vocal tract morphology. These included the following measures: vocal tract length, vocal tract vertical (distance from the glottis to the palatal plane), posterior cavity length, nasopharyngeal length, vocal tract horizontal (distance from lips to the pharyngeal wall), lip thickness, anterior cavity length, oropharyngeal width, and vocal tract oral (vocal tract horizontal minus lip thickness). They find that there are pre-puberty differences in the size of the vocal tract’s horizontal length and vocal tract’s oral size, with the gender differences increasing in the puberty and post-puberty age groups in the pharynx (vocal tract vertical and posterior cavity length increases). There are no differences in the nasopharyngeal length. Crucially, Vorperian et al. document that sex differences in vocal tract anatomy are not stable across development. Rather, the sex differences wax and wane at different developmental time points. However, Barbier et al. (2015) were unable to replicate the finding of gender differences before puberty in growth curves fit to measurements taken from midsagittal X-rays recorded in longitudinal studies of the development of dentition, and argued that the early sex differences found in Vorperian et al. (2009) were an artifact of imbalanced sampling of sexes and ages within the large moving window used in their statistical model. The very question of the robustness of the differences in early vocal-tract size is an open one. This, in turn, invites the question of the extent to which these vocal-tract differences might explain any observed differences in the speech of boys and girls.
Still other cases provided evidence for the learning of potentially social variation. One such case involves the variants of medial /t/ in Newcastle English, described earlier. Foulkes, Docherty, and Watt (2005) examined productions of medial /t/ by children acquiring that dialect. They found that boys and girls began to produce variants of medial /t/ between three and four years of age that mirrored those spoken by adult men and women, respectively. This is a very striking finding, as it suggests very early learning of this feature. Even more striking was the finding that mothers’ use of different variants of /t/ in child-directed speech differed depending on whether they were speaking to a boy or to a girl. Mothers used more male-typed variants when speaking to boys than when speaking to girls. This suggests that child-directed speech may be one mechanism through which linguistic variation is taught.
Still other evidence of the learning of gendered speech comes from studies of the relationship between social preferences and speech in children. Two recent studies found an association between the extent to which children’s interests and self-presentation aligned with those that are expected of their biological sex, and how sex-typical their voices were. Munson, Crocker, Pierrehumbert, Owen-Anderson, and Zucker (2015) examined acoustic and perceptual characteristics of 30 North American English-speaking boys diagnosed with gender identity disorder (GID). A label of GID (a now-obsolete diagnostic category, replaced by the label Gender Dysphoria) indicates that the child has social preferences and, potentially, a gender identity that does not align with her or his biological sex. Munson et al. found that boys with GID were rated by a group of naïve listeners to sound less prototypically boy-like than boys without GID. However, the acoustic differences between the groups were subtle, and did not suggest that they differed in ways that would suggest differences in vocal-tract size or vocal-fold morphology. That is, the differences suggested that listeners were attending to subtle, learned differences between the groups. Li et al. (2016) examined correlations between the spectral characteristics of /s/ and measures of children’s gender identity taken from a parent questionnaire about children’s behavior. Li et al. found significant correlations between the acoustic characteristics of boys’ productions of /s/ and performance on the questionnaire. Together, Li et al. and Munson et al.’s findings support the hypothesis that children learn gendered speech variants, and that the learning of these variants correlates with social preferences.
The literature reviewed in this chapter is merely a subset of the existing literature on the phonetics of sex and gender. However, even if the literature had been reviewed comprehensively, the message would be the same: previous work has raised more questions about the nature of gender and sex effects on speech than they have answered. Indeed, we regard this fact as one appealing aspect of the science of phonetics: much work is left to be done. Many sex and gender differences in speech are, in truth, under-documented, understudied, and have origins that can remain unknown. Hence, there is much room for new research. This section outlines three areas of research that we believe to be very important, at least at the time that this chapter was written in late 2018.
The first item on the future agenda is to better understand the potential anatomical grounding of sex differences in consonant production. There are numerous benefits to studying sex differences in sounds produced with a relatively open vocal tract: their acoustic modeling is readily accessible to individuals with relatively modest training in mathematics, and they comprise a large portion of the speech stream. However, as detailed acoustic-phonetic investigations of social differences progress, it has become increasingly clear that there are many cases of consonant variation, such as the variation in /s/ and /t/ described above. While these are seemingly unrelated to vocal-tract features, that conclusion is limited by the fact that our understanding of the articulatory-acoustic relationships for consonants is less developed than our understanding for vowels. One useful set of techniques to address this shortcoming is analysis by synthesis. By this, we mean using speech synthesis to infer the articulatory configurations that might have produced specific vocalizations. There now exist a variety of speech synthesizers based on articulatory models of the vocal tract that use simple interfaces (e.g., Plummer & Beckman, 2016). Individuals can manipulate articulatory variables and compare acoustic output with that observed in laboratory and field studies. The result of this is a step toward understanding the articulatory bases of the sex and gender differences that exist in acoustic signals.
A second agenda moving forward concerns the availability of corpora to study sex and gender differences. It is here that there is perhaps the biggest mismatch between research on gender differences in speech that have been conducted using methods from sociolinguistics and ones that have been conducted using methods from experimental phonetics. Consider the corpus studies of Peterson and Barney (1952) and Hillenbrand et al. (1995) These studies examine productions of words that minimize linguistic sources of variation (i.e., they minimize coarticulatory effects by using words with phonetic contexts that are as neutral as possible). Moreover, the use of read speech, produced without an interlocutor present, minimizes any motivation to produce speech with socially meaningful variation. Indeed, the task of producing speech in such a context may suppress the production of socially meaningful variation.
In contrast, consider sociolinguistic studies of gender, such as those described by Stuart-Smith, Zimman, and others. Those studies use methods that maximize the likelihood of eliciting socially significant variation, by making recordings in socially meaningful environments, and with specific interlocutors. These are sensible choices when the goal is to elicit socially meaningful variation. However, these same choices increase the likelihood that recordings will include background noise that will obscure acoustic measures, or that sounds will be elicited in a small number of words, and that the characteristics of words will obscure or enhance sex differences, or that the speech will reflect accommodation to particular interlocutors.
There is no easy solution to this conundrum. One solution is to be careful to interpret male-female differences cautiously, given the limitations of the methods used to collect the corpus in question. Consider, for example, the finding that there are some differences in speech that appear not to be due to vocal-tract length differences. This conclusion does not compromise the fact that vocal-tract length influences are nonetheless present in speech signals: individuals do not have the capacity to shorten or lengthen their vocal tracts infinitely. A naïve listener who assumes that all speaker differences are due to individual vocal-tract sizes will be right more often than they are wrong. Such a finding invites the question of whether children’s learning of sex and gender differences first involves attending to these gross speech features, before attending to finer-grained features like the phonetic detail of a medial /t/ or the spectral characteristics of an /s/.
The third item on our future research agenda is to collect data on more languages. The advent of affordable, high-quality recording equipment and free acoustic analysis software means that there are more opportunities than ever to analyze data on under-documented languages. We make the strong argument that no language documentation should be conducted without considering carefully the demographic characteristics of speakers. Having access to more carefully balanced samples can lead to testing hypotheses about the universality of some of the male-female differences that we summarized in this paper. The charge to collect data on more languages includes data on development.
The final item in our research agenda relates to mindset. Many readers will have noticed that throughout this article, we have presented our argument as if the male pronunciation was the standard, and the female pronunciation was the aberrant one, requiring an explanation. This was done in part out of habit. However, once we realized that we were writing this way, we decided to remain consistent, and to close the chapter by encouraging readers to break out of this mindset when studying gender and speech. We hope that the work we have reviewed convinces the reader that sex and gender differences are best understood as cases of linguistic variation, not as cases of deviation from a standard. We encourage the reader to continue to pursue this topic with a mindset of understanding the roots and consequences of this variation, rather than carrying on Henry Higgins’ lament that women don’t speak like men.
The authors would like to thank their collaborators for helping them develop the ideas presented in this article, especially Mary E. Beckman, Jan Edwards, Keith Johnson, Grant McGuire, and Andrew Plummer. All remaining errors are ours alone.
Babel, M., & McGuire, G. (2013). Listener expectations and gender bias in nonsibilant fricative perception. Phonetica, 70(1–2), pp. 117–151.
Babel, M., & McGuire, G. (2015). Perceptual fluency and judgments of vocal aesthetics and stereotypicality. Cognitive Science, 39(4), pp. 766–787.
Babel, M., McGuire, G., & King, J. (2014). Towards a more nuanced view of vocal attractiveness. PLoS One, 9(2), p. e88616.
Babel, M., & Munson, B. (2014). Producing Socially Meaningful Variation. In M. Goldrick, V. Ferreira, & M. Miozzo (Eds.), The Oxford handbook of language production (pp. 308–328). Oxford: Oxford University Press.
Bachorowski, J., & Owren, M. (1999). Acoustic correlates of talker sex and individual talker identity are present in a short vowel segment produced in running speech. The Journal of the Acoustical Society of America, 106(2), pp. 1054–1063.
Barbier, G., Boë, L., Captier, G., & Laboissière, R. (2015). Human vocal tract growth: A longitudinal study of the development of various anatomical structures. Interspeech-2015, pp. 364–368.
Barreda, S. (2017). Listeners respond to phoneme-specific spectral information when assessing speaker size from speech. Journal of Phonetics, 63, pp. 1–18.
Bennett, S. (1981). Vowel formant frequency characteristics of preadolescent males and females. The Journal of the Acoustical Society of America, 69(1), pp. 321–328.
Bishop, J., & Keating, P. (2012). Perception of pitch location within a speaker’s range: Fundamental frequency, voice quality and speaker sex. The Journal of the Acoustical Society of America, 132(2), pp. 1100–1112.
Bradlow, A., Torretta, G., & Pisoni, D. (1996). Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20(3–4), pp. 255–272.
Brizendine, L. (2006). The female brain. New York, NY: Broadway Books.
Budka, M., & Osiejuk, T. (2013). Formant frequencies are acoustic cues to caller discrimination and are a weak indicator of the body size of corncrake males. Ethology, 119(11), pp. 1–10.
Byrd, D. (1994). Relations of sex and dialect to reduction. Speech Communication, 15(1–2), pp. 39–54.
Cartei, V., Cowles, H., & Reby, D. (2012) Spontaneous voice gender imitation abilities in adult speakers. PLoS One, 7(2), p. e31353
Catford, J. C. (1977). Fundamental problems in phonetics. Bloomington, IN: Indiana University Press.
Chaplin, T.M., 2015. Gender and emotion expression: A developmental contextual perspective. Emotion Review, 7(1), pp. 14–21
Chen, M. Y. (1997). Acoustic correlates of English and French nasalized vowels. The Journal of the Acoustical Society of America, 102(4), pp. 2360–2370.
Cherng, C. H., Wong, C. S., Hsu, C. H., & Ho, S.T. (2002). Airway length in adults: Estimation of the optimal endotracheal tube length for orotracheal intubation. Journal of Clinical Anesthesia, 14(4), pp. 271–274.
Chiba, T., & Kajiyama, M. (1941). The vowel: Its nature and structure. Tokyo: Kaiseikan.
Coates, J. (2016). Women, men, and language: A sociolinguistic account of gender differences in language (3rd ed.). New York, NY: Routledge.
Cumming, R. (2011). The effect of dynamic fundamental frequency on the perception of duration. Journal of Phonetics, 39(3), pp. 375–387.
Dehé, N. (2014). Parentheticals in spoken English: The syntax-prosody relation. New York, NY: Cambridge University Press.
Diehl, R., Lindblom, B., Hoemeke, K., & Fahey, P. (1996). On explaining certain male-female differences in the phonetic realization of vowel categories. Journal of Phonetics, 24(2), pp. 187–208.
Docherty, G. J., & Foulkes, P. (2005). Glottal variants of/t/in the Tyneside variety of English. In Hardcastle, W. J., & Beck, J. M. (Eds.), A Figure of Speech: A Festschrift for John Laver (pp. 173–197). New York: Routledge.
D’Onofrio, A. (2015). Persona-based information shapes linguistic perception: Valley Girls and California vowels. Journal of Sociolinguistics, 19(2), pp. 241–256.
Eagly, A. H., & Wood, W. (1999). The origins of sex differences in human behavior: Evolved dispositions versus social roles. American Psychologist, 5(6), pp. 408–423.
Eckert, P. (2014). The Problem with binaries: Coding for gender and sexuality. Language and Linguistics Compass, 8(11), pp. 529–535.
Eckert, P., & Wenger, E. (2005). Communities of practice in sociolinguistics. Journal of Sociolinguistics, 9(4), pp. 582–589.
Fant, G. (1966). A note on vocal tract size factors and nonuniform F-pattern scalings. Speech Transactions Laboratory Quarterly Progress and Status Report, 7(44), pp. 22–30.
Fant, G. (1970). Acoustic theory of speech production (2nd ed.). Paris: Mouton de Gruyter.
Feinberg, D. R., DeBruine, L. M., Jones, B. C., & Perrett, D. I. (2008). The role of femininity and averageness of voice pitch in aesthetic judgments of women’s voices. Perception, 37(4), pp. 615–623.
Feinberg, D. R., Jones, B. C., Little, A. C., Burt, D. M., & Perrett, D. I. (2005). Manipulations of fundamental and formant frequencies influence the attractiveness of human male voices. Animal Behaviour, 69(3), pp. 561–568.
Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. The Journal of the Acoustical Society of America, 106(3), pp. 1511–1522.
Fitch, W. T., & Hauser, M. (2002). Unpacking” Honesty”: Vertebrate vocal production and the evolution of acoustic signals. In A. Simmons, R., Fay, & A. Popper (Eds), Acoustic communication (pp. 65–137). New York, NY: Springer.
Foulkes, P., Docherty, G., & Watt, D. (2005). Phonological variation in child-directed speech. Language, 81(1), pp. 177–206.
Fuchs, S., & Toda, M. (2010). Do differences in male versus female /s/ reflect biological or sociophonetic factors? In S. Fuchs, M. Toda, & M. Zygis (Eds.), Turbulent sounds: An interdisciplinary guide (pp. 281–302). Berlin: Mouton de Gruyter.
González, J. (2004). Formant frequencies and body size of speaker: A weak relationship in adult humans. Journal of Phonetics, 32(2), pp. 277–287.
Hagiwara, R. (1997). Dialect variation and formant frequency: The American English vowels revisited. The Journal of the Acoustical Society of America, 102(1), pp. 655–658.
Hall-Lew, L. (2011). The completion of a sound change in California English. Proceedings of the International Congress of Phonetic Sciences, 17, pp. 807–810.
Harrington, J., Kleber, F., & Reubold, U. (2008). Compensation for coarticulation,/u/-fronting, and sound change in standard southern British: An acoustic and perceptual study. The Journal of the Acoustical Society of America, 123(5), pp. 2825–2835.
Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), pp. 3099–3111.
Irino, T., & Patterson, R. D. (2002). Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-mellin transform. Speech Communication, 36(3–4), pp. 181–203.
Iseli, M., Shue, Y. L., & Alwan, A. (2007). Age, sex, and vowel dependencies of acoustic measures related to the voice source. The Journal of the Acoustical Society of America, 121(4), pp. 2283–2295.
Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, 34(4), pp. 485–499.
Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. The Journal of the Acoustical Society of America, 108(3), pp. 1252–1263.
Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), pp. 820–857.
Kuhl, P. K., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., Ryskina, V. L., … Lacerda, F. 1997. Cross-language analysis of phonetic units in language addressed to infants. Science, 277(5326), pp. 684–686.
Labov, W., Rosenfelder, I., & Fruehwald, J. (2013). One hundred years of sound change in Philadelphia: Linear incrementation, reversal, and reanalysis. Language, 89(1), pp. 30–65.
Lehiste, I. (1976). Influence of fundamental frequency pattern on the perception of duration. Journal of Phonetics, 4(2), pp. 113–117.
Li, F., Rendall, D., Vasey, P. L., Kinsman, M., Ward-Sutherland, A., & Diano, G. (2016). The development of sex/gender-specific /s/ and its relationship to gender identity in children and adolescents. Journal of Phonetics, 57, 59–70.
Matar, N., Portes,. C., Lancia, L., Legou, T., & Baider, F. (2016). Voice quality and gender stereotypes: A study on Lebanese women with Reinke’s edema. Journal of Speech Language and Hearing Research, 59(6), pp. S1608–S1617.
Mendoza-Denton, N. (2008). Homegirls: Symbolic practices in the making of Latina Youth styles. Maiden, MA: Wiley-Blackwell.
Munson, B. (2007a). The acoustic correlates of perceived masculinity, perceived femininity, and perceived sexual orientation. Language and Speech, 50(1), pp. 125–142.
Munson, B. (2007b). Lexical characteristics mediate the influence of talker sex and sex typicality on vowel-space size. In J. Trouvain & W. Barry (Eds.), Proceedings of the International Congress on Phonetic Sciences (pp. 885–888). Saarbrucken, Germany: University of Saarland.
Munson, B., Crocker, L., Pierrehumbert, J., Owen-Anderson, A., & Zucker, K. (2015). Gender typicality in children’s speech: A comparison of the speech of boys with and without gender identity disorder. The Journal of the Acoustical Society of America, 137(4), pp. 1995–2003.
Munson, B., McDonald, E. C., DeBoe, N. L., & White, A. R. (2006). Acoustic and perceptual bases of judgments of women and men’s sexual orientation from read speech. Journal of Phonetics, 34(2), pp. 202–240.
Neary, T. M. (1989). Static, dynamic, and relational properties in vowel perception. The Journal of the Acoustical Society of America, 85(5), pp. 2088–2113.
Oh, E. (2011). Effects of speaker gender on voice onset time in Korean stops. Journal of Phonetics, 39(1), pp. 59–67.
Patterson, R. D., Smith, D. R. R., van Dinther, R., & Walters, T. C. (2008). Size information in the production and perception of communication sounds. In W. A. Yost, A. N. Popper, & R. R. Fay (Eds.), Auditory perception of sound sources (pp. 43–75). New York, NY: Springer.
Perry, T. L., Ohde, R. N., & Ashmead, D. H. (2001). The acoustic bases for gender identification from children’s voices. The Journal of the Acoustical Society of America, 109(6), pp. 2988–2998.
Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), pp. 175–184.
Pisanski, K., Mora, E., Pisanski, A., Reby, D., Sorokowski, P., Frackowiak, T., & Feinberg, D. (2016). Volitional exaggeration of body size through fundamental and formant frequency modulation in humans. Scientific Reports, 6, p. 34389.
Plummer, A., & Beckman, M. (2016). Sharing speech synthesis software for research and education within low-tech and low-resource communities. Proceedings of INTERSPEECH (pp. 1618–1622). San Francisco, CA: International Speech Communication Association.
Podesva, R., & Van Hofwegen, J. (2014). How conservatism and normative gender constrain variation in inland California: The case of /s/. University of Pennsylvania Working Papers in Linguistics, 20(2), pp. 129–137.
Reby, D., McComb, K., Cargnelutti, B., Darwin, C., Fitch, W. T., et al. (2005) Red deer stags use formants as assessment cues during intrasexual agonistic interactions. Proceedings of the Royal Society B, 272(1566), pp. 941–947.
Rendall, D., & Owren, J. (2010). Vocalizations as tools for influencing the affect and behavior of others. In Brudzynski, S. (Ed.), Handbook of mammalian vocalization: An integrative neuroscience approach (pp. 177–185). Oxford: Academic Press.
Rendall, D., Owren, M., Weerts, E., & Hienz, R. (2004). Sex differences in the acoustic structure of vowel-like vocalizations in baboons and their perceptual discrimination by baboon listeners. The Journal of the Acoustical Society of America, 115(1), pp. 411–421.
Riding, D., Lonsdale, D., & Brown, B. (2006). The effects of average fundamental frequency and variance of fundamental frequency on male vocal attractiveness to women. Journal of Nonverbal Behaviour, 30(2), pp. 55–61.
Riordan, C. (1977). Control of vocal-tract length in speech. The Journal of the Acoustical Society of America, 62(4), pp. 998–1002.
Sachs, J., Lieberman, P., & Erikson, D. (1972). Anatomical and cultural determinants of male and female speech. In R. W. Shuy & R. W. Fasold (Eds.), Language attitudes: Current trends and prospects (pp. 74–84). Washington, DC: Georgetown University Press.
Sandage, M., Plexico, L., & Schiwitz, A. (2015). Clinical utility of CAPE-V sentences for determination of speaking fundamental frequency. Journal of Voice, 29(4), pp. 441–445.
Saxton, T., Caryl, P., & Roberts, S. C. (2006). Vocal and facial attractiveness judgments of children, adolescents and adults: The ontogeny of mate choice. Ethology, 112(12), pp. 1179–1185.
Shadle, C. H. (1991). The effect of geometry on source mechanisms of fricative consonants. Journal of Phonetics, 19(4), pp. 409–424.
Simpson, A. P. (2002). Gender-specific articulatory – acoustic relations in vowel sequences. Journal of Phonetics, 30(3), pp. 417–435.
Simpson, A. P. (2009). Phonetic differences between male and female speech. Language and Linguistics Compass, 3(2), pp. 621–640.
Simpson, A. P. (2012). The first and second harmonics should not be used to measure breathiness in male and female voices. Journal of Phonetics, 40(3), pp. 477–490.
Skuk, V., Dammann, L., & Schweinberger, S. (2015). Role of timbre and fundamental frequency in voice gender adaptation. The Journal of the Acoustical Society of America, 138(2), pp. 1180–1193.
Skuk, V., & Schweinberger, S. (2014). Influences of fundamental frequency, formant frequencies, aperiodicity, and spectrum level on the perception of voice gender. Journal of Speech, Language, and Hearing Research, 57(1), pp. 285–296.
Smith, D., & Patterson, R. (2005). The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex and age. The Journal of the Acoustical Society of America, 118(5), pp. 3177–3186.
Smith, D., Patterson, R., Turner, R., Kawahara, H., & Irino, T. (2005). The processing and perception of size information in speech sounds. The Journal of the Acoustical Society of America, 117(1), pp. 305–318.
Smith, E. A., Hall, K. C., & Munson, B. (2010). Bringing semantics to sociophonetics: Social variables and secondary entailments. Laboratory Phonology, 1(1), pp. 121–155.
Stevens, K. N. (2000). Acoustic phonetics. Boston, MA: MIT Press.
Stuart-Smith, J. (2007). Empirical evidence for gendered speech production: /s/ in Glaswegian. In J. Cole & J. Hualde (Eds.), Laboratory phonology 9 (pp. 65–86). Berlin: Mouton de Gruyter.
Timmers, M., Fischer, A., & Manstead, A. (2003). Ability versus vulnerability: Beliefs about men and women’s emotional behavior. Cognition and Emotion, 17(1), pp. 41–63.
Titze, I. (1989). Physiologic and acoustic differences between male and female voices. The Journal of the Acoustical Society of America, 85(4), pp. 1699–1707.
van Bezooijen, R. (1995). Sociocultural aspects of pitch differences between Japanese and Dutch women. Language and Speech, 38(3), pp. 253–265.
Vorperian, H. K., Wang, S., Chung, M. K., Schimek, E. M., Durtschi, R. B., Kent, R. D., … Gentry, L. R. (2009). Anatomic development of the oral and pharyngeal portions of the vocal tract: An imaging study. The Journal of the Acoustical Society of America, 125(3), pp. 1666–1678.
Vorperian, H. K., Wang, S., Schimek, E. M., Durtschi, R. B., Kent, R. D., Gentry, L. R., & Chung, M. K. (2011). Developmental sexual dimorphism of the oral and pharyngeal portions of the vocal tract: An imaging study. Journal of Speech, Language, and Hearing Research, 54(4), pp. 995–1010.
Weirich, M., & Simpson, A. P. (2014). Differences in acoustic vowel space and the perception of speech tempo. Journal of Phonetics, 43, pp. 1–10.
Whiteside, S. P. (2001). Sex-specific fundamental and formant frequency patterns in a cross-sectional study. The Journal of the Acoustical Society of America, 110(10), pp. 464–478.
Whiteside, S. P., & Irving, C. J. (1998). Speakers’ sex differences in voice onset time: A study of isolated word production. Perceptual and Motor Skills, 86(2), pp. 651–654.
Yamazawa, H., & Hollien, H. (1992). Speaking fundamental frequency patterns of Japanese women. Phonetica, 49(2), pp. 128–140.
Zimman, L. (2013). Hegemonic masculinity and the variability of gay-sounding speech: The perceived sexuality of transgender men. Journal of Language & Sexuality, 2(1), pp. 1–39.
Zimman, L. (2017). Variability in /s/ among transgender speakers: Evidence for a socially-grounded account of gender and sibilants. Linguistics, 55(5), pp. 993–1019.
Zuckerman, M., & Driver, R. (1989). What sounds beautiful is good: The vocal attractiveness stereotype. Journal of Nonverbal Behavior, 13(2), pp. 67–82.
Zuckerman, M., & Miyake, K. (1993). The attractive voice: What makes it so? Journal of Nonverbal Behavior, 17(2), pp. 119–135.