… if the play makes the public aware that there are such people as phoneticians, and that they are among the most important people in England at present, it will serve its turn.
George Bernard Shaw, Preface to Pygmalion
In 1964 my eighth-grade school marching band won a local competition in the Imperial Valley of Southern California. I played baritone horn and was an enthusiastic member. And I knew that, by winning this contest, we would now be able to go to the higher regional competition in Los Angeles, about 130 miles (210 kilometres) to the northwest of our little town of Holtville, near the Mexican border.
Our director wanted to expose the band to some higher culture while in the LA area, so he petitioned the school board to allow us to attend a showing of Mozart’s opera Don Giovanni. The school board said no. Too risqué for junior high schoolers. Instead, we were allowed to attend the band director’s second choice, a showing at the Egyptian Theater in Hollywood of My Fair Lady, starring Rex Harrison and Audrey Hepburn. The band instructor prepared us by talking about the play by George Bernard Shaw, the source for the movie.
This film eventually played a role in my decision to become a linguist, as it revolved around the transformative power of human speech, told from the perspective of Henry Higgins and his reluctant pupil Eliza Doolittle. What is this thing called speech that all humans possess, that George Bernard Shaw believed to be the key to success in life? In The Kingdom of Speech Tom Wolfe claims that speech is the most important invention in the history of the world. It not only enables us to speak to one another but also immediately classifies by economic class, age group and educational attainment. If erectus were around today, would people consider them brutish because of the way they spoke, even if one could so dress them as to pass them off as a peculiar-looking modern human?
Although communication is ancient, human speech is evolutionarily recent. Cognitive scientist and phonetician Philip Lieberman claims that the speech apparatus of modern Homo sapiens is only about 50,000 years old, so recent that even earlier Homo sapiens could not speak as we do today.1 This is not to be confused, though, with the 50,000 year date proposed by other authors on the appearance of language. Speech followed language. Therefore, Lieberman’s 50,000 year date would be, if correct, evidence against the idea that language proper appeared suddenly 50,000 years ago. If erectus did indeed invent symbols and begin humanity’s upward trek through the progression of signs to language, enhanced speech would have come later. It is expected that the first languages would have been inferior to our present languages. No invention begins at the top. All human inventions get better over time. And yet this does not mean that erectus spoke a subhuman language. What it does mean, however, is that they lacked fully modern speech, for physiological reasons, and that their information flow was slower – they didn’t have as much to talk about as we do today, nor do they seem to have had sufficient brain power to process and produce information as quickly as modern sapiens. Erectus’s physiological shortcoming was overcome by gradual biological evolution. The development of information processing and enhanced grammatical ability results from cultural evolution. Both biological and cultural evolution over 60,000 subsequent generations of humans improved our linguistic abilities dramatically.
In a 2016 paper in the publication of the American Association for the Advancement of Science, Tecumseh Fitch and his colleagues argue in effect that Lieberman is mistaken in his view of the evolution of the human vocal tract. They claim instead that the vocal apparatus is much older than the fifty millennia proposed by Lieberman – so old, in fact, that it is found in macaque monkeys.2 While the study by Fitch and colleagues is intriguing, there are two reasons why it is not particularly useful in understanding language evolution. First, most of the extreme tongue positions that they claim to be similar between macaques and humans come from macaque’s yawning. The assumption of Fitch and his co-authors seems to be that, if they can get macaques to place their tongues in the right positions to produce certain human vowels while yawning, the macaques could repeat this if they were speaking. This is a doubtful assumption, though, because yawning is not nearly so effortless a gesture as the production of a back vowel (similar in the shape of the tongue for yawning) is for humans. The tongue is retracted in effortful ways, and it is doubtful speech sounds would evolve out of a yawning vocal tract shape. Another problem with this study is that the authors compare the phonetics of the macaque against archival human phonetics. But the authors should have retested human phonetic properties using the very same methods they used on macaques, in order to compare them more equally.3 Finally and most importantly, is the fact that language does not require speech as we know it. Languages can be whistled, hummed, or spoken with a single vowel with or without a consonant. It is the confluence of culture and the Homo brain that gives us language. Our modern speech is a nice, functional add-on.
On the surface of it, human speech is simple. Vowels and consonants are created along the same principles as the notes formed by wind blowing through a clarinet. The root of both is basic physics. Air flows up from the lungs and out the mouth and is modified as it passes through either the tube of the clarinet or the tube of the human vocal apparatus. In the case of the clarinet the airflow is transformed by keys and a reed that alter it so that it can make the sublime sounds Benny Goodman produced, or the squeaking and squawking of a beginner. Before reaching the mouth, the flow of air is transformed into speech by the larynx, the tongue, the teeth and the different shapes and movements of all the stuff in our throats, noses and mouths that lie above the larynx.
But speech is more complex than a mere wind-tube effect. That is because the tube of human speech is controlled by a complex respiratory physiology directed by an even more complex brain. The creation of speech requires precise control of more than one hundred muscles of the larynx, the respiratory muscles, the diaphragm and the muscles between our ribs – our ‘intercostal’ muscles – and muscles of our mouth and face – our orofacial muscles. The muscle movements required of all these parts during speech is mind-bogglingly complex. The ability to make these movements required that evolution change the structures of the brain and the physiology of the human respiratory apparatus. On the other hand, none of these subsequent adaptations was required for language. They all simply made speech the highly efficient form of transmitting language that we know it as today. Still, it is unlikely that any erectus woman could have played Eliza. Her appearance would never have fooled anyone.
There are three basic parts to human speech capabilities that evolution needed to provide us with to enable us to talk and sing as humans do today. These are the lower respiratory tract, which includes the lungs, heart, diaphragm and intercostal muscles, the upper respiratory tract, which includes the larynx, the pharynx, the nasopharynx, the oropharynx, the tongue, the roof of the mouth, the palate, the lips, the teeth and, most importantly by far, the brain.
The average human produces 135–185 words per minute. Two things about this are deeply impressive. First, it is amazing that humans can talk that fast and consider it normal. Second, it is nearly incredible that people can understand anyone speaking that fast. But, of course, humans do both of these, producing and perceiving speech, without the slightest effort when they are healthy. These are the two sides to speech – production (speaking or signing) and perception (hearing with understanding). To grasp how speech production and perception evolved one has to know not only how the upper and lower respiratory tracts evolved, but also how the brain is able to control the physical components of speech so well and so quickly.
To tell the story of speech, we need to look at the vocal apparatus and the evidence for speech capabilities across the various species of Homo. It is important to have a clear idea of how sounds are made, how sounds are perceived and how the brain is able to manage all of this. But prior to that, it is essential to comprehend the state of human speech today. How does speech work with modern Homo sapiens? Knowing the answer to such questions can make it possible to judge how effective the speech of other Homo species would have been relative to sapiens’, and if they were, in fact, capable of speech.
Speech comes out of mouths, travels through the air and enters the ears of hearers, to be interpreted by their brains. Each of the three steps in the creation and transmission and understanding of speech has an entire subfield of phonetics, the science of sounds, dedicated to it. The creation of sounds is the domain of the field of ‘articulatory phonetics’. The transmission of sounds through the air is ‘acoustic phonetics’. And the hearing and interpretation of sounds is ‘auditory phonetics’. But one also encounters names of subfields that are concerned with other types of function. There are studies of the physics and mechanics of speech perception and speech production. These different studies are often grouped together under the name ‘experimental phonetics’. It isn’t necessary to understand all of these to understand the evolution of speech but a wee bit of understanding of them would be helpful.
The larynx is vital to the understanding of the language of Homo species, as it enables humans not only to pronounce human speech sounds, but also to have intonation and use pitch to indicate what aspect of an utterance is new, what is old, what is particularly important, whether people are asking a question, or making a statement. The larynx is where the airflow from the lungs is manipulated in order to produce phonation, the confluence of energy, muscles and airflow required to produce the sounds of human speech.
The larynx is a small transducer that sits atop the trachea, with a top called the epiglottis that can flip closed to keep food or liquid from entering through the larynx and into the lungs, potentially causing great harm. Figure 21 shows just a glimpse of its complexity.*
One thing that every researcher into the evolution of speech agrees upon is the idea that our speech production evolved in tandem with our speech perception. Or, as Crelin puts it in his pioneering work, ‘there tends to be a precise match between the broadcast bandwidth and the tuning of perceptual acuity’. Or ‘the possession of articulate speech therefore implies that both production and perception are attuned to each other, so that parameters carrying the bulk of the speech information are optimized in both production and perception’. In other words, the ears and the mouth work well together because they have evolved together for several million years.
Figure 21: The larynx
Speech begins with air, which can create human sounds when it flows into the mouth or when it is expelled from the mouth. The former types of sounds are called ‘ingressive’ sounds and the latter are ‘egressive’ sounds. English and other European languages use egressive sounds exclusively in normal speech. Ingressive sounds in these better-known languages are rare, usually found only in interjections, such as the sound of ‘huh’ when the air is sucked in. The place where the air begins in a sound is called the ‘initiator’. In all of the speech sounds of English, the lungs are the initiator. Thus one says that all sounds of English are ‘pulmonic’ sounds. But there are two other major air initiators that many languages of the world use, the glottis (the opening in the larynx, for glottalic sounds) and the tongue (for lingual sounds). These are also not sounds of English.
To quote from my Language: The Cultural Tool:
[In] Tzeltal, Ch’ol and others, so-called ‘glottalised’ sounds – implosives and ejectives – are common.
When I began my linguistic career, in the mid-1970s, I went to live for several months among the Tzeltales of Chiapas, Mexico. One of my favourite phrases was c’uxc’ajc’al ‘it’s hot outside’, which contains three glottalised consonants (indicated in Tzeltal orthography by the apostrophe). To make these sounds, the glottis, the space between the two vocal cords in the larynx, must be closed, cutting off air from the lungs. If the entire larynx is then forced up at the same time that the lungs or tongue cut off the flow of air out of the mouth, then pressure is created. When the tongue or lips then release the air out of the mouth, an explosive-like sound is produced. This type of sound, seen in the Tzeltal phrase above, is called an ‘ejective’. We could also produce the opposite of an ejective, known as an ‘implosive’ sound. To make an implosive, the larynx moves down instead of up, but everything else remains the same as for an ejective. This downward motion of the larynx will produce an implosive – caused by air suddenly rushing into the mouth. We do not have anything like these sounds in English.
I remember practising ejectives and implosives constantly for several days, since the Tzeltales I worked with use them both. They are interesting sounds – not only are they fun sounds, but they extend the range of human speech sounds beyond the strictly lung-produced sounds in European languages.
The glottis can be used to modify sounds in other ways. Again, from Language:
A different type of glottalised sound worth mentioning is produced by nearly, but not quite, closing the glottis and allowing lung air to barely flow out. This effect is what linguists call ‘creaky voice’. People often produce creaky voice involuntarily in the mornings after first arising, especially if their vocal cords are strained through yelling, drinking, or smoking. But in some languages, creaky voice sounds function as regular vowels.
Some glottalised sounds are known as clicks. These are created by using the tongue to block the flow of air into or out of the mouth, while pressure builds up behind the glottis. As with other sounds using the lungs or the glottis, lingual sounds can also be egressive or ingressive, produced by closing the airflow off with the tip of the tongue while building pressure inward or outward with the back of the tongue. Clicks are found in a vary small number of languages, all in Africa and almost all languages of the Bantu family. I remember first hearing clicks in Miriam Makeba’s ‘click song’. Makeba’s native language was Xhosa, a Bantu language.
Figure 22: The International Phonetic Alphabet
A list of all the consonants that are produced with lung air are given in the portion of the International Phonetic Alphabet shown in Figure 22.
Consonants are different from vowels in several ways. Unlike vowels, consonants impede (rather than merely shaping) the flow of air as it comes out of the mouth. The International Phonetic Alphabet (IPA) chart is recognised by all scientists as the accepted way of representing human speech sounds. The columns of the chart are ‘modes’ of pronunciation. These modes include allowing the air to flow out of the nose, which produces nasal sounds like [m], [n] and [ɳ]. Other modes are ‘occlusives’ or ‘stops’ (air is completely blocked as it flows through the mouth), sounds such as [d], [t], [k], or [g]. And there are ‘fricatives’, where the airflow is not completely stopped, but it is impeded sufficiently to cause friction, turbulent or sibilant sounds, such as [s], [f] and [h].
The rows in the IPA chart are places of articulation. The chart starts on the left with sounds produced near the front of the mouth and moves further back in the throat. The sounds [m] and [b] are ‘bilabials’. They are produced by blocking the flow of air at the lips, both upper and lower lips coming together to completely block the flow of air. The sound [f] is a bit further back. It is produced by the lower lip touching the upper teeth and only partially impeding rather than completely blocking the flow of air. Then we get to the sounds like [n], [t] and [d], where the tongue blocks the flow of air either just behind the teeth (as in Spanish) or at the small ridge (alveolar ridge) on the hard palate (roof of the mouth), not far behind the teeth (as in English).
We eventually get to the back of the mouth, where sounds like [k] and [g] are produced by the back of the tongue rising to close off air at the soft palate. In other languages, the sounds go back further. Arabic languages are known for their pharyngeal sounds, made by constricting the epiglottis or by retracting the tongue into the pharynx. The epiglottis is a piece of stretchy cartilage that comes down to cover the hole at the top of the larynx just in case food or liquid tries to get in. One should not talk with a full mouth; if the epiglottis is not at the ready this could be fatal. Humans, except human infants, are the only creatures that cannot eat and vocalise at the same time.
What is crucial in the IPA charts is that the segments they list almost completely exhaust all the sounds that are used for human languages anywhere in the world. The phonetic elements therein are all easy (at least with a bit of practice) for humans to make. But the basal ganglia do favour habits, and so once we have mastered our native set of phonemes, it can be hard to get the ganglia out of their rut to learn the articulatory habits necessary for the speech of other languages.
But consonants do not speech make. Humans also need vowels. To take one example, the vowels of my dialect of English, Southern California, are as shown in Figure 23.
Just like the consonant chart, the vowel chart in Figure 23 is ‘iconic’. Its columns represent the front of the mouth to the back. The rows of the vowel chart indicate the relative height of the tongue when producing the vowel. The trapezoidal shape of the chart is intended to indicate, again iconically, that as the tongue lowers, the space in the mouth between the vowels shrinks.
California vowels, like all vowels, are target areas where the tongue raises or lowers to a specific region in the mouth. At the same time, as the tongue moves to the target area to raise or lower, the tongue muscles are tense or relaxed (‘lax’). The lips can be rounded or flat. The tense vowel, [i] is the vowel of the word ‘beet’. The lax vowel [ɪ], on the other hand, is the vowel heard in the word ‘bit’. In other words ‘bit’ and ‘beet’ are identical except that the tongue muscles are tense in the ‘beet’ and relaxed in ‘bit’. Another way of talking about ‘lax’ vs ‘tense’ vowels, one preferred by many linguists, is to refer to them as ‘Advanced Tongue Root’ (the tongue is thereby tensed by being pushed forward in the mouth and flexing) or ‘Not Advanced Tongue Root’ (the tongue is relaxed, its root further back in the mouth), usually written in the linguistics literature as [+ATR] or [−ATR].
The funny-looking vowel character [æ] is the vowel in my dialect for ‘cat’. It is low, front and unrounded. But if moving up the chart and to the back of the mouth the sound [u] is reached, the vowel sound of the word ‘boot’. This is a back, rounded vowel. ‘Back’ in this sense means that the back portion of the tongue is raised, rather than the front portion (the blade or tip) as in the unrounded vowel [i]. The lips form a round ‘o’ shape when producing [u]. Any vowel can be rounded. Thus, to make the French vowel [y], make the English vowel [i] while rounding the lips.
Figure 23: Southern California English vowels
The point is that the various speech sounds available to all human languages are conceptually easy to understand. What is hard about them is not how to classify nor even to analyse them, but how to produce them. Humans can learn all the sounds they want when they are young and their basal ganglia are not in a rut. But as we get older, the ganglia are challenged to make new connections.
When I was enrolling for my first course in articulatory phonetics (in order to learn how to make all the sounds of speech used in all the languages of the world) at the University of Oklahoma in 1976, the teaching assistants for the course gave each student an individual interview in order to place them in sections according to phonetic ‘talent’ (or perceptual ability). I walked into the classroom used for this purpose and the first thing I was asked to do was to say the word ‘hello’ – but by sucking air into my lungs rather than expelling air outwards. ‘Weird,’ I thought. But I did it. Then I was asked to imitate a few words from Mayan languages with glottal ejectives. These are ‘popping’ sounds in which the air comes out of the mouth but originates above the lungs, with pressure built up by bringing the vocal folds together and then letting the air behind them ‘eject’ out of the mouth. And I tried imitating African click sounds. This course was going to be valuable to me, I knew, because I was preparing to go to the Amazon to conduct field research on a language that was still largely unknown to the outside world, Pirahã.
Again, every language in the world, from Armenian to Zapotec, uses the same inventory of articulatory movements and sounds. The reason for this is that the human auditory system co-evolved with the human articulatory system – that is, humans learned to hear best the sounds they are able to make. There are always outliers, of course, and there are still unexpected novelties to be discovered. In fact, I have personally discovered two sounds in the Amazon over the years (one in the Chapakuran language family, the other in Pirahã) not found in any other language of the world.
The field linguist needs to learn what sounds the human body can make and use for speech because she or he must be prepared to begin work from the very first minute they arrive at their field destination. They have to know what they are hearing in order to begin an analysis of the speech and language of the people they have gone to live with.
This brief introduction covers only part of one-third of the science of phonetics, articulatory phonetics. But what happens to speech sounds once they exit the mouth? How are people able to distinguish them? Hearers are not usually able to look into the mouth of the person speaking to them, so how can they tell whether she or he is making a [p] or a [t] or a [i] or a [a]?
This is the domain of acoustic phonetics. An immediate question regarding sound perception is why, if air is coming out of the mouth when one is talking, is it that only the consonants and vowels are heard, rather than the sound of air rushing out of the mouth? First, the larynx energises the air by the vibration of the vocal cords, or oscillation of other parts of the larynx. This changes the frequency of the sound and brings it within the perceptible range for humans because evolution has matched these frequencies to human ears. Second, the sounds of the air rushing out of the mouth have been tuned out by evolution, falling below the normal frequency ranges that the human auditory system can easily detect. That is a good thing. Otherwise people would sound like they’re wheezing instead of talking.
The energising of the flow of air in speech by the larynx is known as phonation, which produces for each sound what is known as the ‘fundamental frequency’ of the sound. The fundamental frequency is the rate of vibration of the vocal folds during phonation and this varies according to the size, shape and bulk (fat) of the larynx. Small people will generally have higher voices, that is, higher fundamental frequencies, than larger people. Adults have deeper voices, lower fundamental frequencies, than children, men have lower voices than women and taller people often have deeper voices than shorter people.
The fundamental frequency, usually written as F0, is one of the ways that people can recognise who is talking to them. We grow accustomed to others’ frequency range. The varying frequency of the vibration of the vocal cords is also how people sing and how they control the relative pitches over syllables in tone languages, such as Mandarin or Pirahã, among hundreds of others where the tone on the syllable is as important to the meaning of the word as the consonants and vowels. This ability to control frequency is also vital in producing and perceiving the relative pitches of entire phrases and sentences, referred to as intonation. F0 is also how some languages are whistled, using either the relative pitches on syllables or the inherent frequencies of individual speech sounds.
It may surprise no one to learn, however, that F0 is not all there is. In addition to the fundamental frequency, as each speech sound is produced harmonic frequencies, or formants, are produced that are associated uniquely what that particular sound. These formants enable us to distinguish the different consonants and vowels of our native language. One does not directly hear the syllable [dad], for example. What is heard are the formants and their changes associated with these sounds.
A formant can be visualised by hitting a tuning fork that produces the note ‘E’ and placing it on the face of an acoustic guitar near the sound hole. If the guitar is tuned properly, the ‘E’ string of the same octave as the tuning fork will vibrate or resonate with the fork’s vibrations. This resonance is responsible for the different harmonics or formants of each speech sound. These formants can be seen in a spectrogram, with each formant at a particular multiple of the fundamental frequency of the sound (Figure 24).
Figure 24: Vowel spectrogram
In this spectrogram of four vowels the fundamental frequency is visible at the bottom and dark bands are visible going up the columns. Each band, associated with a frequency on the left side of the spectrogram, is a harmonic resonance or formant of the relevant vowel. Going left to right across the bottom, the time elapsed in the production of the sounds is measured. The darkness of the bands indicates the relative loudness of the sound produced. It is the formants that are the ‘fingerprints’ of all speech sounds. Human ears have evolved to hear just these sounds, picking out the formants which reflect the physical composition of our vocal tract. The formants, from low to high frequency, are simply referred to as F1, F2, F3 and so on. They are caused by effects of resonators such as the shape of the tongue, rounding of the lips and other aspects of the sound’s articulation.
The formant frequencies of the vowels are seen in the spectrogram (given in hertz (Hz)). What is amazing is not that only that we hear these frequency distinctions between speech sounds, but that we do so without knowing what we are doing, even though we produce and perceive these formants so unerringly. It is the kind of tacit knowledge that often leads linguists to suppose that these abilities are congenital rather than learned. And certainly some aspects of this are inborn. Human mouths and ears are a matched set, thanks to natural selection.
There is too little understood about how sounds are interpreted physiologically by our ears and brains to support a detailed discussion of auditory phonetics, the physiology of hearing. But the acoustics and articulation of sounds is quite enough to prime a discussion of how these abilities evolved.
If it is correct to say that language preceded speech, then it would be expected that Homo erectus, having invented symbols and come up with a G1 language, would still not have possessed top-of-the-line human speech capabilities. And they did not. Their larynges were more ape-like than human like. In fact, although neanderthalensis had relatively modern larynges, erectus lagged far behind.
The main differences between the erectus vocal apparatus and the sapiens apparatus were in the hyoid bone and pre-Homo vestiges, such as air sacs in the centre of the larynx. Tecumseh Fitch was one of the first biologists to point out the relevance of air sacs to human vocalisation. Their effect would have been to render many sounds emitted less clear than they are in sapiens. The evidence that they had air sacs is based on luck in finding fossils of erectus hyoid bones. The hyoid bone sits above the larynx and anchors it via tissue and muscle connections. By contracting and relaxing the muscles connecting the larynx to the hyoid bone, humans are able to raise and lower the larynx, altering the F0 and other aspects of speech. In the hyoid bones of erectus, on the other hand, though not in any fossil Homo more recent than erectus, there are no places of attachment to anchor the hyoid. And these are not the only differences. So different are the vocal apparatuses of erectus and sapiens that Crelin concludes: ‘I judge that the vocal tract was basically apelike.’ Or, as others say:
Authors describe a hyoid bone body, without horns, attributed to Homo erectus from Castel di Guido (Rome, Italy), dated to about 400,000 years BP. The hyoid bone body shows the bar-shaped morphology characteristic of Homo, in contrast to the bulla-shaped body morphology of African apes and Australopithecus. Its measurements differ from those of the only known complete specimens from other extinct human species and early hominid (Kebara Neanderthal and Australopithecus afarensis) and from the mean values observed in modern humans. The almost total absence of muscular impressions on the body’s ventral surface suggests a reduced capability for elevating this hyoid bone and modulating the length of the vocal tract in Homo erectus. The shield-shaped body, the probable small size of the greater horns and the radiographic image appear to be archaic characteristics; they reveal some similarities to non-humans and pre-human genera, suggesting that the morphological basis for human speech didn’t arise in Homo erectus.4
There is no way that erectus, therefore, could have produced the same kind or quality of speech, in terms of ability to clearly discriminate the same range of speech sounds in perception or production, as modern humans. None of this means that erectus would have been incapable of language, however. Erectus had sufficient memory to retain a large number of symbols, at least in the thousands – after all, dogs can remember hundreds – and would have been able, in the use of context and culture, to disambiguate symbols that were insufficiently distinct in their formants, due to the lesser articulatory capacity of erectus. However, what is expected is that the new dependency on language would have created a Baldwin effect such that natural selection would have preferred Homo offspring with greater speech production and perception abilities, both in the vocal apparatus as well in the various control centres of the brain. Eventually, humans went from erectus’s low-quality speech to their current high-fidelity speech.
What size inventory of consonants and vowels, intonation and gestures does a language need to ensure that it has the right ‘carrying capacity’ for all the meanings it wants to communicate? Language can be thought of in many ways. One way to view it, though, is as matching up meanings to forms and knowledge in such a way that hearers can understand speakers.
If it were known with certainty that Homo erectus and Homo neanderthalensis were incapable of making the full range of sounds of anatomically modern humans, would this mean that they could not have had languages as rich as sapiens? It is hard to say. It is almost certain that sapiens are better at speech than erectus and other hominins that preceded sapiens. There are innumerable benefits and advantages to being the proud possessor of a modern speech apparatus. It makes speech easier to understand. But sapiens’ souped-up vocal tract is not necessary to either speech or language. It is just very, very good to have. Like having a nice travel trailer and a powerful 4x4 pickup instead of a covered wagon pulled by two mules.
In fact, computers show that a language can work just fine with only two symbols, 0 and 1. All computers communicate by means of those two symbols, current on – 1 – and current off, 0. All the novels, treatises, PhD dissertations, love letters and so on in the history of the world can, with many deficiencies, such as the absence of gestures, intonation and information about salient aspects of sentences, be translated into sequences of 0 and 1. So, if erectus could have made just a few sounds, more or less consistently, they could be in the language game, right there with sapiens. This is why linguists recognise that language is distinct from speech. Sapiens quite possibly speak more clearly, with sounds that are easier to hear. But this only means, again, that erectus drove a Model T language. Sapiens drive the Tesla. But both the Model T and the Tesla are cars. The Model T is not a ‘protocar’.
Though hard to reconstruct from fossil records, the human vocal tract, like the human brain, also evolved dramatically from earlier hominids to modern sapiens. But in order to tell this part of the story, it is necessary to back up a bit and talk about the sounds that modern humans use in their languages. This is the end point of evolution and the starting point of any discussion of modern human speech sounds.
The evolutionary questions that lie beneath the surface of all linguistics field research are ‘How did humans come to make the range of sounds that are found in the languages of the world today?’ and, next, ‘What are these sounds?’
The sounds that the human vocal apparatus uses all are formed from the same ingredients.
The technical label for any speech sound in any language tells how that sound is articulated. The consonant [p] is known as a: ‘voiceless bilabial occlusive (also called a “stop”) with egressive lung air’. This long, but very helpful, description of the speech sound means that the [p] sound in the word ‘spa’, to take one example, is pronounced by relaxing the vocal cords so that they do not vibrate. The sound is therefore ‘voiceless’. (The sound [b] is pronounced exactly like [p] except that in [b] the vocal folds – also called cords – are tensed and vibrating, rendering [b] a ‘voiced’ sound.) The phrase ‘egressive lung air’ means that the air is flowing out of the mouth or nose or both and that it originated with the lungs. This needs to be stated because not all speech sounds use lung air. The term ‘occlusive’, or ‘stop’, means that the airflow is blocked entirely, albeit momentarily. The term ‘bilabial’ refers to the action of the upper and lower lips together. In conjunction with the term ‘occlusive’, ‘bilabial’ means that the airflow was blocked entirely by the lips. If one pronounces the sounds in the hypothetical word [apa] while lightly holding the index finger on the ‘Adam’s apple’ (which is actually your larynx), the vibration of the vocal cords can be felt to cease from the first [a] to the [p] and then start vibrating again on the second [a]. But if the same procedure is followed for the hypothetical word [aba] the vocal cords will continue to vibrate for each of [a], [b] and [a], that is for the entire duration of the word.
Though there are hundreds of sounds in the world’s 7,000+ languages, they are all named and produced according to these procedures. And what is even more important, these few simple procedures, using parts of the body that evolved independently of language – the teeth, tongue, larynx, lungs and nasal cavity – are sufficient to say anything that can be said in any language on the planet. Very exciting stuff.
Humans can, of course, bypass speech altogether and communicate with sign languages or written languages. Human modes of communication, whether writing, sign languages or spoken languages, engage one or both of two distinct channels – the aural-oral and the manual-visual. In modern human languages, both channels are engaged from start to finish. This is essential in human language, where gestures, grammar and meaning are combined in every utterance humans make. There are other ways to manifest language, of course. Humans can communicate using coloured flags, smoke signals, Morse code, typed letters, chicken entrails and other visual means. But, funnily enough, no one expects to find a community that communicates exclusively by writing or smoke signals unless they have some sort of shared physical challenge or are all cooperating with some who do.
One question worth asking is whether there is anything special about human speech or whether it is just composed of easy-to-make noises.5 Would other noises work as well for human speech?
As Philip Lieberman has pointed out, one alternative to human speech sounds is Morse code.6 The fastest speed a code operator can achieve is about fifty words per minute. That is about 250 letters per minute. Operators working this quickly, however, need to rest frequently and can barely remember what they have transcribed. But a hungover college student can easily follow a lecture given at the rate of 150 words per minute! We can produce speech sounds at roughly twenty-five per second.
Speech works also by structuring the sounds we make. The main structure in the speech stream is the syllable. Syllables are used to organise phonemes into groups that follow a few highly specific patterns across the world’s languages.7 The most common patterns are things like Consonant (C) + Vowel (V); C+C+V; C+V+C; C+C+C+V+C+C+C and so on (with three consonants on either side of the vowel pushing the upper limits of the largest syllables observed in the world’s languages). English provides an example of complex syllable structure, seen in words like strength, s-t-r-e-n-g-th, which illustrates the pattern C+C+C+V+C+C+C (with ‘th’ representing a single sound). But what I find interesting is that in the majority of languages C+V is either the only syllable or by far the most common. With the organisational and mnemonic help of syllables, our neural evolution plus our contingency judgements – based on significant exposure to our native language – we are able to parse our speech sounds and words far faster than other sounds.
Suppose you want to say, ‘Piss off, mate!’ How do you get those sounds out of your mouth on their way to someone’s ear? There are three syllables, five consonants and three vowels in these three words, based on the actual spoken words rather than the written words using the English alphabet. The sounds are, technically, [ph], [I], [s], [ɔ], [f], [m], [ei] and [t]. The syllables are [phIs], [f] and [meit] and so, unusually in English, each word of this insult is also a syllable.
Sign languages also have much to teach us about our neural cognitive-cerebral platform. Native users of sign languages can communicate as quickly and effectively as speakers using the vocal apparatus. So our brain development cannot be connected to speech sounds so tightly as to render all other modalities or channels of speech unavailable. It seems unlikely that every human being comes equipped by evolution with separate neuronal networks, one for sign languages and another for spoken languages. It is more parsimonious to assume instead that our brains are equipped to process signals of different modalities and that our hands and mouths provide the easiest ones. Sign languages, by the way, also show evidence for syllable-like groupings of gestures, so we know that we are predisposed to such groupings, in the sense that our minds quickly latch on to syllabic groupings as ways of better processing parts of signs. Regardless of other modalities, though, the fact remains that vocal speech is the channel exclusively used by the vast majority of people. And this is interesting, because in this fact we do see evidence that evolution has altered human physiology for speech.
Human infants begin life much as other primates vocally. The child’s vocal tract anatomy above their larynx (the supralaryngeal vocal tract or SVT) is very much like the anatomy of the corresponding tract in chimps. When human newborns breathe, their larynx rises to lock into the passage leading to the nose (the nasopharyngeal passage). This seals off the trachea from the flow of mother’s milk or other things in the newborn’s mouth. Thus human babies can eat and breathe without choking, just like a chimp.
Adults lose this advantage. As they mature, their vocal tract elongates. Their mouths get shorter while the pharynx (the section of the throat immediately behind the mouth and above the larynx, trachea and oesophagus) gets longer. Consequently, the adult larynx doesn’t go up as high relative to the mouth and thus is left exposed to food or drink falling on it. As noted earlier, if this kind of stuff enters our trachea, people can choke to death. It is necessary, therefore, to coordinate carefully the tongue, the larynx, a small flap called the epiglottis and the oesophageal sphincter (the round muscle in our food pipe) to avoid choking while eating. One thing people take care to avoid is talking with their mouths full. Talking and eating can kill or cause severe discomfort. Humans seem to have lost an advantage possessed by chimps and newborn humans.
But the news is not all bad. Although the full inventory of changes to the human vocal apparatus is too large and technical to discuss here, the final result of these developments enables us to talk more clearly than Homo erectus. This is because we can make a larger array of speech sounds, especially vowels, like the supervowels ‘i’, ‘a’ and ‘u’, which are found in all languages of the world. These are the easiest vowels to perceive. We’re the only species that can make them well. Moreover, the vowel ‘i’ is of special interest. It enables the hearer to judge the length of the speaker’s vocal tract and thus determine the relative size as well as the gender of the speaker and to ‘normalise’ the expectations for recognising that speaker’s voice.
This evolutionary development of the vocal apparatus gives more options in the production of speech sounds, a production that begins with the lungs. The human lungs are to the vocal apparatus as a bottle of helium is to a carnival balloon. The mouth is like the airpiece. As air is released, the pitch of the escaping air sound can be manipulated by relaxing the piece, widening or narrowing the hole by which the air hisses out, cutting the air off intermittently and even ‘jiggling’ the balloon as the air is expelled.
But if human mouths and noses are like the balloon’s airpiece, they also have more moving parts and more twists and chambers for the air to pass through than a balloon. So people can make many more sounds than a balloon. And since human ears and their inner workings have co-evolved with humans’ sound-making system, it isn’t surprising that they have evolved to make and be sensitive to a relatively narrow set of sounds that are used in speech.
According to evolutionary research, the larynges of all land animals evolved from the same source – the lung valves of ancient fish, in particular as seen in the Protopterus, the Neoceratodus and the Lepidosiren. Fish gave speech as we know it. The two slits in this archaic fish valve functioned to prevent water entering into the lungs of the fish. To this simple muscular mechanism, evolution added cartilage and tinkered a bit more to allow for mammalian breathing and the process of phonation. Our resultant vocal cords are therefore actually a complex set of muscles. They were first called cordes by the French researcher, Ferrein, who conceived of the vocal apparatus as a musical instrument.8
What is complicated, on the other hand, is the control of this device. Humans do not play their vocal apparatuses by hand. They control each move of hundreds of muscles, from the diaphragm to the tongue to the opening of the naso-pharyngeal passage, with their brains. Just as the shape of the vocal apparatus has changed over the millennia to produce more discernible speech, matching more effectively the nuances of the language that speakers have in their heads, so the brain evolved connections to control the vocal apparatus.
Humans must be able to control their breathing effectively to produce speech. Whereas breathing involves inspiration and expiration, speech is almost exclusively expiration. This requires control of the flow of air and the regulation of air-pressure from the lungs through the vocal folds. Speech requires the ability to keep producing speech sounds even after the point of ‘quiet breathing’ (wherein air is not forced out of the lungs in exhalation by normal muscle action, but allowed to seep out of the lungs passively). This control enables people to speak in long sentences, with the attendant production not only of individual speech sounds, such as vowels and consonants, but also of the pitch and modulation of loudness and duration of segments and phrases.
It is obvious that the brain has a tight link to vocal production because electrical stimulation of parts of the brain can produce articulatory movements and some examples of phonation (vowel sounds in particular). Other primate brains respond differently. Stimulation of the regions corresponding to Brodmann Area 44 in other primates produces face, tongue and vocal cord movements but not phonation as similar stimulation produces in humans.
To state the obvious, chimpanzees are unable to talk. But this is not, as some claim, because of their vocal tract. A chimp’s vocal tract certainly could produce enough distinct sounds to support a language of some sort. Chimps do not talk, rather, because of their brains – they are not intelligent enough to support the kind of grammars that humans use and they are not able to control their vocal tracts finely enough to control speech production. Lieberman locates the main controllers of speech in the basal ganglia, what he and others refer to as our reptilian brain. The basal ganglia are, again, responsible for habit-like behaviours among others. Disruption of the neural circuits linking the basal ganglia to the cortex can result in disorders such as obsessive-compulsive disorder, schizophrenia and Parkinson’s disease. The basal ganglia are implicated in motor control, aspects of cognition, attention and several other aspects of human behaviour.
Therefore, in conjunction with the evolved form of the FOXP2 gene, which allows for better control of the vocal apparatus and mental processing of the kind used in modern humans’ language, the evolution of connections between the basal ganglia and the larger human cerebral cortex are essential to support human speech (or sign language). Recognising these changes helps us to recognise that human language and speech are part of a continuum seen in several other species. It is not that there is any special gene for language or an unbridgeable gap that appeared suddenly to provide humans with language and speech. Rather, what the evolutionary record shows is that the language gap was formed over millions of years by baby steps. At the same time, erectus is a fine example of how early the language threshold was crossed, how changes in the brain and human intelligence were able to offer up human language even with ape-like speech capabilities. Homo erectus is the evidence that apes could talk if they had brains large enough. Humans are those apes.
* For those interested in the history of studies of human speech, these go back for centuries. But the modern investigation of the physiology and anatomy of human speech is perhaps best exemplified in a book by Edmund S. Crelin of the Yale University School of Medicine, entitled The Human Vocal Tract: Anatomy, Function, Development, and Evolution (New York: Vantage Press, 1987). It contains hundreds of drawings and photographs not only of the modern human vocal apparatus but also of the relevant sections of fossils of early humans, as well as technical discussions of each.