An easy place to begin a discussion of the interface between phonetics and phonology is with the speech chain, a phrase introduced by Denes and Pinson (1963) to describe the series of transformations a spoken message undergoes from its origin in the mind of the speaker to its destination in the mind of the listener. There appear to be at least two interfaces in this chain, the first between the phonological representation of the speaker’s intended message and its realization in their articulations, and the second between the auditory qualities evoked by the speech signal’s acoustics and the phonological representation that the listener recognizes. One might imagine that there are also two others, between the articulations and the acoustics of the speech signal and between the acoustics and the auditory qualities they evoke, but they are more properly transductions than interfaces because the natures of these transformations are entirely predictable from the physics, physiology, and psychology of the vocal and auditory apparatus. Figure 13.1 shows three representations of the most accessible link in this chain, the acoustic speech signal itself, its phonetic transcription (Figure 13.1a), the waveform (Figure 13.1b), and the spectrogram (Figure 13.1c). The phonetic transcription includes some more detail than the phonological representation (Figure 13.1d), but clearly corresponds more closely to it than to either representation of the utterance’s acoustics (Figure 13.1b,c).
Indeed, if one were to compare this rather detailed phonetic transcription of this rather hyperarticulated one-syllable word with the corresponding waveform or spectrogram, one would be immediately struck with how difficult it is to identify the correspondences between the discrete units of the transcription and the intervals and continuous acoustic properties of the waveform or spectrogram. The features and their values in the phonological representation of that word (Figure 13.1d) correspond even less well to the waveform’s or spectrogram’s intervals and acoustic properties. Correspondences between phonetic transcriptions or phonological representations would be equally hard to detect in a record of the articulatory movements that produced the acoustic signal displayed in the waveform and spectrogram, or in a record of the neural responses to the acoustic properties of that signal in the auditory system. And even though the processes that operate on or the constraints that regulate the contents of phonological representations may be motivated by the continuous articulatory, acoustic, or auditory properties of speech, those processes or constraints refer to their constituent segments and features as discrete units.1 If phonological and phonetic representations nonetheless differ in kind like this, then mechanisms are needed to accomplish two translations. The first translation is from the categories of the speaker’s message to the utterance’s articulatory continuum, and the second is from the auditory continuum to the categories of the listener’s recognition of the message’s phonological content. At a minimum, the actual translation mechanism(s) would serve as or mark the interface(s) between the phonology of an utterance and its phonetics.
This description of possible differences between phonological and phonetic representations is amplified in Pierrehumbert’s (1990) catalogue. Phonological representations are in the mind, where they constitute a part of speakers’ and listeners’ implicit linguistic knowledge, differences between them are qualitative, and the rules and constraints that specify phonological well-formedness are syntactic, in specifying combinations and structural arrangements of discrete, symbolic categories. Phonetic representations are events in the world, differences between them are quantitative, and the gradient relationships between their constituents are expressed in calculus.2 The difference is also often characterized as one of knowing versus doing, where the phonology of a language represents what its speakers and listeners know about its sounds and the patterns they enter into, and its phonetics represents the behaviors of those speakers and listeners when actually producing or perceiving speech. Both this distinction between knowledge and behavior and this equation of phonology with knowledge and phonetics with behavior are, however, challenged by a number of scholars whose proposals are discussed in the following.
If the phonetic representations of utterances whose phonological representations differ in their values for just one distinctive feature were instead compared, then it would be possible to observe phonetic correspondences to phonological contrasts, at least within individual languages in specified contexts. For example, the spectrograms of tore and bore in Figures 13.2a,b differ from that of door in Figure 13.1c in ways that correspond systematically to their different specifications for voicing and place of articulation contrasts, e.g., the long delay between the stop burst and the onset of voicing in tore compared to the negligible delay in “door,” and the far lower onset of the second formant in bore compared to the noticeably higher onset in door. These correspondences can, however, be produced by a mechanism that translates the categorical differences between minimally contrasting phonological representations into the phonetic realization of these contrasts as systematically different articulatory movements and one that translates systematically different auditory qualities back into categorical contrasts.3
Before presenting evidence in 4 “There is an interface between phonetics and phonology” for a more substantive interface between phonology and phonetics than (a) mere translation mechanism(s), I will first sketch three rather different arguments against there being any substantive interface between phonology and phonetics in the following section.
Here, I present and evaluate three proposals that there is no interface between phonetics and phonology. The first such proposal integrates the direct realist theory of speech perception (Fowler, 1986a, 1986b) with articulatory phonology (Browman & Goldstein, 1986, 1989, 1992, 1995a; Gafos & Goldstein, 2012; Goldstein & Fowler, 2003), in which the constituents of phonological representations in both the speaker’s and listener’s minds are articulatory gestures, specifying the location and degree of vocal tract constrictions and their relative coordination or timing. The second consists of an argument that phonological representations have no phonetic substance, nor do the rules or constraints that operate on or regulate them refer to any phonetic substance (Hale & Reiss, 2000, 2008). This argument has two parts. First, the constituents of phonological representations must be purely formal objects, and the rules and constraints that refer to them must be strictly syntactic, as Pierrehumbert (1990) proposed earlier. Second, independent theories of speakers’, listeners’, and learners’ behavior are needed anyway, so it is theoretically extravagant to incorporate phonetics into phonology. The third proposal turns this second argument on its head by insisting that the phonology is so thoroughly entwined with the physiology, physics, and psychology of movement, aerodynamics, acoustics, and audition that it cannot be separated from the phonetics. Phonological patterns and phonetic behavior also cannot be accounted for without appealing to general theories of pattern recognition, memory, and learning (Ohala, 1990, 2005). More modestly, perhaps, phonetic continua may be incorporated directly into phonological grammars (Flemming, 2001). The first of these arguments against any interface between phonetics and phonology asserts that only the world exists, while the last two arguments either deny any connection between phonology and the world or insist that phonology cannot be taken out of the world.
The first argument that there is no interface between phonology and phonetics rests on a perhaps unfamiliar and certainly underappreciated approach to theorizing about speech production and perception. This approach is motivated by the idea that both speakers’ and listeners’ behavior can be understood entirely by observing what they do in and receive from the world, and that it is therefore unnecessary to hypothesize that what speakers do as speakers is implement a representation whose constituents differ in kind from their actualization as acts of speaking or that listeners map the acoustic properties of the speech signal unto such representations. This idea in turn motivates arguments in Fowler (1986a, 1986b) that there can be no interface between phonology and phonetics, because, if there is a phonology, it cannot be distinguished from the phonetics. Her argument is laid out programmatically in these two quotes:
The essential modification is to our conceptualization of the relation between knowing and doing. First, phonetic segments as we know them can only have properties that can be realized in articulation. Indeed, from an event perspective, the primary reality of a phonetic segment is its public realization as vocal-tract activity. What we know of segments, we know from hearing them produced by other talkers or by producing them ourselves. Secondly, the idea that speech production involves a translation from a mental domain into a physical, non-mental domain such as the vocal tract must be discarded.
(Fowler, 1986a, pp. 9–10)
In a “bottom-up” account of perceiving speech, a listener has to reconstruct the talker’s mental intent from hints that the physical acoustic signal provides. In a “top-down” account, that bottom-up effort is aided by cognitive mediation. In a top-down account of talking, talkers use a mental plan to guide physical gestures of the vocal tract. In all three accounts (i.e., the bottom-up and top-down accounts of speech perception and the top-down account of speech production) there is a causal process transforming input or outputs into or out of a mental domain from or to a physical domain. This is an impossible kind of process. Theories of speech production use a conceptual slight of hand known as the “motor command” to make the translation (Fowler, 1983). It is a sleight of hand because commands are clearly mental kinds of things that a mind can formulate, but the “commands” are obeyed by motor neurons that are physical kinds of things responsive to release of a transmitter substance, not to commands. A workable theory of production and perception has to avoid translation across domains like this.
(Fowler, 1986b, pp. 168–169)
Knowing does not differ from doing, or alternatively, what’s in the mind is what’s in the world, for speakers or listeners. While making clear that Fowler rejects any translation between mental and physical representations, these quotes also appear to convey that she is concerned only with the phonetics of speech. They might, therefore, reasonably prompt the question: where is the phonology in either of these pronouncements?
This appearance is misleading. To see this, we must distinguish between two kinds of information: information about the content of the speaker’s message versus information about the articulatory gestures that the speaker produces to convey the message’s content. In more familiar conceptions of a phonological representation, the information about the content of the speaker’s message is encoded in the phonological representation, in its constituent phonemes or their constituent features and in their arrangement with respect to one another. And the speaker’s articulatory gestures are intended to convey that information to the listener, who has to decode it from the acoustic properties of the speech signal produced by those articulatory gestures. In this conception, the information changes form or is represented differently at each stage in the speech chain between the speaker and listener until its original form in the speaker’s phonological representation is reconstituted in the listener’s phonological representation. But despite its formal differences, it is nonetheless the same information at all stages in the speech chain. Fowler (1986a, pp. 23–24) instead argues, following Gibson (1982), that information about the speaker’s articulatory gestures is different from information about the contents of the speaker’s message, because information about articulatory gestures is perceived directly from the acoustic properties of the speech signal, while information about the contents of the speaker’s message is instead perceived indirectly, as a product of the direct perception of those gestures.
It’s worth noting that distinguishing between two kinds of information in this way is entirely compatible with the consensus that the meaning of a linguistic utterance is arbitrarily related to its form. In the view of Fowler (1986a, 1986b), theories of speech production and perception, as theories of these behaviors, are responsible for explaining how speakers produce forms that listeners recognize as forms, and not for how speech production and perception contribute to conveying the content of the speaker’s message to the listener.
Contemporaneous work by Browman and Goldstein (1986) can be viewed as taking up the task of developing an account of phonological representations that is compatible with this view. Browman and Goldstein (1986, et seq.) proposed that the articulatory gestures that speakers produce, and that, according to Fowler (1986a, 1986b), listeners perceive, are also the constituents of the utterance’s phonological representation – their proposal is known as “articulatory phonology.” As such, articulatory gestures represent as well as convey the speaker’s message in just the same, arbitrary way as do the abstract features of a phonological representation like that in Figure 13.1d (see Gafos & Goldstein, 2012; Goldstein & Fowler, 2003, for an explicit argument that representations composed of articulatory gestures are functionally and combinatorically equivalent to representations composed of features). Browman and Goldstein’s (1986) proposal thus takes Fowler’s argument one essential step further, in also spelling out the relationship between the form of a linguistic message and the meaning that it conveys, i.e., what is indirectly perceived as well as what is directly perceived.
Browman and Goldstein’s (1986) proposal nonetheless raises a further question: are the constituents of the phonological representation of an utterance detectable in and identifiable from a record of the speaker’s articulatory movements after all, and are they equally detectable in and identifiable from a record of the acoustic properties produced by those movements, as displayed, e.g., in Figures 13.1b,c and 13.2a,b? For Fowler (1986a, 1986b), these are the wrong questions; the right one is instead: can listeners find information in the acoustic signal produced by these articulations that reliably identifies the articulatory gestures which produced those properties? The pervasiveness of coarticulation, which can extend beyond immediately adjacent segments, and reduction, which can in the limit make it impossible to detect an intended segment’s occurrence, encourages a skeptical if not an outright negative answer to this question. Fowler nonetheless answers it in the affirmative, and in doing so appears to reject Hockett’s (1955) famous description of the phonetic realization of an utterance as the uninterpretable destruction its phonological representation:
Imagine a row of Easter eggs traveling along a moving belt; the eggs are of various sizes, and variously colored, but not boiled. At a certain point, the belt carries the row of eggs between the two rollers of a wringer, which quite effectively smash them and rub them more or less into each other. The flow of eggs before the wringer represents the series of impulses from the phoneme source; the mess that emerges from the wringer represents the output of the speech transmitter. At a subsequent point, we have an inspector whose task it is to examine the passing mess and decide, on the basis of the broken and unbroken yolks, the variously spread-out albumen, and the variously colored bits of shell, the nature of the flow of eggs which previously arrived at the wringer. Note that he does not have to put the eggs together again – a manifest physical impossibility – but only to identify.
(Hockett, 1955, p. 210)
This passage’s implied pessimism about the likelihood of the inspector – acknowledged as the hearer in the following passage – successfully identifying the phonemic sources of the phonetic mess is actually misleading. Hockett goes on to propose that the inspector/hearer may succeed after all if they treat the allophones of successive phonemes as overlapping in time, and parse each phoneme’s contribution to the signal’s acoustic properties during the interval when they overlap (see Fowler & Smith, 1986, for a very similar account). As Hockett ultimately does, Fowler (1986a, 1986b) argues that the phonetic realization does not destroy the phonological representation, despite the overlap of the articulations of successive speech sounds.
The argument in Fowler (1986a, 1986b) that representations in the mind are not translated into a different kind of representation in the world appears to rule out any interface, and any need for one. However, some sort of interface is still needed after all to handle the overlap and context-specific variation of articulations and their acoustic products. For that interface between the phonological representation and its phonetic realization to be nondestructive, the articulatory gestures must still be detectable and identifiable in the acoustic signal, at least by the listener, even if not so easily by the phonetician. An interface that is supposed to preserve the detection and identification of articulatory gestures is the task dynamics (Kelso, Saltzman, & Tuller, 1986a, 1986b; Saltzman, 1995; Saltzman & Kelso, 1987; Saltzman & Munhall, 1989). The following description relies principally on Saltzman and Munhall (1989).
To understand how the task dynamics serves as an interface between discrete articulatory gestures and continuous articulatory movements, we must first describe articulatory gestures more precisely. An articulatory gesture is itself a dynamical system, one defined by the goal that it is intended to achieve, for example, to close the lips. These goals don’t vary across contexts nor depending on what other goals must be achieved at the same time, rather the articulatory movements that achieve them do. Thus, closing the lips may require more jaw raising following the open vowel in /æp, æb, æm/ than following the closed vowel in /ip, ib, im/. Goals are specified as values for context-independent dynamical parameters that specify the gesture’s target, the speed of approach to and away from the target, and the extent to which the gesture controls articulations moment-by-moment over the course of its realization. Individual speech sounds – the “phonetic segments” referred to by Fowler (1986a, 1986b) – consist of gestures whose beginnings and ends are sufficiently contemporaneous that they shape the vocal tract and its acoustic products more or less simultaneously, while successive speech sounds consist of gestures and thus articulations and acoustic products that overlap less in time.
Regardless of how extensive the overlap is, the movements of the various articulators that could contribute to realizing the targets of contemporaneously active gestures must be coordinated and their relative contributions adjusted so that each gesture’s target is reached. The kinematics of articulatory movements will thus differ within individual speech sounds and between successive speech sounds, depending on what demands temporally overlapping gestures make on articulators at any moment in time. An example of such variation between contrasting speech sounds is differences between voiced and voiceless stops in the articulatory adjustments needed to manage their contrasting aerodynamic regimes: the oral cavity must be expanded during the closure of a voiced stop to keep oral air pressure from building up so fast that air no longer flows up through the glottis, but greater force is required to maintain the closure of a voiceless stop because the open glottis causes intraoral air pressure to build up more behind it. An example of such variation between successive speech sounds is differences between vowels before voiceless stops compared to before voiced stops or sonorants, where the mouth may not open as far during the vowel in /æp/ as it does in /æb/ or /æm/, because the following lip-closing gesture overlaps more with the vowel and requires that the jaw be raised earlier (but cf. Summers, 1987, for evidence of faster and larger jaw opening before a voiceless than a voiced stop), and the velum may be lowered more or earlier in /æm/ than it is in /im/ to ensure that the open vowel is nasalized enough for that property to be detected (Kingston & Macmillan, 1995; Macmillan, Kingston, Thorburn, Walsh Dickey, & Bartels, 1999).4 The task dynamics ensures that the gestures’ dynamics remain invariant, and the gestures themselves thus remain detectable and identifiable, while permitting the articulators’ kinematics to vary within individual speech sounds and between successive speech sounds.
The task dynamics is a genuine interface between a phonological representation consisting of goal-directed articulatory gestures and its phonetic realization as articulatory movements, which takes the time- and context-invariant dynamic parameters’ values of the gestures as inputs and outputs the time- and context-varying kinematic properties of actual articulatory movements. But the task dynamics is not an interface that translates between one kind of representation and another, but instead one that transduces the continuous quantities represented by those dynamic parameters as continuous movements with temporal and spatial kinematics.
This discussion has focused so far on getting from articulatory gestures to movements of articulators. Before closing, it is worthwhile to recognize subsequent developments of the proposal that phonological representations are composed of articulatory gestures that specify goals, not articulator movements, and to note that representing the constituents of speech sounds phonologically as gestures does not differ as much as one might expect from representing them as autosegments. Developments in articulatory phonology show that it accounts very differently for at least: (a) fast speech phenomena – gestures are overlapped and their acoustics obscured by or mixed with those of neighboring sounds rather than sounds deleting or assimilating to neighboring sounds (Browman & Goldstein, 1990), (b) incomplete neutralization of contrasts – the input consists of the articulatory gestures for both the neutralized and unneutralized goals, with the former greater in strength than the latter in neutralizing contexts (Gafos, 2006, see also Smolensky and Goldrick (2016) for a generalization of this approach to any case where more than one phonological element is activated simultaneously during phonetic implementation),5 (c) syllable structure – the gestures of the consonants in onsets can be coordinated with those of the nuclear vowel differently than the gestures of consonants in codas (Browman & Goldstein, 1988), and languages can differ in how consonants in onsets or codas are coordinated with one another (Hosung, Goldstein, & Saltzman, 2009; Shaw, Gafos, Hoole, & Zeroual, 2009, 2011), and (d) speech errors – errors are coproductions of the intended gesture and an intruded one rather than whole segment intrusions (Goldstein, Pouplier, Chen, Saltzman, & Byrd, 2007; Pouplier & Goldstein, 2010). Phonological representations composed of articulatory gestures differ from those composed of autosegments in lacking any correspondent to a timing unit or superordinate prosodic unit. But time remained as abstract as in autosegmental representations, until the recent development of more explicit dynamical models incorporating time explicitly (Roon & Gafos, 2016).
Instead of arguing that there is no interface between phonology and phonetics because phonological representations cannot and must not be distinguished from their phonetic realizations, Hale and Reiss (2000, 2008), Morén (2007), and Reiss (2017) argue that there is no interface other than translation or transduction from phonological categories to phonetic continua. In their view, neither the form of phonological representations nor the processes operating on them are regulated by the phonetic substance of the speech sounds that realize those representations and processes. The argument has two parts. In the first, these authors argue that what appears to be evidence of phonetic regulation of phonological representations and processes must instead be interpreted as evidence that phonetic substance influences language learning and/or language change. There is thus no need to duplicate the account of that diachronic or ontogenetic influence in the synchronic phonological grammar, nor should such duplication be permitted by phonological theory. In the second part, they argue that phonological theory is responsible only for defining “the set of computationally possible human grammars” (Hale & Reiss, 2000, p. 162, emphasis in the original), because that definition would be a step toward explaining the phonological competence that exists in the mind of the speaker.6 The two parts of this argument turn out to be inseparable at present because there is debate about whether computationally possible grammars must include constraints or other formal devices that are motivated by the phonetic substance of speech and/or by the learning of a language’s phonology (for discussion, see Bermúdez-Otero & Börjars, 2006, and see the discussion of Beguš (2017); Hayes (1999) later in this chapter). Nonetheless, the discussion that follows addresses only the first part of this argument, as developed in Hale and Reiss (2000), because only it addresses the nature or existence of any interface between phonology and phonetics.
Hale and Reiss (2000) begin their argument for eliminating phonetic substance from phonological representations and the rules and constraints that refer to those representations by attributing contextually constrained contrasts to acquisition and language change. Beckman (1997) accounted for some of these phenomena as instances of positional faithfulness.7 Sounds contrast for a feature in some but not all contexts in a language when an identity constraint (Ident[feature]specific) requiring faithful preservation of input feature values in specific contexts outranks a markedness constraint (*Feature) that would rule out that specification, which in turn outranks a more general faithfulness constraint preserving that specification everywhere: Ident[feature]specific >> *Feature >> Ident[feature]general.
If the language permits the contrast everywhere, Ident[feature]general >> *Feature and Ident[feature]specific can be ranked anywhere, and if it permits it nowhere, *Feature >> Ident[feature]general, Ident[feature]specific. The distribution of the [voice] contrast in Thai, English, and Creek represent the three possibilities: in Thai, /b, d/ con-trast with /p, t/ in syllable onsets but not codas, in English, they contrast in codas as well as onsets, and in Creek, there is no [voice] contrast in any position:
Beckman’s (1997) account explicitly appeals to the substance of the utterance because the possible specific contexts are psycholinguistically prominent and/or phonetically salient, e.g., at the beginning of a word or syllable or in a stressed syllable. Hale and Reiss (2000) observe that the psycholinguistically prominent or phonetically salient contexts in which a contrast is preserved are those where its phonetic correlates are either more likely to be attended to, e.g., in the first syllables of words or the onsets of syllables, or easier to detect because their values are more extreme, e.g., in stressed syllables. Attention and detection are less likely and/or harder in other contexts. A child learning the language might therefore fail to attend to or detect a contrast’s correlates in those other contexts. If these failures become general, the language would change into one where the contrast is preserved only in contexts where its correlates are reliably attended to or detected, and the contrast would be restricted synchronically to those contexts. This restriction would be explained by the merger in other contexts resulting from earlier generations’ failure to learn the yet earlier language where this contrast appeared in all contexts, not by encoding the restriction as an active constraint in the synchronic phonological grammar (see Blevins, 2004, for a generalization of this proposal regarding the division of explanatory labor between psycholinguistically or phonetically motivated sound change and the contents of the synchronic grammars of individual languages).8
On these grounds, Hale and Reiss (2000) argue that there is no practical difference between the experience of a child learning a language like German in which the voicing contrast is restricted to onsets and one learning a language like English in which this contrast is not restricted to any position in the syllable. The child learning German will never encounter voiced obstruents in syllable codas, while the one learning English will encounter numerous examples of voiced obstruents in codas.9 Both children will receive far more than enough positive evidence regarding the distribution of voiced obstruents for them to successfully learn the contextual distribution of the voicing contrast in their respective languages (again, see Blevins, 2004, for a more general argument that positive evidence is more than enough to inform the learner about what sounds a language has, and how they behave, without requiring that the resulting synchronic grammar consist of anything more than a description of those sounds and their behavior).
This and other similar arguments lead Hale and Reiss (2000) to conclude that,
Phonology is not and should not be grounded in phonetics since the facts that phonetic grounding is meant to explain can be derived without reference to phonology. Duplication of the principles of acoustics and acquisition inside the grammar violates Occam’s razor and thus must be avoided. Only in this way will we be able to correctly characterize the univeral aspects of phonological computation.
(p. 162, emphasis in the original)
and a bit later,
The goal of phonological theory, as a branch of cognitive science, is to categorize what is a computationally possible phonology, given the computational nature of the phonological component of UG.
(p. 163)
where “computationally possible” is to be contrasted with “diachronically possible” and “ontogenetically possible.” The latter are the (much smaller?) set of phonological grammars that can develop over time and that are learnable given psycholinguistic and phonetic constraints. The following two, surprisingly similar cases serve as illustrations (see Browman & Goldstein, 1995b, for further discussion of their similarity). Krakow (1989, 1999) shows that soft palate lowering begins earlier than the oral constriction in a syllable-final nasal like that in seam E,10 and as a result the preceding vowel is nasalized. In a syllable-initial nasal like that in see me, soft palate lowering and the oral constriction begin and end more or less at the same time, and neither preceding nor following vowels are nasalized. Sproat and Fujimura (1993) show that the raising of the tongue tip toward the alveolar ridge in a syllable-final lateral like that in Mr. Beel Hikkovsky is delayed relative to the dorsal constriction at the uvula, and as a result the rime is a diphthong with an ɰ-like offglide. In a syllable-initial lateral like that in Mr. B. Likkovsky, the tongue tip raising and the dorsal constriction begin and end more or less at the same time. The results reported in both studies come entirely from English speakers, but if these timing patterns are shared with other languages, then distinctive vowel nasalization is more likely to develop before syllable-final than syllable-initial nasals, and a syllable-final lateral is more likely to develop into the offglide of a diphthong than a syllable-initial lateral. Both likelihood differences appear to be borne out in what language changes are observed versus unobserved.
But there is no computational reason why a speech community or language learner could not ignore these psycholinguistic and phonetic facts and nasalize a vowel after a syllable-initial nasal or turn a syllable-initial lateral into a velar glide. de Lacy and Kingston (2013) make an analogous argument, that a theory of phonology sensu strictu is not responsible for accounting for how often particular sounds or sound patterns occur, as both may be accounted for more perspicuously by an appropriate account of language learning or learnability (see also Staubs, 2014). Blanket appeals to greater learnability must be made cautiously, however, as there is experimental evidence that other criteria may outweigh ease of learning. For example, Rafferty, Griffiths, and Ettlinger (2013) show that an artificial language exhibiting vowel harmony is more successfully learned than one that does not, but that vowel harmony disappears when the language is transmitted across generations of learners.
Once again, the essential difference here is between computationally possible and diachronically or ontogenetically probable languages. The computationally possible languages include those that have entirely arbitrary phonological rules (AKA “crazy” rules (Bach & Harms, 1972); see also (Blust, 2005)). Such rules are diachronically or ontogenetically improbable. Hale and Reiss (2000, 2008) and Reiss (2017) see the learnability of arbitrary rules as a feature rather than a bug, in that arbitrary rules or constraints are computationally possible. On the one hand, this argument is supported by the fact the languages spoken today, and the much smaller subset of which we have adequate descriptions, are surely only a subset of what’s a possible human language and thus of what are computationally possible phonological grammars. On the other hand, it hasn’t been demonstrated yet that the set of possible languages is equal to the set of those that are computationally possible, nor what are the limits on computational possibility. One needs to know what the primitives may be, and how they may be operated on, and how the patterns they enter into may be constrained. For recent useful discussions of and commentary on a range of possibilities, see Heinz and Idsardi’s (2017) introduction to a special issue of Phonology on computational phonology and the other papers in that special issue. Their introduction and that special issue show that the computational approaches to phonology are still refreshingly diverse in the computational tools used, the standards of evidence and argument applied, and in the data considered. Heinz (to appear) and related work consist of tests of the formal limits of possible phonological constraints and operations, and thus on determining what the set of possible phonologies are within limits of greater or lesser strictness. In trying to determine how strict these limits are, this work proceeds by examining the range of E-languages rather than assuming that the possible formal contents of I-languages can be decided a priori. Moreover, the possible representations and operations are of quite general types, and thus the limits considered are based on entirely different criteria than those imposed by Hale and Reiss (2000, 2008) or Reiss (2017).
Hasty decisions about how to analyze a particular set of facts and the lack of caution they entail can also lead to error. A telling example is the one that Hale and Reiss cite, intervocalic /r/-insertion in Massachusetts English, as an case of an arbitrary rule. It was described as such in McCarthy’s (1993) original account of its behavior, but findings reported by Gick (2002) suggest that it’s instead well-motivated phonetically. Gick shows that /r/, or more precisely its English approximant pronunciation [ɻ], is the expected glide after the vowels where it’s inserted, /a, ɔ, ɘ/, because it resembles those vowels in being produced with a pharyngeal constriction. Inserting /r/ after these vowels is thus just as expected on phonetic grounds as inserting /j/ and /w/ after /i,e/ and /u,o/, respectively. While this case shows that rules which have been supposed to be arbitrary can turn out to phonetically motivated after all, and therefore motivate reexamination of other supposedly arbitrary rules, it does not rule out the possibility that a residue of genuinely arbitrary rules may remain (see, once again, Bach & Harms, 1972, for possible examples).
Rather than arguing that phonetics must be kept out of phonology, Ohala (1990) argues that it cannot be kept out. He also argues that there can be no interface between phonetics and phonology, because describing their relationship as or via an interface incorrectly assumes that they are two “domains.” In one conception of the interface, one domain is phonology, the other phonetics, which meet “where phonological representations become implemented physically.” In another conception, phonology and phonetics are “largely autonomous disciplines whose subject matters may be similar … but which [are studied] in different ways; there is an area in between [the interface: JK] where the two can cooperate” (p. 153). In place of an interface, Ohala proposes instead a thorough integration of phonetics and phonology, first as a superior account of their actual relationship,
it will be impossible to characterize either discipline as autonomous or independent. If it boils down to a kind of “phonology = knowing” vs. “phonetics = doing” dichotomy, the knowledge – of units, distinctive features, etc. – is acquired and updated by complex phonetic analysis… . There is a growing literature showing that phonological units and processes are what they are largely due to the physical and physiological structure of the speech mechanism… . It is equally impossible to imagine a “phonetic” component working largely independently of the phonological knowledge or competence of a speaker … there has never been any question that the movements in speech are implementations of phonologically-specified word shapes.
(p. 155)
Second, he argues that integrating research on phonology and phonetics is a more fruitful method of investigating phonological and phonetic phenomena. Ohala (1990), as well as Ohala (1981, 1983, 2005) and Ohala and Jaeger (1986) and the references cited therein explain a variety of sound patterns and changes as products of the phonetic behavior of speakers and/or listeners.
Ohala (1990, 2005) in particular also contrasts the character and quality of these phonetic explanations with those that cast them in terms of phonological representations. Two examples of such explanations suffice as illustrations. First, the contents of vowel inventories are explained very differently by Lindblom’s dispersion model (Liljencrants & Lindblom, 1972; Lindblom, 1986) than by Chomsky and Halle’s (1968) markedness conventions (2a–d) as follows. In Lindblom’s model, vowels are dispersed sufficiently within a space defined by the ranges of first and second formant frequencies (F1, F2) that can be produced by a human vocal tract. The expression of these formants’ frequencies in auditory units (Bark rather than Hz) emphasizes that it is the listener who needs the vowels to be sufficiently dispersed. Schwartz, Boë, Valleé, and Abry (1997a, 1997b) added a preference for focal vowels; that is, those vowels in which adjacent formants are close enough together (in Bark) to produce a single prominent and presumably perceptually salient spectral peak.11 Simulations presented in de Boer (2000); Lindblom (1986); Schwartz et al. (1997b) show that dispersion and focalization successfully capture the contents and structure of cross-linguistically common vowel inventories (but see Kingston, 2007, for remaining shortcomings). de Boer’s (2000) simulations also showed that sufficiently dispersed vowel inventories could evolve from the interactions between agents trying to learn each other’s pronunciations without explicitly invoking a dispersion requirement.
Chomsky and Halle’s (1968) marking conventions for vowels in (2a,b) represent the contents of the minimal three-vowel inventory, /i, u, a/, which has more non-low than low vowels, and where the non-low vowels are high. (2a) also represents the fact that larger inventories than /i, u, a/ usually have more non-low than low vowels, while (2b) also represents the fact that high vowels usually occur in these larger inventories, too. (2c,d) represent the common agreement between rounding and backness in non-low vowels, and the common absence of rounding in low vowels.
Despite Chomsky and Halle’s (1968) stated goal of trying to capture the contribution of features’ phonetic correlates to phonological inventories and patterns by these means, markedness conventions remain no more than descriptively adequate, because they don’t explain why high vowels should outnumber low ones nor why a back vowel is more likely to be rounded, and a front vowel unrounded.
The dispersion plus focalization model achieves explanatory adequacy by appealing directly to the limits and liberties on transforming the articulations of vowels into their acoustics and on transforming vowels’ acoustics into auditory qualities. The auditory system is more sensitive to frequency differences in the lower part of the spectrum where F1 varies as a function of tongue body height/closeness. Because vowels differing in F1 are thus inherently more dispersed than those differing in their frequencies for higher formants as a function of tongue body backness and lip rounding, languages can reliably make finer distinctions in height than backness and rounding – Chomsky and Halle’s (1968) markedness conventions don’t even describe the greater number of height than backness and rounding distinctions. Differences in tongue body backness also produce a larger range of possible F2 values when the body constricts the oral cavity along the palate, as it does in high vowels, than when it constricts the oral cavity in the pharynx, as it does in low vowels. Thus, non-low, especially high vowels can be more dispersed than low vowels. The dispersion of high vowels can also be enhanced more than that of low vowels by rounding the lips when the tongue body is backed, and spreading them when it’s fronted (see Linker, 1982, for articulatory evidence of more extreme lip positions in higher vowels, and see Kaun (1995, 2004) and Pasquereau (2018) for phonological consequences of these height-determined differences in the degree of lip rounding). The corner vowels are also more likely to be focal, in that F1 and F2 are close together in /u, a/ and F2 and F3 are close together in /i/.
A second comparison of phonetic with phonological explanations concerns the intrusive stops that appear between nasals and laterals and adjacent fricatives for some English speakers, e.g., in warmth [wɔɻmpθ] and false [fɔlts]. Clements (1987) accounts for the intrusive [p] in warm[p]th as a product of the nasal’s [−continuant] and place values spreading onto the following fricative’s timing slot, where their overlap with the fricative’s [−voice] value produces a voiceless, oral bilabial closure. According to Ohala (1990, 2005), the intrusive stop is instead produced in warm[p]th when the velum is raised and the glottis is abducted early in anticipation of the fricative’s states for these articulators, before the complete oral closure is replaced by the fricative’s narrow constriction. On this account, the intrusive stop results from a timing error rather than from spreading two of the fricatives’ phonological feature values. Ohala argues, moreover, that the intrusive stops between laterals and fricatives cannot be explained as a product of spreading the feature values from the lateral to the fricative, because both sounds are [+continuant]. Instead, in the transition from air flow around the sides of the tongue in the lateral to air flow down the center of the tongue in the fricative, the sides of the tongue might be raised before the lateral’s central constriction is released. The result would be a brief interval when air neither flows around the sides nor down the center of the tongue, i.e., a complete oral closure like that of a stop. The stop would be voiceless if the /s/’s large glottal abduction begins earlier than the lowering of the center of the tongue. Because a timing error can explain intrusive stops in between laterals and fricatives as well as between nasals and fricatives, it is a more general and parsimonious explanation than feature spreading.13
This explanation is, however, incomplete, in not accounting completely for the facts. A timing error is no more than a phonetic event without phonological consequences unless and until it becomes systematic. Ohala (1990, 2005) cites examples from English and other languages where the intrusive stop has become part of words’ lexical entries. Citing Fourakis and Port (1986), Clements (1987) also proposes that stop intrusion has become systematic between nasals and fricatives in some English dialects, while remaining absent in others. Can a timing error still be responsible for cases where a stop systematically intrudes in particular words or in all eligible words in a particular dialect? Or is the stop’s occurrence no longer dependent on its original phonetic cause? (I return to these questions later in sections “Distinguishing between phonology and phonetics” and “Phonetic implementation.”) The same question can be raised regarding the role of dispersion and focalization as regulating the contents of vowel inventories. Both may have originally influenced the locations of vowels in the vowel space when a particular vowel inventory emerged in a language, and both may influence any shifts in those locations as vowels are added to or subtracted from that space as a result of sound changes. But do dispersion and focalization continue to regulate vowels’ locations in the vowel space during periods, often quite lengthy ones, in a language’s history when its vowel inventory remains unchanged? Putting this question more generally, must lexicalized or phonologized sound patterns remain dependent on their phonetic causes?
There appears to be a contradiction here, but one that proves resolvable. On the one hand, it appears that the answer to this question must be “yes,” because the phonetic constraints on articulations, their transformation to acoustic properties, and the transformation of those properties to auditory qualities don’t go away once they’ve shaped a language’s phonology. They could therefore continue to regulate each of these links in the speech chain between the speaker and listener. On the other hand, it appears that the answer to this question must instead be “no,” because subsequent sound changes often obscure or remove entirely a sound pattern’s original phonetic motivation, interactions with the language’s lexicon or morphology can limit or extend the pattern’s scope analogically, and the accumulation of changes can create foci or environments for phonological processes which consist of arbitrary disjunctive sets of features (Garrett, 2015; Mielke, 2008, 2011, and references cited therein). The end results in each kind of subsequent development are synchronic sound patterns that would otherwise be phonetically unlikely if not altogether impossible.
Ohala (2005) argues from the evidence that phonologized sound patterns both do and do not remain dependent on their phonetic causes that it is a mistake to incorporate phonetic naturalness requirements into phonological grammars. The phonetic constraints will regulate what speakers and listeners do without their having to know them. They will thus determine not only how speakers of a language will behave contemporaneously, but also what is likely to happen during a language’s history, as well as what are common typological patterns. Phonetic constraints will not, however, determine what must happen; that is, what the sound patterns are of particular languages. Ohala also suggests that what speakers and listeners actually do know will most likely be revealed by psycholinguistic studies of production and perception of speech.
A case that has been much discussed in the literature recently supports Ohala’s assessment of both the extent and limitations of phonetic explanations, as well as his suggestion that what speakers and listeners know can best be discovered by studying acts of production and perception. This case is the devoicing of stops and affricates after nasals in two closely related Southern Bantu languages, Tswana (Beguš, 2017; Coetzee & Pretorius, 2010; Gouskova, Zsiga & Boyer, 2011; Hyman, 2001) and Shekgalagari (Solé, Hyman & Monaka, 2010). Hyman (2001) first presented the Tswana case as a challenge to the proposal that there is a phonetically well-motivated constraint against the occurrence of voiceless stops and affricates after nasals, *NT (Hayes, 1999; Hayes & Stivers, 1995; Pater, 1999). If *NT is itself as phonetically motivated as these studies suggest, if a possible, indeed frequent repair is to voice the stop, and if markedness constraints must be phonetically grounded, then there can in principle be no complementary constraint, *ND, against voiced stops and affricates after nasals.14 The last condition is essential, as otherwise the observed facts could be obtained so long as *ND remains lower-ranked, and repairs to *NT other than voicing are ruled out by higher ranked faithfulness constraints. For example, *NT, ID[nasal], Max >> Ident[voice], *ND maps /nt nd/ onto [nd nd].15 Examples of synchronic devoicing, as a possible repair to violation of *ND, after the first person object clitic N- from Tswana and Shekgalagari are given on the right in (3, 4) (Hyman, 2001; Solé et al., 2010):
There is some debate in this literature about the facts of Tswana, in particular about whether stops only devoice after nasals. This debate can in part be attributed to individual and possibly also dialectal differences between the speakers who provided the data for the various studies (Coetzee & Pretorius, 2010; Gouskova et al., 2011), but instrumental analyses show that at least some speakers consistently devoice voiced stops after nasals and not elsewhere. An instrumental analysis by Solé et al. (2010) produced similar results for the speech of a single speaker of Shekgalagari. So how can voiced stops devoice after nasals when this change is not only not phonetically motivated, but voicing after nasals is instead?
Hyman (2001) proposes that at the first relevant stage in Tswana’s history (5a), voiced stops that didn’t follow nasals lenited to homorganic continuants.16 Subsequently at the second stage (5b), all stops that remained stops devoiced. Stop devoicing was context-free; that voiced stops only occurred after nasals following their earlier lenition elsewhere was irrelevant (see Beguš, 2017, for a generalization of this account to stop devoicing after nasals in other languages). At the final stage (5c), the voiced fricatives produced by lenition at the first stage became stops once more:
Once the third change turned continuants back into stops, reversing the first change, voiced stops elsewhere were expected to alternate with their voiceless counterparts after nasals.
As noted by Coetzee and Pretorius (2010), the underlying forms of labial stops in the first column in Table 13.1 should therefore have been realized with the surface forms in the second column at this stage in Tswana’s history.18
Two characteristics of the pronunciation of stops in Tswana (and Shekgalagari) support Ohala’s (2005) argument against incorporating a requirement for phonetic naturalness into phonological grammars.
The first and most obvious is the synchronic alternation in both languages between voiceless stops after nasals and voiced stops elsewhere. Post-nasal voicing is phonetically motivated, but post-nasal devoicing is not and cannot be phonetically motivated.19 Voicing is likely to continue from the nasal into the oral interval of a nasal-oral stop sequence because the leak through the velopharyngeal port prevents any appreciable rise in intraoral air pressure during the nasal interval. That oral interval is also typically quite brief, as the soft palate is raised just in time to ensure that at least but no more than the release is oral rather than nasal (Beddor & Onsuwan, 2003; Hubbard, 1995; Maddieson, 1989; Maddieson & Ladefoged, 1993). The soft palate is also not raised until late in the closure interval of an oral stop following a contrastively nasalized vowel (Huffman, 1993). Because the entire oral interval is so brief, much or all of it may be voiced even if voicing continues only for a short time itself after the soft palate is raised and air no longer leaks out through the velopharyngeal port. Devoicing would therefore not be expected if it occurred only after nasals.
Underlying |
Expected |
Tswana I |
Tswana II |
---|---|---|---|
bV |
bV |
? |
? |
pV |
pV |
? |
? |
VbV |
VbV |
VbV ~ VβV |
VbV ~ VβV |
VpV |
VpV |
VpV |
VpV ~ VbV |
mbV |
mpV |
mpV |
mbV ~ mpV |
mpV |
mpV |
mpV |
mbV ~ mpV |
Beguš (2017) argues, however, that stops are expected to devoice because intraoral air pressure will rise fast enough behind a stop closure to eliminate the pressure drop across the glottis unless the oral cavity is expanded to slow its rise. On his account, it’s necessarily irrelevant that the only voiced stops that remain in either Tswana or Shekgalagari after lenition occur after nasals, a context where the aerodynamics not only don’t threaten but actually encourage the maintenance of voicing. That is, devoicing has to be context-free in a strict sense, or more precisely, it has to occur without regard to the phonetic conditions in which voiced stops occur after lenition. Beguš makes this point somewhat differently, in emphasizing the continuation of the complementary distribution between stop pronunciations after nasals versus fricative pronunciations elsewhere following the first stage. The competition between one phonetically natural process, stop devoicing, and another, post-nasal voicing, is resolved in favor of the former because it’s more general, applying to stops wherever they occur, even if the only context they occur in is one that should inhibit this sound change phonetically (see Hayes, 1999; Moreton & Pater, 2012a, 2012b, for further argument that simpler processes are favored). Devoicing is and must be indifferent to the fact that stops only occur after nasals. This argument implies that a preference for formal generality can determine which of two opposing phonetically motivated outcomes is chosen, and not the relative strengths of the phonetic motivations.
The second piece of evidence supporting Ohala’s argument against directly incorporating phonetic naturalness into phonological grammars can be found in Coetzee and Pretorius’s (2010) detailed description of individual difference between Tswana speakers’ pronunciations. That description demonstrates the push and pull between phonetic naturalness and other pressures (third and fourth columns in Table 13.1). The seven speakers they label as “Tswana I,” devoice stops after nasals in 90% or more of tokens, and all but one pronounce /b/ as a voiced stop [b] rather than a fricative /β/ after vowels in more than 80% of tokens; the exceptional speaker does so in about 50% of tokens. Tswana I speakers pronounce /b/ as a stop [b] rather than a fricative [β] after vowels most often when a clitic boundary falls between the vowel and /b/, less often when a prefix boundary does, and least often inside a morpheme. Voiceless stops remain voiceless stops in nearly all instances and contexts. The five speakers in Tswana II pronounce /b/ as [b] after nasals with variable frequency, one in less than half the tokens, three in more than half, and one in roughly 80% of tokens. Unlike the Tswana I speakers, these speakers appear to be reverting to the phonetically natural pattern of voicing rather than devoicing after nasals. However, the speakers in this group resemble those in Tswana I in producing /b/ as a stop [b] or fricative /β/ with the same sensitivity to the presence and nature of the morphological boundary between the preceding vowel and the consonant. Four of the five Tswana II speakers pronounce /p/ as voiced [b] after vowels in a noticeable minority of tokens, but the voiceless pronunciation is more common for all five speakers in this context. The speakers in these two groups thus differ from one another in how they pronounce voiced stops after nasals, but resemble one another in how they pronounce them after vowels, and how their pronunciations in the latter contexts depend on the sound’s morphological context. These facts show that sound patterns are not solely governed by their phonetic realizations, and therefore that phonological grammars should not incorporate a requirement that the patterns they’re responsible for be just those that are phonetically natural.
The purpose of the extended discussion of this example has been to support Ohala’s (2005) argument that phonological grammars need not be limited to phonetically natural processes or constraints, nor are phonetically natural processes or constraints free from regulation by other components of the grammar. Instead phonetically unexpected patterns, like stop devoicing after nasals, can be incorporated into synchronic grammars as the product of a series of sound changes, each of which may themselves be phonetically motivated. And phonetically expected patterns, like post-vocalic spirantization, can come to be grammatically conditioned. Nonetheless, Ohala’s conception of the contents of a phonological grammars remains distinct from Hale and Reiss’s (2000) conception, in permitting phonetics to motivate its symbolic content and operations directly, and thereby limiting its content and operation far more narrowly than what can be computed by manipulating symbols.
Like Ohala (2005), Flemming (2001) argues that the phonetics and phonology overlap so thoroughly that there is no interface between them whose purpose is to translate between the categories of phonological representations and the continua of their phonetic realizations. But unlike Ohala, Flemming argues that grammars incorporate naturalness in the form of constraints that are not only phonetically or substantively grounded but that also refer directly to phonetic continua (see also Boersma, 1998, 2011; Boersma & Hamann, 2009, for proposals that grammars refer directly to phonetic continua and that differ formally but not substantively from Flemming’s proposals). Flemming motivates his argument by pointing to a variety of cases where phonetic and phonological processes closely resemble one another. More generally, he argues that,
Phonetics and phonology are not obviously distinguished by the nature of the representations involved, or in terms of the phenomena they encompass. As far as representation is concerned, most of the primitives of the phonological representation remain phonetically based, in the sense that features and timing units are provided with broadly phonetic definitions. This has the peculiar consequence that sound is represented twice in the grammar, once at a coarse level of detail in the phonology and then again at a finer grain in the phonetics. Perhaps more significant is the fact that there are also substantial similarities between many phenomena which are conventionally classified as phonetic and those which are conventionally classified as phonological; for example coarticulation is similar in many respects to assimilation.
(pp. 9–10)
Before continuing, it’s worth noting that Ohala (2005) would see no need to formalize the phonetic motivations for phonological patterns as rules or constraints in the phonological grammar; quite the contrary, it would instead be enough for him to identify those motivations and construct a phonetic explanation of those patterns from those motivations.
The rest of this discussion of Flemming’s proposal is limited to his central illustration of the last of these observations, namely, the fronting of back vowels by coarticulation with neighboring coronal consonants and the assimilation of back to front vowels in such contexts in languages such as Cantonese. As Flemming notes, coarticulation with a neighboring coronal consonant raises the frequency of the second formant (F2) in back vowels such as /u/ and /o/. In Cantonese, this phonetic effect of coarticulation is phonologized when coronal consonants occur on both sides of /u/ or /o/, where the contrast between back rounded vowels /u/ and /o/, thok “to support,” put “to wipe out,” kot “to cut,” and the corresponding front rounded vowels /y/ and /ø/, khyt “to decide,” søŋ “wish,” neutralizes to the front vowels, tyt “to take off,” jyt “moon,” jøt “weak,” tøn “a shield” – neither *tut nor *ton are possible words in Cantonese.
Flemming (2001) accounts for coarticulation via a markedness-like constraint that favors shorter movements of articulators. Perfect satisfaction of this constraint would result in no movement of the articulators between a consonant and a neighboring vowel, and thus no change in the value between the consonant’s acoustic target for the second formant (F2) and the vowel’s. Violations of this constraint are a direct scalar function of the difference between the consonant’s and the vowel’s F2 values in a candidate. This effort-minimization constraint competes with faithfulness-like constraints requiring that both the consonant’s and the vowel’s F2 targets be reached in a candidate; violations of these constraints are also direct scalar functions of the difference between these targets and the actual F2 values in a candidate. All these functions are weighted, and the cost (disharmony) of a candidate is the sum of their weighted values. The optimal candidate then is the one with the lowest value for this sum, i.e., the lowest cost.
Neutralization of the contrast between back rounded /u, o/ and front rounded /y, ø/ between coronal consonants is accounted for by a constraint which requires that contrasting categories differ by some minimum amount along the relevant phonetic dimension, here F2, MinDist. Violations of this constraint, too, are assessed by a direct, weighted scalar function of the F2 difference between the back rounded vowels and their front counterparts. This constraint competes with one which requires that the number of contrasts be maximized, MaximizeContrast, and that would preserve the contrast between back and front rounded vowels. When both flanking consonants are coronal, effort minimization produces an F2 value for a back rounded vowel that is too close to the F2 target of its front counterpart, and neutralizes the back:front contrast to the front category when the cost of maintaining the contrast is greater than the benefit of maintaining it, as specified by the weight of MaximizeContrast.
Assessing violations via scalar functions that refer to distances along continuous phonetic dimensions incorporates sounds’ phonetic properties directly into the evaluation of candidates’ well-formedness. Moreover, phonological categories themselves are represented by their values along continuous phonetic dimensions, so there is no longer any need to translate between discrete categories and continuous values. Finally, unequivocally phonological outcomes like neutralization of contrasts can be achieved by the interaction of constraints that refer to continuous phonetic dimensions, such as MinDist, with categorical constraints, such as MaximizeContrast, which refer to contrasting sounds as discrete units.
Flemming’s (2001) effort to erase the formal difference between phonetics and phonology by these means should not, however, be read as an argument that phonology can be reduced to phonetics. Instead, it is intended to show that operations on continua can produce outcomes that amount to operations on categories (see the discussion in phonetic grounding of alternations below of Steriade, 2001, 2008, for an account that resembles Flemming’s and Boersma’s in its assumption that the phonetics regulates the phonology, but that does not incorporate direct reference to phonetic continua into grammars).
Three distinct arguments have been reviewed for eliminating a separate interface between phonetics and phonology: no interface is needed (or possible) the units of phonological representations must be of a kind that can be realized in the act of speaking, namely, articulatory gestures; do phonological representations have any phonetic substance? the apparent psycholinguistic and phonetic grounding of phonological representations, processes, and constraints must instead be attributed to learning and diachrony, and phonology must be restricted to formal computation; or there is no phonology without phonetics nor phonetics without phonology phonology and phonetics are so inextricably and mutually intertwined that there is no border between them, no possibility of fruitfully investigating the one without also investigating the other, and finally no need to keep reference to continua out of the phonology. Given the force of each of these arguments, and their variety, it’s reasonable to ask whether it is still possible to argue convincingly for an interface between phonetics and phonology. In the remaining sections of this chapter, I present compelling reasons for responding “yes” to this question.
As a bridge to section “There is an interface between phonetics and phonology,” it is useful to discuss the arguments that support a widely accepted alternative to the beginning of the preceding quote from Flemming (2001), repeated here for convenience:
Cohn (1993, 2007), Myers (2000), and Solé (1992, 1995, 2007) all offer diagnostics for deciding whether a pattern is phonological or phonetic. All these diagnostics derive from a distinction between phonological categories and phonetic gradients: the contents of phonological representations consist of categories and the processes that operate on them or the constraints that regulate them refer to these categories, while the contents of phonetic representations are gradients and the procedures that implement them refer to gradients (see Pierrehumbert, Beckman, & Ladd, 2000, for an argument that phonology and phonetics cannot be distinguished as referring to categories versus gradients, and compare Pierrehumbert (1990)). These diagnostics are first discussed with respect to the facts of vowel nasalization, VOT, and vowel duration in English versus Spanish or Catalan, before turning to a more general account of diagnostics that might distinguish phonology from phonetics.
Solé (1992, 1995) show that nasalization extends much further into vowels preceding nasals in English than in Spanish, and that nasalization’s extent increases with the rate-determined duration of the vowel in English but not Spanish. She interprets these facts as evidence that an allophonic rule produces the extensive vowel nasalization in English, but that coarticulation produces the modest vowel nasalization in Spanish; that is, vowels become phonologically specified as [+nasal] before nasals in English, but not in Spanish, where the brief interval of coarticulation arises in the phonetics.20
It’s important to carefully examine the effect of rate on the extent of vowel nasalization in English, because Myers (2000) and Solé (2007) have both argued that rate-determined variation is evidence of gradient phonetic implementation, not a categorical phonological process or constraint. Solé (1992, 1995) shows that the vowel lengthened and the absolute extent of nasalization increased as speech rate slowed down, but the duration of the initial unnasalized portion of the vowel remained constant. Its constancy suggests that English speakers time the onset of soft palate lowering relative to the beginning of the vowel and continue to lower it for the remainder of the vowel, no matter how short or long it lasts. Spanish speakers instead time the onset of soft palate lowering relative the onset of the nasal, and thus only nasalize as much of the vowel as they must, given the mechanics of moving from a closed to an open velopharyngeal port. The sort of rate-determined variation observed in English is not therefore evidence that vowel nasalization is phonetically gradient rather than phonologically categorical in that language.
Solé (2007) shows that VOT in long-lag (voiceless aspirated) stops in English also increases as rate slows, but the VOT of short-lag (voiceless unaspirated) stops in Catalan does not, and that vowel duration differences before voiced versus voiceless obstruents increase as rate slows in English, but remain constant across rates in Catalan.21 The similarity in the effects of rate on vowel nasalization, VOT, and vowel duration leads Solé to argue that the temporal extents of all three properties are controlled by English speakers so as to scale with rate. On her account, controlled properties are phonological properties; others might attribute them instead to phonetic grammars (see 4.3).
Myers (2000) proposes a number of mutually exclusive criteria for diagnosing whether a process or pattern is phonological or phonetic. On the one hand, any process or pattern that refers to morphosyntatic categories, morphological structure, or specific morphemes can only be phonological. Such processes cannot be phonetic because phonetics can only refer to quantities and such processes refer instead to categories. There is a deeper reason why none of these processes can be phonetic: they all entail restrictions on or exceptions to what would otherwise be completely general processes. Phonetic processes are assumed to be exceptionless because they are automatic byproducts of the physics, physiology, and/or psychology of speaking and listening. Any process or pattern that refers to gradients, speaking rate or style, or to differences between speakers can only be phonetic. Such processes or patterns cannot be phonological because they refer to quantities, and there is no evidence in Myers’s view that phonological patterns do (cf. Flemming, 2001).
If such a bright line can be drawn between phonological and phonetic processes and patterns, then it follows that some sort of interface is needed to mediate the interaction between the categorical and gradient characteristics of speech sounds.
This section begins with discussions of the phonetic grounding of inventories and alternations, before turning to phonetic implementation. Inventories and alternations are static, synchronic properties or characteristics of languages, so the focus in the first two sections is on how they are motivated by, or grounded in the phonetic properties of the affected sounds (for other discussions of grounding, see Archangeli & Pulleyblank, 1994; Flack, 2007; Padgett, 2002; Smith, 2002). Implementation is instead behavior, so the focus of the last section is on the procedures that realize the phonological representation of an utterance phonetically.
Following Ohala (1983), Hayes (1999) reviews phonetic explanations for the absence of anterior voiceless stops, e.g., /p/ in Arabic, and posterior voiced stops, e.g., /ɡ/ in Dutch. But Hayes also observes that both stops frequently occur in the voiced and voiceless series of stop phonemes, despite the phonetic challenges to reliably producing or perceiving them – indeed either gap, no /p/ or no /ɡ/, occurs significantly less often than would be expected by chance.22 Hayes also observes that phonotactic constraints more often regulate an entire series rather than just its phonetically more challenged members; for example, all voiced stops are devoiced in syllable codas in German, not just the one most susceptible to devoicing, /g/. Hayes argues that this discrepancy can best be resolved by distinguishing between the particularity of phonetic regulation of speaking and hearing and the systematicity of phonological constraints on the occurrence and distribution of segments.23 Hayes’s argument resembles Beguš’s (2017) in giving priority to formal generality over phonetic motivation tout court, but differs in that formal generality regulates inventory content rather than the choice between competing phonetic motivations.
If the influences of phonetics and phonology on segment inventories and distributions do differ in this way, then the need for an interface between them re-emerges. That interface cannot, however, be a sharp boundary across which the categories of phonological representations are translated into the quantities of their phonetic realizations, but must instead consist of a negotiation between the finer demands of the phonetics and coarser demands of the phonology. In Hayes’s account, that negotiation consists of trading off the phonetic grounding of a phonological constraint with its complexity, where a constraint is less complex than another when its structural description is properly included within the other’s structural description.
Hayes (1999) estimates that the specific and therefore more complex constraints against /p/ and /ɡ/ are more phonetically grounded than the more general and thus less complex constraints against voiceless and voiced obstruents, respectively. According to Hayes, if phonetic grounding were the only criterion, then stop inventories like those observed in Dutch and Arabic would be far more common than they are (see also Bermúdez-Otero & Börjars, 2006, for further discussion of this point). Instead, languages tend to have voiceless or voiced stops at all three major places of articulation or to have them at none of these places. This fact suggests that there may be general markedness constraints prohibiting voiceless or voiced stops across places of articulation, i.e., *[−Continuant/−Voice] or *[+Voice/−Continuant], rather than place-specific markedness constraints prohibiting aerodynamically challenging /p/ or /g/, i.e., *[−Continuant/−Voice, labial] or *[+Voice/−Continuant, dorsal] – these constraints are stated in this way, because the most frequent repairs that languages apply to eliminate /p/ or /ɡ/ in outputs are spirantization to /ɸ/ or /f/ and devoicing to /k/, respectively. Apparently, simplicity (or generality) governs the contents of inventories, not phonetic grounding. As Ohala (1979) observed, consonant inventories favor maximal over less than maximal use of available distinctive features, and in this respect, differ from vowel inventories, which are more obviously governed by a requirement that their members be sufficiently dispersed in a space defined by phonetic dimensions.
This discussion has had three purposes. First, it lays bare the uncertainties about how far a phonetic explanation of the contents of an inventory may reach, or perhaps in this example, how short its reach may be. Second, it exposes an essential difference between phonetic and phonological explanations, as they are currently understood. Phonetic explanations refer to individual speech sounds or to their characteristics, and they can thus recognize differences between individual sounds that are otherwise very similar to one another, such as those between voiceless or voiced stops at different places of articulation.
Phonological explanations refer instead to sets of speech sounds, which ideally resemble one another phonetically, but which minimally pattern alike. This distinction between more particular and more general reference resembles what Hayes (1999) refers to as more versus less complex, but those terms apply only to the phonological representations of sounds. For example, the sets consisting of /p/ or /ɡ/ alone are more complex that those consisting of /p, t, k/ or /b, d, ɡ/ because they must refer to place.
The relative simplicity of the more inclusive phonological sets does not carry over, however, to the corresponding phonetic sets, [p, t, k] and [b, d, ɡ], even if the context in which these sounds occur was held constant. It does not carry over because the aerodynamic adjustments required to ensure that each member of the set [p, t, k] is realized as a stop and to ensure that each member of the set [b, d, ɡ] is realized as voiced differ between places of articulation, as do the articulators used to produce the closures at each place.
The simplicity of the sets /p, t, k/ and /b, d, ɡ/ does not therefore follow from the phonetic characteristics of their members, but instead only from the length of their phonological definitions. This point is by now obvious, perhaps even belabored, but it was the only way to bring us to the final point of this discussion, which is this: the fact that phonological inventories consist of sets of sounds with simpler phonological descriptions is something to be explained; it is not itself an explanation. A postscript can be added to this final point, namely, not all sets of phonemes, even of consonants, pattern like stops; for example, languages without gaps at any of the three major places in their stop inventories often still lack a contrastive dorsal nasal /ŋ/. A more comprehensive approach to investigating the possible phonetic grounding of typologically common inventories is needed before any decision can be made as to how to divide the labor of explanation between a particular phonetics and a general phonology, along with a more critical examination of what purport to be phonological explanations.
Steriade (2001, 2008) proposes that phonological alternations are governed by a requirement that alternants differ minimally from one another, and that minimally different alternants are those that are perceptually most similar to one another, not those that differ minimally in their distinctive feature values. Perceptual similarity estimates are obtained from the distances between alternants in the P-map, a representation of the perceptual correspondents of alternants’ context-specific phonetic correlates.
Steriade’s (2008) central example is the alternation between obstruents that contrast for [voice] in some contexts, and neutralize that contrast to [−voice] in other contexts. Steriade identifies the contexts where the [voice] contrast is usually maintained as those where a vowel or more open sonorant articulation follows the obstruent, and those where the contrast instead neutralizes to [−voice] as preceding another obstruent or a word boundary. An obstruent may also assimilate in [voice] to a following obstruent. In Steriade (2001, 2008), she argues that the contexts where the contrast is preserved are those where its correlates, or “cues,” are best realized or best conveyed to the listener, while those contexts in which the contrast is neutralized are those in which those cues are unrealized, poorly realized, or not conveyed to the listener. So Steriade (2008) first grounds her account of alternations phonetically by observing that whether a contrast is maintained in a context depends on how reliably the listener can detect its identifying cues in that context.
The second way in which Steriade (2008) grounds her account is in her solution to the “too-many solutions” problem. This problem arises in optimality theoretic grammars because markedness constraints don’t specify how to repair their violation, yet languages don’t avail themselves of all the possible repairs. In the case of interest here, the /b/ in inputs such as /tabza, tab/ where its cues would be poorly conveyed could be repaired by devoicing /tapza, tap/, nasalization /tamza, tam/, spirantization /taβza, taβ/, gliding /tawza, taw/, deletion /taza, ta/, epenthesis /tabəza, tabə/, and by a number of other changes, yet the only repair observed is the first, devoicing.24
Steriade (2008) argues that only this repair is chosen because /tapza, tap/ are phonologically the most similar to the unrepaired strings /tabza, tab/ of all the possible output strings produced by the alternative repairs. /tabza, tab/ are more similar phonologically to /tapza, tap/ than they are to /tamza, tam/, /taβza, taβ/, etc., because /b/ and /p/ are perceptually more similar to one another than are /b/ and /m/, /b/ and β/, or /b/ any of the other unobserved repairs in those contexts. And the contrast does not neutralize before a sonorant, e.g., in /bra, ba/, because these strings are even more dissimilar phonologically from /pra, pa/ than are /tabza, tab/ from /tapza, tap/. They are more dissimilar because the cues to the [voice] contrast are more robust before a sonorant or a vowel than before an obstruent or a word boundary.
Phonological similarity must reflect perceptual similarity between sounds rather the number of differences in their feature values, because at least three unobserved repairs, nasalization, spirantization, and gliding, like devoicing, change the value of just one distinctive feature: [−nasal] > [+nasal], [−continuant] > [+continuant], and [−sonorant] > [+sono- rant], respectively, cf. [+voice] > [−voice].25 The /b/ is devoiced in /tabza, tab/ because obstruents contrasting for [voice] before another obstruent or a word boundary are perceptually more similar to one another than those before a sonorant or a vowel. They are perceptually more similar before an obstruent or a word boundary because the number, size, and detectability of the phonetic differences are reduced compared to before a sonorant or a vowel.
The other sound pattern that Steriade (2001) argues reflects perceptual similarity is the progressive assimilation in clusters of apico-alveolars and retroflexes26 and the neutralization of contrasts between them when not preceded by a vowel. She observes that these sounds differ more acoustically during the transition from a preceding vowel than during the transition to a following vowel. She also cites Anderson’s (1997) evidence (Table 13.2) that Western Arrente speakers identify apico-alveolars and retroflexes far less well in CV strings made by removing the preceding V from VCV strings than in the original VCV strings, but identify all but one dental and post-alveolar laminal consonants roughly as well in the CV as the VCV strings – the exception is the dental lateral [l ̪]. The values in the Drop column for the apical consonants in Table 13.2 also show that retroflex identification suffers more than apico-alveolar identification when the preceding vowel is removed:
Anderson’s (1997) results are compelling evidence in support of the first of Steriade’s proposals, that contrasts are kept in contexts where their distinguishing acoustic correlates are present and detectable and lost in contexts where they are not. However, they fail to speak to her second proposal, that preferred or possible alternants must be perceptually similar to one another. They fail to do so because the Western Arrente listeners’ task in Anderson’s experiment was to identify the consonant in each stimulus with a linguistic category. As the following discussion of consonant confusion studies makes clear, identification tasks in which stimuli are assigned to a category in the listeners’ native language do not provide any measure of perceptual similarity distinguishable from a measure of linguistic similarity represented by shared distinctive feature values.
The difficulty with both examples, of course, as Steriade (2008) herself acknowledges, is that we lack the necessary measurements of context-specific perceptual similarity between consonants. One possible source of such assessments is studies of confusions between consonants when mixed with noise or filtered, or when the listener’s hearing is impaired. A measure of perceptual similarity could be derived from those studies, on the assumption that more similar consonants would be more confusable. This assumption is made explicit by Dubno and Levitt (1981):
First, it is postulated that in a recognition task, an incorrect response implies that the available acoustic cues were not sufficient to differentiate a certain consonant from a number of other consonants. Second, it is assumed that the listener’s incorrect response would be that consonant which is most similar to the target in terms of the acoustic variables that are important in the recognition process.
(p. 249)
However, not all studies of consonant confusions have compared the same consonantal contrasts between contexts where their cues are expected to be more versus less reliably conveyed to listeners; exceptions are Benkí (2003), Dubno and Levitt (1981), Redford and Diehl (1999), Cutler, Weber, Smits, and Cooper (2004), and Wang and Bilger (1973). But it has proven difficult to derive a consistent measure of perceptual similarity from even these studies, because their results differ as a function of how confusions were induced, whether confusions were more numerous after than before vowels or vice versa, what the native language of the listeners was, what responses listeners used, and other factors.
Even more research into perceived similarity using confusion studies is unlikely to be fruitful given an unrelieved uncertainty about whether the confusions documented in these studies reflect the phonetic or phonological features of the stimuli. On the one hand, we have Wang and Bilger’s (1973)’s reluctant conclusion that they had not been able to show that the listeners’ confusions in their study reflected “natural perceptual features” (p. 1249) instead of the phonological features that represent voice, manner, and place contrasts,
it is possible to account for a large proportion of transmitted information in terms of articulatory and phonological features … if natural perceptual features do exist, then there is little evidence in confusion-matrix data to support their existence.
(p. 1264)27
On the other hand, Dubno and Levitt (1981) identify a suite of acoustic properties that reliably predict the patterns of consonant confusions produced by the listeners in their study. But Dubno and Levitt also observe that,
the best predictions of incorrect responses was provided by different groups of variables [acoustic properties: JK] for each nonsense syllable subset and each experimental condition. That is, no one set of acoustic information could accurately predict confusions between all types of syllables.
(p. 258)
The subsets they refer to consist of different sets of consonants, preceding or following vowels. Dubno and Levitt’s observation that the acoustic properties that predict confusions differ between contexts appears to support Steriade’s (2008) assertion that alternations arise because the cues to contrasts differ between, or are not conveyed equally well in all contexts. However, no one has appealed yet to Dubno and Levitt’s specific observations in explaining why some phonological alternations are observed and others are not.
As an alternative to the equivocal evidence from studies of consonant confusions, Steriade (2008) turns instead to evidence from partial rhymes, direct judgments of similarity, and judgments of foreign accents, all of which point to [−voice] obstruents being more similar to [+voice] obstruents in the neutralization contexts than the other possible repairs (see her paper for citations to the specific studies). But none of this evidence distinguishes a similarity judgment based on comparing sounds’ distinctive feature specifications from one based on the perceptual correspondents of their context-specific phonetic correlates any better than the studies of consonant confusions do. Phonological similarity and its regulation of alternations may indeed be grounded in perceptual similarity, but that remains a hypothesis that has not yet been tested directly.
The account offered by Steriade (2001, 2008) of the restrictions on possible alternations was discussed in this section because it appears to appeal so directly to the phonetic characteristics of the alternating sounds in the contexts in which they do and do not alternate. But as the discussion in the preceding paragraphs emphasizes, it only appears to do so, in the sense that phonological similarity is not yet demonstrably grounded in the phonetics of the sounds. In fact, it’s difficult to conceive of a means by which phonetic grounding of similarity, as phonetic grounding, could be demonstrated with mature language users, as their perception of the sounds of their native language cannot easily be separated from their phonological representations of them. Perhaps, similarity judgments could be made using non-speech analogues, that is, sounds which preserve the acoustics of the speech sounds of interest but which are not recognized either as those sounds nor as speech. But there remains the necessity of showing that the perceived similarity between the original speech sounds scaled directly with the perceived similarity of the corresponding non-speech analogues.
Given that this chapter began with a reference to Pierrehumbert, perhaps it is fitting that it closes with one, too, specifically, a reference to her work developing a model of the phonetic implementation of intonation in English and Japanese (Pierrehumbert, 1980; Liberman & Pierrehumbert, 1984; Pierrehumbert & Beckman, 1988); see also Bruce (1977) for a fundamentally similar account of Swedish intonation. The way in which the labor of explanation of the observed patterns is divided between the phonology and the phonetics in that model is also instructive about how that labor might be divided more generally.
Briefly, in these accounts of both English and Japanese, the phonological representation of intonation consists of a string of high (H) and low (L) tones which are aligned with the string of segments; that is, a tune consisting of H and L tones is aligned with a text consisting of segments. Japanese differs from English in that some of these tones arise from the lexicon, in morphemes bearing HL pitch accents, as well as from the intonational component, while all English tones arise from the intonational component. The intonational components of the two languages also differ considerably in which tones are aligned with particular locations in the text. However, once the string of Hs and Ls is specified and aligned in the utterance’s phonological representation, the two languages resemble one another closely in how those Hs and Ls are mapped onto actual f0 values: both adjust tones’ relative prominence, interpolate between specified f0 targets, apply downstep (AKA “catathesis”) and final lowering, and exhibit declination. The mapping consists of translating Hs and Ls into f0 targets followed by further transformations of those f0 targets as a function of their prominence and context. In this account of English and Japanese intonation, the phonetics bears the bulk of the labor in determining how an utterance is pronounced, not the phonology.28 A similar division of labor may be observed in Bruce’s (1977) account of Swedish intonation.
Liberman and Pierrehumbert (1984) briefly discuss the generalization of this division of explanatory labor beyond intonation. Their example is the intrusion of stops between nasals and following fricatives already discussed, e.g., warm[p]th, prin[t]ce, and leng[k]th. These intrusions were previously described as a product of devoicing and soft palate raising occurring before opening the complete closure of the nasal into the fricative’s narrow constriction, i.e., as an error in the timing of two of the fricative’s articulations relative to a third. The occurrence of intrusive stops between a lateral and a following fricative, e.g., heal[t]th, fal[t]se, and wel[t]sh, was explained similarly, as early raising of the sides of the tongue before lowering its center, producing a brief complete closure. The intrusive vowels that occur between liquids and neighboring consonants in a number of languages, e.g., Chilean Spanish c[o]ronica “chronicle,” Mono ɡąf[ū]rū “mortar,” Dutch kal[a]m “quiet,” Scots Gaelic tar[a]v “bull,” may also be explained as byproduct of timing, where the articulation of the liquid overlaps so little with the other consonant in the cluster that a simultaneous vowel articulation becomes briefly audible between them (see Hall, 2006, for extensive documentation and the analysis summarized here).
As already noted, Clements (1987) proposes that the phonology rather than timing is responsible for the stops which intrude between nasals and fricatives. This proposal rests first on the observation that (some) speakers of South African English never pronounce nasal-fricative sequences with intrusive stops, while (some) speakers of American English always do (Fourakis & Port, 1986). According to Clements, if members of a speech community behave systematically, then that behavior is necessarily encoded in the grammar, in this case, the phonological grammar of their language. He would presumably make a similar argument for the intrusive stops between laterals and fricatives in English, because not all English speakers produce them, and for the intrusive vowels documented between liquids and other consonants by Hall (2006), because they’re not observed in all languages.
But Liberman and Pierrehumbert (1984) suggest an equally plausible alternative, that dialects or languages could differ systematically in the timing and coordination regimes for the articulations of successive consonants. A telling piece of evidence in support of this alternative is that the stops that intrude between nasals and fricatives are considerably shorter than those that realize a /t/ in the input, i.e., in prince compared to prints (Fourakis & Port, 1986). The nasal is also shorter when the /t/ arises in the input. Clements (1987) accounts for the difference in the stop’s duration in a feature-geometric analysis by spreading an oral cavity node that dominates the [−continuant] and place specifications of the nasal to the following fricative. Because the result is a sequence of [−continuant][+continuant] specification dominated by a single timing slot, i.e., warm[p͡θ], prin[t͡s]e, and len[k͡θ], the stop closure is shorter than it would be if it occupied an entire timing slot by itself. Because the stop shares a timing slot with the nasal, this account also explains why the nasal is not shortened as much as would be when the /t/ arises in the input. A genuine cluster would compress all the segments in the rime more.
Clements also argues that the intruded stop must be phonological because it is glottalized in some dialects. In his account, glottalization of voiceless stops in codas, too, is a phonological process because it is produced by the members of some speech communities and not by the members of others. On this logic, if glottalization is a phonological process, then stop intrusion must be, too, because only a phonological process can feed another one. This is unconvincing on two grounds. First, my own pronunciation shows that it is possible for a speaker to consistently glottalize voiceless stops in syllable codas, but to not glottalize those that intrude between nasals and fricatives. If stop intrusion doesn’t consistently feed glottalization, perhaps it’s not a phonological process. Second and less personally, glottalization itself is plausibly a dialect-specific phonetic implementation process, which could be realized on any voiceless closure in the right prosodic context, regardless of whether that closure originates in the input or as the result of a particular pattern of articulatory timing and coordination. That is, there’s no need to presume any ordering of processes of the sort found in phonological derivations; intrusion and glottalization would instead be simultaneous (and independent to account for my speech). More generally still, a common characteristic of intrusive vowels does not lend itself to such an account: they are most often identical in quality to the vowels occurring on the other side of the liquid. Assuming the intrusive vowels constitute a vowel node, achieving identity by autosegmental spreading would entail crossing association lines,29 whereas a phonetic account in which the liquid is loosely coordinated with the preceding consonant and overlaps with the simultaneous vowel gesture faces no such problem (see Hall, 2006, and references cited therein for further discussion of differences between cases that are handled well with autosegments versus gestures).
Although this discussion implies that the division of explanatory labor between phonetics and phonology can be extended quite readily and confidently from intonation to other sound patterns, these remarks should instead be read as only tentatively suggesting an alternative analysis of a very limited if frequently occurring phenomenon, the intrusive segments produced by particular patterns of articulatory coordination. A more comprehensive effort to test the generality of this division of labor can scarcely be said to have begun, much less progressed to a point where one can indeed confidently assert that phonetic implementation is responsible for most of what is observed in patterns of articulations, acoustics, or perception, and phonology is responsible for relatively little (see Kingston & Diehl, 1994; Stevens & Keyser, 2010, pp. 423–426 and p. 18, respectively). Nor that the reverse is true, and the phonetics cedes the responsibility to the phonology.
This lack of progress can probably be attributed to three obvious facts. First, in most cases, no other elements in a phonological representation correspond to a single phonetic correlate in the way that the tones that represent intonational contrasts correspond to f0. And f0 is only the most accessible phonetic correlate, among many, of tonal contrasts.
Second, the phonetic correlates of other contrasts, besides being multiple, vary substantially in their realization across contexts, speakers, and languages, such that it’s been difficult to proceed beyond mere descriptions of their variation to generalizable formal statements of their implementation.
The third fact requires more discussion. It is the accumulation of evidence that apparently similar or even identical phonological categories are implemented phonetically in different ways across languages. One such example was already discussed, the differences in the timing of soft palate lowering in vowels preceding nasals between English versus Spanish documented by Solé (1992, 1995). An additional example is the differences between languages in the duration of the delay in the onset of voicing following the release of the stop in voiceless unaspirated and aspirated or short- and long-lag stops (VOT) documented by Cho and Ladefoged (1999). This example illustrates the uncertainty about the extent to which a sound pattern can be explained phonetically particularly well, because it shows that variation can be explained only in part by referring to the physics and physiology of speech production. Cho and Ladefoged set out to describe and explain the finding that across languages, velar stops have longer VOTs than dental or alveolar (henceforth “coronal”) and bilabial stops. Cho and Ladefoged document the various purely phonetic factors that could contribute to this difference in VOT between velar and more anterior places of articulation. The first three lead to a slower venting of the air trapped behind a velar than a more anterior closure, a slower drop in intraoral air pressure above the glottis, and a longer delay in producing a large enough pressure drop across the glottis for voicing to begin:
All four of these factors could contribute the longer VOTs observed in velar than bilabial stops.
Because Cho and Ladefoged’s (1999) sample includes relatively few languages with voiceless aspirated stops and of those only two have bilabial voiceless aspirated stops, the rest of this discussion is limited to comparisons of voiceless unaspirated stops in the 13 (out of 18) languages that have them at bilabial, coronal, and velar places of articulation. Mean VOTs (ses) are 16 ms (1.5), 18 ms (1.6), and 37 ms (3.0). One-sided paired t -tests30 show that bilabial and coronal VOTs don’t differ (t (12) = −1.06, p > 0.1), but bilabial and coronal VOTs are both significantly shorter than velar VOTs (t (12) = −6.95, p < 0.001; t (12) = – 6.88, p < 0.001). Thus far, the voiceless unaspirated VOTs differ across places as predicted from the phonetic factors listed above: bilabial ≈ coronal < velar.
But Figure 13.3 shows that there is considerable language-specific variation in how much longer velar VOTs are relative to either bilabial or coronal VOTs. Because velar VOTs were just shown to be significantly longer than either bilabial or velar VOTs, it’s expected that all the points would lie above the y = x line, but the figure also shows that velar VOTs do not scale with either bilabial or coronal VOTs: the lines fit by linear regression to the points both have positive slopes, but these slopes are not significantly different from 0 and the extent of the spread of values around these lines (the gray shading) shows that little in the variation of velar VOTs is predicted from variation in either the bilabial or coronal VOTs. Instead, velar VOTs vary unpredictably between languages from modestly to extremely greater than either bilabial or coronal VOTs. Velar VOTs in some languages, e.g., Yapese, Wari’, Montana Salish, Navajo, and Hupa, are long enough to be treated as aspirated in other languages, while others are considerably shorter, e.g., Apache, Defaka, Tsou, Gaelic, Khonoma Angami, and Dahalo. The language with the longest VOT for velar stops, Yapese, doesn’t have contrasting voiceless aspirated stops, but two of the languages with long velar VOTs contrast them with voiceless aspirated stops with even longer VOTs, and only one of the languages with shorter VOTs, Apache, contrasts them with voiceless aspirated stops (see Cho & Ladefoged, 1999, themselves for more extensive observations of language-specific differences in VOTs). These differences are not predicted by the physics and physiological differences between velar and more anterior stops listed previously.
Source: From Cho and Ladefoged (1999).
Kingston and Diehl (1994) offer two related arguments that speakers’ phonetic behavior cannot be entirely predicted by the physical and physiological constraints on articulations and on the transduction of those articulations into the acoustic properties of the speech signal. The first argument was built on a review of explanations for the differences in f0 in vowels followed stops that contrast for [voice]: f0 is lower following [+voice] than [−voice] obstruents. These f0 differences are observed in languages like English, where the [voice] contrast is otherwise realized as a difference between voiceless unaspirated versus voiceless aspirated (or short- versus long-lag) stops at the beginnings of words, and in languages like French, where the contrast is instead realized as a difference between voiced versus voiceless unaspirated (or prevoiced versus short-lag) stops. This review led to us to reject all attempts to explain the covariation of f0 with VOT as the automatic byproducts of physical or physiological dependencies between the articulations that produce the differences in VOT and those that produce the differences in f0 (see also Dmitrieva, Llanos, Shultz, & Francis, 2015). In place of such explanations, we next argued that the responsible articulations were controlled independently by speakers in order to combine acoustic properties that would integrate auditorily: a lower f0 in a following vowel would integrate with an earlier onset of voicing, either in a prevoiced or short-lag stop, to produce a percept of low-frequency energy, while a higher f0 would instead integrate with a later onset of voicing, either in a short- or a long-lag stop, to produce a percept of the absence of low-frequency energy. Further support for this second argument was reported by Kingston and Diehl (1995) and Kingston, Diehl, Kirk, and Castleman (2008); various challenges are offered by Hanson (2009), Holt, Lotto, and Kluender (2001), and Kirby and Ladd (2016).
If speakers can and do exercise such detailed control over their articulations, then the language-specific differences in VOT documented by Cho and Ladefoged (1999) are entirely unremarkable, even if the specific values observed in each language are still unexplained. These observations also suggest that speakers and listeners have learned a phonetic grammar along with their phonological grammar in the course of acquiring competence in their language. A language’s phonetic grammar consists of the procedures for implementing the language’s phonological representations in ways that characterize the behavior of native speakers and listeners of that language, and that can distinguish their behavior from non-native speakers and listeners.
As an aide-mémoire for the reader, the principal studies cited in this review are listed as follows, with brief reminders of their arguments:
This variety in points of view may dishearten a reader who has gotten this far by leaving them to wonder whether any consensus is possible regarding the nature of the interface between phonology and phonetics. After all, they have read three different arguments denying that there is such interface or any need for one, and the positive proposals also differ dramatically in their account of the interface. This uncertainty and the debates sketched in this chapter show that there is indeed no consensus, but instead at least this multitude of points of view. In fact, this review is certainly not the only attempt to discuss and develop an account of the interface between phonetics and phonology, nor do those other attempts agree with the account developed here. Therefore, the interested reader would also do well to consult at least Keating (1988a, 1988b, 1990), Kingston (2007), and Scobbie (2007) for further discussion and different points of view regarding the interface between phonology and phonetics.
This is of course an entirely unsatisfactory state of affairs. Getting satisfaction will not, however, depend on forcibly reconciling this multitude to a single well-agreed-upon point of view, but instead on a concerted effort to study the interface(s), by developing testable models of the relationship between the speaker’s phonological specification of an utterance and its implementation as articulations and their acoustic consequences, and similarly developing testable models of the relationship between an utterance’s acoustic properties, their auditory transformation and mapping onto the listener’s phonological representation. In this way of putting things, there are at least two interfaces where implementation can be studied, modeled, and formalized, from the speaker to speech and from speech to the listener. This way of putting things also suggests how progress may be made, by focusing on one or another of these two interfaces, and devising proof of concept experiments followed up by more thorough investigation once the proof is obtained. It is because this is only way to get satisfaction that this paper was closed with a discussion of the division of explanatory labor between phonetics and phonology in the account of English, Japanese, and Swedish intonation. As sketched in that discussion, the proposed division of labor was arrived at by developing just such a model of how a tune’s phonological representation is implemented in the speaker’s production of the corresponding intonation contour in these languages. That division of labor was extended, in favor of a phonetic explanation, to accounting for the occurrence and characteristics of intrusive segments. But this is just potential progress on a small part of the speaker to speech interface; an equally general approach to the speech to listener interface is lacking, largely because of unresolved debates regarding the objects of speech perception (Diehl, Lotto, & Holt, 2004; Fowler, 2006; Lotto & Holt, 2006) and the extent to which the listener’s linguistic knowledge interacts with their perception of the signal’s acoustic properties (McClelland, Mirman, & Holt, 2006; Norris, McQueen, & Cutler, 2000). This recommendation also implies that developing formal models of phonetic grounding of sound inventories and patterns may best be achieved by first developing such models of phonetic implementation, rather than trying to adjudicate the division of labor on the grounds of a priori assumptions of what is properly phonology versus phonetics.
A certain consequence of developing such models of implementation and grounding is sure to be profound and unexpected changes in our conceptions of what phonological representations consist of, as well as the processes and constraints acting on them. It is equally certain that our conceptions of phonetic implementation will change profoundly and unexpectedly, too. But both kinds of changes should be welcome, so long as they are supported by sound empirical work and fully general formal models.
I remain very grateful to the editors, William Katz and Peter Assmann, for their invitation to think hard once again about the interface between phonetics and phonology, and even more so for their patience in waiting for me to complete this chapter. I also appreciate the enormous care and interest of three reviewers, Ryan Bennett, Marie Huffman, and Amanda Rysling, whose close readings and clear guidance regarding what needed to be revised, reworked, added, or discarded have not only made this chapter more readable and intelligible than it would have been without their efforts but have also focused and improved my own thinking about the phenomena and theories. Any remaining errors and infelicities are mine alone.
Anderson, V. B. (1997). The perception of coronals in Western Arrente. In Proceedings of Eurospeech ’97: Fifth European conference on speech communication and technology (Vol. 1, pp. 389–392).
Archangeli, D., & Pulleyblank, D. (1994). Grounded phonology. Cambridge, MA: MIT Press.
Bach, E., & Harms, R. (1972). How do languages get crazy rules? In R. Stockwell & R. Macaulay (Eds.), Linguistic change and generative theory (pp. 1–21). Bloomington, IN: Indiana University Press.
Beckman, J. (1997). Positional faithfulness, positional neutralization, and Shona vowel harmony. Phonology, 14(1), 1–46.
Beddor, P. S. (2007). Nasals and nasalization: The relationship between segmental and coarticulatory timing. In J. Trouvain & W. J. Barry (Eds.), Proceedings of the XVIth international congress of phonetic sciences (pp. 249–254). Saarland University Saarbrücken, Saarbrucken, Germany.
Beddor, P. S., & Krakow, R. A. (1999). Perception of coarticulatory nasalization by speakers of English and Thai: Evidence for partial compensation. The Journal of the Acoustical Society of America, 106, 2868–2887.
Beddor, P. S., & Onsuwan, C. (2003). Perception of prenasalized stops. In M. J. Solé, D. Recasens, & J. Romero (Eds.), Proceedings of the XVth international congress of phonetic sciences (pp. 407–410). Causal Productions, Barcelona.
Beguš, G. (2017). Post-nasal devoicing and a probabilistic model of phonological typology. Ms.
Benkí, J. R. (2003). Analysis of English nonsense syllable recognition in noise. Phonetica, 60, 129–157.
Bermúdez-Otero, R., & Börjars, K. (2006). Markedness in phonology and in syntax: The problem of grounding. Lingua, 116, 710–756.
Blevins, J. (2004). Evolutionary phonology: The emergence of sound patterns. Cambridge: Cambridge University Press.
Blust, R. (2005). Must sound change be linguistically motivated? Diachronica, 22, 219–269.
Boersma, P. (1998). Functional phonology: Formalizing the interaction between articulatory and perceptual drives. The Hague: Holland Academic Graphics.
Boersma, P. (2011). A programme for bidirectional phonology and phonetics and their acquisition and evolution. In A. Benz & J. Mattausch (Eds.), Bidirectional optimality theory (pp. 33–72). Amsterdam: John Benjamins.
Boersma, P., & Hamann, S. (2009). Cue constraints and their interaction in perception and production. In P. Boersma & S. Hamann (Eds.), Phonology in perception (pp. 55–110). Berlin: Mouton de Gruyter.
Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory phonology. Phonology Yearbook, 3, 219–252.
Browman, C. P., & Goldstein, L. M. (1988). Some notes on syllable structure in articulatory phonology. Phonetica, 45, 140–155.
Browman, C. P., & Goldstein, L. M. (1989). Articulatory gestures as phonological units. Phonology, 6, 201–251.
Browman, C. P., & Goldstein, L. M. (1990). Tiers in articulatory phonology, with some implications for casual speech. In J. Kingston & M. E. Beckman (Eds.), Papers in laboratory phonology I (pp. 341–376). Cambridge: Cambridge University Press.
Browman, C. P., & Goldstein, L. M. (1992). Articulatory phonology: An overview. Phonetica, 49, 155–180.
Browman, C. P., & Goldstein, L. M. (1995a). Dynamics and articulatory phonology. In R. F. Port & T. V. Gelder (Eds.), Mind as motion: Explorations in the dynamics of cognition (pp. 175–193). Cambridge, MA: MIT Press.
Browman, C. P., & Goldstein, L. M. (1995b). Gestural syllable position effects in American English. In F. Bell-Berti & L. J. Raphael (Eds.), Producing speech: Contemporary issues. For Katherine Harris (pp. 19–33). New York, NY: AIP Press.
Bruce, G. (1977). Swedish word accents in sentence perspective. Lund: Gleerup.
Cho, T., & Ladefoged, P. (1999). Variation and universals in VOT: Evidence from 18 languages. Journal of Phonetics, 27, 207–229.
Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York, NY: Harper & Row.
Clements, G. N. (1983). The hierarchical representation of tone features. In I. R. Dihoff (Ed.), Current approaches to African linguistics (pp. 145–176). Dordrecht: Foris.
Clements, G. N. (1987). Phonological feature representation and the description of intrusive stops. In A. Bosch, B. Need, & E. Schiller (Eds.), CLS 23: Parasession on autosegmental and metrical phonology (pp. 29–50). Chicago, IL: Chicago Linguistic Society.
Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.
Coetzee, A. W., & Pretorius, R. (2010). Phonetically grounded phonology and sound change: The case of Tswana labial plosives. Journal of Phonetics, 38, 404–421.
Cohn, A. C. (1993). Nasalisation in English: Phonology or phonetics. Phonology, 10, 43–81.
Cohn, A. C. (2007). Is there gradient phonology? In G. Fanselow, C. Féry, M. Schlesewsky, & R. Vogel (Eds.), Gradience in grammar: Generative perspectives (pp. 25–44). Oxford: Oxford University Press.
Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America, 116, 3668–3478.
de Boer, B. (2000). Self-organization in vowel systems. Journal of Phonetics, 28, 441–465.
de Lacy, P., & Kingston, J. (2013). Synchronic explanation. Natural Language and Linguistic Theory, 31, 287–355.
Denes, P. B., & Pinson, E. (1963). The speech chain. New York, NY: Bell Telephone Laboratories.
Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual Review of Psychology, 55, 149–179.
Dmitrieva, O., Llanos, F., Shultz, A. A., & Francis, A. L. (2015). Phonological status, not voice onset time, determines the acoustic realization of onset f0 as a secondary voicing cue in Spanish and English. Journal of Phonetics, 49, 77–95.
Donegan, P. J., & Stampe, D. (1979). The study of natural phonology. In D. A. Dinnsen (Ed.), Current approaches to phonological theory (pp. 126–173). Bloomington, IN: Indiana University Press.
Donegan, P. J., & Stampe, D. (2009). Hypotheses of natural phonology. Poznań Studies in Contemporary Linguistics, 45(1), 1–31.
Dubno, J. R., & Levitt, H. (1981). Predicting consonant confusions from acoustic analysis. The Journal of the Acoustical Society of America, 69, 249–261.
Flack, K. (2007). The sources of phonological markedness (Ph.D. dissertation). University of Massachusetts, Amherst.
Flemming, E. (2001). Scalar and categorical phenomena in a unified model of phonetics and phonology. Phonology, 18.1, 7–44.
Fourakis, M., & Port, R. (1986). Stop epenthesis in English. Journal of Phonetics, 14, 197–221.
Fowler, C. A. (1983). Converging sources of evidence on spoken and perceived rhythms of speech: Cyclic production of vowels in monosyllabic stress feet. Journal of Experimental Psychology: General, 112(3), 386.
Fowler, C. A. (1986a). An event approach to the study of speech perception from a direct realist perspective. Journal of Phonetics, 14, 3–28.
Fowler, C. A. (1986b). Reply to commentators. Journal of Phonetics, 14, 149–170.
Fowler, C. A. (2005). Parsing coarticulated speech in perception: Effects of coarticulation resistance. Journal of Phonetics, 33, 199–213.
Fowler, C. A. (2006). Compensation for coarticulation reflects gesture perception, not spectral contrast. Perception and Psychophysics, 68, 161–177.
Fowler, C. A., & Smith, M. (1986). Speech perception as vector analysis: An approach to the problems of segmentation and invariance. In J. Perkell & D. H. Klatt (Eds.), Invariance and variability of speech processes (pp. 123–136). Hillsdale, NJ: Lawrence Erlbaum Associates.
Gafos, A. I. (2006). Dynamics in grammar: Comment on Ladd and Ernestus & Baayen. In L. M. Goldstein, D. H. Whalen, & C. T. Best (Eds.), Laboratory phonology 8: Varieties of phonological competence (pp. 51–79). Berlin: Mouton de Gruyter.
Gafos, A. I., & Goldstein, L. M. (2012). Articulatory representation and organization. In A. C. Cohn, C. Fougeron, & M. K. Huffman (Eds.), Oxford handbook of laboratory phonology (pp. 220–231). Oxford: Oxford University Press.
Garrett, A. (2015). Sound change. In The Routledge handbook of historical linguistics (pp. 227–248). London: Routledge.
Gibson, J. J. (1982). Notes on affordances. In E. Reed & R. Jones (Eds.), Reasons for realism: Selected essays of James J. Gibson (pp. 401–418). Hillsdale, NJ: Lawrence Erlbaum Associates.
Gick, B. (2002). An x-ray investigation of pharyngeal constriction in American English schwa. Phonetica, 59, 38–48.
Goldstein, L. M., & Fowler, C. A. (2003). Articulatory phonology: A phonology for public language use. In N. O. Schiller & A. S. Meyer (Eds.), Phonetics and phonology in language comprehension and production (pp. 159–207). Berlin: Mouton de Gruyter.
Goldstein, L. M., Pouplier, M., Chen, L., Saltzman, E., & Byrd, D. (2007). Dynamic action units slip in speech production errors. Cognition, 103(3), 386–412.
Gouskova, M., Zsiga, E., & Boyer, O. T. (2011). Grounded constraints and the consonants of Tswana. Lingua, 121, 2120–2152.
Hale, M. R., & Reiss, C. (2000). Substance abuse and dysfunctionalism: Current trends in phonology. Linguistic Inquiry, 31, 157–169.
Hale, M. R., & Reiss, C. (2008). The phonological enterprise. Oxford: Oxford University Press.
Hall, N. (2006). Crosslinguistic patterns of vowel intrusion. Phonology, 23(3), 387–429.
Hanson, H. M. (2009). Effects of obstruent consonants on fundamental frequency at vowel onset in English. The Journal of the Acoustical Society of America, 125, 425–441.
Hayes, B. (1994). “Gesture” in prosody: Comments on the paper by Ladd. In P. A. Keating (Ed.), Papers in laboratory phonology III: Phonological structure and phonological form (pp. 64–74). Cambridge: Cambridge University Press.
Hayes, B. (1999). Phonetically-driven phonology: The role of optimality theory and inductive grounding. In M. Darnell, E. Moravscik, M. Noonan, F. Newmeyer, & K. Wheatley (Eds.), Functionalism and formalism in linguistics (Vol. 1, pp. 243–285). Amsterdam: John Benjamins.
Hayes, B., & Stivers, T. (1995). Postnasal voicing. Ms.
Heinz, J. (to appear). The computational nature of phonological generalizations. In L. M. Hyman & F. Plank (Eds.), Phonological typology. Berlin: Mouton de Gruyter.
Heinz, J., & Idsardi, W. J. (2017). Computational phonology today. Phonology, 34, 211–291.
Hockett, C. (1955). A manual of phonology. International Journal of American Linguistics Monograph Series, 21, memoir 11.
Holt, L. L., Lotto, A. J., & Kluender, K. R. (2001). Influence of fundamental frequency on stop consonant voicing perception: A case of learned covariation or auditory enhancement? The Journal of the Acoustical Society of America, 109.2, 764–774.
Hosung, N., Goldstein, L. M., & Saltzman, E. (2009). Self-organization of syllable structure: A coupled oscillator model. In F. Pellegrino, E. Marisco, & I. Chitoran (Eds.), Approaches to phonological complexity (pp. 299–328). Berlin and New York, NY: Mouton de Gruyter.
Hubbard, K. (1995). ‘Prenasalized consonants’ and syllable timing: Evidence from Runyambo and Luganda. Phonology, 12, 235–256.
Huffman, M. K. (1993). Phonetic patterns of nasalization and implications for feature specification. In M. K. Huffman & R. A. Krakow (Eds.), Nasals, nasalization, and the velum (pp. 303–327). New York, NY: Academic Press.
Hyman, L. M. (1986). The representation of multiple tone heights. In K. Bogers, H. V. D. Hulst, & M. Mous (Eds.), The phonological representation of suprasegmentals (pp. 109–152). Dordrecht: Foris.
Hyman, L. M. (2001). The limits of phonetic determinism in phonology: *NC revisited. In E. Hume & K. Johnson (Eds.), The role of speech perception in phonology (pp. 141–185). San Diego: Academic Press.
Iskarous, K., Fowler, C. A., & Whalen, D. H. (2010). Locus equations are an acoustic expression of articulatory synergy. The Journal of the Acoustical Society of America, 128, 2021–2032.
Iskarous, K., Mooshammer, C., Hoole, P., Recasens, D., Shadle, C., Saltzman, E., & Whalen, D. H. (2013). The coarticulation/invariance scale: Mutual information as a measure of coarticulation resistance, motor synergy, and articulatory invariance. The Journal of the Acoustical Society of America, 134, 1271–1282.
Kaun, A. (1995). The typology of rounding harmony: An optimality theoretic account (Ph.D. dissertation). University of California, Los Angeles.
Kaun, A. (2004). The typology of rounding harmony. In B. Hayes, R. Kirchner, & D. Steriade (Eds.), Phonetically based phonology (pp. 87–116). Cambridge: Cambridge University Press.
Keating, P. (1988a). The phonology-phonetics interface. In F. J. Newmeyer (Ed.), Linguistics: The Cambridge survey (Vol. 1, pp. 281–302). Cambridge: Cambridge University Press.
Keating, P. (1988b). Underspecification in phonetics. Phonology, 5, 275–292.
Keating, P. (1990). Phonetic representation in generative grammar. Journal of Phonetics, 18, 321–334.
Kelso, J. A. S., Saltzman, E., & Tuller, B. (1986a). The dynamical theory of speech production: Data and theory. Journal of Phonetics, 14, 29–60.
Kelso, J. A. S., Saltzman, E., & Tuller, B. (1986b). Intentional contents, communicative context, and task dynamics: A reply to the commentators. Journal of Phonetics, 14, 171–196.
Kingston, J. (2007). The phonetics-phonology interface. In P. de Lacy (Ed.), The Cambridge handbook of phonology (pp. 435–456). Cambridge: Cambridge University Press.
Kingston, J., & Diehl, R. L. (1994). Phonetic knowledge. Language, 70, 419–454.
Kingston, J., & Diehl, R. L. (1995). Intermediate properties in the perception of distinctive feature values. In B. Connell & A. Arvaniti (Eds.), Phonology and phonetics: Papers in laboratory phonology (Vol. IV, pp. 7–27). Cambridge: Cambridge University Press.
Kingston, J., Diehl, R. L., Kirk, C. J., & Castleman, W. A. (2008). On the internal perceptual structure of distinctive features: The [voice] contrast. Journal of Phonetics, 36(1), 28–54.
Kingston, J., & Macmillan, N. (1995). Integrality of nasalization and F1 in vowels in isolation and before oral and nasal consonants: A detection-theoretic application of the garner paradigm. Journal of Acoustical Society of America, 97, 1261–1285.
Kirby, J. P., & Ladd, D. R. (2016). Effects of obstruent voicing on vowel f0: Evidence from true “voicing” languages. The Journal of the Acoustical Society of America, 140, 2400–2411.
Krakow, R. A. (1989). The articulatory organization of syllables: A kinematic analysis of labial and velar gestures (Ph.D. dissertation). Yale University.
Krakow, R. A. (1999). Physiological organization of syllables: A review. Journal of Phonetics, 27, 23–54.
Ladd, D. R. (1994). Constraints on the gradient variability of pitch range, or, Pitch level 4 lives! In P. A. Keating (Ed.), Papers in laboratory phonology III: Phonological structure and phonological form (pp. 43–63). Cambridge: Cambridge University Press.
Liberman, M., & Pierrehumbert, J. B. (1984). Intonational invariance under changes in pitch range and length. In M. Aronoff & R. T. Oehrle (Eds.), Language sound structure (pp. 157–233). Cambridge, MA: MIT Press.
Liljencrants, J., & Lindblom, B. (1972). Numerical simulation of vowel quality systems: The role of perceptual contrast. Language, 48, 839–862.
Lindblom, B. (1986). Phonetic universals in vowel systems. In J. J. Ohala & J. J. Jaeger (Eds.), Experimental phonology (pp. 13–44). Orlando: Academic Press.
Linker, W. (1982). Articulatory and acoustic correlates of labial activity in vowels: A cross-linguistic study. UCLA Working Papers in Phonetics, 56.
Lotto, A. J., & Holt, L. L. (2006). Putting phonetic context effects into context: A commentary on Fowler (2006). Perception and Psychophysics, 68, 178–183.
Macmillan, N., Kingston, J., Thorburn, R., Walsh Dickey, L., & Bartels, C. (1999). Integrality of nasalization and F1 II. Basic sensitivity and phonetic labeling measure distinct sensory and decision-rule interactions. Journal of Acoustical Society of America, 106, 2913–2932.
Maddieson, I. (1989). Prenasalized stops and speech timing. Journal of the International Phonetic Association, 19, 57–66.
Maddieson, I., & Ladefoged, P. (1993). Phonetics of partially nasalized consonants. In M. K. Huffman & R. A. Krakow (Eds.), Nasals, nasalization, and the velum (pp. 251–301). New York, NY: Academic Press.
Maddieson, I., & Precoda, K. (1992). Syllable structure and phonetic models. Phonology, 9, 45–60.
McCarthy, J. J. (1993). A case of surface rule inversion. Canadian Journal of Linguistics, 38, 169–195.
McClelland, J., Mirman, D., & Holt, L. (2006). Are there interactive processes in speech perception? TRENDS in Cognitive Sciences, 10, 363–369.
Mielke, J. (2008). The emergence of distinctive features. Oxford: Oxford University Press.
Mielke, J. (2011). The nature of distinctive features and the issue of natural classes. In A. C. Cohn, C. Fougeron, & M. K. Huffman (Eds.), Handbook of laboratory phonology (pp. 185–196). Oxford: Oxford University Press.
Morén, B. (2007). The division of labor between segment-internal structure and violable constraints. In S. Blaho, P. Bye, & M. Krämer (Eds.), Freedom of analysis? (pp. 313–344). Berlin: Mouton de Gruyter.
Moreton, E., & Pater, J. (2012a). Structure and substance in artificial-phonology learning. Part 1: Structure. Language and Linguistics Compass, 6, 686–701.
Moreton, E., & Pater, J. (2012b). Structure and substance in artificial-phonology learning. Part 2: Substance. Language and Linguistics Compass, 6, 702–718.
Myers, S. (2000). Boundary disputes: The distinction between phonetic and phonological sound patterns. In N. Burton-Roberts, P. Carr, & G. Docherty (Eds.), Phonological knowledge: Conceptual and empirical issues (pp. 245–272). Oxford: Oxford University Press.
Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences, 23, 299–370.
Ohala, J. J. (1979). Phonetic universals in phonological systems and their explanation. In Proceedings of the 9th international congress of phonetic sciences (pp. 5–8).
Ohala, J. J. (1981). The listener as a source of sound change. In Proceedings of Chicago linguistic society: Papers from the parasession on language and behavior (p. 178–203). Chicago, IL: Chicago Linguistic Society.
Ohala, J. J. (1983). The phonological end justifies any means. In S. Hattori & K. Inoue (Eds.), Proceedings of the 13th international congress of linguists (pp. 232–243). Tokyo, 29 August-4 September 1982: [Distributed by Sanseido Shoten.].
Ohala, J. J. (1990). There is no interface between phonology and phonetics: A personal view. Journal of Phonetics, 18, 153–171.
Ohala, J. J. (2005). Phonetic explanations for sound patterns. Implications for grammars of competence. In W. J. Hardcastle & J. M. Beck (Eds.), A figure of speech, a festschrift for John Laver (pp. 23–38). London: Lawrence Erlbaum Associates.
Ohala, J. J., & Jaeger, J. J. (1986). Experimental phonology. Orlando: Academic Press.
Padgett, J. (2002). Constraint conjunction versus grounded constraint subhierarchies in optimality theory. Santa Cruz: University of California.
Pasquereau, J. (2018). Phonological degrees of labiality. Language, 94(4), e216–e265.
Pater, J. (1999). Austronesian nasal substitution and other NC effects. In R. Kager, H. van der Hulst, & W. Zonneveld (Eds.), The prosody-morphology interface (pp. 310–343). Cambridge: Cambridge University Press.
Pierrehumbert, J. B. (1980). The phonetics and phonology of English intonation (Doctoral dissertation). Massachusetts Institute of Technology.
Pierrehumbert, J. B. (1990). Phonological and phonetic representation. Journal of Phonetics, 18, 375–394.
Pierrehumbert, J. B., & Beckman, M. E. (1988). Japanese tone structure. Cambridge, MA: MIT Press.
Pierrehumbert, J. B., Beckman, M. E., & Ladd, D. R. (2000). Conceptual foundations of phonology as a laboratory science. In N. Burton-Roberts, P. Carr, & G. Docherty (Eds.), Phonological knowledge: Conceptual and empirical issues (pp. 273–303). Oxford: Oxford University Press.
Pouplier, M., & Goldstein, L. M. (2010). Intention in articulation: Articulatory timing in alternating consonant sequences and its implications for models of speech production. Language and Cognitive Processes, 25(5), 616–649.
Rafferty, A. N., Griffiths, T. L., & Ettlinger, M. (2013). Greater learnability is not sufficient to produce cultural universals. Cognition, 129(1), 70–87.
Recasens, D. (1985). Coarticulatory patterns and degrees of coarticulatory resistance in Catalan CV sequences. Language and Speech, 28.2, 97–114.
Redford, M. A., & Diehl, R. L. (1999). The relative perceptual distinctiveness of initial and final consonants in CVC syllables. The Journal of the Acoustical Society of America, 106, 1555–1565.
Reiss, C. (2017). Substance free phonology. In S. J. Hannahs & A. Bosch (Eds.), Routledge handbook of phonological theory. Abingdon, OX: Routledge.
Roon, K. D., & Gafos, A. I. (2016). Perceiving while producing: Modeling the dynamics of phonological planning. Journal of Memory and Language, 89, 222–243.
Saltzman, E. (1995). Dynamics and coordinate systems in skilled sensorimotor activity. In R. F. Port & T. van Gelder (Eds.), Mind as motion: Explorations in the dynamics of cognition (pp. 149–173). Cambridge, MA: MIT Press.
Saltzman, E., & Kelso, J. A. S. (1987). Skilled actions: A task dynamic approach. Psychological Review, 94, 84–106.
Saltzman, E., & Munhall, K. (1989). A dynamical approach to gestural modeling in speech production. Ecological Psychology, 1, 333–382.
Schwartz, J. L., Boë, L. J., Valleé, N., & Abry, C. (1997a). Major trends in vowel system inventories. Journal of Phonetics, 25, 233–253.
Schwartz, J. L., Boë, L. J., Valleé, N., & Abry, C. (1997b). The dispersion focalization theory of vowel systems. Journal of Phonetics, 25, 255–286.
Scobbie, J. (2007). Interface and overlap in phonetics and phonology. In G. Ramchand & C. Reiss (Eds.), The Oxford handbook of linguistic interfaces (pp. 17–52). Oxford: Oxford University Press.
Shaw, J. A., Gafos, A. I., Hoole, P., & Zeroual, C. (2009). Syllabification in Moroccan Arabic: Evidence from patterns of stability in articulation. Phonology, 26, 187–215.
Shaw, J. A., Gafos, A. I., Hoole, P., & Zeroual, C. (2011). Dynamic invariance in the phonetic expression of syllable structure: A case study of Moroccan Arabic consonant clusters. Phonology, 28(03), 455–490.
Smith, J. (2002). Phonological augmentation in prominent positions (Ph. D. Dissertation). University of Massachusetts, Amherst.
Smolensky, P., & Goldrick, M. (2016). Gradient symbolic representations in grammar: The case of French liaison. Rutgers Optimality Archive, 1552.
Solé, M. J. (1992). Phonetic and phonological processes: The case of nasalization. Language and Speech, 35, 29–43.
Solé, M. J. (1995). Spatio-temporal patterns of velopharyngeal action in phonetic and phonological nasalization. Language and Speech, 38, 1–23.
Solé, M. J. (2007). Controlled and mechanical properties of speech: A review of the literature. In M. J. Sole, P. S. Beddor, & J. J. Ohala (Eds.), Experimental approaches to phonology (pp. 302–321). Oxford: Oxford University Press.
Solé, M. J., Hyman, L. M., & Monaka, K. C. (2010). More on post-nasal devoicing: The case of Shekgalagari. Journal of Phonetics, 38, 604–615.
Sproat, R., & Fujimura, O. (1993). Allophonic variation in English /l/ and its implications for phonetic implementation. Journal of Phonetics, 21, 291–311.
Staubs, R. (2014). Computational modeling of learning biases in stress typology (Unpublished doctoral dissertation). University of Massachusetts, Amherst.
Steriade, D. (2001). Directional asymmetries in place assimilation: A perceptual account. In E. Hume & K. Johnson (Eds.), The role of speech perception in phonology (pp. 219–250). New York, NY: Academic Press.
Steriade, D. (2008). The phonology of perceptibility effects: The P-map and its consequences for constraint organization. In K. Hanson & S. Inkelas (Eds.), The nature of the word: Studies in honor of Paul Kiparsky (pp. 151–180). Cambridge, MA: MIT Press.
Stevens, K. N., & Keyser, S. J. (2010). Quantal theory, enhancement and overlap. Journal of Phonetics, 38, 10–19.
Summers, W. V. (1987). Effects of stress and final-consonant voicing on vowel production: Articulatory and acoustic analyses. Journal of Acoustical Society of America, 82, 847–863.
Wang, M. D., & Bilger, R. C. (1973). Consonant confusions in noise: A study of perceptual features. Journal of Acoustical Society of America, 54, 1248–1266.