Critique of Fowler’s Direct Realism Randy L. Diehl Section IA (starting on p. 7) outlines the tenets of direct realism and suggests that auditory perception is not fundamentally different from visual and haptic perception in being environment-directed or direct. That is, the perceptual system is assumed to acquaint the perceiver with the distal causes of the stimulation. In speech perception, these distal causes are vocal-tract activities, or what Carol (1989) elsewhere refers to as "phonetic gestures." Because several of Carol's main arguments presuppose the correctness or at least the plausibility of the direct-realist account of speech perception, it is necessary first to consider whether speech perception really is analogous to, say, the visual detection of surface layout in the environment. Even before hearing of J.J. Gibson, I don't believe that I ever doubted that observers actually experience surface properties and states of motion of objects in the environment. With respect to visual perception of surface layout, I guess that I've always been a direct realist. There are at least four arguments to be made for the directrealist characterization of visual perception. The first argument is experiential. When observers look at properly illuminated tables and chairs, they see tables and chairs. They become acquainted with them in the perfectly obvious sense that they can describe their surface properties--size, shape, orientation, texture, etc.--in almost unlimited detail. And, generally speaking, observers agree in their descriptions of those surface properties. The second argument is ecological. Carol is correct to say that "if perceivers are to survive...[p]erceptual systems must acquaint perceivers with the environment in which they participate as actors" (p.7). For the most part, it is the surface properties and states of motion of objects in the environment that perceivers must be visually acquainted with in order to find food and potential mates and to avoid predators, cliffs, and dangerous flying objects. The third argument has to do with the nature of the relation between the distal source of the visual stimulation and the optical information that is presumed to specify that distal source. In order for direct realism to work in principle, it is necessary that the specificational relation between information and distal source be unique and unambiguous. That is, there must be sufficient information (potentially available to an observer free to explore the environment) to uniquely specify the shape, orientation, etc. of the distal object. We have every reason to believe that this fundamental condition on the specificational relation is satisfied in the case of visual detection of surface layout, assuming a few general constraints hold (see, for example, Ullman, 1984). An important characteristic of these general constraints, e.g., that most objects can be assumed to be rigid in their motion, is that they can be independently verified on perceptual grounds. The final argument also involves the specificational relation between information and distal source. The theory of direct perception assumes that the informational medium (e.g., optical or acoustic signals) is directly structured by the distal event that is perceived. It is this direct structuring that makes light and sound informative about the sources that impose the structure. In the optical case, it is evident that the informational medium really is directly structured by the distal events (i.e., the layout and motion of environmental surfaces) that are the objects of direct perception. Now, the question is: Do these four arguments (let's call them, respectively, the arguments from experience, from ecological advantage, from specificational uniqueness, and from direct informational structuring) also apply in support of a direct-realist account of audition in general and of speech perception in particular? First, consider the argument from experience. Carol correctly points out that there are at least some properties of sound-producing events that listeners do experience auditorily and about which they can give accurate descriptions: sound location is a good example. But many other properties of sound-producing events seem not to find their way into the listener's auditory experience. The desktop computer in front of me makes a low-frequency noisy sound that I happen to know is produced by a fan. However, if I didn't already know this (from reading the manual and visually inspecting the inside of the machine), I would be quite uncertain what electrical or mechanical events were structuring the sound. My auditory experience is certainly not that of a rotating fan blade. In an earlier critique of Carol's direct-realist account of speech perception, I wrote the following: “Another apparent problem with articulatory gestures as objects of perception has to do with their accessibility to the observer. I can look around and describe the layout of my environment in rather remarkable detail. This is one of the reasons why I feel comfortable referring to the layout as an object of perception. However, when I ask a phonetically naive individual what is going on in my vocal tract when I talk, his knowledge is utterly deficient. He can describe my words and even my phonemes, but he has almost no intuitive grasp of my non-visible articulatory gestures or the changes they effect in vocal-tract shape. Conceivably, this profound difference in accessibility between surface layout and articulatory events is inconsequential. But the theory must make clear why this is so. In her paper [Fowler, 1986a], Fowler acknowledges "the failure of our intuitions in speech to recognize that perceived phonetic events are articulatory..." (p. 6). However, she goes on to suggest that part of the problem arises from deficiencies in the way articulatory events are described by most researchers. If there is a significant mismatch between these conventional descriptions and the actual level of description recovered by listeners, then perhaps it is no surprise that listeners' articulatory intuitions seem so impoverished. I do not think this argument is very convincing. The problem is not that listeners recover a different kind of articulatory description than the conventional ones offered by researchers; it is rather that they seem to have no reliable intuitions about articulation at all. Coordinated gestures, or coordinative structures, (see Kelso, Saltzman & Tuller, 1986) may well be a significant theoretical improvement over conventional notions of articulation, but it seems evident that listeners have no more intuitive grasp of them than they do of any other level of description” (Diehl, 1986, pp. 62-63). In her reply to the above comment, Carol wrote: As for accessibility, that is not the only, or even the best, index as to whether articulated phonetic segments are perceived. A more telling index is that we shadow or imitate speech both well and remarkably rapidly (e.g., Porter & Castellanos, 1980; Porter & Lubker, 1980). Thus, the facts appear to be that we do perceive speech as articulated, and yet we are not easily made aware that we do. As for why we are unaware, Remez' reference to tacit knowledge may be on the right track. Language is tiered and its ecologically most significant information is provided by levels more encompassing than the phonetic level. It may be very difficult to ignore them in order to attend to their constituents (Fowler, 1986b, p.154). I must disagree with the first point. The ability to imitate speech (rapidly or otherwise) is certainly not telling evidence that we perceive articulated phonetic segments. Performance in this case could just as well reflect our ability to produce acoustic signals that resemble the ones we hear. As Diehl and Kluender (1989b) recently pointed out, the later account has the added virtue of being able to handle, say, our ability to imitate a melody played on a musical instrument by whistling or humming (despite the fact that the original tune and the imitation are produced in very different ways). Although the ability to imitate speech is not telling evidence one way or the other, it would be very telling indeed if phonetic gestures were accessible to listeners in something like the way tables and chairs are visually accessible. The fact that they are not requires some explanation. Carol offers one type of explanation in her second point of the above paragraph. It is that other levels of language beyond the articulatory are more ecologically significant. I agree, and this brings us to the argument from ecological advantage. Whereas visual acquaintance with the surface layout of the environment confers enormous survival advantage on an organism, auditory acquaintance with someone else's phonetic gestures per se appears to be ecologically inconsequential. What matters to a human listener is whether the meaning of an utterance is understood. Even assuming that listeners can recover phonetic gestures from the acoustic signal, there would appear to be no ecological requirement that they do so. If meanings can be accessed via the gestures, they can presumably also be accessed directly from the acoustic signal without first recovering the gestures. At the level of auditory perception in general, there are of course some physical properties of sound sources that matter for the survival of organisms--again, location is a good example. But for many classes of sound source, there is no conceivable adaptive advantage that would accrue from perceiving the physical events that directly structure the sound. What matters in these cases is whether the listener is able to judge what kind of object or event is associated with a given sound. A couple of examples may clarify this point. A rodent that has experienced the sight, smell, and sound of a rattlesnake will probably make an avoidance response in the future upon hearing the distinctive rattle. Survival here depends not on directly perceiving the physical events that give rise to the rattle sound, but in knowing that a certain kind of predator makes a certain kind of sound. Similarly, if I'm walking in the African grasslands and I hear a lion's roar, my adaptive behavior depends on my knowing that lions make that kind of sound. Direct acquaintance with the lion's vocal-tract structures seems completely irrelevant. Survival requires that auditory perception be environment-directed, but it does not require that it always or even typically be direct in Carol's sense. The point is that the argument for ecological advantage applies in the auditory case only selectively. It is interesting that the arguments from experience and from ecological advantage seem convergently to favor certain source properties (e.g., location, velocity of motion) and convergently to disfavor others (e.g., human vocal-tract events), as likely objects of direct auditory perception. This convergence also appears to hold for the argument from specificational uniqueness. The spatial location of a sound source in a free field is redundantly specified by several distributed properties of the sound wave. We know that many organisms are capable of detecting these properties and, on that basis, of identifying source location with a high degree of accuracy. In other words, the specificational relation between information and source property in this case is virtually unambiguous. Does a similar specificational uniqueness obtain in the case of sound sources such as the human vocal tract? The answer is clearly no. Even if we confine the discussion to actualizable vocal-tract shapes and aerodynamic conditions, there are typically an unlimited number of ways to produce any vocal signal. The reason for this is that most acoustic parameters are dependent on several different vocal-tract parameters, such that a change in the value of one parameter can be offset by changes in one or more other parameters. For example, one can achieve about the same formant pattern corresponding to /u/ either by rounding the lips, by lowering the larynx, or by doing a little bit of both. And among the various ways to implement the lip rounding gesture, one can achieve about the same output by trading lip protrusion against lip constriction. Carol (1989) has argued that, since phonetic gestures are construed as equivalence classes defined by a common phonetic end (e.g., lip closure), this many-to-one mapping problem is mitigated. However, as Diehl and Kluender (1989b) replied, many of the kinds of trade-offs illustrated above are cross-gestural, making it virtually impossible for the listener to know which gesture or combination of gestures is responsible for a given acoustic pattern. Moreover, the many-to-one mapping between source and sound is not limited to the articulatory domain. For example, the Reynolds number corresponding to the transition between laminar and turbulent airflow maps onto an infinite set of combinations of volume-velocities (usually corresponding to velocity of lung collapse) and constriction areas. Within a range of such combinations, the acoustic outputs are virtually identical. Consistent with this specificational ambiguity, talkers are known to produce the same sound classes in gesturally diverse ways. Consider the following example. A primary distinctive acoustic property of the American English vowel / er / is a low-frequency F3. It is theoretically possible to lower F3 by, among other things, constricting the vocal tract at any of the three antinodes in the volume-velocity waveform corresponding to F3 (Fant, 1960). These antinodes happen to occur at the lips, the midpalate, and the mid oral pharynx, and as Ohala (1985) and Lindau (1985) have pointed out, talkers tend to make constrictions at just these locations when producing /er /. What is interesting is that some talkers constrict at all three points, whereas others tend to use only two, e.g., palatal and pharyngeal, or palatal and labial. And for the palatal constriction, some talkers use a bunched-tongue configuration, whereas others use a retroflex configuration. What all these diverse gestural types have in common is their similar acoustic effects. Now, the question is: why would all this gestural diversity occur in a speech community if its members were capable of recovering the actual source properties from the sound? Additional evidence for the same point is presented in Ladefoged et al., (1972). Matters only get worse when we drop the restriction that the vocal parameters be physiologically actualizable in humans. Ladefoged et al., (1977) have shown that the same vowel formant patterns that can be generated by a realistic vocal configuration can be duplicated with a variety of physiologically unrealistic vocal configurations. Now, conceivably listeners use somesthetic knowledge of their own vocal tracts to winnow down the alternatives, but this should not be required if auditory perception really is direct. Also, this gambit would not be available to nonhumans listening to speech, and Carol (1989) elsewhere has suggested that nonhumans can directly perceive human phonetic gestures just as humans can. What is true of human vocal tracts is also true of many other sound sources. The tone produced by a vibrating string is equivalent across a range of values of string length and string tension. [The kinematics may be the same but the dynamics are different, and Carol's theory requires that listeners be able to recover the dynamics as well as the kinematics of sound sources (see, e.g., her treatment of the separate dynamical sources contributing to F0 variation in speech, Fowler, 1989).] Analogously, the same resonant sound may be initiated by an air pressure source created by piston-like compression, bellowslike compression, or even heat induced expansion against a fixed container. Examples of this kind of source ambiguity are virtually unlimited. Finally, any sound produced by a mechanical/aerodynamic system can of course be duplicated with an electrical or electronic analog device. Although humans may have enough knowledge in most cases to distinguish the two classes of sound source, this is typically not done on the basis of auditory perception alone. It may appear fanciful to ask how a quail or chinchilla knows whether it is listening to mechanical events in the human vocal tract or to electrical ones, but for the direct realist, this is a genuine difficulty. To say that the output of a speech synthesizer is a "mirage" that mimics the output of a real sound source is, as far as an animal listener is concerned, to beg the question. How could an animal know what is real and what is a mirage in this case? Mechanical systems and their electrical analogs are virtually isomorphic in terms of physical theory, so an appeal to parsimony isn't much help. It seems evident also that there is no selection pressure that would drive an animal to favor one source interpretation over the other. Notice that this kind of quandary only arises if one insists that organisms perceive sound-producing events rather than sounds. Like the arguments from experience and from ecological advantage, the argument from specificational uniqueness is only selectively applicable in the case of auditory perception in general, and it does not apply in the particular case of speech perception. The failure of the first two arguments to carry over from the domain of visual perception of surface layout is certainly not helpful for the direct-realist case, but neither is it devastating. As we have seen, Carol attributes the relative inaccessibility of gestures to their relatively lowly status in the hierarchy of language, and this is not an unreasonable argument. (The same argument might be invoked to explain why the acoustic/auditory properties of phonetic segments are often not very accessible. But, then, since I make no claims that the perception of acoustic signals is closely analogous to the perception of tables and chairs, I am not very bothered about having to invoke such an argument.) And the fact that it is irrelevant to our survival whether we perceive phonetic gestures or their acoustic consequences does not imply that the direct-realist account is wrong. However, the failure of the argument from specificational uniqueness to apply generally in the auditory case is another matter altogether. When the specificational relation between sound and sound source is fundamentally ambiguous, the direct-realist account simply cannot be true. Before leaving this topic, I should point out that audition does not provide the only examples of specificational ambiguity. Although surface layout can be recovered by perceivers, they cannot unambiguously detect the spectral distribution of light (even within the visible range of wavelengths) and hence recover the true physical reflectance of an object. The fact that color perception schemes are usually based on three kinds of photopigments implies that an infinite number of possible spectral distributions will converge on a single distribution of photopigment absorption values. (As in the speech case, this ambiguity appears to be of limited ecological significance.) The point of this example is that specificational uniqueness (or ambiguity) is not a property of a given sense modality per se. Rather, it follows from the particular physical and physiological principles that apply in a given case. Contrary to what Carol suggests, we are not claiming that speech or auditory perception is special vis a vis the rest of perception. What we are claiming is that, for direct perception to be possible in a given domain, there has to be a certain kind of specificational relation between the information and the distal source. In both vision and audition, this specificational relation exists for some source properties, but not for others. The final reason I cited for favoring a direct-realist account in the case of visual perception of environmental layout was the argument from direct informational structuring. Light is informative about surfaces in the environment, because it is the surfaces that directly structure the light. (Our previous discussion implies that such direct structuring is a necessary but not a sufficient condition for the direct-realist account to work.) Now consider the case of speech perception. In reply to a comment by Ohala (1986), Carol (1986b) wrote: “Ohala asks why, if I was going to focus on something other than the acoustic signal as an object of perception, I picked articulatory activity when there are other possibilities further "upstream," including muscle contractions, neuronal activity and mental events... One answer to Ohala's question is that, in a theory of direct perception, the distal event has to structure the informational medium; otherwise it cannot be directly perceived. That rules out anything that might be truly "upstream" from articulation” (p. 150). This response explicitly invokes the argument from direct informational structuring to eliminate events such as muscle contractions as likely objects of direct perception. Although muscle contractions ultimately affect the acoustic signal, their influence is not as direct as that of articulatory events (i.e., phonetic gestures), which makes the latter the more appropriate sound sources to be recovered by the perceiver. But it is necessary to ask: do phonetic gestures themselves directly structure the acoustic signal, or is there some other class of events that is even more directly responsible for structuring the acoustic signal? The answer to this is unequivocal: the speech acoustic signal is directly structured by the aerodynamic properties of the vocal tract, which are in turn created by combinations of initiatory, phonatory, and articulatory events. As J.C. Catford (1977) reminds us: the by (p. “The aerodynamic phase is also an extremely important one. It is the link between the speaker's bodily activity (in organic phase) and the resultant sound waves (in the acoustic phase). We must always remember that organic postures and movements do not themselves generate sounds [his italics]: they merely create the necessary aerodynamic conditions, for the generation of speech sounds is in all cases an aerodynamic process. Some of the organic activities cause pressure changes in the vocal tract, which result in a flow of air; other organic activities regulate the flow in ways that create sounds, either channelling the air-flow through narrow spaces, generating the audible hiss of turbulence, or by allowing it to burst forth in rapid periodic puffs, generating the sound of voice, and so on” 11). Thus, the argument from direct informational structuring, which Carol has used to eliminate events "upstream" of phonetic gestures (e.g., muscle contractions) from consideration as objects of direct perception, also effectively eliminates phonetic gestures from such consideration. It would seem that the direct-realist account of speech perception must be recast with aerodynamic events assuming the role previously played by articulatory events, or else some independent motivation must be found for assigning a privileged status to phonetic gestures. In sum, the attempt to draw a close analogy between visual perception of environmental layout and auditory perception in general (or speech perception in particular) encounters serious difficulties, and I am not optimistic that these can be overcome. Moreover, there is another important reason to doubt that listeners actually recover phonetic gestures when they perceive speech sounds. Lindblom and his colleagues (Liljencrants and Lindblom, 1972; Lindblom, 1986; Lindblom, MacNeilage, and Studdert-Kennedy, in preparation) have shown that, among various selection criteria that govern the structure of phonetic segment inventories, one of the most salient is the principle of dispersion: segments tend to be arranged in the phonetic space so as to maximize perceptual distinctiveness. Diehl and Kluender (1989a) have argued that an articulatory construal of the phonetic space to which the dispersion principle applies cannot account for the relevant facts, and that the dispersion principle rather applies to an auditorily defined space (see critical commentaries by Fowler, 1989, Remez, 1989, and Studdert-Kennedy, 1989, and a reply by Kluender and Diehl, 1989b). Stevens and Keyser (1989) have also presented evidence favoring an auditory construal of the dispersion principle. Accordingly, I am obliged to reject as unsound any arguments leveled against Diehl and Walsh (1989) that presuppose the correctness or plausibility of the direct-realist account of auditory perception in general or of speech perception in particular.