Critique of Fowler's Direct Realism

advertisement
Critique of Fowler’s Direct Realism
Randy L. Diehl
Section IA (starting on p. 7) outlines the tenets of direct realism and
suggests that auditory perception is not fundamentally different from
visual and haptic perception in being environment-directed or direct.
That is, the perceptual system is assumed to acquaint the perceiver
with the distal causes of the stimulation. In speech perception, these
distal causes are vocal-tract activities, or what Carol (1989)
elsewhere refers to as "phonetic gestures." Because several of Carol's
main arguments presuppose the correctness or at least the plausibility
of the direct-realist account of speech perception, it is necessary
first to consider whether speech perception really is analogous to,
say, the visual detection of surface layout in the environment.
Even before hearing of J.J. Gibson, I don't believe that I ever doubted
that observers actually experience surface properties and states of
motion of objects in the environment. With respect to visual
perception of surface layout, I guess that I've always been a direct
realist. There are at least four arguments to be made for the directrealist characterization of visual perception. The first argument is
experiential. When observers look at properly illuminated tables and
chairs, they see tables and chairs. They become acquainted with them
in the perfectly obvious sense that they can describe their surface
properties--size, shape, orientation, texture, etc.--in almost
unlimited detail. And, generally speaking, observers agree in their
descriptions of those surface properties.
The second argument is ecological. Carol is correct to say that "if
perceivers are to survive...[p]erceptual systems must acquaint
perceivers with the environment in which they participate as actors"
(p.7). For the most part, it is the surface properties and states of
motion of objects in the environment that perceivers must be visually
acquainted with in order to find food and potential mates and to avoid
predators, cliffs, and dangerous flying objects.
The third argument has to do with the nature of the relation between
the distal source of the visual stimulation and the optical information
that is presumed to specify that distal source. In order for direct
realism to work in principle, it is necessary that the specificational
relation between information and distal source be unique and
unambiguous. That is, there must be sufficient information
(potentially available to an observer free to explore the environment)
to uniquely specify the shape, orientation, etc. of the distal object.
We have every reason to believe that this fundamental condition on the
specificational relation is satisfied in the case of visual detection
of surface layout, assuming a few general constraints hold (see, for
example, Ullman, 1984). An important characteristic of these general
constraints, e.g., that most objects can be assumed to be rigid in
their motion, is that they can be independently verified on perceptual
grounds.
The final argument also involves the specificational relation between
information and distal source. The theory of direct perception assumes
that the informational medium (e.g., optical or acoustic signals) is
directly structured by the distal event that is perceived. It is this
direct structuring that makes light and sound informative about the
sources that impose the structure. In the optical case, it is evident
that the informational medium really is directly structured by the
distal events (i.e., the layout and motion of environmental surfaces)
that are the objects of direct perception.
Now, the question is: Do these four arguments (let's call them,
respectively, the arguments from experience, from ecological advantage,
from specificational uniqueness, and from direct informational
structuring) also apply in support of a direct-realist account of
audition in general and of speech perception in particular? First,
consider the argument from experience. Carol correctly points out that
there are at least some properties of sound-producing events that
listeners do experience auditorily and about which they can give
accurate descriptions: sound location is a good example. But many
other properties of sound-producing events seem not to find their way
into the listener's auditory experience. The desktop computer in front
of me makes a low-frequency noisy sound that I happen to know is
produced by a fan. However, if I didn't already know this (from
reading the manual and visually inspecting the inside of the machine),
I would be quite uncertain what electrical or mechanical events were
structuring the sound. My auditory experience is certainly not that of
a rotating fan blade.
In an earlier critique of Carol's direct-realist account of speech
perception, I wrote the following:
“Another apparent problem with articulatory gestures as
objects of perception has to do with their accessibility to the
observer. I can look around and describe the layout of my
environment in rather remarkable detail. This is one of the
reasons why I feel comfortable referring to the layout as an object
of perception. However, when I ask a phonetically naive
individual what is going on in my vocal tract when I talk, his
knowledge is utterly deficient. He can describe my words and
even my phonemes, but he has almost no intuitive grasp of my
non-visible articulatory gestures or the changes they effect in
vocal-tract shape. Conceivably, this profound difference in
accessibility between surface layout and articulatory events is
inconsequential. But the theory must make clear why this is so.
In her paper [Fowler, 1986a], Fowler acknowledges
"the failure of our intuitions in speech to recognize that
perceived phonetic events are articulatory..." (p. 6). However, she
goes on to suggest that part of the problem arises from
deficiencies in the way articulatory events are described by most
researchers. If there is a significant mismatch between these
conventional descriptions and the actual level of description
recovered by listeners, then perhaps it is no surprise that
listeners' articulatory intuitions seem so impoverished.
I do not think this argument is very convincing. The
problem is not that listeners recover a different kind of
articulatory description than the conventional ones offered by
researchers; it is rather that they seem to have no reliable
intuitions about articulation at all. Coordinated gestures, or
coordinative structures, (see Kelso, Saltzman & Tuller, 1986)
may well be a significant theoretical improvement over
conventional notions of articulation, but it seems evident that
listeners have no more intuitive grasp of them than they do of any
other level of description” (Diehl, 1986, pp. 62-63).
In her reply to the above comment, Carol wrote:
As for accessibility, that is not the only, or even the
best,
index as to whether articulated phonetic segments are perceived.
A more telling index is that we shadow or imitate speech both
well and remarkably rapidly (e.g., Porter & Castellanos, 1980;
Porter & Lubker, 1980). Thus, the facts appear to be that we do
perceive speech as articulated, and yet we are not easily made
aware that we do. As for why we are unaware, Remez' reference
to tacit knowledge may be on the right track. Language is tiered
and its ecologically most significant information is provided by
levels more encompassing than the phonetic level. It may be very
difficult to ignore them in order to attend to their constituents
(Fowler, 1986b, p.154).
I must disagree with the first point. The ability to imitate speech
(rapidly or otherwise) is certainly not telling evidence that we
perceive articulated phonetic segments. Performance in this case could
just as well reflect our ability to produce acoustic signals that
resemble the ones we hear. As Diehl and Kluender (1989b) recently
pointed out, the later account has the added virtue of being able to
handle, say, our ability to imitate a melody played on a musical
instrument by whistling or humming (despite the fact that the original
tune and the imitation are produced in very different ways). Although
the ability to imitate speech is not telling evidence one way or the
other, it would be very telling indeed if phonetic gestures were
accessible to listeners in something like the way tables and chairs are
visually accessible. The fact that they are not requires some
explanation. Carol offers one type of explanation in her second point
of the above paragraph. It is that other levels of language beyond the
articulatory are more ecologically significant. I agree, and this
brings us to the argument from ecological advantage.
Whereas visual acquaintance with the surface layout of the environment
confers enormous survival advantage on an organism, auditory
acquaintance with someone else's phonetic gestures per se appears to be
ecologically inconsequential. What matters to a human listener is
whether the meaning of an utterance is understood. Even assuming that
listeners can recover phonetic gestures from the acoustic signal, there
would appear to be no ecological requirement that they do so. If
meanings can be accessed via the gestures, they can presumably also be
accessed directly from the acoustic signal without first recovering the
gestures.
At the level of auditory perception in general, there are of course
some physical properties of sound sources that matter for the survival
of organisms--again, location is a good example. But for many classes
of sound source, there is no conceivable adaptive advantage that would
accrue from perceiving the physical events that directly structure the
sound. What matters in these cases is whether the listener is able to
judge what kind of object or event is associated with a given sound. A
couple of examples may clarify this point. A rodent that has
experienced the sight, smell, and sound of a rattlesnake will probably
make an avoidance response in the future upon hearing the distinctive
rattle. Survival here depends not on directly perceiving the physical
events that give rise to the rattle sound, but in knowing that a
certain kind of predator makes a certain kind of sound. Similarly, if
I'm walking in the African grasslands and I hear a lion's roar, my
adaptive behavior depends on my knowing that lions make that kind of
sound. Direct acquaintance with the lion's vocal-tract structures seems
completely irrelevant. Survival requires that auditory perception be
environment-directed, but it does not require that it always or even
typically be direct in Carol's sense. The point is that the argument
for ecological advantage applies in the auditory case only selectively.
It is interesting that the arguments from experience and from
ecological advantage seem convergently to favor certain source
properties (e.g., location, velocity of motion) and convergently to
disfavor others (e.g., human vocal-tract events), as likely objects of
direct auditory perception. This convergence also appears to hold for
the argument from specificational uniqueness. The spatial location of
a sound source in a free field is redundantly specified by several
distributed properties of the sound wave. We know that many organisms
are capable of detecting these properties and, on that basis, of
identifying source location with a high degree of accuracy. In other
words, the specificational relation between information and source
property in this case is virtually unambiguous.
Does a similar specificational uniqueness obtain in the case of sound
sources such as the human vocal tract? The answer is clearly no. Even
if we confine the discussion to actualizable vocal-tract shapes and
aerodynamic conditions, there are typically an unlimited number of ways
to produce any vocal signal. The reason for this is that most acoustic
parameters are dependent on several different vocal-tract parameters,
such that a change in the value of one parameter can be offset by
changes in one or more other parameters. For example, one can achieve
about the same formant pattern corresponding to /u/ either by rounding
the lips, by lowering the larynx, or by doing a little bit of both.
And among the various ways to implement the lip rounding gesture, one
can achieve about the same output by trading lip protrusion against lip
constriction. Carol (1989) has argued that, since phonetic gestures
are construed as equivalence classes defined by a common phonetic end
(e.g., lip closure), this many-to-one mapping problem is mitigated.
However, as Diehl and Kluender (1989b) replied, many of the kinds of
trade-offs illustrated above are cross-gestural, making it virtually
impossible for the listener to know which gesture or combination of
gestures is responsible for a given acoustic pattern. Moreover, the
many-to-one mapping between source and sound is not limited to the
articulatory domain. For example, the Reynolds number corresponding to
the transition between laminar and turbulent airflow maps onto an
infinite set of combinations of volume-velocities (usually
corresponding to velocity of lung collapse) and constriction areas.
Within a range of such combinations, the acoustic outputs are virtually
identical.
Consistent with this specificational ambiguity, talkers are known to
produce the same sound classes in gesturally diverse ways. Consider
the following example. A primary distinctive acoustic property of the
American English vowel / er / is a low-frequency F3. It is
theoretically possible to lower F3 by, among other things, constricting
the vocal tract at any of the three antinodes in the volume-velocity
waveform corresponding to F3 (Fant, 1960). These antinodes happen to
occur at the lips, the midpalate, and the mid oral pharynx, and as
Ohala (1985) and Lindau (1985) have pointed out, talkers tend to make
constrictions at just these locations when producing /er /. What is
interesting is that some talkers constrict at all three points, whereas
others tend to use only two, e.g., palatal and pharyngeal, or palatal
and labial. And for the palatal constriction, some talkers use a
bunched-tongue configuration, whereas others use a retroflex
configuration. What all these diverse gestural types have in common is
their similar acoustic effects. Now, the question is: why would all
this gestural diversity occur in a speech community if its members were
capable of recovering the actual source properties from the sound?
Additional evidence for the same point is presented in Ladefoged et
al., (1972).
Matters only get worse when we drop the restriction that the vocal
parameters be physiologically actualizable in humans. Ladefoged et
al., (1977) have shown that the same vowel formant patterns that can be
generated by a realistic vocal configuration can be duplicated with a
variety of physiologically unrealistic vocal configurations. Now,
conceivably listeners use somesthetic knowledge of their own vocal
tracts to winnow down the alternatives, but this should not be required
if auditory perception really is direct. Also, this gambit would not
be available to nonhumans listening to speech, and Carol (1989)
elsewhere has suggested that nonhumans can directly perceive human
phonetic gestures just as humans can.
What is true of human vocal tracts is also true of many other sound
sources. The tone produced by a vibrating string is equivalent across
a range of values of string length and string tension.
[The
kinematics may be the same but the dynamics are different, and Carol's
theory requires that listeners be able to recover the dynamics as well
as the kinematics of sound sources (see, e.g., her treatment of the
separate dynamical sources contributing to F0 variation in speech,
Fowler, 1989).] Analogously, the same resonant sound may be initiated
by an air pressure source created by piston-like compression, bellowslike compression, or even heat induced expansion against a fixed
container. Examples of this kind of source ambiguity are virtually
unlimited.
Finally, any sound produced by a mechanical/aerodynamic system can of
course be duplicated with an electrical or electronic analog device.
Although humans may have enough knowledge in most cases to distinguish
the two classes of sound source, this is typically not done on the
basis of auditory perception alone. It may appear fanciful to ask how
a quail or chinchilla knows whether it is listening to mechanical
events in the human vocal tract or to electrical ones, but for the
direct realist, this is a genuine difficulty. To say that the output
of a speech synthesizer is a "mirage" that mimics the output of a real
sound source is, as far as an animal listener is concerned, to beg the
question. How could an animal know what is real and what is a mirage
in this case? Mechanical systems and their electrical analogs are
virtually isomorphic in terms of physical theory, so an appeal to
parsimony isn't much help. It seems evident also that there is no
selection pressure that would drive an animal to favor one source
interpretation over the other. Notice that this kind of quandary only
arises if one insists that organisms perceive sound-producing events
rather than sounds.
Like the arguments from experience and from ecological advantage, the
argument from specificational uniqueness is only selectively applicable
in the case of auditory perception in general, and it does not apply in
the particular case of speech perception. The failure of the first two
arguments to carry over from the domain of visual perception of surface
layout is certainly not helpful for the direct-realist case, but
neither is it devastating. As we have seen, Carol attributes the
relative inaccessibility of gestures to their relatively lowly status
in the hierarchy of language, and this is not an unreasonable argument.
(The same argument might be invoked to explain why the
acoustic/auditory properties of phonetic segments are often not very
accessible. But, then, since I make no claims that the perception of
acoustic signals is closely analogous to the perception of tables and
chairs, I am not very bothered about having to invoke such an
argument.) And the fact that it is irrelevant to our survival whether
we perceive phonetic gestures or their acoustic consequences does not
imply that the direct-realist account is wrong. However, the failure
of the argument from specificational uniqueness to apply generally in
the auditory case is another matter altogether. When the
specificational relation between sound and sound source is
fundamentally ambiguous, the direct-realist account simply cannot be
true.
Before leaving this topic, I should point out that audition does not
provide the only examples of specificational ambiguity. Although
surface layout can be recovered by perceivers, they cannot
unambiguously detect the spectral distribution of light (even within
the visible range of wavelengths) and hence recover the true physical
reflectance of an object. The fact that color perception schemes are
usually based on three kinds of photopigments implies that an infinite
number of possible spectral distributions will converge on a single
distribution of photopigment absorption values. (As in the speech
case, this ambiguity appears to be of limited ecological significance.)
The point of this example is that specificational uniqueness (or
ambiguity) is not a property of a given sense modality per se. Rather,
it follows from the particular physical and physiological principles
that apply in a given case. Contrary to what Carol suggests, we are
not claiming that speech or auditory perception is special vis a vis
the rest of perception. What we are claiming is that, for direct
perception to be possible in a given domain, there has to be a certain
kind of specificational relation between the information and the distal
source. In both vision and audition, this specificational relation
exists for some source properties, but not for others.
The final reason I cited for favoring a direct-realist account in the
case of visual perception of environmental layout was the argument from
direct informational structuring. Light is informative about surfaces
in the environment, because it is the surfaces that directly structure
the light. (Our previous discussion implies that such direct
structuring is a necessary but not a sufficient condition for the
direct-realist account to work.) Now consider the case of speech
perception.
In reply to a comment by Ohala (1986), Carol (1986b) wrote:
“Ohala asks why, if I was going to focus on something other
than the acoustic signal as an object of perception, I picked
articulatory activity when there are other possibilities further
"upstream," including muscle contractions, neuronal activity and
mental events...
One answer to Ohala's question is that, in a theory of
direct perception, the distal event has to structure the
informational medium; otherwise it cannot be directly perceived.
That rules out anything that might be truly "upstream" from
articulation” (p.
150).
This response explicitly invokes the argument from direct informational
structuring to eliminate events such as muscle contractions as likely
objects of direct perception. Although muscle contractions ultimately
affect the acoustic signal, their influence is not as direct as that of
articulatory events (i.e., phonetic gestures), which makes the latter
the more appropriate sound sources to be recovered by the perceiver.
But it is necessary to ask: do phonetic gestures themselves directly
structure the acoustic signal, or is there some other class of events
that is even more directly responsible for structuring the acoustic
signal? The answer to this is unequivocal: the speech acoustic signal
is directly structured by the aerodynamic properties of the vocal
tract, which are in turn created by combinations of initiatory,
phonatory, and articulatory events. As J.C. Catford (1977) reminds us:
the
by
(p.
“The aerodynamic phase is also an extremely important
one. It is the link between the speaker's bodily activity (in
organic phase) and the resultant sound waves (in the acoustic
phase). We must always remember that organic postures and
movements do not themselves generate sounds [his italics]: they
merely create the necessary aerodynamic conditions, for the
generation of speech sounds is in all cases an aerodynamic
process. Some of the organic activities cause pressure changes
in the vocal tract, which result in a flow of air; other organic
activities regulate the flow in ways that create sounds, either
channelling the air-flow through narrow spaces, generating the
audible hiss of turbulence, or by allowing it to burst forth in
rapid periodic puffs, generating the sound of voice, and so on”
11).
Thus, the argument from direct informational structuring, which Carol
has used to eliminate events "upstream" of phonetic gestures (e.g.,
muscle contractions) from consideration as objects of direct
perception, also effectively eliminates phonetic gestures from such
consideration. It would seem that the direct-realist account of speech
perception must be recast with aerodynamic events assuming the role
previously played by articulatory events, or else some independent
motivation must be found for assigning a privileged status to phonetic
gestures.
In sum, the attempt to draw a close analogy between visual perception
of environmental layout and auditory perception in general (or speech
perception in particular) encounters serious difficulties, and I am not
optimistic that these can be overcome.
Moreover, there is another important reason to doubt that listeners
actually recover phonetic gestures when they perceive speech sounds.
Lindblom and his colleagues (Liljencrants and Lindblom, 1972; Lindblom,
1986; Lindblom, MacNeilage, and Studdert-Kennedy, in preparation) have
shown that, among various selection criteria that govern the structure
of phonetic segment inventories, one of the most salient is the
principle of dispersion: segments tend to be arranged in the phonetic
space so as to maximize perceptual distinctiveness. Diehl and Kluender
(1989a) have argued that an articulatory construal of the phonetic
space to which the dispersion principle applies cannot account for the
relevant facts, and that the dispersion principle rather applies to an
auditorily defined space (see critical commentaries by Fowler, 1989,
Remez, 1989, and Studdert-Kennedy, 1989, and a reply by Kluender and
Diehl, 1989b). Stevens and Keyser (1989) have also presented evidence
favoring an auditory construal of the dispersion principle.
Accordingly, I am obliged to reject as unsound any arguments leveled
against Diehl and Walsh (1989) that presuppose the correctness or
plausibility of the direct-realist account of auditory perception in
general or of speech perception in particular.
Download