A Review of Auditory Perceptual Theories and the Prospects for an

advertisement
A Review of Auditory Perceptual Theories
and the Prospects for an Ecological Account
Ewan A. Macpherson
Department of Psychology
University of Wisconsin-Madison
(In partial fulfillment of Preliminary Exam requirements)
July 1995
Contents
1 Introduction
1.1 Motivation
1.2 Definitions
1.3 The neglect of auditory perceptual theory
1
1
1
2
2 Opinions on the role of auditory perception
2.1 Opinions on the role of perception in general
2.2 The role of hearing according to Helmholtz
2.3 The role of hearing according to James
2.4 The role of hearing according to Gibson
2.5 Other opinions: identification or source recovery?
2.6 Summary
4
4
5
7
7
8
9
3 Theories of Auditory Perception
3.1 Helmholtz's account of audition
3.2 James' account of audition
3.3 Brunswik's probabilistic functionalism
3.4 Gibson's account of audition
3.5 Computational accounts of audition
9
9
10
11
12
13
4 Establishment ecological research
4.1 Brunswick & Mohrmann: loudness constancy
4.2 Auditory scene analysis & auditory image perception
4.2.1 Bregman: auditory scene analysis
4.2.2 Yost: auditory image perception
4.2.3 Summary
4.3 Ballas & Howard: interpreting environmental sound
15
15
17
17
19
21
22
5 Ecological ecological research
5.1 Time-to-contact: acoustic looming
5.2 Using auditory information for active contact
5.3 Transformational invariants: breaking & bouncing
5.4 Perceiving numbers by audition
5.5 Acoustic texture in distance perception
24
25
26
27
28
29
6 Prospects for an Ecological account
6.1 Auditory ecology
6.2 Superposition
6.3 Specification
6.4 Auditory affordances
29
30
32
33
37
7 Conclusions & Speculations
41
References
45
1 Introduction
1.1 Motivation
During the spring of 1995 I attended a seminar given by Bill Epstein entitled "Thinking and
Perceiving".
The discussion centered around the conception of perception as a process of
unconscious inference, and starting with the writings of Helmholtz and Berkeley, continued through
to the computational constructivism of Marr. In addition to the various construals of this notion, we
also dealt with the objections and alternatives which have been offered, and discussed what sorts of
experimental results would stand as evidence for one position or another. While unspecified in the
course title, the 'perceiving' referred to throughout was uniformly visual, with little reference to
audition or the other modalities. Thus, this paper is motivated in part by my speculation about how
the content of the seminar would have changed if hearing were the canonical sense for perceptual
theorizing. In keeping with this theme, I have somewhat liberally included short quotations from
several authors in lieu of "readings" since it is often enlightening to see cited authors' original words.
More specifically, my aims in this paper are threefold: to review perceptual theorizing carried
out within the context of audition; to examine a selection of experiments motivated by the differing
theoretical viewpoints; and finally, to look critically at the difficulties that proponents of direct
perception might face in importing a theory framed primarily in terms of vision into the auditory
modality. Since these aims are rather interdependent some topics are discussed in more than one
context, but I hope that I have been successful in minimizing repetition. Rather than getting mired
in a full-fledged analysis of the direct perception debate, I have attempted to take each position on
its own merits while (I hope) maintaining a suitable and evenhanded skepticism.
1.2 Definitions
Before beginning the discussion proper, I would like to define what I mean by the terms
'Establishment' and 'Ecological', which I use to characterize the two main styles of perceptual theory.
These refer respectively to accounts of perception which posit mediation by various psychological
processes and those which do not. In terms of a succinct introduction to the direct-indirect debate
I do not feel that I can do better than to offer the following passage by Rock:
1
"... the essence of a direct theory is that stimulus information is available that
uniquely correlates with each particular perception. Thus the specification of such
information provides the necessary and sufficient explanation of perception. The
essence of an indirect theory is that the stimulus information, while a necessary
determinant, is not sufficient, because certain mediating processes must occur, once
the stimulus information is registered or picked up, prior to the achievement of the
percept. Such mediating processes can be described in psychological language and
are a necessary part of the chain of event leading to the final perception. In my
opinion, these processes could be either interactive in nature, such as were stressed
by the Gestaltists, or they could be cognitive or thoughtlike in character. Examples
of such processes are variously referred to as 'organizing' or 'grouping', 'interpreting',
'taking account of', 'computing', 'inferring', 'describing', 'deciding', and the like."
(Rock 1980)
The position which sets itself against accounts involving mediation is variously referred to direct
perception, direct realism, or ecological perception. The program identified by these terms involves
a rather radical redefinition of perception and stimulation, and since words tend to take on special
meanings in this context I will use 'Ecological' to refer to the specific approach and 'ecological' for
environment-directed perception in general. The Ecological approach actively contrasts itself with,
and defines itself negatively with respect to inferential accounts - it is "indirect" which "wears the
trousers" (Turvey et al. 1981). Therefore it seems reasonable to adopt Fodor and Pylyshyn's use of
the term 'Establishment' to refer to the collection of theories with which the Ecologists take issue
(Fodor & Pylyshyn 1981).
1.3 The neglect of auditory perceptual theory
Also as a preliminary, I would like to briefly discuss the relative neglect of audition in perceptual
theorizing. The roots of the traditional, constructivist view of perception lie in analyses of vision.
Bishop Berkeley discussed the perception of space, and it was in terms of vision that Helmholtz
presented his theory of perception as a process of unconscious inference (Helmholtz 1867). He made
little reference to similar matters in his monograph on auditory perception (Helmholtz 1877), and
in fact the title of the latter refers explicitly to the "sensations of tone". Boring (1942) construed
"auditory theory" to be a framework for discussing the physiology of the inner ear, while more
modern collections with pan-modal titles (Cognitive Approaches to Human Perception (Ballesteros
1994), for example) still cheerfully ignore the non-visual modalities. Contemporary contributors to
2
cognitive constructivist theory also work primarily with vision (for example Gregory 1993, Rock
1983, and Shepard 1990), as do computational constructivists such as Marr (1982).
In the last three decades an alternative to familiar accounts of perception, inspired by the
work of Gibson has gained some acceptance.
While Gibson addresses all the modalities as
'perceptual systems' (1966), the full exposition of his theory deals explicitly with vision (1979). The
most serious proponents of his program tend to similarly dwell on vision, sometimes restricting their
discussion of audition to a single page (Michaels & Carello 1981).
Unsurprisingly, this visual bias persists in most discussions of the relative merits of
traditional and direct theories of perception. Typical examples are the target article by Ullman and
the resulting commentary (Ullman 1980), the debate between Fodor & Pylyshyn (1981) and Turvey
et al. (1981), and the analyses by Bruce & Green (1990) and Hochberg (1994).
The main thrust of auditory research also seems to have proceeded in the absence of
discussion of the fundamental nature of perceptual processes. Licklider (1959) remarks:
"There is no over-all theory of hearing. No one since Helmholtz has tried to handle
anything like all the known problems within a single framework. Each of the several
theories of hearing that are extant deals with a restricted set of questions."
This seems true today, and of course the number of "known problems" continues to increase. To
what can we attribute this lack of theorizing, or conversely why has such work been more often
undertaken in the visual domain? The explanation seems to lie partly in beliefs or intuitions about
differences between the two modalities and about sound itself, but more so in the historical roots of
certain lines of experimentation.
Firstly, hearing has traditionally been thought of as passive and vision active. For example,
Dowling et al. (1987) cite Shopenhauer's belief that music's affective power is due to the passive
nature of hearing, which allowed "brain-fibres" to vibrate in synchrony with musical tones.
Secondly, the "products" of hearing were more often described in terms of sensation and rarely in
terms of object perception, and the interest in the perception of musical tones rather than of the
"noises" produced by everyday sources reinforced this emphasis.
3
Yost (1990) elaborates on this, and suggests that the early direction of hearing research and
experimentation rests on an historical accident of timing. Sound was not considered to be localized
in space, and thus it was unclear how sound sources could be localized except by association with
perceptions derived from sight and touch. Helmholtz's psychoacoustic investigations revealed the
ear to be a sensitive frequency analyzer, and Lord Rayleigh's sound localization experiments came
after interest had been focused on the analysis problem (Strutt 1877). Licklider (1959) also credits
Helmholtz with establishing boundaries of interest within auditory science, and points out that
although von Bekesy's discovery of mechanical tuning within the cochlea disproved Helmholtz's
resonance theory, it merely altered the way frequency selectivity was studied. A final factor may be
that hearing provides a fruitful and "clean" domain for the application of the theory signal detection
(Green & Swets 1966). Thus the tendency to concentrate on basic psychophysics has persisted
throughout this century.
2 Opinions on the role of auditory perception
2.1 Opinions on the role of perception in general
Before examining some writings on the nature of the auditory process, I would like to survey
comments by a number of authors on the role of hearing. Differing views of what should properly
be considered its function or end-products must have an effect on the types of processes postulated
or required. In general, Ecological and Establishment advocates hold somewhat different views of
the role of perception, which I will present before moving on to comments specifically about
audition.
The problem of comparison is complicated by the fact that the two camps not only ascribe
different roles to perception but also define what counts as perception differently. Both the
Establishment and Ecological accounts acknowledge that perception serves to provide information
about the environment. For the Establishment, perception is the process of deriving mental
representations of the objects and events in the environment - the process of "getting the outside
inside". For example Pylyshyn (1984) defines sensory transducers as mechanisms for producing
symbols which depend on states of the environment, William James refers to perception in terms of
4
the conscious awareness of external objects (James 1890), and Fodor (1975) makes frequent
reference to "perceptual knowledge". Perception serves to provide knowledge of "what is where"
in the world, and action is guided on the basis of that knowledge.
In the Ecological view, perception is a keeping-in-contact which supports action, while the
emphasis in Establishment theories is more epistemological. In Ecological accounts, action is
'directly' related to perception, while in Establishment theories the relationship is 'mediated' by other
processes. The following passages illustrate the Ecological view:
"Perceiving is an achievement of the individual, not an appearance in the theater of
his consciousness. It is a keeping-in-touch with the world, an experiencing of things,
rather than a having of experiences. It involves awareness-of instead of just
awareness. It may be awareness of something in the environment or something in the
observer or both at once, but there is no content of awareness independent of that of
which one is aware." (Gibson 1966)
"Fodor and Pylyshyn, as Establishment theorists, concentrate on how ones takes the
environment, appealing to verbal labels of experience to lead the way in delineating
subject matter. when the concentration is shifted to perceptual guidance of activity,
however, it is clear that most of this continuous, nested perceiving lacks words for
referring to it. ... Fodor and Pylyshyn's kind of perception (in percepts) is whatever
eventuates in a perceptual judgement or belief. Gibson's kind of perception, in
contrast, is that which eventuates in the 'proper' adjustment or oriented (to various
levels of the environment) activity." (Turvey et al. 1981)
This distinction between non-propositional perception-of and propositional perception-as is a major
point for Ecological theorists. The division between those adhering to the -as and -of interpretations
is not cleanly along "mediated" and "direct" lines, however. John Searle, certainly no supporter of
unconscious inference accounts of mental phenomena, explicitly states that "all perception is
perception-as" (Searle 1992).
2.2 The role of hearing according to Helmholtz
Most of Helmholtz's writings on hearing are found in the monograph On the Sensations of Tone As
a Physiological Basis for the Theory of Music. As the title suggests, this is a work with rather
specific aims. In particular, it deals with the perception of "musical tones" (defined as steady-state
5
combinations of sine-tone partials) and not with everyday sounds, which Helmholtz referred to as
"noises". Despite this emphasis on hearing in a musical context it may be possible to draw some
conclusions about his thinking about the role of hearing in general.
Firstly, sensation is stressed as playing a more dominant role in hearing than in the other
senses (again, in a musical context). In the introduction, Helmholtz writes:
"Music stands in a much closer relation to pure sensation than do the other arts. The
latter rather deal with what the senses apprehend, that is with the images of outwards
objects, collected by psychical processes from immediate sensation. ... in music, the
sensations of tone are the material of the art. So far as these sensations are excited
in music, we do not create out of them any external objects or actions. Again, when
in hearing a concert we recognize one tone as due to a violin and another as due to
a clarinet, our artistic enjoyment does not depend upon our conception of a violin or
clarinet, but solely on our hearing of the tones they produce, whereas the artistic
enjoyment resulting from viewing a marble statue does not depend on the white light
which it reflects into the eye, but upon the mental image of the beautiful human form
which it calls up." (Helmholtz 1877)
So although the listener can identify the source of a tone, the "raw" sensation of timbre is very clearly
present in awareness. Source identification is possible, but not necessarily the single overriding goal.
A second emphasis, on the challenge of source separation, does suggest an important place
for the "images of outwards objects" in hearing. As well as considering the ability to follow separate
melodic lines in a piece of music, the reader is also asked to consider a ballroom:
"Here we have a number of musical instruments in action, speaking men and women,
rustling garments, gliding feet, clinking glasses, and so on. All these causes give rise
to systems of waves, which dart through the mass of air in the room, are reflected
from its walls, return, strike the opposite wall, are reflected again, and so on until
they die out. ... in short, a tumbled entanglement of the most different kinds of
motion, complicated beyond conception. And yet the ear is able to distinguish all the
separate constituent parts of this confused whole ..." (Helmholtz 1877)
Presumably this separation is supposed by Helmholtz to allow the listener to "apprehend" the
speaking men and women, the rustling clothes, etc.
6
2.3 The role of hearing according to James
William James also advanced a "knowing what is where" view of perception's role. Throughout the
chapters on perception in The Principles of Psychology, he discusses both the visual and auditory
modalities in parallel, drawing no fundamental distinction between them. Perception results, in his
view, in conscious ideas suggested by sensation. "The first of these ideas is that of the thing to
which the sensible quality belongs. The consciousness of particular things present to sense is
nowadays called perception" (James 1890). In an auditory example (taken somewhat out of
context), he writes: "Thus, I hear a sound, and say 'a horse-car'". That is, the object is identified by
its sound.
2.4 The role of hearing according to Gibson
Leaping ahead to the mid-20th century, one might expect Gibson to have a somewhat different view
on the role of auditory perception, but the explicit differences to be found are subtle. In The Senses
Considered as Perceptual Systems he writes:
"The function of the auditory system, then, is not merely to permit hearing, if by that
is meant the arousal of auditory sensations. Its exteroceptive function is to pick up
the direction of an event, permitting orientation to it, and the nature of an event,
permitting identification of it." (Gibson 1966)
The obvious difference is the substitution of 'event' for 'object', but since by necessity the production
of sound involves a dynamic event, this might be construed as a difference in terminology. A greater
difference is the proposal that the 'nature' of an event is picked up. This presumably consists of the
shapes, motions, and materials involved in the production of the sound, but it is difficult to interpret
Gibson's usage precisely, and the point is not elaborated in the most mature incarnation of his theory
(1979), which considers only vision. In light of the theory of affordances, the idea that picking up
the nature of an event subserves its identification seems somewhat inconsistent with an Ecological
stance. I will return to the discussion of the problem of auditory affordances in Section 6.4.
2.5 Other opinions: identification or source recovery?
7
Other writers also stress identification in the auditory modality. Of these some explicitly identify
their viewpoint as Ecological while others do not. As an example of the latter, Schubert (1974)
proposes the Source Identification Theory as an organizing principle of the auditory system, at least
in the processing of non-speech sounds. For speech he extends this to include a principle of Source
Behavior recognition in an effort to embrace the motor theory of speech perception. In this account,
the listener uses the sound stimulus to identify articulatory gestures, and from these derives the
phonemic and semantic content of an utterance. The means by which Schubert suggests this is
accomplished are far from unmediated, however. Another promotion of source identification is
found in Jenkins' ecological but somewhat un-Gibsonian meditation on acoustic information (Jenkins
1985). The majority of the examples given refer to gaining "what is where" knowledge of soundproducing objects.
The idea that listening to speech is exceptional is challenged by Fowler, a committed direct
realist. While in agreement with Schubert that in this case the auditory system recovers "the causal
source of the acoustic signal" (Fowler 1991), she maintains that it is wholly unspecial in that regard
and that all hearing involves event recovery rather than associating objects with sounds (ie.
identification). While admitting that there are situations in which there is no adaptive advantage in
perceiving events directly, her argument is that there frequently is such an advantage and therefore
that evolutionary pressures will have produced an auditory system which attempts to do exactly that.
In addition to Fowler's writings in the context of speech perception, perhaps the most serious
examination of the role of audition from an Ecological perspective is to be found in a pair of papers
by Gaver (1993a, 1993b). Here he proposes that our auditory sense exists to pick up sound-carried
information "...about an interaction of materials at a location in an environment". The sound
reaching a listener's ears is held to bear information about each of these elements: the nature of the
interaction, striking or scraping, say; the materials involved, wood or water; the location relative to
the listener or to the environmental setting; and the nature of the environment itself, in terms of
reflectiveness and configuration of surfaces. The example of sound from a moving car is provided:
"We can hear an approaching automobile, its size and its speed. We can hear where
it is and how fast it is approaching. And we can hear the narrow echoing walls of the
8
alley it is driving along. These are the phenomena of concern to an ecological
approach to perception." (Gaver 1993a)
Thus what are heard are various physical features of environmental events, but as with Fowler, Gaver
does not attempt to make the case that these are necessarily ecologically-significant features
analogous to Gibson's visual affordances.
2.6 Summary
To review then, there seem to be two views on the role of auditory perception in addition to
Helmholtz's sensation-based account of music perception. The Establishment story is that hearing
serves to localize and identify sound-producing objects, while the Ecological view holds that the
physical nature of sound-producing events is directly perceived - the causal source of the acoustic
signal is recovered. As noted previously this account is not strictly ecological in the way visual
theories of perception-for-action claim to be.
3 Theories of Auditory Perception
Having examined the range of viewpoints on the role of auditory perception, I now turn to
discussions of the processes which are held to underlie the fulfillment of this role. These are quite
varied, including hints of unconscious inference in the writings of Helmholtz, the direct perception
approach of Gibson, and auditory applications of computational constructivism.
3.1 Helmholtz's account of audition
Beginning again with Helmholtz, we find that he devotes little discussion to the mental processes
involved in hearing. This may be largely due to his emphasis on the "sensation of tone", rather than
on adaptive auditory perception outside a musical context. However, a number of passages suggest
that Helmholtz feels that a great deal of work needs to be done on the auditory input in order to
produce separate percepts for the sound sources contributing to it. He is not as explicit as in his
advancement of unconscious inference as a theory of visual perception, but he certainly suggests
ratiomorphic, constructional mental activity. The three quotations which follow give the sense that
the auditory system is involved in analysis, inference, and problem solving respectively. The second
is preceded in the original by a passage describing the visual inspection of the surface of the ocean
9
and the ease with which the superimposed systems of waves are separated by eye. (The emphases
are not present in the originals).
"We shall see that the ear has no decisive test by which it can in all cases distinguish
between the effect of a motion of the air caused by several different music tones
arising from different sources, and that caused by the music tone of a single sounding
body. Hence the ear has to analyze the composition of single musical tones, under
proper conditions, by means of the same faculty which enabled it to analyze the
composition of simultaneous music tones."
"I must own that whenever I attentively observe this spectacle [the visual separation
of ocean wave systems] it awakens in me a peculiar kind of intellectual pleasure,
because it bares to the bodily eye, what the mind's eye [perception in general?] grasps
only by the help of a long series of complicated conclusions for the waves of the
invisible atmospheric ocean."
"Now there are many circumstances which assist us first in separating the musical
tones arising from different sources, and secondly, in keeping together the partial
tones of each separate source. Thus when one musical tone is heard for some time
before being joined by the second, and then the second continues after the first has
ceased, the separation in sound is facilitated by the succession in time. We have
already heard the first musical tone by itself and hence know immediately what we
have to deduct from the compound effect for the effect of this first tone."
3.2 James' account of audition
James discusses perception as a general process without strongly differentiating between the
modalities, although he does seem to side with Bishop Berkeley in asserting the primacy of touch,
and is quite explicit in his description of the processes. The account is sensation-based and
constructivist, and is well-summarized in the following two quotations:
"Sensational and reproductive brain processes combined, then are what give us the
content of our perceptions" (James 1890)
"Perception may then be defined, in Mr. Sully's words, as that process by which the
mind
10
supplements a sense-impression by an accompaniment or escort of revived
sensations, the whole aggregate of actual and revived sensation being
solidified or 'integrated' into the form of a percept, that is, an apparently
immediate apprehension or cognition of an object now present in a particular
locality or region of space." (James 1890)
Moreover, James' account is also clearly empiricist:
"Infants must go through a long education of the eye and ear before they can perceive
the realities which adults perceive. Every perception is an acquired perception."
(James 1890)
and continuing in a footnote, he makes special reference to audition:
"The educative process is particularly obvious in the case of the ear, for all sudden
sounds seem alarming to babies. The familiar noises of house and street keep them
in constant trepidation until such time as they have either learned the objects which
emit them, or have become blunted to them by frequent experience of their
innocuity." (James 1890)
3.3 Brunswik's probabilistic functionalism
Occupying a position somewhere between traditional, perception-as constructivism and Gibson's
Ecological approach lies Brunswik's probabilistic functionalism, which influenced Gibson's thinking
significantly (Lombardo 1987). In this framework, the emphasis is on the perceptual constancies,
referred to as distal focusing, and on their achievement in non-laboratory, or "representative"
contexts. The perceptual process is held to take the form of statistical inference; proximal cues of
varying reliability are weighted and combined to produce a "best bet" at the distal state of affairs.
The model of the process incorporates three types of weightings or correlations, referred to as
validities. Correlations between distal features and proximal cues are ecological validities; the
weightings placed on cues to produce percept features are criterial validities; and the degree of
correspondence between the distal feature and the percept is the functional validity. This last is a
metric of achievement.
11
While Brunswik himself applied his methods principally to the three canonical visual
constancies (size, shape, and color) the same system has been applied to audition in a study of
loudness constancy (Mohrmann 1939). This work will be described in Section 4.1 as one example
of Establishment-style experimentation.
3.4 Gibson's account of audition
Gibson's account of the basis of auditory perception exactly parallels his treatment of vision, and has
no place for the cues which play such an important role in Brunswik's conception of ecological
perception. The hearing organism is said to use its listening system, "two ears together with the
muscles for orienting them to a source of sound", to sample the 'acoustic array'. This permits the
pick-up of invariants which specify the mechanical sound-producing event. No mediation by
inference, memory, or computation is required. As in any direct theory, the usefulness of such a
process rests on specification, or the one-to-one mapping from sound-field properties to soundsource properties. For example, interaural time and amplitude differences and their patterns of
change as the head moves are identified as specifiers of the location of a source. Two quotations will
serve as evidence of his belief in acoustic specificity:
"In meaningful sounds, these variables [spectral and temporal features] can be
combined to yield higher-order variables of staggering complexity. But these
mathematical complexities seem nevertheless to be the simplicities of auditory
information, and it is just these variables that are distinguished naturally by an
auditory system. Moreover, it is just these variables that are specific to the source of
the sound - the variables that identify the wind in the trees or the rushing of water,
the cry of the young or the call of the mother. The sounds of rubbing, scraping,
rolling, and brushing, for example, are distinctive acoustically and are distinguished
phenomenally." (Gibson 1966)
"... the kind of wave train is specific to the kind of mechanical event at the source of
the field; that is, the sequence and composition of pressure changes at a point in the
air correspond to what happened mechanically... This correspondence is the
justification for our metaphorical assertions that the waterfall 'splashes', the wind
'whistles', and the thunder 'cracks'." (Gibson 1966)
12
Gibson also repeats his argument against sensations as a basis for perception in the context of
hearing. A sound signal, as a function of time, can be decomposed into a collection of sinusoids, but
he points out that adopting this mode of analysis leads to the dubious assumption that any complex
sound can be reduced to a collection of pitch sensations. The point he makes is similar to what
Jenkins (1985) calls Johansson's Law of Perceptual Richness, which is that mathematically complex
stimuli may be hard to describe, but are information-rich, while mathematically-simple stimuli may
not be so simply dealt with by the perceptual system (Johannson 1985).
3.5 Computational accounts of audition
In Gibson's auditory theory, the pick-up of information is said to be performed by neural structures
which 'resonate' to the invariants of stimulation.
By removing these processes from the
psychological "domain of discourse" (Ullman 1980) Gibson left them unanalyzed. Those interested
in artificial intelligence and the development of perceiving machines do not have this luxury,
however, and must face the problem of actually extracting invariants. Despite this component, and
an emphasis on representational transformation, Gibson's Ecological approach is often identified as
a source of inspiration (as well as exasperation) by those who practice computational
constructivism1. For example, Sloman attempts to incorporate affordance-like objects of perception
into his computational theory, but writes: "... we need not stick with Gibson's mystifying and
unanalysed notions of direct information 'pickup' and 'resonance', although I shall sketch a design
for such a system that has distant echoes of these notions" (Sloman 1989). Marr holds a similar
view:
"Gibson's important contribution was to take the debate away from the philosophical
considerations of sense-data and the affective qualities of sensation and to note
instead that the important thing about the senses is that they are channels for
perception of the real world outside or, in the case of vision, of the visible surfaces."
(Marr 1982)
1
The 'computational' in 'computational constructivism' refers specifically to a style of
processing involving mathematical manipulations and explicitly geometrical representations. As
Pylyshyn has pointed out (1984), all forms of constructivism can be considered computational
since inference is couched in terms of propositions, propositions are represented symbolically,
and an operation over symbols is computation.
13
"Although one can criticize certain shortcomings in the quality of Gibson's analysis,
its major, and in my view, fatal shortcoming lies at a deeper level and results from
a failure to realize two things. first, the detection of physical invariants, like image
surfaces, is exactly and precisely an information-processing problem, in modern
terminology. And second, he underestimated the sheer difficulty of such detection."
(Marr 1982)
This combination of consideration of ecological constraints and formal computation has been termed
'natural computation' by Richards (1988), and forms yet another class of auditory theory. C.J. Searle
(1982) and Lyon (1983), among others,have applied these methods to auditory processes. Other
major impetuses are soundscape understanding or 'machine listening' (Ellis 1995), and automatic
music transcription (Nunn 1995). Curiously the design of speech recognition systems seems to have
proceeded without much contact with perceptual science, and the techniques used are often generalpurpose pattern recognition algorithms rather than auditory models.
Bruce & Green (1990) offer a possible reconciliation between computational and Ecological
accounts, framed in terms of non-symbolic representation. Neural "maps" can represent variables
of the input and preserve isomorphisms, but as Searle (1992) maintains, once the neurophysiological
bases of these maps are understood, the incentive to characterize the process in terms of symbolic
computation is greatly reduced. For example, it appears that interaural time differences are mapped
to "place" in the medial nucleus of the superior olivary complex (Pickles 1988) - the representation
is not symbolic. Certainly much of Marr's theory of early vision could be read simply as a functional
description of simple neural processing. Hatfield (1990) also proposes a rapprochement between
direct and representational transformation accounts via connectionist "symbol" processing.
14
4 Establishment ecological research
In the next two sections of the paper I will review a number examples of individual experiments or
of research programs conducted from the Establishment and Ecological viewpoints. The dual aims
are to compare the style of experimentation within the two camps and to provide some context for
the discussion of the Ecological approach with which I conclude in Section 6. Experiments which
are self-consciously motivated by an anti-direct stance tend to seek the effects of perceiver
knowledge on percepts (Hochberg 1994), while others tacitly working within the classical framework
uncritically offer inference-based explanations of their observations. There are also many examples
of the types of experiments which are a favorite target of Ecologists: snapshot theories of motion
perception, lateralization of sine tone stimuli, fixed-head sound localization, and auditory illusions
using "impoverished stimuli" of various sorts.
Since the subject matter and analyses found in these studies are so obviously different from
those in Ecologically-motivated work, it does not seem particularly illuminating to discuss them
here. Instead I will focus on experiments which address issues relevant to object or event perception
from an Establishment viewpoint. The emphasis in this work is usually on filling unsatisfying gaps
in the direct perception account (eg. how are cues or invariants extracted?) or on offering alternative
accounts involving mediating processes. I will attempt to show by example that to some extent one
can address auditory perception ecologically without being strictly Ecological.
4.1 Brunswick & Mohrmann: loudness constancy
As mentioned previously, the concern with investigating environmental perception did not originate
with Gibson. Brunswik and his colleagues examined many perceptual constancies within the
framework of his probabilistic functionalism. An example of this approach applied in hearing is a
study of loudness constancy by Mohrmann (1939, described by Postman & Tolman 1959). The task
of the subjects was to report the loudness of the sounds produced by a number of sources while
adopting one of two attitudes. The first, the naive-realistic attitude, was distally focused, and
required the listener to estimate the intensity at the source, while the analytic, or sensorial, attitude
concerned the intensity at the listener's position. The actual intensity was measured using
microphones at the source and listener positions, but the response method is not described.
15
Achievement of constancy was calculated by correlating the judgements with the physical
measurements. Presumably the proximal intensity was varied by altering the distance to the source
rather than changing its amplitude, since the latter would cause both distal and proximal intensities
to vary in parallel. In addition, the experiment was performed in the dark, with listeners blindfolded
after viewing the source, and with the source in plain view throughout.
If listeners were able to adopt the desired attitudes perfectly, the constancy ratios obtained
should be 1 in the naive-realistic case and 0 in the analytic case. This trend was observed, but
constancy ratios ranged from approximately 0.65 (for tones) to 0.95 (for speech) in the realistic case,
and from about 0.1 to 0.5 in the analytic. This suggests that on the whole observers are more
successful at reporting distal intensities than proximal ones. In addition, constancy was favored
when subjects could see the source and how far away it was no matter which attitude they were
requested to take, but visual cues hindered proximal reporting more than they assisted already-good
distal reporting. That is, listeners could only successfully adopt an analytic stance in the dark
condition. Another feature of the data is that the complex sounds, such as speech and music,
permitted much higher loudness constancy than tones and noise.
These results can of course be interpreted in several ways. In Brunswik's terms, the adaptive
value of perception lies in distal focusing, and therefore it should not be surprising that we have
easier access to distal representations that to the proximal cues from which they are derived.
Unconscious inference could be invoked to explain the achievement of greater constancy in the
visible-source condition, in which vision provides information about the distance to the source. This
could be used by the auditory system, which "knows" how intensity varies with distance, to
determine the source's loudness2. This of course begs the question of how the visual system obtains
unambiguous distance information. The advocate of direct perception would explain the difficulty
of reporting proximal intensity as evidence that the auditory system is designed to recover source
properties. The better constancy obtained with speech and music could be attributed to their greater
ecological validity and informational richness in comparison to the lowly sine tone and noise burst,
for which source recovery would be ambiguous. The advantage bestowed by visual information is
2
Warren (1982) discusses an approach in which estimates of loudness are actually held to be
disguised estimates of distance.
16
less conveniently explained within an Ecological account, but conceivably cross-modal invariants
for loudness could be hypothesized.
4.2 Auditory scene analysis & auditory image perception
As Helmholtz pointed out, a central issue in auditory research concerns the means by which the
complex superposition of sounds from several sources is processed so that each may be perceived
separately. This process is referred to as source segregation or auditory scene analysis, and is
addressed in an extensive program of research conducted by Bregman and his associates (Bregman
1990) and in a theoretical paper by Yost (1990). Each author is concerned with slightly different
aspects of the problem, and phrases his assumptions and motivations differently. Here I will give
a review of their theoretical orientations and the types of experiments associated with each.
4.2.1 Bregman: auditory scene analysis
For Bregman, like Marr, “perception is the process of using information provided by our senses to
form mental representations of the world around us”, and as in visual scene analysis an important
problem is the grouping of separate pieces of information about the same object together. He writes:
“it is important to emphasize again that the way the sensory inputs are grouped by our nervous
systems determines the patterns that we perceive”. So, the products of perception are in this account
very much influenced by mental activity (or at least alterable neurophysiological activity). The scene
analysis task is posed as a problem to be solved by the auditory system through a process of
representational transformation. Bregman stresses that on one hand it is important to examine the
ecology of audition - the constraints on and commonalities among natural auditory scenes - and
suggests that the auditory system uses ‘knowledge of this sort in the form of useful heuristics in
order to achieve source separation. The formation of representations is held to be constrained both
by innate, primitive grouping rules and by learned rule complexes, which he calls schemas.
Grouping occurs both sequentially (on successively-presented segments of a sound pattern)
and in parallel (on sound components present simultaneously). The end result of the grouping
processes is one or more sound streams, which are described variously as the auditory equivalents
of visual objects, perceptual units representing single happenings, or as perceptual representations:
“a computational stage on the way to a full description of an auditory event”. When a stream is
compared to an object, clearly the meaning is not that streams exist in the environment, but that a
17
stream is a unit of auditory experience with its own properties (rhythm, pitch contour and timbre,
say) just as a visual object is a unit of experience. Despite referring to ecological constraints as a
guide to grouping processes, it is clear that a stream does not necessarily correspond one-to-one with
a sound source. It is possible for the sound from many sources to merge into a single stream, or for
sound from a single source to be segregated into several streams.
The latter effect is revealed in experiments on pitch streaming, a sequential grouping process
(Bregman and Campbell 1971). When a tone sequence consisting of alternating high and low tones
is presented it can appear as a single stream if it is played slowly or the tones are not widely
separated in pitch, or it may split into two streams if played fast or with wide separation. In
situations where a single sequence is grouped into multiple streams, it is very difficult for listeners
to discern the temporal relationships between them. For example, rhythmic patterns perceived in
a single stream can dissolve if pitch manipulations cause it to split into multiple streams. Bregman
writes that we can:
"...look at the streaming effect as the auditory system's description as a mixture of
two sources - one high in pitch and the other low. This is the system's best bet as to
the deep structure of the situation. The heuristic that seems to be involved here is
this: Temporally adjacent segments are not necessarily to be grouped as arising from
the same source, especially when the segments themselves have sharp boundaries.
... In such cases, the events are to be grouped according to similarity." (Bregman
1981).
The reference to "deep structure" is not accidental; he often compares the heuristics involved in
"parsing" the auditory input to Chomskian grammatical rules. Formal generative grammars have
been used in modelling the perception of music (Lerdahl and Jackendoff 1983), and Ballas (1987,
see Section 4.3) uses a speech metaphor in his account of environmental sound perception, so this
approach is not unique.
Bregman's experimental program also includes investigations of other streaming phenomena,
such as those based on timbre differences, and of auditory analogs to visual amodal completion
effects. When tone and noise bursts are alternated in sequence, the tone appears to become
continuous when the noise is sufficiently intense that it would have rendered a truly continuous tone
18
inaudible. This effect also occurs when a tone glide is interrupted by noise - under the appropriate
conditions the glide appears to persist through the noise while continuing to change in pitch, a
phenomenon which has been used to investigate the auditory system's 'assumptions' about the rates
of change of sound source characteristics (Kluender & Jenison 1992). Warren (1982) reviews
several of these illusory continuity effects, and Bregman is in agreement with his suggestion that
their function is to group together sound segments originating from the same source which would
otherwise be separated by masking signals. The ability to elaborate sketchy, temporally-limited
sensory information into temporally-extended stable percepts has also been noted in binaural
experiments (Stellmack 1994).
4.2.2 Yost: auditory image perception
In a paper entitled "Auditory image perception and analysis: The basis for hearing", Yost (1990) also
addresses the scene analysis problem, although he distinguishes his point of view from Bregman's.
His emphasis is on processes which allow the separation of concurrently active sources, under the
premise that main function of the auditory system is held to be the "determination of sound sources".
'Determination' is explicitly distinguished from 'identification'. It is generation of an 'auditory image'
corresponding to a single sound source. These images are the objects of the identification process
although identification need not be successful in order for them to be perceived. An auditory image
seems to be approximately the same as a stream although in a sense its identification with a single
physical source suggests that it is a more environmentally-oriented concept. While Bregman's
proposal that streams are the units of auditory experience seems clear, the use of the word 'image'
is rather more confusing. Consider this passage:
"Because the sounds from different sources do not arrive at the auditory system
separately, the auditory system must process the neural representation of the complex
sound field into elements ('auditory images') that allow the listener to potentially
determine the source. The presence of sound sources is inferred of deduced from
percepts, the auditory images, based on the information arriving at the ears of a
listener. Thus auditory images are the bases for hearing." (Yost 1990)
Such an image is clearly not something which is imagined; it is not the sort of thing studied by those
interested in auditory imagery (Reisberg 1992). Nor is it analogous to a retinal image. If the images
19
are percepts (ie. the experiential outcomes of the process of perception) then they are in classical
terms the conscious representations of sound sources in the environment. However Yost seems to
introduce an extra step of inferring the existence of sound sources from percepts rather than taking
the Helmholtzian position that percepts are the result of inference. To complicate matters Yost
elsewhere states that image perception is sufficient for sound source determination. The proliferation
of levels seems to result from an awkward attempt to keep the discussion outside the realm of
cognition. For example:
"If one reviews the literature on image formation (Handel 1990, Bregman 1990), the
topic may appear to be more closely related to cognitive science, or even to
phenomenology, than to issues that would be of direct interest to psychoacoustics and
auditory physiology. An assumption of this paper is that the auditory system is
responsible for auditory image formation and the four questions posed above are
amenable for study by auditory scientists."
Yost claims to seek an explanation in terms of neurophysiology or basic psychophysics, but,
although denying it, needs a foot in both camps. If auditory images are not phenomenological
entities, then they are hardly percepts.
Rather than belabor this point, I will press on and discuss the experiment Yost presents as
an example of image formation and briefly describe the means by which he feels this is achieved.
The necessity for scene analysis occurs whenever more than one source is active at the same time.
The experimental stimulus in this case was a mixture of a man uttering the vowel /a/ and a
synthesized pipe organ note. Neither the physical frequency spectrum nor the output of an auditory
filter bank model make it obvious that two and only two sources are present, but all of the subjects
who heard the stimulus reported hearing only two. Identifying the sources was more variable, but
all listeners heard some spoken vowel and a musical note.
The strategy adopted in explaining this ability is to examine the ecology of sound production
for physical attributes of sources which might be encodable in the auditory nerve signals. The seven
physical variables suggested are: spectral separation, intensity profile, harmonicity, spatial
separation, temporal separation, common temporal onsets and offsets, and coherent slow temporal
modulation. While this does not exactly constitute a search for invariants, it is meant as a first step
20
in a neurophysiological account of source separation. Note that the importance of temporal
separation and common onsets and offsets was recognized by Helmholtz (1877, see Section 3.1).
4.2.3 Summary
The search for an account at this ecological-neural level is perhaps the only feature of this approach
and of Bregman's which Ecological theorists would not object to. Talk of grouping rules,
representations, problem solving, deductions, and inference is the antithesis of a direct theory.
However, no convincing account of source separation in terms of acoustic invariants has yet been
offered. Gibson (1966) proposed that orienting the head so as to synchronize the binaural inputs for
one source while desynchronizing those for others was the basis of 'selective listening'. While spatial
separation and binaural input does assist in source segregation it is clearly quite possible with a
single channel of input, and thus Gibson's account is inadequate. This issue is discussed in more
detail in Section 6.2.
Bregman's work might be subjected to the standard criticism that his stimuli are
impoverished and unnatural and that the results therefore have little or no relevance to ecological
listening. In addition, any account which posits rules is vulnerable both to questions about who is
applying the rules (ie. the homunculus problem) and to objections about lack of constraints. One can
keep adding rules to explain whatever behavior is observed. However, Bregman states that he is
interested in a functional description of these processes - the rules are tools for predicting percepts
rather than actual constituents of the auditory system. His is an as-if, not an in-fact, rule-following
account. The primitive rules are described as "automatic innate processes that act without conscious
control" (Bregman 1990). On the other hand, his description of the more sophisticated, top-down,
schema-based processes is less reconcilable with direct perception accounts. Here consciouslydirected attention and "the activation of stored knowledge of familiar patterns" are held to play a
role. A Gibsonian explanation would involve an account of perceptual learning, which involves the
discovery of additional variables of stimulation permitting finer discriminations.
4.3 Ballas & Howard: interpreting environmental sound
As a final example of a non-Ecological approach to environmental perception I will examine Ballas
and Howard's paper, "Interpreting the Language of Environmental Sounds" (1987). Their main point
21
is that a useful analogy can be made between the perception and understanding of speech and of
environmental sound. Both seem to involve bottom-up, data-driven processes combined with topdown, context-dependent, knowledge-based cognitive processes which serve to resolve ambiguities
and permit the recovery of meaning. This is similar to Bregman's distinction between primitive and
schema-based processes, but the authors' intent is to present evidence that not only is the general
form of perceptual processing similar, but that specific details are too. The claim that environmental
sound can be considered a language is therefore more than metaphorical.
Ballas and Howard discuss four experiments in support of their contention. The first
involved the free-response identification of a number of short recorded sounds, several of which
were intended to represent events in water or steam-pipe systems. It was found that (with the
exception of a water drip sound) actions were much more accurately identified than agents. That is,
listeners could more reliably say whether the event involved an impact, friction or flow than whether
the materials involved were water, wood, metal or air. These results are contrasted with those of
Vanderveer (1979), who obtained much more accurate judgements, but Howard and Ballas offer the
explanation that Vanderveer's stimuli (e.g. jingling keys, fingers drumming on a table) were
presented in an appropriate context, a seminar room, and that this cued the listeners. Their
conclusion is that, taken in isolation, the meanings of individual environmental sounds (ie. the
identities of the source events) can have ambiguity as can the meanings of isolated words.
A second study attempted to draw a parallel between sound and speech homonyms using an
Information Theory approach. Listeners were again asked to identify the recorded sounds from the
first experiment and to rate the confidence of their identifications. The responses were sorted into
categories and the "entropy" of each sound calculated based on the number of different categories
into which it was placed. The correlation between confidence and entropy was significant,
suggesting that identification is affected by the number of different causes to which a sound might
be attributed. The authors also suggest that identification might be influenced by the frequency of
occurrence of particular sounds in the same way that word recognition depends on frequency, but
they admit that quantifying this may be difficult.
The final two studies used the same set of sounds presented in sequences and were concerned
with the effect of context on the identification of individual sounds within sequences or the learning
22
of sequences as whole units. Context was found to influence the interpretation of individual sounds.
For example a hammer striking a pipe was thought to be a factory machine in one sequence and a
car crash in another. This effect is compared to the resolution of homonym meaning in sentences:
"... it appears that the integration of sequences of sounds resembles the integration
of sequences of words in a sentence. In the latter case, multiple interpretations of
each word might be activated initially and all but one eliminated on the basis of the
context provided by the other words." (Ballas & Howard 1987)
Although not mentioned by the authors, the activation and inhibition which they propose could be
perhaps be investigated using the established tools of experimental psycholinguisitics.
In the final experiment, listeners were asked to learn sequences of two sorts. One set
contained randomly-ordered combinations of drips, clangs, flushes etc, while the other consisted of
causally-sensible structured sequences created using a small finite state grammar. In addition, half
the subjects in each condition were informed that they would hear sounds involving water and half
were given no instruction. The structured sequences were learned more quickly than the random
ones, and there was an interaction with the instructions given. Prior information aided those learning
the structured patterns but hindered those learning the random ones. The interpretation is that the
expectation of causally-logical sequences interfered with the learning of random patterns. In effect
there is held to be a grammar of causality which listeners use to parse environmental sound
sequences. Jackendoff (1987) makes similar claims about the representation of visual events and
their relationship to language.
An Ecological response to this might question the validity of results obtained with sounds
taken out of an environmental and causal context. In other words, Ballas and Howard might have
too restricted a view of what should comprise an environmental sound or stimulus. Fodor and
Pylyshyn (1981) discuss this move with respect to the phonemic restoration effect and conclude that
widening the conception of the effective stimulus allows the resolution of ambiguity, but
concomitantly reduces the ability to explain the perceptual similarity which can occur in differing
contexts. For direct theorists who hold that the auditory system seeks to recover the soundproducing physical events, the existence of sound homonyms may pose no problem, since these
23
sounds are often produced by similar physical systems. Ballas and Howard give the example of a
loud sharp bang, which could be caused by an engine backfire, a gun, or an explosion. In all cases
the physical cause of the sound is the rapid expulsion of air from an enclosure, but, as I argue in
Section 6.4, the environmental significances of these causes differ significantly, and identification
is important.
5 Ecological ecological research
The number of auditory studies explicitly inspired by the Ecological approach is not large. A
substantial portion of the literature consists of speculative discussions of the applicability of direct
or Ecological accounts to audition rather than descriptions of experimental work. The five examples
discussed below have been chosen to indicate the types of experiments performed and the relative
successes and failures which were encountered.
The influence of the Ecological approach manifests itself in the objectives of particular
experiments or studies and consequently in their design. Characteristic aims are: the discovery of
invariants of stimulation; obtaining evidence that perception is causally related to these invariants,
which in turn is taken as evidence for direct perception; and characterizing the manner in which
perception guides action. The search for invariants involves either mathematical analysis or physical
measurements of a given environmental situation. In order to show that perceptual systems actually
utilize a particular invariant its presence must be shown to be a sufficient condition for the relevant
percepts to arise. Thus, observers must be shown to be able to perceive the environmental property
which the invariant specifies and their percepts must be alterable by experimental manipulations of
the invariant.
It is of course difficult to prove experimentally that perception is unmediated, particularly
since the putative mediating processes are presumably unconscious and inaccessible to introspection.
The argument for direct perception therefore generally consists of the identification an invariant,
verification of its efficacy, and a subsequent appeal to parsimony. If perception appears to be a
function of stimulation, why invoke unconscious inference or other processes? Ultimately this is a
somewhat unsatisfying approach since it must proceed case-by-case and leaves open the question
of directness in situations for which no invariant has yet been discovered. However, one may also
24
hold that since it is an empirical matter, there is no logical inconsistency in simply assuming the
existence of specification until it is disproved (Fowler 1991).
The style of experimentation also differs from the bulk of Establishment perceptual research
in the types of stimuli used and the types of responses required of participants. Typically, the stimuli
are complex or "realistic". Subjects are asked to characterize events or perform certain actions based
on their perceptions. These are frequently more complex or natural actions than the typical
psychophysical discrimination task.
5.1 Time-to-contact: acoustic looming
The derivation and investigation of an acoustic variable for time-to-contact provides a good model
of the Ecological approach. In vision the inverse of relative rate of expansion of an object's retinal
projection (r / dr/dt) specifies the time-to-contact if it is moving directly towards the observer. Shaw,
McGowan and Turvey (1991) derive an acoustic equivalent based on the simplifying assumptions
that the source is a compact monopole, the acoustic medium is non-absorbing, and the surroundings
are anechoic. Under these conditions, acoustic time-to-contact, or taua, is equal to twice the inverse
of the relative rate of change in intensity (2I / dI/dt) at the observer's position. If time-to-contact
were to be deduced only from successive "snapshot" judgements of distance, accuracy would suffer,
since estimation of auditory distance is notoriously poor (Gardner 1969). Prior to any experimental
verification of the effectiveness of this invariant, Guski (1992) questioned whether the auditory
system could in principle use this variable since he thought it required access to the absolute intensity
of the sound source. This concern seems to be based on a misapprehension; the intensity in question
is not that of the source, but the proximal intensity. The acoustic tau is independent of overall
intensity and distance, just as the visual one is independent of size.
A number of studies have examined the ability of listeners to judge the time-of-passage of
a moving sound source (for example Rosenblum et al. 1987), but these do not directly address the
effectiveness of the taua invariant since other sources of information such as intensity and Doppler
shift changes also specify the time of closest approach. Taua offers prospective information for timeof-arrival, and therefore the important test is whether arrival time can be accurately predicted from
acoustic information collected before the "collision" when other variables are uninformative.
Rosenblum (1993) describes an experiment in which recordings of cars passing an observer at
25
various speeds were edited into thirds to evaluate the usefulness of information from different
portions of the stimulus. The results indicate that information available prior to passage is as useful
in estimating arrival time as hearing the actual passage. Jenison (1994) has derived variables
involving intensity, interaural time difference, and Doppler shift which are cues to parameters of the
more general approach problem, in which the source moves past the observer at some distance and
at a particular trajectory angle. Wightman & Jenison (1995) report data from an experiment using
such synthesized stimuli which show that listeners can use prospective information to discriminate
arrival times differing by about 300 ms.
While the effectiveness of this invariant seems to have been established, the assumptions
under which it was derived are actually quite restrictive. Sources radiating short wavelengths cannot
be approximated by compact monopoles and in reverberant environments the invariant applies only
to the direct signal. I shall discuss issues of this sort in the concluding sections of the paper.
5.2 Using auditory information for active contact
The studies of acoustic tau discussed above required subjects to judge time-to-contact independent
of any other action. In an experiment conducted by Heine and Guski (1993), participants were
requested to catch a ball rolling towards them using only acoustic information. The balls were
released on a ramp which they rolled down and continued towards the edge of a table at which the
subject was seated. Only a single reach-and-catch gesture was permitted, so good performance
depended on estimation of time-to-contact from the sound produced by the ball.
While results varied with the size of ball used (and hence the strength of the sound produced),
performance turned out to be quite poor overall. The authors advance various explanations for this,
the first of which is that the experiment was conducted in an anechoic room, a condition under which
it is very difficult to judge distance auditorily. This seems like an unfortunate point to raise, since
the advantage of the "looming" invariant for time-to-contact is that it is independent of distance. If
distance judgements are required, the case for the efficacy of the invariant is undermined. A second
point raised is that sighted humans rarely rely only on acoustic information in natural situations.
Hearing typically aids orientation and preparation for visually-guided action. However the fact that
blind athletes can apparently use similar information to play games involving rolling balls leads the
26
authors to conclude that sufficient information is present in the acoustic signal, but that their sighted
subjects were not attuned to it.
5.3 Transformational invariants: breaking & bouncing
An early and frequently-cited example of Ecological acoustics is the study of the perception of
breaking and bouncing by Warren and Verbrugge (1984).
The emphasis is on identifying
transformational invariants (specifying a dynamic characteristic) for bouncing and breaking events.
It is suggested that a "single damped quasi-periodic pulse train" specifies a bouncing event and that
an "initial rupture burst dissolving into overlapping multiple damped quasi-periodic pulse trains"
specifies breaking. Subjects listened to natural tokens of bottles and jars hitting a linoleum floor and
were asked to identify the type of event independent of the material involved. In addition to the
breaking and bouncing categories, subjects were encouraged to respond "don't know" if they could
not decide or if they perceived some other type of event. Given this three-choice task, correct
identification was better than 98% for both types of tokens. To verify that the hypothesized
invariants do specify the two types of events, synthetic tokens were constructed using recorded
sounds from four single pieces of glass. Here correct identification was 90.7% for bouncing and
86.7% for bouncing.
It is of course possible that subjects used prior knowledge of similar events to perform the
classification rather than perceiving them directly via the temporal patternings. As the authors
acknowledge, and additional problem is the response method used. If these temporal structures truly
specify the events, then rates of correct identification should be unaffected by the number of different
sorts of events to be identified. If non-breaking and non-bouncing events were included, would
performance deteriorate? Predefining the categories brings to mind a criticism which has been
leveled at Establishment theorists; Turvey et al. (1981) state that those opposed to Establishment
theory should ask of its proponents "both why and how any given thing comes to be described in just
those predicates that are consonant with the hypothesis mediating its interpretation." By restricting
the responses and the categories, perhaps Warren and Verbrugge have cast the task into the form of
a statistical inference problem.
5.4 Perceiving numbers by audition
27
Occasionally, as in the ball-catching study, invariants of stimulation may exist, but seem to be
poorly-utilized by observers. The task in this experiment (Heine, Guski & Pittenger 1993) was for
listeners to estimate the number of steel balls dropped and allowed to bounce on a wooden surface.
For a single ball the sound consisted simply of a series of impacts, while with two or more balls there
were also collisions between balls. Recordings were made in an attempt to find acoustical correlates
of the number of balls dropped. Correlation coefficients with magnitudes from 0.95 to 0.99 were
found between the number of balls and the peak sound level, the time interval between the first and
second bounces, and the overall duration of the event.
Although subjects were able to identify the single-ball case reliably, in all other cases the
number of balls tended to be under-estimated, and the variability in responses was high. In fact,
from the data presented, it does not appear that listeners could reliably distinguish between 2 and 9
balls. The explanations offered for this result are similar to those in the ball-catching experiment.
The task is somewhat unnatural and makes atypical demands on the auditory system, which may not
be attuned to pick up the acoustic invariants available. The authors again make an un-Ecological
remark about the subjects' lack of knowledge of the situation. Apparently when shown the
experimental setup before being blindfolded the correlation between judgements and number of balls
increased from 0.73 to 0.84, which suggests that prior, non-auditorily-derived knowledge of the
situation may be as important as "attunement", which was not demonstrated.
5.5 Acoustic texture in distance perception
The final example of Ecologically-inspired experimentation is an investigation of the utility of
providing "acoustic texture" in a distance judgement task (Höger 1993). Gibson (1979) proposed
that texture gradients are invariants for surface slant and that the amount of texture occluded by an
object serves to specify its distance from the observer. By (rather weak) analogy, "it is assumed that
characteristic changes of background sounds from different locations constitute an acoustic texture
gradient of depth". Four loudspeakers were positioned at 4 m increments from the listener, whose
task it was to identify the position from which one of three sounds (truck, dog or ducks) was
presented. In a "texture" condition a recording of singing birds was played in a random order from
each loudspeaker prior to presentation of the test stimulus.
28
The data revealed no significant effect of adding texture except at one distance for the truck
sound. A second experiment employed monaural recordings of stationary or moving cars at various
distances. These were presented over headphones with and without texture, and listeners were asked
to report the apparent distance to the car. Texture had no effect for the stationary car, but slightly
improved a tendency to underestimate distance in the moving car condition. This bias did not exist
for the stationary car, which is puzzling since moving stimuli contain dynamic Doppler shift and
intensity cues to distance, and hence judgements might be expected to be more accurate.
It is clear that acoustic texture cannot specify distance in the way that visual texture is held
to do. In the visual case, occlusion is essential and this does not exist in the auditory case. In
Section 6, I argue that attempts of this sort to apply the principles of visual ecological theory directly
to the auditory realm are ill-advised.
6 Prospects for an Ecological account
My third aim in this paper is to have a critical look at the current status of Ecological accounts of
audition in order to assess their successes and shortcomings. Since Gibson's approach is so deeply
rooted in vision, the first step taken is to examine the differences between the auditory and visual
ecologies. Following this, I discuss the problem of the superposition of acoustic signals, of acoustic
specificity, and of auditory affordances.
6.1 Auditory ecology
A theory of audition (whether Establishment or Ecological in style) must take account of the
particulars of acoustic ecology. The manner in which sound is usefully structured by the world
differs greatly from the way light is, and therefore auditory systems (and auditory theories) are faced
with many challenges dissimilar to those found in vision. Although many differences can be listed,
I contend that the root cause is the fact that audible sound has very long wavelengths in comparison
to those of light. The range of human hearing covers wavelengths from 10m to 2cm, while light
sensitivity consists of wavelengths from approximately 400 to 600 nm. Light and sound are both
wave phenomena, but their differing scales mean that the manner in which they interact with the
same objects in the world are dissimilar.
29
The first consequence of wave length is that there can be no "acoustic retina". To achieve
the same spatial resolving power as the eye, an acoustic lens would need a diameter of approximately
200 m for the highest frequencies and 100 km for the lowest. The transduction of sound is therefore
non-directional; sound impinging upon the listener from any direction is "projected" to a single point
- the eardrum. There is no geometrical preservation of space or place-to-place mapping from the
world to a receptor surface as there is in vision. The challenge facing the visual system is often
stated in the form of the inverse projection problem. Given a 2-dimensional retinal projection, there
are infinitely many 3-D surface layouts which could have produced it. Clearly the problem is even
worse in audition since the projection is from three dimensions to a 0-dimensional point. The
situation is ameliorated somewhat by the facts that we possess two ears and that sound travels
relatively slowly, allowing interaural time differences to specify one component of source direction.
A second consequence of sound's large wavelengths is that sound-emitting objects, unless
they very large, do not occlude others sources in the way that visual objects do. Diffraction permits
sound to sweep past objects and to propagate around corners, and thus occlusion cannot provide
information about the relative distances of interposed objects. Auditory masking is sometimes
compared to visual occlusion, but the processes are really quite different. An intense sound will
mask other sounds independent of their direction of origin, and there is no way to "listen around" a
masker. It is a cotemporal process rather than a codirectional one.
The combination of 3-D to 0-D projection and the lack of occlusion means that the auditory
system is faced with determining the spatial positions and character of sources the sounds from
which are superimposed at the receptor. There is no independent access to sounds from different
directions or at different distances, and the information from all concurrently active sources must
pass through a single channel. Somehow this information gives rise to percepts of individual
sources.
The situation is further confounded by a third consequence of wavelength, which is that
sound reflection is specular and maintains the important temporal structure of the original source
signal. It is generally specular because sound-reflecting surfaces are much smoother at the
wavelength scale than are the same surfaces when reflecting light. Frequently there is little to
30
distinguish an echo from an additional source, and these reflections are themselves superimposed
on the signal at the eardrum.
Because of sound's long wavelength and our lack of acoustic retinae, the information
contained in sound reflected from an object is rather low-resolution. Humans, unlike bats, rely
primarily on the sound-emitting properties of objects rather than their sound-reflecting properties.
Bregman vividly sums up the situation this way:
"This way of using sound has the effect of making acoustic events transparent; they
do not occlude energy from what lies behind them. The auditory world is like the
visual world would be if all objects were very, very transparent and glowed in
sputters and starts by their own light, as well as reflecting the light of their neighbors.
This would be a very hard world for the visual system to deal with." (Bregman 1990)
Helmholtz also addresses the problem of superposition in his discussion of the separation of systems
of ripples on the surface of a body of water:
"But the ear is much more unfavorably situated in relation to a system of waves of
sound, than the eye for a system of waves of water. The ear is affected only by the
motion of that mass of air which happens to be in the immediate neighborhood of its
tympanum within the aural passage. ... The ear is therefore in nearly the same
condition as the eye would be if it looked at one point of the water through a long
narrow tube, which would permit of its seeing its rising and falling, and were then
required to undertake an analysis of the compound waves. It is easily seen that the
eye would, in most cases, completely fail in the solution of such a problem. The ear
is not in a condition to discover how the air is moving at distant spots, whether the
waves which strike it are spherical or plane, whether they interlock in one or more
circles, or in what direction they are advancing. The circumstances on which the eye
chiefly depends in forming a judgement, are all absent for the ear.
If, then, notwithstanding all these difficulties, the ear is capable of
distinguishing musical tones arising from different sources - and it really shews a
marvelous readiness in so doing - it must employ means and possess properties
altogether different from those employed or possessed by the eye." (Helmholtz 1877)
Thus, the auditory system relies mainly on different sorts of structures in stimulation than the visual
system - temporal ones rather than spatial. The acoustic signal therefore supplies information about
31
very different properties of objects that does light, and this potentially leads to a further source of
ambiguity in the stimulus. I will discuss the problem of acoustic specificity shortly, but first wish
to examine the significance of the superposition problem for an Ecological theory of hearing.
6.2 Superposition
A central tenet of the Ecological approach is the idea that what count as stimuli should be broadened
with respect to the traditional view. Thus the stimulus in vision is taken to be the optic array, rather
than the retinal image. There is no reason why an acoustic array could not be defined to give spectral
content as a function of time and direction over a sphere centered on the listener. However it is not
clear that defining the stimulus in this way is of much use since there is no directional access to this
array prior to transduction. One cannot sample the acoustic array in the same sense that the visual
system can sample the optic array. Directional information such as binaural difference cues and
direction-dependent pinna filtering might be held sufficient to define unambiguously the location
of a source, but these are properties of sounds corresponding to individual sources and not of the
complex superposition of signals at the eardrum.
In general, superposition seems to be an unaddressed and difficult problem for direct realism.
Proposed invariants such as the acoustic tau and Warren and Verbrugge's bounce-specifying
temporal patterns are properties of individual sources or events. If a listener is presented with a
stationary source and a looming one, taua of the overall signal does not specify time-to-contact. The
sources must be separated so that only those components belonging to the moving source are
subjected to the looming "computation". Again, suppose a bouncing event is heard simultaneously
with babble from a group of speakers - the stimulus as a whole will not take the form of a quasiperiodic pulse train.
As mentioned previously, Gibson (1966) suggests that orienting to a sound source
synchronizes the arrivals at the two ears, but separation is also possible with monaural listening and
with diffuse, unlocalizable sources. It is hard to imagine how separation occurs without something
like the segregation and fusion processes proposed by Yost and Bregman, but these seem to operate
heuristically and to impose a structure on the stimulation. Ultimately the percepts derived seem to
owe as much to the processes of separation as to the sound-structuring properties of the environment.
32
This is not the sort of explanation proponents of Ecological perception have in mind, but no serious
alternative has been proposed.
A point to note is that in the domains where the Ecological approach has been most
successfully applied, vision and haptics, the superposition problem does not exist. Only one object
can be in contact with the skin at any point, and only light from the nearest surface in a particular
direction contributes to the optic array. The fact that source separation is less problematic in these
modalities perhaps explains why it has not been dealt with in Ecological accounts of audition.
6.3 Specification
Setting aside the issue of superposition, let us consider specification in the auditory domain. For a
direct account to succeed, detectible properties of the acoustic signal must stand in a one-to-one
relationship with the perceived properties of sound sources. The source-to-sound mapping is clearly
unique, but, even for a single source in a noise-free environment, can we be sure that the reverse
mapping is also unique? Can the inverse problem of recovering the causal source of the acoustic
signal be solved?
For a number of simple sound-producing systems it seems that it cannot. First consider the
Helmholtz resonator, which consists of a vessel enclosing a volume of air with a neck containing a
"plug" of air. The resonant frequency of such a device depends only on the mass of air in the plug
and on the volume of air in the main chamber. Vessels of many shapes and sizes can produce the
same sound, and therefore these parameters cannot be specified. Similarly, the frequency of
vibration of a stretched string depends on its length, mass, and tension. Thus length, for example,
cannot be specified since a change in length can always be compensated by appropriate adjustments
in tension or mass.
The 2-dimensional counterpart of the string, the stretched membrane or drum, also suffers
from this same ambiguity. While the frequencies of various modes of vibration provide information
about the area of the membrane and the length of its perimeter, it has been proven that drums of
different shapes can vibrate with exactly the same set of frequencies when struck (Cipra 1992,
Driscoll 1995). Hence "one cannot hear the shape of a drum" (Gordon et al. 1992). Finally, it has
been demonstrated that identical vowel spectra can be produced by the human vocal tract in very
33
different configurations (Ladefoged et al. 1978), and in principle a given set of formants can be
produced by a variety of vocal tract area functions. This is a problem for those who maintain that
speech is perceived on the basis of articulator position recovery.
No amount of sampling or scanning of the acoustic array can resolve the ambiguities, so it
must be assumed that these particular sound-specifying parameters are not specified in the sound
produced. Perhaps these are merely overly-simplified systems, which Gaver might group with
musical sounds;
"Musical sounds are not representative of the range of sounds we normally hear. ...
Musical sounds seem to reveal little about their sources, whereas everyday sounds
provide a great deal of information about theirs." (Gaver 1993a)
Fowler (1991) states that a claim that we "hear the world" is not a claim that we hear every property
of the world or that every different thing is perceived differently, but claims that specificity exists
nearly always for "for relevant properties of objects and events with which we interact". This
assertion seems vaguely circular, since it would be rather lucky for us to live in a world where no
relevant properties of objects are unspecified. It is clear that there are properties which cannot be
specified acoustically - whether these are relevant or not is a matter for debate. There are also
properties which do seem to be specifiable. For example the elasticity of a vibrating material is
indicated by the decay rate of vibration when it is struck (Wildes & Richards 1988). Fowler also
refers to the rareness of "mirages" outside of the laboratory, but in addition to the sound homophones
described by Ballas, one can think of more natural examples. Gibson mentions that thunder 'cracks',
but tree branches also 'crack', and the physical causes are quite dissimilar.
Given that some properties of the systems discussed cannot be specified, it is necessary either
to suggest means of resolving ambiguity, to refine the idea of what it means to recover the source,
or to abandon the inverse problem altogether (Kluender 1991). For an Ecological account, supplying
the perceiver with knowledge of the constraints of the system is not an option. For example if one
knew the possible configurations for a human vocal tract it might be of assistance in recovering
articulator positions, although in modeling this appears to be difficult even with careful X-ray
measurements of one individual speaker (Bailly et al. 1991). Fowler's move to block "premature
34
allegations of lack of specificity in acoustic speech signals" (1991) is the proposal that in running
speech the situation is different. The requirement that the current configuration must be smoothly
connected to those before and after may constrain the problem enough to yield a unique solution.
A different view is held by Kluender (1991) who refers to work on visual structure-from-motion in
maintaining that once rigidity is given up (and he claims it must) all bets are off in solving the
inverse problem. Gaver (1993b) recognizes the limits of specification and suggests that what are
specified are constraints on solutions to the inverse problem.
Coupling this with Fowler's position that we do not hear everything, the question seems to
be what do we hear? With how fine a brush is the auditory world painted, and can the answer to that
question be accounted for by the information available in acoustic stimulation? Answering these
questions is of course rather difficult since even in free identification tasks it is impossible for
listeners to describe every aspect of their percepts. Discrimination studies leave open the question
of whether responses are based on recovery of source properties or simply on differences in the
acoustic signals. Studies in which subjects are asked to detect source-properties often limit the
domain of responses, and thus do not speak directly to the specificity question. Examples of the
latter are studies of the perception of breaking and bouncing (Warren and Verbrugge 1984), handclapping (Repp 1987), and mallet hardness (Freed 1990).
The forgoing comments mainly concern the specification of shape and vibrational properties
of sound emitters, but questions of specificity also exist in determining the spatial layout of sources
and the environment in which they are active. In determining the direction from which sound is
arriving the auditory system, absent head movements, depends on binaural difference information
and the directional filtering performed by the pinnae. The spectrum of the sound reaching the
eardrum is ambiguous with respect to this spectral cue because the contributions of the source
spectrum and the pinna filtering are not separately available. Yet listeners can localize sounds
without employing head movements. The explanation for this achievement has traditionally been
that listeners employ a priori knowledge of the source spectrum to recover the pinna filter function
and to identify the source position, although this idea has not been tested rigorously.
The sound field at a listener's position is structured not only by the sources of sound but also
by the layout of reflecting surfaces in the environment, and it is often proposed that by consequence
35
of this we can hear the location of these surfaces. A previously-mentioned example was that "we
can hear the narrow, echoing walls of the alley it [a car] is driving along" (Gaver 1993a). We can
obviously tell the difference between the interior of a cathedral and a coat closet, but how much
information for the layout of surfaces is actually present? Consider two properties of a sound field
which depend on the characteristics of the enclosure in which events occur: the reverberation time
and the direction of arrival of reflections. It has been shown that reverberation time of a room is
directly proportional to its volume and inversely proportional to the surface area of the walls and
their absorbtivity (Morse and Ingard 1968). Therefore the shape of the room cannot be specified by
this parameter. Secondly, in an experiment using synthesized stimuli carried out in our laboratory
(HDRL, Waisman Center) we found that subjects were unable to discriminate between cases in
which wall reflections accurately duplicated those in a rectangular room and those in which the
reflections came from arbitrary directions with the same distribution of time delays. In this case the
auditory system was not sensitive to the locations of the walls, but only to their distances relative to
the listener and the source. Thus only "fuzzy" information about the layout of surfaces in the
environment seems to be present in the acoustic array.
The final observations I will make about specificity concern the auditory perception of
distance. In an Establishment analysis, the auditory system is faced with an inverse projection
problem exactly analogous to that in vision. An image projected on the retina could arise from an
object at any distance if its size is chosen appropriately, and (in an open space without reflecting
walls) a sound of given proximal intensity could be caused by a source at any distance given the
appropriate sound level. Another feature which varies with distance, the absorption of high
frequencies, is ambiguous in the same way as the pinna filtering cue to direction. Gibson's solution
to the visual problem is to point out that objects are not encountered floating in a featureless void,
but that they generally appeared against some sort of textured background (Gibson 1979). The
amount of texture surrounding and occluded by the object is held to specify its distance and size, but
a similar invariant cannot exist in audition because there is generally no such thing as acoustic
occlusion.
Höger (1993) attempted to devise an auditory counterpart to Gibson's surface texture, but
found little improvement in the accuracy of distance judgements when "texture" was added. I feel
36
that this experiment is an example of a tendency which, at worst, leads to the assumption that
principles derived for Ecological optics apply equally well in the auditory modality, and at best to
the production of rather strained analogies.
6.4 Auditory affordances
Although not intrinsic to direct perception, affordances are an important constituent of the Ecological
approach. While one might envision an account in which just spatial layout itself is perceived
without mediation, Gibsonians emphasize the ecological significance of certain configurations. This
is necessary since their reconceptualization of perception ties it intimately to action. One can find
various definitions of affordances in the literature, some more straightforward than others:
"The affordances of the environment are what it offers the animal, what it provides
or furnishes for good or ill. The verb to afford is found in the dictionary, but the
noun affordance is not. I have made it up. I mean by it something that refers to both
the environment and the animal in a way which no existing term does. It implies the
complementarity of the animal and the environment." (Gibson 1979)
"Affordances are the acts or behaviors permitted by objects, places, and events."
(Michaels and Carello 1981)
"A propertied thing X ... affords an activity Y ... for a propertied thing Z ... if and
only if certain properties of X ... are dually complemented by certain properties of Z,
where dual complementation of properties translates approximately as properties that
are related by a symmetrical transformation or duality T such that: T(P1) P2 and
T(P2) P1." (Turvey et al. 1981)
Although Ecological theorists define affordances in rather general terms, those which are commonly
introduced to explain the idea tend to be of a particular type. For example we are given climbability,
grabability, crawl-intoability (Turvey et al. 1981), sit-onability, and drink-fromability (Michaels &
Carello 1981). These affordances are said to be the objects of perception specified by variables of
stimulation and as such they share one essential property. This is that the characteristics of surfaces
responsible for structuring the optic array (and thereby providing information for the affordances)
37
are the same characteristics which underlie their ecological significance. In other words, a group of
surfaces provides certain affordances by virtue of its shape, and it is its shape which structures the
information-bearing light. In fact the case for direct perception of affordances rests on this type of
specification and the additional assertion that there is a one-to-one mapping between layout and
variables of the optic array.
Whether or not affordances are directly perceived in vision, the situation in audition is clearly
somewhat different. In general the ecological significance of a sound-emitter need have little to do
with the means by which it produces sound, although there are certainly exceptions. A snake may
be identified as a threat by hearing its hiss, but it is threatening because it is a snake and not because
it lacks vocal cords. The ringing of a telephone has significance because a telephone is a messageconveying device and not because it contains a brass bell or a buzzer. In both of these examples it
is perception-as which is important, and not perception-of. Of course counter-examples are also
available; a woodpecker may detect hollows in a tree trunk by tapping and an organism may judge
the approximate size of an enclosure by variables related to reverberation.
The affordances of objects are generally related to their shapes. These may be specified by
light, but need not be specified by sound. Sounds are signifiers as well as specifiers. The sorts of
characteristics which comprise affordances are not always the sorts of things which can be specified
in the acoustic array. Michaels and Carello (1981) state that "to detect affordances is, quite simply,
to detect meaning", but it seems clear that meaning can be detected without affordances as they are
typically construed. If one accepts this, it would seem that we have found an instance of one of
Mace's "five ways to have a theory of indirect perception" (Mace 1977) because meaning is not
specified without an additional step of identification or recognition. Mace points out that a direct
theory of perception must be an Ecological one, although an ecological theory need not be direct
(Fowler 1990). Unless meaning itself, in other words affordance, is specified and picked up,
mediation is required to interface perception with the psychological systems controlling action.
The difficulty of translating the standard concept of affordance into the auditory domain is
reflected in the infrequency with which writers on ecological acoustics use the term. For example
it is not mentioned by Jenkins (1985), Fowler (1990, 1991), or Gaver (1993a, 1993b). When authors
do refer to affordances in an auditory context they seem to do so with some carelessness, or in ways
38
which distress strict adherents of Gibson's program. For example in Handel's monograph, Listening
(1989), he claims to take an approach inspired in part by Gibson, but makes the following statement
about sound source identification:
"At a third level, we hear objects. [The first two levels being physical features of
sounds and more abstract timbral qualities.] I am thinking of 'violinness', 'President
Carterness', 'President Reaganness', and 'airplaneness'. What is characteristic is that
the sounds seem directly perceived as objects. Gibson (1979) has used the term
affordances."
for which he is rightly taken to task by Heine & Guski (1991). Affordances are not objects; they are
what objects afford.
Other explicit examples are few in number. Gibson (1966) mentions that sound sources
afford orientation and localization, meaning that an organism can establish its position and heading
in space relative to a sound-emitter. Note however that this affordance is related to spatial layout
and not to the properties of the object which determine what sort of sound it produces. Michaels &
Carello (1981) briefly discuss the complementarity of perception and action in the context of an
articulatory basis for speech perception, but it is not clear that this really addresses the issue of
affordances. While it is an indispensable part of an Ecological account, the theory of affordances
seems to be one which has yet to be seriously addressed by Ecological acousticians.
39
7 Conclusions & Speculations
In this paper I have attempted to review some theoretical writings on the nature of auditory
perception, to examine the sorts of experimentation on distally-focused perception carried out by
Establishment and Ecological researchers, and to look critically at state and prospects of Ecological
acoustics.
In general it seems that despite the emphasis on vision in perceptual theorizing, theories of
audition have paralleled those of vision. Advocates of perception-as and perception-of accounts do
not seem to differentiate between the perceptual systems. Those who propose unconscious inference,
association, or representational transformations as accounts of visual perception do so also in the
case of hearing. Those who maintain that perception is direct and unmediated and that the world is
specified by stimulation apply their analysis with equal conviction to both modalities. These
consistencies are evidence of the desire to develop a theory of perception, either mediated or direct,
in which all the modalities are governed by the same principles.
To date, attempts to devise an Ecological account of audition seem to suffer from three
shortcomings. First, the proposed source-specifying invariants of sound are not in general invariants
of the effective stimulus - the eardrum signal - in which sounds from many sources are
superimposed. Therefore an account of direct source separation is required. Second, many soundstructuring properties of objects cannot be specified uniquely in the acoustic signal. Therefore it is
important to give an account of what it is that is directly perceived. Finally, no serious attempt has
been made to define auditory affordances, and without them a theory of perception-for-action is a
step short of direct.
I will conclude with some (perhaps ill-advised) speculations about the objects of auditory
perception and the difference between audition and vision. In a series of papers, Diehl & Kluender
and Fowler engage in a lively debate about what should properly be considered the objects of speech
perception (Diehl & Kluender 1989a, Fowler 1989, Diehl & Kluender 1989b, Fowler 1990, Diehl
et al. 1991, Fowler 1991). Fowler's position is that, directly or not, the auditory system attempts to
recover sound-producing events, and thus the objects of speech perception are articulatory. Diehl
& Kluender maintain that speech perception involves decoding auditory information, and that
40
effective communication does not require access to the vocal tract configurations of one's
conversants. Does the auditory system seek to satisfy the same goals as the visual in general, or are
there natural situations in which we hear sounds and not the properties of sound-emitters?
Gaver makes a distinction between musical listening and everyday listening, in which the
former involves attending to the timbre and other abstract auditory properties of a sound (not
necessarily music), and the latter to hearing objects. The distinction is somewhat akin to the
difference between Brunswik's analytic and naive-realistic attitudes. The type of listening one
indulges in is to some extent under conscious control. Although everyday listening may be the
default, it is quite possible to listen musically to environmental sounds, and in fact contemporary
musical genres like musique concrète and the electroacoustic pieces of Alvin Lucier rely on this
ability. In addition, shifting one's focus from everyday to musical listening does not result in a
relocation of the percept to the ears of the listener. Thus the following criticism of Diehl &
Kluender's account is moot:
"In acoustic perception, Diehl et al. aver, however, stimulus structure in the air that
has been caused by an event is hear in itself. Why? And why does this allegedly
acoustic-signal perceiving system localize sound, not where the acoustic signal is (in
the ear), but where the acoustic-signal causing event is in the world?" (Fowler 1991).
It is simply a fact that this does not happen, whichever style of listening one happens to be involved
in. Nor does it happen in vision. An observer can compare the relative projective sizes or the colors
of objects without the location of the percepts jumping to the retina. Fowler mentions "nonsense
stimuli" such as sinewave analogs to speech, which do not contain enough information to specify
their causal sources, but these stimuli are localized in-the-world to the same degree that real speech
or any other sound is.
So, it is possible to avoid recovering a sound's causal source (or at least to ignore its
recovery), but in the absence of adjusting one's Brunswikian attitude, is it always the case that this
recovery takes place? It seems clear that the visual system attempts to recover light-structuring
properties of the world, that is, surface layout. We do see surfaces and objects, but, regardless of
identification, do we always hear materials and the events they are involved in? It is my feeling that
41
the answer depends on time scale. We seem to hear events such as bouncing and approach, which
structure sound at a scale of tenths of seconds to seconds, but it is not clear that we hear properties
which structure vibration at smaller time scales.
Consider the following examples. When we hear ventilation noise, do we really hear air
turbulence resonating in a duct? In perceiving speech, do we really hear the vocal cords vibrating?
When we listen to a door swinging shut and slamming do we really hear the "stiction" in the rusty
hinges and the vibration of the door, or just a squeak and a bang? When walking in the park do we
really hear the crickets' little legs rubbing away or just a curious buzz? In fact, the sounds of many
animals seem to pose this sort of problem; even Jenkins provides an example while extolling the
richness of acoustic information:
"From our backyard locale, my wife and I heard a remarkable burst of song - some
kind of warbler. At length we located a small bird on a high wire at the end of the
yard. Could it be that this tiny bird was the source of the song? We thought it
unlikely, but we were rapidly convinced by the synchrony of the bursts of song and
the movements of the bird." (Jenkins 1985)
Note that the sound source was identified as a warbler, but even this did not help to specify the size
of the bird.
Is there any a priori reason to think that audition is fundamentally different from vision in
this way? Assuming that these intuitions are correct, why should it be the case that we can hear
sounds in the environment without hearing vibration-structuring properties? A possible explanation
lies in the previously mentioned fact that vision is primarily directed at reflectors of energy, while
audition is primarily directed at sources of energy. What happens when we view sources of light
directly? It seems that the experience is of "a source of light of a particular color and intensity at a
particular location". While one can perhaps identify the spectrum-structuring properties of the
source (it's an LED, it's a sodium lamp etc.) the primary experience is of the radiant light itself.
Gibson makes the following remarks about radiant light:
"Is there any kind of information in radiant light? The answer must be yes, for the
spectrum of any radiant beam specifies vibrations in the atoms that emitted the
42
energy. The astronomer with a spectroscope can identify the substance of the star.
One could aim the instrument at a luminous object and determine whether it is
incandescent, fluorescent, bioluminescent, etc. But note that an eye cannot do this;
it cannot register the distribution of wavelengths and cannot measure their absolute
intensities. This is not the kind of information an eye can pick up. A single spot of
light in darkness conveys only a minimum of information to an eye." (Gibson 1966).
"Radiant light has no structure; ambient light has structure. Radiant light is
propagated; ambient light is not, it is simply there. Radiant light comes from atoms
and returns to atoms; ambient light depends on an environment of surfaces. Radiant
light is energy; ambient light can be information." (Gibson 1979).
The perception of radiant light is an exceptional case in Gibson's visual theory, but radiant sound is
the main stuff of audition. It seems somewhat perverse to hold that radiant light specifies atoms but
that atoms are not perceived while maintaining that radiant sound specifies vibration-structuring
properties of objects and that these can be perceived. If it is indeed the case that we can hear
temporally-extended events but that we sometimes hear only sounds (while still being able to
identify their sources), perhaps Schubert's concepts of Source Identification and Source Behavior
Recognition can serve as a model for a uniquely auditory theory of perception.
Whether or not the foregoing comments are convincing, it is clear that transferring any
perceptual theory wholesale from one modality to another can be problematic. The ecology of
audition poses unique challenges which must be taken seriously by theorists of any stripe.
43
References
Bailley G, Laboissière R, Schwartz JL (1991): A model of coarticulation based on connectionist
sequential networks: can we recover articulatory movements from acoustics. Conference on Current
Phonetic Research Paradigms: Implications for Speech Motor Control. Stockholm, Sweden, August
1991. (cited in Kluender 1991)
Ballas JA, Howard Jr. JH (1987): Interpreting the language of environmental sounds. Environment
and Behavior 19(1):91-114.
Ballesteros S (ed) (1994): Cognitive approaches to human perception. Laurence Erlbaum Associates.
Boring EG (1942): Sensation and perception in the history of experimental psychology. AppletonCentury.
Bregman AS (1981): Asking the "what for" question in auditory perception. In Perceptual
Organization, ed Kubovy M & Pomerantz JR. Laurence Erlbaum Associates.
Bregman AS (1990): Auditory scene analysis. MIT Press.
Bregman AS, Campbell J (1971): Primary auditory stream segregation and perception of order in
rapid sequences of tones. J.Exp.Psych. 89:244-249.
Bruce V, Green PR (1990): Visual perception: Physiology, psychology and ecology. Laurence
Erlbaum Associates.
Cipra B (1992): You can't hear the shape of a drum. Science 255:1642-1643.
Dowling JW, Lung KM, Herrbold S (1987): Aiming attention in pitch and time in the perception of
interleaved melodies. Perception & Psychophysics 41(6):642-656.
Diehl RL, Kluender KR (1989a): On the objects of speech perception. Eco.Psych. 1(2):121-144.
Diehl RL, Kluender KR (1989b): Reply to commentators. Eco.Psych. 1(2):195-225.
Diehl RL, Walsh MA, Kluender KR (1991): On the interpretability of speech/nonspeech
comparisons: A reply to Fowler. J.Acoust.Soc.Am. 89(6):2905-2909.
Driscol A (1995): Eigenmodes of isospectral drums. World Wide Web document. URL:
http://cam.cornell.edu/~driscol/research/drums.html.
44
Ellis D (1995): Hard problems in computational auditory scene analysis. World Wide Web
document. URL: http://sound.media.mit.edu/~dpwe/writing/hard-probs-1995jul09.html.
Fodor JA (1975): The language of thought. Harvard University Press.
Fodor J, Pylyshyn Z (1981): How direct is visual perception? Some reflections on Gibson's
'Ecological Approach'. Cognition 9:139-196.
Fowler CA (1989): Real objects of speech perception: A commentary on Diehl and Kluender.
Eco.Psych. 1(2):145-160.
Fowler CA (1990): Sound-producing sources as objects of perception: Rate normalization and
nonspeech perception. J.Acoust.Soc.Am. 88(3):1236-1249.
Fowler CA (1991): Auditory perception is not special: We see the world, we feel the world, we hear
the world. J.Acoust.Soc.Am. 89(6):2910-2915.
Freed D (1990): Auditory correlates of perceived mallet hardness for a set of recorded percussive
sound events. J.Acoust.Soc.Am. 87:311-322.
Gardner MB (1969): Distance estimation of 0 or apparent 0 -oriented speech signals in anechoic
space. J.Acoust.Soc.Am. 45:47-53.
Gaver WW (1993a): What in the world do we hear? An Ecological approach to auditory event
perception. Eco.Psych. 5(1), 1-29.
Gaver WW (1993b): How do we hear the world?: Explorations in ecological acoustics. Eco.Psych.
5(4):285-313.
Gibson JJ (1966): The senses considered as perceptual systems. Houghton Mifflin.
Gibson JJ (1979): The ecological approach to visual perception. Houghton Mifflin.
Gordon C, Webb D, Wolpert S (1992): One cannot hear hte shape of a drum. Bull.Am.Math.Soc.
27:134-138.
Green DM, Swets JA (1966): Signal detection theory and psychophysics. Wiley.
Gregory RL (1993): Seeing and thinking. Italian J.Psych. 20:749-769.
Guski R (1992): Acoustic tau: An easy analogue to visual tau? Eco.Psych. 4(3): 189-197.
Handel S (1989): Listening: An introduction to the perception of auditory events. MIT Press.
45
Hatfield G (1990): Gibsonian representations and connectionist symbol processing: Prospects for
unification. Psych.Rev. 52:243-252.
Heine WD, Guski R (1991): Listening: The perception of auditory events? An essay review of
Listening: an introduction to the perception of auditory events. by Stephen Handel. Eco.Psych.
3(3):263-275.
Heine WD, Guski R (1993): Using auditory information for active contact with sound sources
moving rectilinearly with respect to a listener. In Contributions to psychological acoustics: Results
of the 6th Oldenburg Symposium on Psychological Acoustics, ed. Schick A. 349-359.
Heine WD, Guski R, Pittenger JB (1993): Perceiving numbers of stell balls by audition. In
Contributions to psychological acoustics: Results of the 6th Oldenburg Symposium on Psychological
Acoustics, ed. Schick A. 361-371.
Helmholtz H von (1867/1925). Physiological optics. Vol. 3. Optical Society of America.
Helmholtz H von (1877/1954): On the sensations of tone. Dover.
Hochberg J. Perceptual theory and visual cognition. In Cognitive approaches to human perception.
ed. Ballesteros S. Laurence Erlbaum Associates. 269-289.
Höger R (1993): Acoustic texture in distance perception. In Contributions to psychological
acoustics: Results of the 6th Oldenburg Symposium on Psychological Acoustics, ed. Schick A. 337348.
Jackendoff R (1987): Consciousness and the computational mind. MIT Press.
James W (1890/1950): The principles of psychology Vol.2. Dover.
Jenison RL (1994): On acoustic information for auditory motion. Perception. (in press?).
Jenkins JJ (1985): Acoustic information for objects, places and events. In Persistence and change:
Proc. 1st Internat. Conf. on Event Perception, eds. Warren W, Shaw R. Laurence Erlbaum
Associates. 115-138.
Johansson G (1985): About visual event perception. In Persistence and change: Proc. 1st Internat.
Conf. on Event Perception, eds. Warren W, Shaw R. Laurence Erlbaum Associates. 29-54.
Kluender KR (1991): Psychoacoustic complementarity and the dynamics of speech perception and
production. Perilus XIV:131-136.
46
Kluender KR, Jenison RL (1992): Effects of glide slope, noise intensity, and noise duration on the
extrapolation of FM glides through noise. Perception & Psychophysics 51(3):231-238.
Ladefoged P, Harshmann R, Goldstein L, Rice L (1978): Generating vocal tract shapes from formant
frequencies. J.Acoust.Soc.Am 64:1027-1035.
Lerdahl F & Jackendoff R (1983): A generative theory of tonal music. MIT Press.
Licklider JCR (1959): Three auditory theories. In Psychology: A study of a science, ed S. Koch.
McGraw-Hill.
Lombardo TJ (1987): The reciprocity of perceiver and environment: The evolution of James J.
Gibson's ecological psychology. Laurence Erlbaum Associates, Hillsdale NJ.
Lyon RF (1983): Binaural localization and source separation. Proc. ICASSP 83:1148-1151.
(reprinted in Richards 1988)
Mace WM (1977): James J. Gibson's strategy for perceiving: Ask not what's inside your head, but
what your head's inside of. In Perceiving, acting, and knowing: Towards an ecological psychology.
ed Shaw R, Bransford J. Laurence Erlbaum Associates.
Marr D (1982): Vision. Freeman.
Michaels CF & Carello C (1981): Direct Perception. Prentice-Hall.
Mohrmann K (1939): Lautheitkonstanz im Entfurnungswechsel. Z. Psychol. 145: 146-199. (cited
in Postman & Tolman 1959).
Morse PM, Ingard KU (1968): Theoretical acoustics. Princeton University Press.
Nunn D (1995): Pictures of some research issues. World Wide Web document. URL:
http://capella.dur.ac.uk/doug/pictures.html.
Pickles JO (1988): An introduction to the physiology of hearing. Academic Press.
Postman L & Tolman EC (1959): Brunswik's probabilistic functionalism. In Psychology: A study
of a science. ed. Koch S McGraw-Hill. 502-564.
Pylyshyn ZW (1984): Computation and cognition. MIT Press.
Reisberg D (ed) (1992): Auditory imagery. Laurence Erlbaum Associates.
47
Repp BH (1987): The sound of two hands clapping: an exploratory study. J.Acoust.Soc.Am.
81(4):1100-1109.
Richards W (ed) (1988): Natural computation. MIT Press.
Rock I (1980): Difficulties with a theory of direct perception. Behavioral and Brain Sciences 3:398399. (Commentary on Ullman 1980).
Rock I (1983): The logic of perception. MIT Press.
Rosenblum LD (1993): Acoustical information for controlled collisions. In Contributions to
psychological acoustics: Results of the 6th Oldenburg Symposium on Psychological Acoustics, ed.
Schick A. 303-322.
Rosenblum LD, Carello C, Pastore RE (1987): Relative effectiveness of three stimulus variables for
locating a moving sound source. Perception 16:175-186.
Schubert ED (1974): The role of auditory perception in language processing. In Reading, perception
and language. eds Duane DD, Rawson MB. York Press, Baltimore.
Searle CJ (1982): Representing acoustic information. Can.J.Psych. 36:402-419. (reprinted in
Richards 1988)
Searle JR (1992): The rediscovery of the mind. MIT Press.
Shaw BK, McGowan RS, Turvey MT (1991): An acoustic variable specifying time-to-contact.
Eco.Psych. 3(3):253-261.
Shepard RN (1990): Mind Sights. Freeman.
Sloman A (1989): On designing a visual system: Towards a Gibsonian computational model of
vision. J. Experimental & Theoretical Artificial Intelligence 1:289-337.
Stellmack MA (1994): The reduction of binaural interference by the temporal nonoverlap of
components. J.Acoust.Soc.Am. 96(3):1465-1470.
Strutt JW (1907): On our perception of sound direction. Philosophical Magazine 13:214-232.
Turvey, Shaw, Reed, Mace (1981): Ecological Laws of perceiving and acting: In reply to Fodor and
Pylyshyn (1981). Cognition 9, 237-304.
Ullman S (1980): Against direct perception. (with commentaries). Behavioral and Brain Sciences
3:373-415.
48
Vanderveer NJ (1979): Ecological acoustics: human perception of environmental sounds.
Dissertation Abstracts International, 40: 4543B. (University Microfilms no. 8004002). (Cited by
Ballas and Howard, 1987).
Warren RM (1982): Auditory perception: a new synthesis. Pergamon.
Warren WH & Verbrugge RR (1984): Auditory perception of breaking and bouncing events.
J.Exp.Psych.:Human Perception and Performance 10:704-712. (reprinted in Richards 1988).
Wightman FL, Jenison RL (1995): Auditory spatial layout. In Handbook of perception and cognition
Vol 5: Perception of space and motion. eds Epstein W, Rogers S. Academic Press. (in press?)
Wildes RP, Richards WA (1988): Recovering material properties from sound. In Natural
Computation. ed Richards WA. MIT Press. 356-363.
Yost (1990): Auditory image perception and analysis: The basis for hearing. Hearing Research 56:818.
49
Download