A Review of Auditory Perceptual Theories and the Prospects for an Ecological Account Ewan A. Macpherson Department of Psychology University of Wisconsin-Madison (In partial fulfillment of Preliminary Exam requirements) July 1995 Contents 1 Introduction 1.1 Motivation 1.2 Definitions 1.3 The neglect of auditory perceptual theory 1 1 1 2 2 Opinions on the role of auditory perception 2.1 Opinions on the role of perception in general 2.2 The role of hearing according to Helmholtz 2.3 The role of hearing according to James 2.4 The role of hearing according to Gibson 2.5 Other opinions: identification or source recovery? 2.6 Summary 4 4 5 7 7 8 9 3 Theories of Auditory Perception 3.1 Helmholtz's account of audition 3.2 James' account of audition 3.3 Brunswik's probabilistic functionalism 3.4 Gibson's account of audition 3.5 Computational accounts of audition 9 9 10 11 12 13 4 Establishment ecological research 4.1 Brunswick & Mohrmann: loudness constancy 4.2 Auditory scene analysis & auditory image perception 4.2.1 Bregman: auditory scene analysis 4.2.2 Yost: auditory image perception 4.2.3 Summary 4.3 Ballas & Howard: interpreting environmental sound 15 15 17 17 19 21 22 5 Ecological ecological research 5.1 Time-to-contact: acoustic looming 5.2 Using auditory information for active contact 5.3 Transformational invariants: breaking & bouncing 5.4 Perceiving numbers by audition 5.5 Acoustic texture in distance perception 24 25 26 27 28 29 6 Prospects for an Ecological account 6.1 Auditory ecology 6.2 Superposition 6.3 Specification 6.4 Auditory affordances 29 30 32 33 37 7 Conclusions & Speculations 41 References 45 1 Introduction 1.1 Motivation During the spring of 1995 I attended a seminar given by Bill Epstein entitled "Thinking and Perceiving". The discussion centered around the conception of perception as a process of unconscious inference, and starting with the writings of Helmholtz and Berkeley, continued through to the computational constructivism of Marr. In addition to the various construals of this notion, we also dealt with the objections and alternatives which have been offered, and discussed what sorts of experimental results would stand as evidence for one position or another. While unspecified in the course title, the 'perceiving' referred to throughout was uniformly visual, with little reference to audition or the other modalities. Thus, this paper is motivated in part by my speculation about how the content of the seminar would have changed if hearing were the canonical sense for perceptual theorizing. In keeping with this theme, I have somewhat liberally included short quotations from several authors in lieu of "readings" since it is often enlightening to see cited authors' original words. More specifically, my aims in this paper are threefold: to review perceptual theorizing carried out within the context of audition; to examine a selection of experiments motivated by the differing theoretical viewpoints; and finally, to look critically at the difficulties that proponents of direct perception might face in importing a theory framed primarily in terms of vision into the auditory modality. Since these aims are rather interdependent some topics are discussed in more than one context, but I hope that I have been successful in minimizing repetition. Rather than getting mired in a full-fledged analysis of the direct perception debate, I have attempted to take each position on its own merits while (I hope) maintaining a suitable and evenhanded skepticism. 1.2 Definitions Before beginning the discussion proper, I would like to define what I mean by the terms 'Establishment' and 'Ecological', which I use to characterize the two main styles of perceptual theory. These refer respectively to accounts of perception which posit mediation by various psychological processes and those which do not. In terms of a succinct introduction to the direct-indirect debate I do not feel that I can do better than to offer the following passage by Rock: 1 "... the essence of a direct theory is that stimulus information is available that uniquely correlates with each particular perception. Thus the specification of such information provides the necessary and sufficient explanation of perception. The essence of an indirect theory is that the stimulus information, while a necessary determinant, is not sufficient, because certain mediating processes must occur, once the stimulus information is registered or picked up, prior to the achievement of the percept. Such mediating processes can be described in psychological language and are a necessary part of the chain of event leading to the final perception. In my opinion, these processes could be either interactive in nature, such as were stressed by the Gestaltists, or they could be cognitive or thoughtlike in character. Examples of such processes are variously referred to as 'organizing' or 'grouping', 'interpreting', 'taking account of', 'computing', 'inferring', 'describing', 'deciding', and the like." (Rock 1980) The position which sets itself against accounts involving mediation is variously referred to direct perception, direct realism, or ecological perception. The program identified by these terms involves a rather radical redefinition of perception and stimulation, and since words tend to take on special meanings in this context I will use 'Ecological' to refer to the specific approach and 'ecological' for environment-directed perception in general. The Ecological approach actively contrasts itself with, and defines itself negatively with respect to inferential accounts - it is "indirect" which "wears the trousers" (Turvey et al. 1981). Therefore it seems reasonable to adopt Fodor and Pylyshyn's use of the term 'Establishment' to refer to the collection of theories with which the Ecologists take issue (Fodor & Pylyshyn 1981). 1.3 The neglect of auditory perceptual theory Also as a preliminary, I would like to briefly discuss the relative neglect of audition in perceptual theorizing. The roots of the traditional, constructivist view of perception lie in analyses of vision. Bishop Berkeley discussed the perception of space, and it was in terms of vision that Helmholtz presented his theory of perception as a process of unconscious inference (Helmholtz 1867). He made little reference to similar matters in his monograph on auditory perception (Helmholtz 1877), and in fact the title of the latter refers explicitly to the "sensations of tone". Boring (1942) construed "auditory theory" to be a framework for discussing the physiology of the inner ear, while more modern collections with pan-modal titles (Cognitive Approaches to Human Perception (Ballesteros 1994), for example) still cheerfully ignore the non-visual modalities. Contemporary contributors to 2 cognitive constructivist theory also work primarily with vision (for example Gregory 1993, Rock 1983, and Shepard 1990), as do computational constructivists such as Marr (1982). In the last three decades an alternative to familiar accounts of perception, inspired by the work of Gibson has gained some acceptance. While Gibson addresses all the modalities as 'perceptual systems' (1966), the full exposition of his theory deals explicitly with vision (1979). The most serious proponents of his program tend to similarly dwell on vision, sometimes restricting their discussion of audition to a single page (Michaels & Carello 1981). Unsurprisingly, this visual bias persists in most discussions of the relative merits of traditional and direct theories of perception. Typical examples are the target article by Ullman and the resulting commentary (Ullman 1980), the debate between Fodor & Pylyshyn (1981) and Turvey et al. (1981), and the analyses by Bruce & Green (1990) and Hochberg (1994). The main thrust of auditory research also seems to have proceeded in the absence of discussion of the fundamental nature of perceptual processes. Licklider (1959) remarks: "There is no over-all theory of hearing. No one since Helmholtz has tried to handle anything like all the known problems within a single framework. Each of the several theories of hearing that are extant deals with a restricted set of questions." This seems true today, and of course the number of "known problems" continues to increase. To what can we attribute this lack of theorizing, or conversely why has such work been more often undertaken in the visual domain? The explanation seems to lie partly in beliefs or intuitions about differences between the two modalities and about sound itself, but more so in the historical roots of certain lines of experimentation. Firstly, hearing has traditionally been thought of as passive and vision active. For example, Dowling et al. (1987) cite Shopenhauer's belief that music's affective power is due to the passive nature of hearing, which allowed "brain-fibres" to vibrate in synchrony with musical tones. Secondly, the "products" of hearing were more often described in terms of sensation and rarely in terms of object perception, and the interest in the perception of musical tones rather than of the "noises" produced by everyday sources reinforced this emphasis. 3 Yost (1990) elaborates on this, and suggests that the early direction of hearing research and experimentation rests on an historical accident of timing. Sound was not considered to be localized in space, and thus it was unclear how sound sources could be localized except by association with perceptions derived from sight and touch. Helmholtz's psychoacoustic investigations revealed the ear to be a sensitive frequency analyzer, and Lord Rayleigh's sound localization experiments came after interest had been focused on the analysis problem (Strutt 1877). Licklider (1959) also credits Helmholtz with establishing boundaries of interest within auditory science, and points out that although von Bekesy's discovery of mechanical tuning within the cochlea disproved Helmholtz's resonance theory, it merely altered the way frequency selectivity was studied. A final factor may be that hearing provides a fruitful and "clean" domain for the application of the theory signal detection (Green & Swets 1966). Thus the tendency to concentrate on basic psychophysics has persisted throughout this century. 2 Opinions on the role of auditory perception 2.1 Opinions on the role of perception in general Before examining some writings on the nature of the auditory process, I would like to survey comments by a number of authors on the role of hearing. Differing views of what should properly be considered its function or end-products must have an effect on the types of processes postulated or required. In general, Ecological and Establishment advocates hold somewhat different views of the role of perception, which I will present before moving on to comments specifically about audition. The problem of comparison is complicated by the fact that the two camps not only ascribe different roles to perception but also define what counts as perception differently. Both the Establishment and Ecological accounts acknowledge that perception serves to provide information about the environment. For the Establishment, perception is the process of deriving mental representations of the objects and events in the environment - the process of "getting the outside inside". For example Pylyshyn (1984) defines sensory transducers as mechanisms for producing symbols which depend on states of the environment, William James refers to perception in terms of 4 the conscious awareness of external objects (James 1890), and Fodor (1975) makes frequent reference to "perceptual knowledge". Perception serves to provide knowledge of "what is where" in the world, and action is guided on the basis of that knowledge. In the Ecological view, perception is a keeping-in-contact which supports action, while the emphasis in Establishment theories is more epistemological. In Ecological accounts, action is 'directly' related to perception, while in Establishment theories the relationship is 'mediated' by other processes. The following passages illustrate the Ecological view: "Perceiving is an achievement of the individual, not an appearance in the theater of his consciousness. It is a keeping-in-touch with the world, an experiencing of things, rather than a having of experiences. It involves awareness-of instead of just awareness. It may be awareness of something in the environment or something in the observer or both at once, but there is no content of awareness independent of that of which one is aware." (Gibson 1966) "Fodor and Pylyshyn, as Establishment theorists, concentrate on how ones takes the environment, appealing to verbal labels of experience to lead the way in delineating subject matter. when the concentration is shifted to perceptual guidance of activity, however, it is clear that most of this continuous, nested perceiving lacks words for referring to it. ... Fodor and Pylyshyn's kind of perception (in percepts) is whatever eventuates in a perceptual judgement or belief. Gibson's kind of perception, in contrast, is that which eventuates in the 'proper' adjustment or oriented (to various levels of the environment) activity." (Turvey et al. 1981) This distinction between non-propositional perception-of and propositional perception-as is a major point for Ecological theorists. The division between those adhering to the -as and -of interpretations is not cleanly along "mediated" and "direct" lines, however. John Searle, certainly no supporter of unconscious inference accounts of mental phenomena, explicitly states that "all perception is perception-as" (Searle 1992). 2.2 The role of hearing according to Helmholtz Most of Helmholtz's writings on hearing are found in the monograph On the Sensations of Tone As a Physiological Basis for the Theory of Music. As the title suggests, this is a work with rather specific aims. In particular, it deals with the perception of "musical tones" (defined as steady-state 5 combinations of sine-tone partials) and not with everyday sounds, which Helmholtz referred to as "noises". Despite this emphasis on hearing in a musical context it may be possible to draw some conclusions about his thinking about the role of hearing in general. Firstly, sensation is stressed as playing a more dominant role in hearing than in the other senses (again, in a musical context). In the introduction, Helmholtz writes: "Music stands in a much closer relation to pure sensation than do the other arts. The latter rather deal with what the senses apprehend, that is with the images of outwards objects, collected by psychical processes from immediate sensation. ... in music, the sensations of tone are the material of the art. So far as these sensations are excited in music, we do not create out of them any external objects or actions. Again, when in hearing a concert we recognize one tone as due to a violin and another as due to a clarinet, our artistic enjoyment does not depend upon our conception of a violin or clarinet, but solely on our hearing of the tones they produce, whereas the artistic enjoyment resulting from viewing a marble statue does not depend on the white light which it reflects into the eye, but upon the mental image of the beautiful human form which it calls up." (Helmholtz 1877) So although the listener can identify the source of a tone, the "raw" sensation of timbre is very clearly present in awareness. Source identification is possible, but not necessarily the single overriding goal. A second emphasis, on the challenge of source separation, does suggest an important place for the "images of outwards objects" in hearing. As well as considering the ability to follow separate melodic lines in a piece of music, the reader is also asked to consider a ballroom: "Here we have a number of musical instruments in action, speaking men and women, rustling garments, gliding feet, clinking glasses, and so on. All these causes give rise to systems of waves, which dart through the mass of air in the room, are reflected from its walls, return, strike the opposite wall, are reflected again, and so on until they die out. ... in short, a tumbled entanglement of the most different kinds of motion, complicated beyond conception. And yet the ear is able to distinguish all the separate constituent parts of this confused whole ..." (Helmholtz 1877) Presumably this separation is supposed by Helmholtz to allow the listener to "apprehend" the speaking men and women, the rustling clothes, etc. 6 2.3 The role of hearing according to James William James also advanced a "knowing what is where" view of perception's role. Throughout the chapters on perception in The Principles of Psychology, he discusses both the visual and auditory modalities in parallel, drawing no fundamental distinction between them. Perception results, in his view, in conscious ideas suggested by sensation. "The first of these ideas is that of the thing to which the sensible quality belongs. The consciousness of particular things present to sense is nowadays called perception" (James 1890). In an auditory example (taken somewhat out of context), he writes: "Thus, I hear a sound, and say 'a horse-car'". That is, the object is identified by its sound. 2.4 The role of hearing according to Gibson Leaping ahead to the mid-20th century, one might expect Gibson to have a somewhat different view on the role of auditory perception, but the explicit differences to be found are subtle. In The Senses Considered as Perceptual Systems he writes: "The function of the auditory system, then, is not merely to permit hearing, if by that is meant the arousal of auditory sensations. Its exteroceptive function is to pick up the direction of an event, permitting orientation to it, and the nature of an event, permitting identification of it." (Gibson 1966) The obvious difference is the substitution of 'event' for 'object', but since by necessity the production of sound involves a dynamic event, this might be construed as a difference in terminology. A greater difference is the proposal that the 'nature' of an event is picked up. This presumably consists of the shapes, motions, and materials involved in the production of the sound, but it is difficult to interpret Gibson's usage precisely, and the point is not elaborated in the most mature incarnation of his theory (1979), which considers only vision. In light of the theory of affordances, the idea that picking up the nature of an event subserves its identification seems somewhat inconsistent with an Ecological stance. I will return to the discussion of the problem of auditory affordances in Section 6.4. 2.5 Other opinions: identification or source recovery? 7 Other writers also stress identification in the auditory modality. Of these some explicitly identify their viewpoint as Ecological while others do not. As an example of the latter, Schubert (1974) proposes the Source Identification Theory as an organizing principle of the auditory system, at least in the processing of non-speech sounds. For speech he extends this to include a principle of Source Behavior recognition in an effort to embrace the motor theory of speech perception. In this account, the listener uses the sound stimulus to identify articulatory gestures, and from these derives the phonemic and semantic content of an utterance. The means by which Schubert suggests this is accomplished are far from unmediated, however. Another promotion of source identification is found in Jenkins' ecological but somewhat un-Gibsonian meditation on acoustic information (Jenkins 1985). The majority of the examples given refer to gaining "what is where" knowledge of soundproducing objects. The idea that listening to speech is exceptional is challenged by Fowler, a committed direct realist. While in agreement with Schubert that in this case the auditory system recovers "the causal source of the acoustic signal" (Fowler 1991), she maintains that it is wholly unspecial in that regard and that all hearing involves event recovery rather than associating objects with sounds (ie. identification). While admitting that there are situations in which there is no adaptive advantage in perceiving events directly, her argument is that there frequently is such an advantage and therefore that evolutionary pressures will have produced an auditory system which attempts to do exactly that. In addition to Fowler's writings in the context of speech perception, perhaps the most serious examination of the role of audition from an Ecological perspective is to be found in a pair of papers by Gaver (1993a, 1993b). Here he proposes that our auditory sense exists to pick up sound-carried information "...about an interaction of materials at a location in an environment". The sound reaching a listener's ears is held to bear information about each of these elements: the nature of the interaction, striking or scraping, say; the materials involved, wood or water; the location relative to the listener or to the environmental setting; and the nature of the environment itself, in terms of reflectiveness and configuration of surfaces. The example of sound from a moving car is provided: "We can hear an approaching automobile, its size and its speed. We can hear where it is and how fast it is approaching. And we can hear the narrow echoing walls of the 8 alley it is driving along. These are the phenomena of concern to an ecological approach to perception." (Gaver 1993a) Thus what are heard are various physical features of environmental events, but as with Fowler, Gaver does not attempt to make the case that these are necessarily ecologically-significant features analogous to Gibson's visual affordances. 2.6 Summary To review then, there seem to be two views on the role of auditory perception in addition to Helmholtz's sensation-based account of music perception. The Establishment story is that hearing serves to localize and identify sound-producing objects, while the Ecological view holds that the physical nature of sound-producing events is directly perceived - the causal source of the acoustic signal is recovered. As noted previously this account is not strictly ecological in the way visual theories of perception-for-action claim to be. 3 Theories of Auditory Perception Having examined the range of viewpoints on the role of auditory perception, I now turn to discussions of the processes which are held to underlie the fulfillment of this role. These are quite varied, including hints of unconscious inference in the writings of Helmholtz, the direct perception approach of Gibson, and auditory applications of computational constructivism. 3.1 Helmholtz's account of audition Beginning again with Helmholtz, we find that he devotes little discussion to the mental processes involved in hearing. This may be largely due to his emphasis on the "sensation of tone", rather than on adaptive auditory perception outside a musical context. However, a number of passages suggest that Helmholtz feels that a great deal of work needs to be done on the auditory input in order to produce separate percepts for the sound sources contributing to it. He is not as explicit as in his advancement of unconscious inference as a theory of visual perception, but he certainly suggests ratiomorphic, constructional mental activity. The three quotations which follow give the sense that the auditory system is involved in analysis, inference, and problem solving respectively. The second is preceded in the original by a passage describing the visual inspection of the surface of the ocean 9 and the ease with which the superimposed systems of waves are separated by eye. (The emphases are not present in the originals). "We shall see that the ear has no decisive test by which it can in all cases distinguish between the effect of a motion of the air caused by several different music tones arising from different sources, and that caused by the music tone of a single sounding body. Hence the ear has to analyze the composition of single musical tones, under proper conditions, by means of the same faculty which enabled it to analyze the composition of simultaneous music tones." "I must own that whenever I attentively observe this spectacle [the visual separation of ocean wave systems] it awakens in me a peculiar kind of intellectual pleasure, because it bares to the bodily eye, what the mind's eye [perception in general?] grasps only by the help of a long series of complicated conclusions for the waves of the invisible atmospheric ocean." "Now there are many circumstances which assist us first in separating the musical tones arising from different sources, and secondly, in keeping together the partial tones of each separate source. Thus when one musical tone is heard for some time before being joined by the second, and then the second continues after the first has ceased, the separation in sound is facilitated by the succession in time. We have already heard the first musical tone by itself and hence know immediately what we have to deduct from the compound effect for the effect of this first tone." 3.2 James' account of audition James discusses perception as a general process without strongly differentiating between the modalities, although he does seem to side with Bishop Berkeley in asserting the primacy of touch, and is quite explicit in his description of the processes. The account is sensation-based and constructivist, and is well-summarized in the following two quotations: "Sensational and reproductive brain processes combined, then are what give us the content of our perceptions" (James 1890) "Perception may then be defined, in Mr. Sully's words, as that process by which the mind 10 supplements a sense-impression by an accompaniment or escort of revived sensations, the whole aggregate of actual and revived sensation being solidified or 'integrated' into the form of a percept, that is, an apparently immediate apprehension or cognition of an object now present in a particular locality or region of space." (James 1890) Moreover, James' account is also clearly empiricist: "Infants must go through a long education of the eye and ear before they can perceive the realities which adults perceive. Every perception is an acquired perception." (James 1890) and continuing in a footnote, he makes special reference to audition: "The educative process is particularly obvious in the case of the ear, for all sudden sounds seem alarming to babies. The familiar noises of house and street keep them in constant trepidation until such time as they have either learned the objects which emit them, or have become blunted to them by frequent experience of their innocuity." (James 1890) 3.3 Brunswik's probabilistic functionalism Occupying a position somewhere between traditional, perception-as constructivism and Gibson's Ecological approach lies Brunswik's probabilistic functionalism, which influenced Gibson's thinking significantly (Lombardo 1987). In this framework, the emphasis is on the perceptual constancies, referred to as distal focusing, and on their achievement in non-laboratory, or "representative" contexts. The perceptual process is held to take the form of statistical inference; proximal cues of varying reliability are weighted and combined to produce a "best bet" at the distal state of affairs. The model of the process incorporates three types of weightings or correlations, referred to as validities. Correlations between distal features and proximal cues are ecological validities; the weightings placed on cues to produce percept features are criterial validities; and the degree of correspondence between the distal feature and the percept is the functional validity. This last is a metric of achievement. 11 While Brunswik himself applied his methods principally to the three canonical visual constancies (size, shape, and color) the same system has been applied to audition in a study of loudness constancy (Mohrmann 1939). This work will be described in Section 4.1 as one example of Establishment-style experimentation. 3.4 Gibson's account of audition Gibson's account of the basis of auditory perception exactly parallels his treatment of vision, and has no place for the cues which play such an important role in Brunswik's conception of ecological perception. The hearing organism is said to use its listening system, "two ears together with the muscles for orienting them to a source of sound", to sample the 'acoustic array'. This permits the pick-up of invariants which specify the mechanical sound-producing event. No mediation by inference, memory, or computation is required. As in any direct theory, the usefulness of such a process rests on specification, or the one-to-one mapping from sound-field properties to soundsource properties. For example, interaural time and amplitude differences and their patterns of change as the head moves are identified as specifiers of the location of a source. Two quotations will serve as evidence of his belief in acoustic specificity: "In meaningful sounds, these variables [spectral and temporal features] can be combined to yield higher-order variables of staggering complexity. But these mathematical complexities seem nevertheless to be the simplicities of auditory information, and it is just these variables that are distinguished naturally by an auditory system. Moreover, it is just these variables that are specific to the source of the sound - the variables that identify the wind in the trees or the rushing of water, the cry of the young or the call of the mother. The sounds of rubbing, scraping, rolling, and brushing, for example, are distinctive acoustically and are distinguished phenomenally." (Gibson 1966) "... the kind of wave train is specific to the kind of mechanical event at the source of the field; that is, the sequence and composition of pressure changes at a point in the air correspond to what happened mechanically... This correspondence is the justification for our metaphorical assertions that the waterfall 'splashes', the wind 'whistles', and the thunder 'cracks'." (Gibson 1966) 12 Gibson also repeats his argument against sensations as a basis for perception in the context of hearing. A sound signal, as a function of time, can be decomposed into a collection of sinusoids, but he points out that adopting this mode of analysis leads to the dubious assumption that any complex sound can be reduced to a collection of pitch sensations. The point he makes is similar to what Jenkins (1985) calls Johansson's Law of Perceptual Richness, which is that mathematically complex stimuli may be hard to describe, but are information-rich, while mathematically-simple stimuli may not be so simply dealt with by the perceptual system (Johannson 1985). 3.5 Computational accounts of audition In Gibson's auditory theory, the pick-up of information is said to be performed by neural structures which 'resonate' to the invariants of stimulation. By removing these processes from the psychological "domain of discourse" (Ullman 1980) Gibson left them unanalyzed. Those interested in artificial intelligence and the development of perceiving machines do not have this luxury, however, and must face the problem of actually extracting invariants. Despite this component, and an emphasis on representational transformation, Gibson's Ecological approach is often identified as a source of inspiration (as well as exasperation) by those who practice computational constructivism1. For example, Sloman attempts to incorporate affordance-like objects of perception into his computational theory, but writes: "... we need not stick with Gibson's mystifying and unanalysed notions of direct information 'pickup' and 'resonance', although I shall sketch a design for such a system that has distant echoes of these notions" (Sloman 1989). Marr holds a similar view: "Gibson's important contribution was to take the debate away from the philosophical considerations of sense-data and the affective qualities of sensation and to note instead that the important thing about the senses is that they are channels for perception of the real world outside or, in the case of vision, of the visible surfaces." (Marr 1982) 1 The 'computational' in 'computational constructivism' refers specifically to a style of processing involving mathematical manipulations and explicitly geometrical representations. As Pylyshyn has pointed out (1984), all forms of constructivism can be considered computational since inference is couched in terms of propositions, propositions are represented symbolically, and an operation over symbols is computation. 13 "Although one can criticize certain shortcomings in the quality of Gibson's analysis, its major, and in my view, fatal shortcoming lies at a deeper level and results from a failure to realize two things. first, the detection of physical invariants, like image surfaces, is exactly and precisely an information-processing problem, in modern terminology. And second, he underestimated the sheer difficulty of such detection." (Marr 1982) This combination of consideration of ecological constraints and formal computation has been termed 'natural computation' by Richards (1988), and forms yet another class of auditory theory. C.J. Searle (1982) and Lyon (1983), among others,have applied these methods to auditory processes. Other major impetuses are soundscape understanding or 'machine listening' (Ellis 1995), and automatic music transcription (Nunn 1995). Curiously the design of speech recognition systems seems to have proceeded without much contact with perceptual science, and the techniques used are often generalpurpose pattern recognition algorithms rather than auditory models. Bruce & Green (1990) offer a possible reconciliation between computational and Ecological accounts, framed in terms of non-symbolic representation. Neural "maps" can represent variables of the input and preserve isomorphisms, but as Searle (1992) maintains, once the neurophysiological bases of these maps are understood, the incentive to characterize the process in terms of symbolic computation is greatly reduced. For example, it appears that interaural time differences are mapped to "place" in the medial nucleus of the superior olivary complex (Pickles 1988) - the representation is not symbolic. Certainly much of Marr's theory of early vision could be read simply as a functional description of simple neural processing. Hatfield (1990) also proposes a rapprochement between direct and representational transformation accounts via connectionist "symbol" processing. 14 4 Establishment ecological research In the next two sections of the paper I will review a number examples of individual experiments or of research programs conducted from the Establishment and Ecological viewpoints. The dual aims are to compare the style of experimentation within the two camps and to provide some context for the discussion of the Ecological approach with which I conclude in Section 6. Experiments which are self-consciously motivated by an anti-direct stance tend to seek the effects of perceiver knowledge on percepts (Hochberg 1994), while others tacitly working within the classical framework uncritically offer inference-based explanations of their observations. There are also many examples of the types of experiments which are a favorite target of Ecologists: snapshot theories of motion perception, lateralization of sine tone stimuli, fixed-head sound localization, and auditory illusions using "impoverished stimuli" of various sorts. Since the subject matter and analyses found in these studies are so obviously different from those in Ecologically-motivated work, it does not seem particularly illuminating to discuss them here. Instead I will focus on experiments which address issues relevant to object or event perception from an Establishment viewpoint. The emphasis in this work is usually on filling unsatisfying gaps in the direct perception account (eg. how are cues or invariants extracted?) or on offering alternative accounts involving mediating processes. I will attempt to show by example that to some extent one can address auditory perception ecologically without being strictly Ecological. 4.1 Brunswick & Mohrmann: loudness constancy As mentioned previously, the concern with investigating environmental perception did not originate with Gibson. Brunswik and his colleagues examined many perceptual constancies within the framework of his probabilistic functionalism. An example of this approach applied in hearing is a study of loudness constancy by Mohrmann (1939, described by Postman & Tolman 1959). The task of the subjects was to report the loudness of the sounds produced by a number of sources while adopting one of two attitudes. The first, the naive-realistic attitude, was distally focused, and required the listener to estimate the intensity at the source, while the analytic, or sensorial, attitude concerned the intensity at the listener's position. The actual intensity was measured using microphones at the source and listener positions, but the response method is not described. 15 Achievement of constancy was calculated by correlating the judgements with the physical measurements. Presumably the proximal intensity was varied by altering the distance to the source rather than changing its amplitude, since the latter would cause both distal and proximal intensities to vary in parallel. In addition, the experiment was performed in the dark, with listeners blindfolded after viewing the source, and with the source in plain view throughout. If listeners were able to adopt the desired attitudes perfectly, the constancy ratios obtained should be 1 in the naive-realistic case and 0 in the analytic case. This trend was observed, but constancy ratios ranged from approximately 0.65 (for tones) to 0.95 (for speech) in the realistic case, and from about 0.1 to 0.5 in the analytic. This suggests that on the whole observers are more successful at reporting distal intensities than proximal ones. In addition, constancy was favored when subjects could see the source and how far away it was no matter which attitude they were requested to take, but visual cues hindered proximal reporting more than they assisted already-good distal reporting. That is, listeners could only successfully adopt an analytic stance in the dark condition. Another feature of the data is that the complex sounds, such as speech and music, permitted much higher loudness constancy than tones and noise. These results can of course be interpreted in several ways. In Brunswik's terms, the adaptive value of perception lies in distal focusing, and therefore it should not be surprising that we have easier access to distal representations that to the proximal cues from which they are derived. Unconscious inference could be invoked to explain the achievement of greater constancy in the visible-source condition, in which vision provides information about the distance to the source. This could be used by the auditory system, which "knows" how intensity varies with distance, to determine the source's loudness2. This of course begs the question of how the visual system obtains unambiguous distance information. The advocate of direct perception would explain the difficulty of reporting proximal intensity as evidence that the auditory system is designed to recover source properties. The better constancy obtained with speech and music could be attributed to their greater ecological validity and informational richness in comparison to the lowly sine tone and noise burst, for which source recovery would be ambiguous. The advantage bestowed by visual information is 2 Warren (1982) discusses an approach in which estimates of loudness are actually held to be disguised estimates of distance. 16 less conveniently explained within an Ecological account, but conceivably cross-modal invariants for loudness could be hypothesized. 4.2 Auditory scene analysis & auditory image perception As Helmholtz pointed out, a central issue in auditory research concerns the means by which the complex superposition of sounds from several sources is processed so that each may be perceived separately. This process is referred to as source segregation or auditory scene analysis, and is addressed in an extensive program of research conducted by Bregman and his associates (Bregman 1990) and in a theoretical paper by Yost (1990). Each author is concerned with slightly different aspects of the problem, and phrases his assumptions and motivations differently. Here I will give a review of their theoretical orientations and the types of experiments associated with each. 4.2.1 Bregman: auditory scene analysis For Bregman, like Marr, “perception is the process of using information provided by our senses to form mental representations of the world around us”, and as in visual scene analysis an important problem is the grouping of separate pieces of information about the same object together. He writes: “it is important to emphasize again that the way the sensory inputs are grouped by our nervous systems determines the patterns that we perceive”. So, the products of perception are in this account very much influenced by mental activity (or at least alterable neurophysiological activity). The scene analysis task is posed as a problem to be solved by the auditory system through a process of representational transformation. Bregman stresses that on one hand it is important to examine the ecology of audition - the constraints on and commonalities among natural auditory scenes - and suggests that the auditory system uses ‘knowledge of this sort in the form of useful heuristics in order to achieve source separation. The formation of representations is held to be constrained both by innate, primitive grouping rules and by learned rule complexes, which he calls schemas. Grouping occurs both sequentially (on successively-presented segments of a sound pattern) and in parallel (on sound components present simultaneously). The end result of the grouping processes is one or more sound streams, which are described variously as the auditory equivalents of visual objects, perceptual units representing single happenings, or as perceptual representations: “a computational stage on the way to a full description of an auditory event”. When a stream is compared to an object, clearly the meaning is not that streams exist in the environment, but that a 17 stream is a unit of auditory experience with its own properties (rhythm, pitch contour and timbre, say) just as a visual object is a unit of experience. Despite referring to ecological constraints as a guide to grouping processes, it is clear that a stream does not necessarily correspond one-to-one with a sound source. It is possible for the sound from many sources to merge into a single stream, or for sound from a single source to be segregated into several streams. The latter effect is revealed in experiments on pitch streaming, a sequential grouping process (Bregman and Campbell 1971). When a tone sequence consisting of alternating high and low tones is presented it can appear as a single stream if it is played slowly or the tones are not widely separated in pitch, or it may split into two streams if played fast or with wide separation. In situations where a single sequence is grouped into multiple streams, it is very difficult for listeners to discern the temporal relationships between them. For example, rhythmic patterns perceived in a single stream can dissolve if pitch manipulations cause it to split into multiple streams. Bregman writes that we can: "...look at the streaming effect as the auditory system's description as a mixture of two sources - one high in pitch and the other low. This is the system's best bet as to the deep structure of the situation. The heuristic that seems to be involved here is this: Temporally adjacent segments are not necessarily to be grouped as arising from the same source, especially when the segments themselves have sharp boundaries. ... In such cases, the events are to be grouped according to similarity." (Bregman 1981). The reference to "deep structure" is not accidental; he often compares the heuristics involved in "parsing" the auditory input to Chomskian grammatical rules. Formal generative grammars have been used in modelling the perception of music (Lerdahl and Jackendoff 1983), and Ballas (1987, see Section 4.3) uses a speech metaphor in his account of environmental sound perception, so this approach is not unique. Bregman's experimental program also includes investigations of other streaming phenomena, such as those based on timbre differences, and of auditory analogs to visual amodal completion effects. When tone and noise bursts are alternated in sequence, the tone appears to become continuous when the noise is sufficiently intense that it would have rendered a truly continuous tone 18 inaudible. This effect also occurs when a tone glide is interrupted by noise - under the appropriate conditions the glide appears to persist through the noise while continuing to change in pitch, a phenomenon which has been used to investigate the auditory system's 'assumptions' about the rates of change of sound source characteristics (Kluender & Jenison 1992). Warren (1982) reviews several of these illusory continuity effects, and Bregman is in agreement with his suggestion that their function is to group together sound segments originating from the same source which would otherwise be separated by masking signals. The ability to elaborate sketchy, temporally-limited sensory information into temporally-extended stable percepts has also been noted in binaural experiments (Stellmack 1994). 4.2.2 Yost: auditory image perception In a paper entitled "Auditory image perception and analysis: The basis for hearing", Yost (1990) also addresses the scene analysis problem, although he distinguishes his point of view from Bregman's. His emphasis is on processes which allow the separation of concurrently active sources, under the premise that main function of the auditory system is held to be the "determination of sound sources". 'Determination' is explicitly distinguished from 'identification'. It is generation of an 'auditory image' corresponding to a single sound source. These images are the objects of the identification process although identification need not be successful in order for them to be perceived. An auditory image seems to be approximately the same as a stream although in a sense its identification with a single physical source suggests that it is a more environmentally-oriented concept. While Bregman's proposal that streams are the units of auditory experience seems clear, the use of the word 'image' is rather more confusing. Consider this passage: "Because the sounds from different sources do not arrive at the auditory system separately, the auditory system must process the neural representation of the complex sound field into elements ('auditory images') that allow the listener to potentially determine the source. The presence of sound sources is inferred of deduced from percepts, the auditory images, based on the information arriving at the ears of a listener. Thus auditory images are the bases for hearing." (Yost 1990) Such an image is clearly not something which is imagined; it is not the sort of thing studied by those interested in auditory imagery (Reisberg 1992). Nor is it analogous to a retinal image. If the images 19 are percepts (ie. the experiential outcomes of the process of perception) then they are in classical terms the conscious representations of sound sources in the environment. However Yost seems to introduce an extra step of inferring the existence of sound sources from percepts rather than taking the Helmholtzian position that percepts are the result of inference. To complicate matters Yost elsewhere states that image perception is sufficient for sound source determination. The proliferation of levels seems to result from an awkward attempt to keep the discussion outside the realm of cognition. For example: "If one reviews the literature on image formation (Handel 1990, Bregman 1990), the topic may appear to be more closely related to cognitive science, or even to phenomenology, than to issues that would be of direct interest to psychoacoustics and auditory physiology. An assumption of this paper is that the auditory system is responsible for auditory image formation and the four questions posed above are amenable for study by auditory scientists." Yost claims to seek an explanation in terms of neurophysiology or basic psychophysics, but, although denying it, needs a foot in both camps. If auditory images are not phenomenological entities, then they are hardly percepts. Rather than belabor this point, I will press on and discuss the experiment Yost presents as an example of image formation and briefly describe the means by which he feels this is achieved. The necessity for scene analysis occurs whenever more than one source is active at the same time. The experimental stimulus in this case was a mixture of a man uttering the vowel /a/ and a synthesized pipe organ note. Neither the physical frequency spectrum nor the output of an auditory filter bank model make it obvious that two and only two sources are present, but all of the subjects who heard the stimulus reported hearing only two. Identifying the sources was more variable, but all listeners heard some spoken vowel and a musical note. The strategy adopted in explaining this ability is to examine the ecology of sound production for physical attributes of sources which might be encodable in the auditory nerve signals. The seven physical variables suggested are: spectral separation, intensity profile, harmonicity, spatial separation, temporal separation, common temporal onsets and offsets, and coherent slow temporal modulation. While this does not exactly constitute a search for invariants, it is meant as a first step 20 in a neurophysiological account of source separation. Note that the importance of temporal separation and common onsets and offsets was recognized by Helmholtz (1877, see Section 3.1). 4.2.3 Summary The search for an account at this ecological-neural level is perhaps the only feature of this approach and of Bregman's which Ecological theorists would not object to. Talk of grouping rules, representations, problem solving, deductions, and inference is the antithesis of a direct theory. However, no convincing account of source separation in terms of acoustic invariants has yet been offered. Gibson (1966) proposed that orienting the head so as to synchronize the binaural inputs for one source while desynchronizing those for others was the basis of 'selective listening'. While spatial separation and binaural input does assist in source segregation it is clearly quite possible with a single channel of input, and thus Gibson's account is inadequate. This issue is discussed in more detail in Section 6.2. Bregman's work might be subjected to the standard criticism that his stimuli are impoverished and unnatural and that the results therefore have little or no relevance to ecological listening. In addition, any account which posits rules is vulnerable both to questions about who is applying the rules (ie. the homunculus problem) and to objections about lack of constraints. One can keep adding rules to explain whatever behavior is observed. However, Bregman states that he is interested in a functional description of these processes - the rules are tools for predicting percepts rather than actual constituents of the auditory system. His is an as-if, not an in-fact, rule-following account. The primitive rules are described as "automatic innate processes that act without conscious control" (Bregman 1990). On the other hand, his description of the more sophisticated, top-down, schema-based processes is less reconcilable with direct perception accounts. Here consciouslydirected attention and "the activation of stored knowledge of familiar patterns" are held to play a role. A Gibsonian explanation would involve an account of perceptual learning, which involves the discovery of additional variables of stimulation permitting finer discriminations. 4.3 Ballas & Howard: interpreting environmental sound As a final example of a non-Ecological approach to environmental perception I will examine Ballas and Howard's paper, "Interpreting the Language of Environmental Sounds" (1987). Their main point 21 is that a useful analogy can be made between the perception and understanding of speech and of environmental sound. Both seem to involve bottom-up, data-driven processes combined with topdown, context-dependent, knowledge-based cognitive processes which serve to resolve ambiguities and permit the recovery of meaning. This is similar to Bregman's distinction between primitive and schema-based processes, but the authors' intent is to present evidence that not only is the general form of perceptual processing similar, but that specific details are too. The claim that environmental sound can be considered a language is therefore more than metaphorical. Ballas and Howard discuss four experiments in support of their contention. The first involved the free-response identification of a number of short recorded sounds, several of which were intended to represent events in water or steam-pipe systems. It was found that (with the exception of a water drip sound) actions were much more accurately identified than agents. That is, listeners could more reliably say whether the event involved an impact, friction or flow than whether the materials involved were water, wood, metal or air. These results are contrasted with those of Vanderveer (1979), who obtained much more accurate judgements, but Howard and Ballas offer the explanation that Vanderveer's stimuli (e.g. jingling keys, fingers drumming on a table) were presented in an appropriate context, a seminar room, and that this cued the listeners. Their conclusion is that, taken in isolation, the meanings of individual environmental sounds (ie. the identities of the source events) can have ambiguity as can the meanings of isolated words. A second study attempted to draw a parallel between sound and speech homonyms using an Information Theory approach. Listeners were again asked to identify the recorded sounds from the first experiment and to rate the confidence of their identifications. The responses were sorted into categories and the "entropy" of each sound calculated based on the number of different categories into which it was placed. The correlation between confidence and entropy was significant, suggesting that identification is affected by the number of different causes to which a sound might be attributed. The authors also suggest that identification might be influenced by the frequency of occurrence of particular sounds in the same way that word recognition depends on frequency, but they admit that quantifying this may be difficult. The final two studies used the same set of sounds presented in sequences and were concerned with the effect of context on the identification of individual sounds within sequences or the learning 22 of sequences as whole units. Context was found to influence the interpretation of individual sounds. For example a hammer striking a pipe was thought to be a factory machine in one sequence and a car crash in another. This effect is compared to the resolution of homonym meaning in sentences: "... it appears that the integration of sequences of sounds resembles the integration of sequences of words in a sentence. In the latter case, multiple interpretations of each word might be activated initially and all but one eliminated on the basis of the context provided by the other words." (Ballas & Howard 1987) Although not mentioned by the authors, the activation and inhibition which they propose could be perhaps be investigated using the established tools of experimental psycholinguisitics. In the final experiment, listeners were asked to learn sequences of two sorts. One set contained randomly-ordered combinations of drips, clangs, flushes etc, while the other consisted of causally-sensible structured sequences created using a small finite state grammar. In addition, half the subjects in each condition were informed that they would hear sounds involving water and half were given no instruction. The structured sequences were learned more quickly than the random ones, and there was an interaction with the instructions given. Prior information aided those learning the structured patterns but hindered those learning the random ones. The interpretation is that the expectation of causally-logical sequences interfered with the learning of random patterns. In effect there is held to be a grammar of causality which listeners use to parse environmental sound sequences. Jackendoff (1987) makes similar claims about the representation of visual events and their relationship to language. An Ecological response to this might question the validity of results obtained with sounds taken out of an environmental and causal context. In other words, Ballas and Howard might have too restricted a view of what should comprise an environmental sound or stimulus. Fodor and Pylyshyn (1981) discuss this move with respect to the phonemic restoration effect and conclude that widening the conception of the effective stimulus allows the resolution of ambiguity, but concomitantly reduces the ability to explain the perceptual similarity which can occur in differing contexts. For direct theorists who hold that the auditory system seeks to recover the soundproducing physical events, the existence of sound homonyms may pose no problem, since these 23 sounds are often produced by similar physical systems. Ballas and Howard give the example of a loud sharp bang, which could be caused by an engine backfire, a gun, or an explosion. In all cases the physical cause of the sound is the rapid expulsion of air from an enclosure, but, as I argue in Section 6.4, the environmental significances of these causes differ significantly, and identification is important. 5 Ecological ecological research The number of auditory studies explicitly inspired by the Ecological approach is not large. A substantial portion of the literature consists of speculative discussions of the applicability of direct or Ecological accounts to audition rather than descriptions of experimental work. The five examples discussed below have been chosen to indicate the types of experiments performed and the relative successes and failures which were encountered. The influence of the Ecological approach manifests itself in the objectives of particular experiments or studies and consequently in their design. Characteristic aims are: the discovery of invariants of stimulation; obtaining evidence that perception is causally related to these invariants, which in turn is taken as evidence for direct perception; and characterizing the manner in which perception guides action. The search for invariants involves either mathematical analysis or physical measurements of a given environmental situation. In order to show that perceptual systems actually utilize a particular invariant its presence must be shown to be a sufficient condition for the relevant percepts to arise. Thus, observers must be shown to be able to perceive the environmental property which the invariant specifies and their percepts must be alterable by experimental manipulations of the invariant. It is of course difficult to prove experimentally that perception is unmediated, particularly since the putative mediating processes are presumably unconscious and inaccessible to introspection. The argument for direct perception therefore generally consists of the identification an invariant, verification of its efficacy, and a subsequent appeal to parsimony. If perception appears to be a function of stimulation, why invoke unconscious inference or other processes? Ultimately this is a somewhat unsatisfying approach since it must proceed case-by-case and leaves open the question of directness in situations for which no invariant has yet been discovered. However, one may also 24 hold that since it is an empirical matter, there is no logical inconsistency in simply assuming the existence of specification until it is disproved (Fowler 1991). The style of experimentation also differs from the bulk of Establishment perceptual research in the types of stimuli used and the types of responses required of participants. Typically, the stimuli are complex or "realistic". Subjects are asked to characterize events or perform certain actions based on their perceptions. These are frequently more complex or natural actions than the typical psychophysical discrimination task. 5.1 Time-to-contact: acoustic looming The derivation and investigation of an acoustic variable for time-to-contact provides a good model of the Ecological approach. In vision the inverse of relative rate of expansion of an object's retinal projection (r / dr/dt) specifies the time-to-contact if it is moving directly towards the observer. Shaw, McGowan and Turvey (1991) derive an acoustic equivalent based on the simplifying assumptions that the source is a compact monopole, the acoustic medium is non-absorbing, and the surroundings are anechoic. Under these conditions, acoustic time-to-contact, or taua, is equal to twice the inverse of the relative rate of change in intensity (2I / dI/dt) at the observer's position. If time-to-contact were to be deduced only from successive "snapshot" judgements of distance, accuracy would suffer, since estimation of auditory distance is notoriously poor (Gardner 1969). Prior to any experimental verification of the effectiveness of this invariant, Guski (1992) questioned whether the auditory system could in principle use this variable since he thought it required access to the absolute intensity of the sound source. This concern seems to be based on a misapprehension; the intensity in question is not that of the source, but the proximal intensity. The acoustic tau is independent of overall intensity and distance, just as the visual one is independent of size. A number of studies have examined the ability of listeners to judge the time-of-passage of a moving sound source (for example Rosenblum et al. 1987), but these do not directly address the effectiveness of the taua invariant since other sources of information such as intensity and Doppler shift changes also specify the time of closest approach. Taua offers prospective information for timeof-arrival, and therefore the important test is whether arrival time can be accurately predicted from acoustic information collected before the "collision" when other variables are uninformative. Rosenblum (1993) describes an experiment in which recordings of cars passing an observer at 25 various speeds were edited into thirds to evaluate the usefulness of information from different portions of the stimulus. The results indicate that information available prior to passage is as useful in estimating arrival time as hearing the actual passage. Jenison (1994) has derived variables involving intensity, interaural time difference, and Doppler shift which are cues to parameters of the more general approach problem, in which the source moves past the observer at some distance and at a particular trajectory angle. Wightman & Jenison (1995) report data from an experiment using such synthesized stimuli which show that listeners can use prospective information to discriminate arrival times differing by about 300 ms. While the effectiveness of this invariant seems to have been established, the assumptions under which it was derived are actually quite restrictive. Sources radiating short wavelengths cannot be approximated by compact monopoles and in reverberant environments the invariant applies only to the direct signal. I shall discuss issues of this sort in the concluding sections of the paper. 5.2 Using auditory information for active contact The studies of acoustic tau discussed above required subjects to judge time-to-contact independent of any other action. In an experiment conducted by Heine and Guski (1993), participants were requested to catch a ball rolling towards them using only acoustic information. The balls were released on a ramp which they rolled down and continued towards the edge of a table at which the subject was seated. Only a single reach-and-catch gesture was permitted, so good performance depended on estimation of time-to-contact from the sound produced by the ball. While results varied with the size of ball used (and hence the strength of the sound produced), performance turned out to be quite poor overall. The authors advance various explanations for this, the first of which is that the experiment was conducted in an anechoic room, a condition under which it is very difficult to judge distance auditorily. This seems like an unfortunate point to raise, since the advantage of the "looming" invariant for time-to-contact is that it is independent of distance. If distance judgements are required, the case for the efficacy of the invariant is undermined. A second point raised is that sighted humans rarely rely only on acoustic information in natural situations. Hearing typically aids orientation and preparation for visually-guided action. However the fact that blind athletes can apparently use similar information to play games involving rolling balls leads the 26 authors to conclude that sufficient information is present in the acoustic signal, but that their sighted subjects were not attuned to it. 5.3 Transformational invariants: breaking & bouncing An early and frequently-cited example of Ecological acoustics is the study of the perception of breaking and bouncing by Warren and Verbrugge (1984). The emphasis is on identifying transformational invariants (specifying a dynamic characteristic) for bouncing and breaking events. It is suggested that a "single damped quasi-periodic pulse train" specifies a bouncing event and that an "initial rupture burst dissolving into overlapping multiple damped quasi-periodic pulse trains" specifies breaking. Subjects listened to natural tokens of bottles and jars hitting a linoleum floor and were asked to identify the type of event independent of the material involved. In addition to the breaking and bouncing categories, subjects were encouraged to respond "don't know" if they could not decide or if they perceived some other type of event. Given this three-choice task, correct identification was better than 98% for both types of tokens. To verify that the hypothesized invariants do specify the two types of events, synthetic tokens were constructed using recorded sounds from four single pieces of glass. Here correct identification was 90.7% for bouncing and 86.7% for bouncing. It is of course possible that subjects used prior knowledge of similar events to perform the classification rather than perceiving them directly via the temporal patternings. As the authors acknowledge, and additional problem is the response method used. If these temporal structures truly specify the events, then rates of correct identification should be unaffected by the number of different sorts of events to be identified. If non-breaking and non-bouncing events were included, would performance deteriorate? Predefining the categories brings to mind a criticism which has been leveled at Establishment theorists; Turvey et al. (1981) state that those opposed to Establishment theory should ask of its proponents "both why and how any given thing comes to be described in just those predicates that are consonant with the hypothesis mediating its interpretation." By restricting the responses and the categories, perhaps Warren and Verbrugge have cast the task into the form of a statistical inference problem. 5.4 Perceiving numbers by audition 27 Occasionally, as in the ball-catching study, invariants of stimulation may exist, but seem to be poorly-utilized by observers. The task in this experiment (Heine, Guski & Pittenger 1993) was for listeners to estimate the number of steel balls dropped and allowed to bounce on a wooden surface. For a single ball the sound consisted simply of a series of impacts, while with two or more balls there were also collisions between balls. Recordings were made in an attempt to find acoustical correlates of the number of balls dropped. Correlation coefficients with magnitudes from 0.95 to 0.99 were found between the number of balls and the peak sound level, the time interval between the first and second bounces, and the overall duration of the event. Although subjects were able to identify the single-ball case reliably, in all other cases the number of balls tended to be under-estimated, and the variability in responses was high. In fact, from the data presented, it does not appear that listeners could reliably distinguish between 2 and 9 balls. The explanations offered for this result are similar to those in the ball-catching experiment. The task is somewhat unnatural and makes atypical demands on the auditory system, which may not be attuned to pick up the acoustic invariants available. The authors again make an un-Ecological remark about the subjects' lack of knowledge of the situation. Apparently when shown the experimental setup before being blindfolded the correlation between judgements and number of balls increased from 0.73 to 0.84, which suggests that prior, non-auditorily-derived knowledge of the situation may be as important as "attunement", which was not demonstrated. 5.5 Acoustic texture in distance perception The final example of Ecologically-inspired experimentation is an investigation of the utility of providing "acoustic texture" in a distance judgement task (Höger 1993). Gibson (1979) proposed that texture gradients are invariants for surface slant and that the amount of texture occluded by an object serves to specify its distance from the observer. By (rather weak) analogy, "it is assumed that characteristic changes of background sounds from different locations constitute an acoustic texture gradient of depth". Four loudspeakers were positioned at 4 m increments from the listener, whose task it was to identify the position from which one of three sounds (truck, dog or ducks) was presented. In a "texture" condition a recording of singing birds was played in a random order from each loudspeaker prior to presentation of the test stimulus. 28 The data revealed no significant effect of adding texture except at one distance for the truck sound. A second experiment employed monaural recordings of stationary or moving cars at various distances. These were presented over headphones with and without texture, and listeners were asked to report the apparent distance to the car. Texture had no effect for the stationary car, but slightly improved a tendency to underestimate distance in the moving car condition. This bias did not exist for the stationary car, which is puzzling since moving stimuli contain dynamic Doppler shift and intensity cues to distance, and hence judgements might be expected to be more accurate. It is clear that acoustic texture cannot specify distance in the way that visual texture is held to do. In the visual case, occlusion is essential and this does not exist in the auditory case. In Section 6, I argue that attempts of this sort to apply the principles of visual ecological theory directly to the auditory realm are ill-advised. 6 Prospects for an Ecological account My third aim in this paper is to have a critical look at the current status of Ecological accounts of audition in order to assess their successes and shortcomings. Since Gibson's approach is so deeply rooted in vision, the first step taken is to examine the differences between the auditory and visual ecologies. Following this, I discuss the problem of the superposition of acoustic signals, of acoustic specificity, and of auditory affordances. 6.1 Auditory ecology A theory of audition (whether Establishment or Ecological in style) must take account of the particulars of acoustic ecology. The manner in which sound is usefully structured by the world differs greatly from the way light is, and therefore auditory systems (and auditory theories) are faced with many challenges dissimilar to those found in vision. Although many differences can be listed, I contend that the root cause is the fact that audible sound has very long wavelengths in comparison to those of light. The range of human hearing covers wavelengths from 10m to 2cm, while light sensitivity consists of wavelengths from approximately 400 to 600 nm. Light and sound are both wave phenomena, but their differing scales mean that the manner in which they interact with the same objects in the world are dissimilar. 29 The first consequence of wave length is that there can be no "acoustic retina". To achieve the same spatial resolving power as the eye, an acoustic lens would need a diameter of approximately 200 m for the highest frequencies and 100 km for the lowest. The transduction of sound is therefore non-directional; sound impinging upon the listener from any direction is "projected" to a single point - the eardrum. There is no geometrical preservation of space or place-to-place mapping from the world to a receptor surface as there is in vision. The challenge facing the visual system is often stated in the form of the inverse projection problem. Given a 2-dimensional retinal projection, there are infinitely many 3-D surface layouts which could have produced it. Clearly the problem is even worse in audition since the projection is from three dimensions to a 0-dimensional point. The situation is ameliorated somewhat by the facts that we possess two ears and that sound travels relatively slowly, allowing interaural time differences to specify one component of source direction. A second consequence of sound's large wavelengths is that sound-emitting objects, unless they very large, do not occlude others sources in the way that visual objects do. Diffraction permits sound to sweep past objects and to propagate around corners, and thus occlusion cannot provide information about the relative distances of interposed objects. Auditory masking is sometimes compared to visual occlusion, but the processes are really quite different. An intense sound will mask other sounds independent of their direction of origin, and there is no way to "listen around" a masker. It is a cotemporal process rather than a codirectional one. The combination of 3-D to 0-D projection and the lack of occlusion means that the auditory system is faced with determining the spatial positions and character of sources the sounds from which are superimposed at the receptor. There is no independent access to sounds from different directions or at different distances, and the information from all concurrently active sources must pass through a single channel. Somehow this information gives rise to percepts of individual sources. The situation is further confounded by a third consequence of wavelength, which is that sound reflection is specular and maintains the important temporal structure of the original source signal. It is generally specular because sound-reflecting surfaces are much smoother at the wavelength scale than are the same surfaces when reflecting light. Frequently there is little to 30 distinguish an echo from an additional source, and these reflections are themselves superimposed on the signal at the eardrum. Because of sound's long wavelength and our lack of acoustic retinae, the information contained in sound reflected from an object is rather low-resolution. Humans, unlike bats, rely primarily on the sound-emitting properties of objects rather than their sound-reflecting properties. Bregman vividly sums up the situation this way: "This way of using sound has the effect of making acoustic events transparent; they do not occlude energy from what lies behind them. The auditory world is like the visual world would be if all objects were very, very transparent and glowed in sputters and starts by their own light, as well as reflecting the light of their neighbors. This would be a very hard world for the visual system to deal with." (Bregman 1990) Helmholtz also addresses the problem of superposition in his discussion of the separation of systems of ripples on the surface of a body of water: "But the ear is much more unfavorably situated in relation to a system of waves of sound, than the eye for a system of waves of water. The ear is affected only by the motion of that mass of air which happens to be in the immediate neighborhood of its tympanum within the aural passage. ... The ear is therefore in nearly the same condition as the eye would be if it looked at one point of the water through a long narrow tube, which would permit of its seeing its rising and falling, and were then required to undertake an analysis of the compound waves. It is easily seen that the eye would, in most cases, completely fail in the solution of such a problem. The ear is not in a condition to discover how the air is moving at distant spots, whether the waves which strike it are spherical or plane, whether they interlock in one or more circles, or in what direction they are advancing. The circumstances on which the eye chiefly depends in forming a judgement, are all absent for the ear. If, then, notwithstanding all these difficulties, the ear is capable of distinguishing musical tones arising from different sources - and it really shews a marvelous readiness in so doing - it must employ means and possess properties altogether different from those employed or possessed by the eye." (Helmholtz 1877) Thus, the auditory system relies mainly on different sorts of structures in stimulation than the visual system - temporal ones rather than spatial. The acoustic signal therefore supplies information about 31 very different properties of objects that does light, and this potentially leads to a further source of ambiguity in the stimulus. I will discuss the problem of acoustic specificity shortly, but first wish to examine the significance of the superposition problem for an Ecological theory of hearing. 6.2 Superposition A central tenet of the Ecological approach is the idea that what count as stimuli should be broadened with respect to the traditional view. Thus the stimulus in vision is taken to be the optic array, rather than the retinal image. There is no reason why an acoustic array could not be defined to give spectral content as a function of time and direction over a sphere centered on the listener. However it is not clear that defining the stimulus in this way is of much use since there is no directional access to this array prior to transduction. One cannot sample the acoustic array in the same sense that the visual system can sample the optic array. Directional information such as binaural difference cues and direction-dependent pinna filtering might be held sufficient to define unambiguously the location of a source, but these are properties of sounds corresponding to individual sources and not of the complex superposition of signals at the eardrum. In general, superposition seems to be an unaddressed and difficult problem for direct realism. Proposed invariants such as the acoustic tau and Warren and Verbrugge's bounce-specifying temporal patterns are properties of individual sources or events. If a listener is presented with a stationary source and a looming one, taua of the overall signal does not specify time-to-contact. The sources must be separated so that only those components belonging to the moving source are subjected to the looming "computation". Again, suppose a bouncing event is heard simultaneously with babble from a group of speakers - the stimulus as a whole will not take the form of a quasiperiodic pulse train. As mentioned previously, Gibson (1966) suggests that orienting to a sound source synchronizes the arrivals at the two ears, but separation is also possible with monaural listening and with diffuse, unlocalizable sources. It is hard to imagine how separation occurs without something like the segregation and fusion processes proposed by Yost and Bregman, but these seem to operate heuristically and to impose a structure on the stimulation. Ultimately the percepts derived seem to owe as much to the processes of separation as to the sound-structuring properties of the environment. 32 This is not the sort of explanation proponents of Ecological perception have in mind, but no serious alternative has been proposed. A point to note is that in the domains where the Ecological approach has been most successfully applied, vision and haptics, the superposition problem does not exist. Only one object can be in contact with the skin at any point, and only light from the nearest surface in a particular direction contributes to the optic array. The fact that source separation is less problematic in these modalities perhaps explains why it has not been dealt with in Ecological accounts of audition. 6.3 Specification Setting aside the issue of superposition, let us consider specification in the auditory domain. For a direct account to succeed, detectible properties of the acoustic signal must stand in a one-to-one relationship with the perceived properties of sound sources. The source-to-sound mapping is clearly unique, but, even for a single source in a noise-free environment, can we be sure that the reverse mapping is also unique? Can the inverse problem of recovering the causal source of the acoustic signal be solved? For a number of simple sound-producing systems it seems that it cannot. First consider the Helmholtz resonator, which consists of a vessel enclosing a volume of air with a neck containing a "plug" of air. The resonant frequency of such a device depends only on the mass of air in the plug and on the volume of air in the main chamber. Vessels of many shapes and sizes can produce the same sound, and therefore these parameters cannot be specified. Similarly, the frequency of vibration of a stretched string depends on its length, mass, and tension. Thus length, for example, cannot be specified since a change in length can always be compensated by appropriate adjustments in tension or mass. The 2-dimensional counterpart of the string, the stretched membrane or drum, also suffers from this same ambiguity. While the frequencies of various modes of vibration provide information about the area of the membrane and the length of its perimeter, it has been proven that drums of different shapes can vibrate with exactly the same set of frequencies when struck (Cipra 1992, Driscoll 1995). Hence "one cannot hear the shape of a drum" (Gordon et al. 1992). Finally, it has been demonstrated that identical vowel spectra can be produced by the human vocal tract in very 33 different configurations (Ladefoged et al. 1978), and in principle a given set of formants can be produced by a variety of vocal tract area functions. This is a problem for those who maintain that speech is perceived on the basis of articulator position recovery. No amount of sampling or scanning of the acoustic array can resolve the ambiguities, so it must be assumed that these particular sound-specifying parameters are not specified in the sound produced. Perhaps these are merely overly-simplified systems, which Gaver might group with musical sounds; "Musical sounds are not representative of the range of sounds we normally hear. ... Musical sounds seem to reveal little about their sources, whereas everyday sounds provide a great deal of information about theirs." (Gaver 1993a) Fowler (1991) states that a claim that we "hear the world" is not a claim that we hear every property of the world or that every different thing is perceived differently, but claims that specificity exists nearly always for "for relevant properties of objects and events with which we interact". This assertion seems vaguely circular, since it would be rather lucky for us to live in a world where no relevant properties of objects are unspecified. It is clear that there are properties which cannot be specified acoustically - whether these are relevant or not is a matter for debate. There are also properties which do seem to be specifiable. For example the elasticity of a vibrating material is indicated by the decay rate of vibration when it is struck (Wildes & Richards 1988). Fowler also refers to the rareness of "mirages" outside of the laboratory, but in addition to the sound homophones described by Ballas, one can think of more natural examples. Gibson mentions that thunder 'cracks', but tree branches also 'crack', and the physical causes are quite dissimilar. Given that some properties of the systems discussed cannot be specified, it is necessary either to suggest means of resolving ambiguity, to refine the idea of what it means to recover the source, or to abandon the inverse problem altogether (Kluender 1991). For an Ecological account, supplying the perceiver with knowledge of the constraints of the system is not an option. For example if one knew the possible configurations for a human vocal tract it might be of assistance in recovering articulator positions, although in modeling this appears to be difficult even with careful X-ray measurements of one individual speaker (Bailly et al. 1991). Fowler's move to block "premature 34 allegations of lack of specificity in acoustic speech signals" (1991) is the proposal that in running speech the situation is different. The requirement that the current configuration must be smoothly connected to those before and after may constrain the problem enough to yield a unique solution. A different view is held by Kluender (1991) who refers to work on visual structure-from-motion in maintaining that once rigidity is given up (and he claims it must) all bets are off in solving the inverse problem. Gaver (1993b) recognizes the limits of specification and suggests that what are specified are constraints on solutions to the inverse problem. Coupling this with Fowler's position that we do not hear everything, the question seems to be what do we hear? With how fine a brush is the auditory world painted, and can the answer to that question be accounted for by the information available in acoustic stimulation? Answering these questions is of course rather difficult since even in free identification tasks it is impossible for listeners to describe every aspect of their percepts. Discrimination studies leave open the question of whether responses are based on recovery of source properties or simply on differences in the acoustic signals. Studies in which subjects are asked to detect source-properties often limit the domain of responses, and thus do not speak directly to the specificity question. Examples of the latter are studies of the perception of breaking and bouncing (Warren and Verbrugge 1984), handclapping (Repp 1987), and mallet hardness (Freed 1990). The forgoing comments mainly concern the specification of shape and vibrational properties of sound emitters, but questions of specificity also exist in determining the spatial layout of sources and the environment in which they are active. In determining the direction from which sound is arriving the auditory system, absent head movements, depends on binaural difference information and the directional filtering performed by the pinnae. The spectrum of the sound reaching the eardrum is ambiguous with respect to this spectral cue because the contributions of the source spectrum and the pinna filtering are not separately available. Yet listeners can localize sounds without employing head movements. The explanation for this achievement has traditionally been that listeners employ a priori knowledge of the source spectrum to recover the pinna filter function and to identify the source position, although this idea has not been tested rigorously. The sound field at a listener's position is structured not only by the sources of sound but also by the layout of reflecting surfaces in the environment, and it is often proposed that by consequence 35 of this we can hear the location of these surfaces. A previously-mentioned example was that "we can hear the narrow, echoing walls of the alley it [a car] is driving along" (Gaver 1993a). We can obviously tell the difference between the interior of a cathedral and a coat closet, but how much information for the layout of surfaces is actually present? Consider two properties of a sound field which depend on the characteristics of the enclosure in which events occur: the reverberation time and the direction of arrival of reflections. It has been shown that reverberation time of a room is directly proportional to its volume and inversely proportional to the surface area of the walls and their absorbtivity (Morse and Ingard 1968). Therefore the shape of the room cannot be specified by this parameter. Secondly, in an experiment using synthesized stimuli carried out in our laboratory (HDRL, Waisman Center) we found that subjects were unable to discriminate between cases in which wall reflections accurately duplicated those in a rectangular room and those in which the reflections came from arbitrary directions with the same distribution of time delays. In this case the auditory system was not sensitive to the locations of the walls, but only to their distances relative to the listener and the source. Thus only "fuzzy" information about the layout of surfaces in the environment seems to be present in the acoustic array. The final observations I will make about specificity concern the auditory perception of distance. In an Establishment analysis, the auditory system is faced with an inverse projection problem exactly analogous to that in vision. An image projected on the retina could arise from an object at any distance if its size is chosen appropriately, and (in an open space without reflecting walls) a sound of given proximal intensity could be caused by a source at any distance given the appropriate sound level. Another feature which varies with distance, the absorption of high frequencies, is ambiguous in the same way as the pinna filtering cue to direction. Gibson's solution to the visual problem is to point out that objects are not encountered floating in a featureless void, but that they generally appeared against some sort of textured background (Gibson 1979). The amount of texture surrounding and occluded by the object is held to specify its distance and size, but a similar invariant cannot exist in audition because there is generally no such thing as acoustic occlusion. Höger (1993) attempted to devise an auditory counterpart to Gibson's surface texture, but found little improvement in the accuracy of distance judgements when "texture" was added. I feel 36 that this experiment is an example of a tendency which, at worst, leads to the assumption that principles derived for Ecological optics apply equally well in the auditory modality, and at best to the production of rather strained analogies. 6.4 Auditory affordances Although not intrinsic to direct perception, affordances are an important constituent of the Ecological approach. While one might envision an account in which just spatial layout itself is perceived without mediation, Gibsonians emphasize the ecological significance of certain configurations. This is necessary since their reconceptualization of perception ties it intimately to action. One can find various definitions of affordances in the literature, some more straightforward than others: "The affordances of the environment are what it offers the animal, what it provides or furnishes for good or ill. The verb to afford is found in the dictionary, but the noun affordance is not. I have made it up. I mean by it something that refers to both the environment and the animal in a way which no existing term does. It implies the complementarity of the animal and the environment." (Gibson 1979) "Affordances are the acts or behaviors permitted by objects, places, and events." (Michaels and Carello 1981) "A propertied thing X ... affords an activity Y ... for a propertied thing Z ... if and only if certain properties of X ... are dually complemented by certain properties of Z, where dual complementation of properties translates approximately as properties that are related by a symmetrical transformation or duality T such that: T(P1) P2 and T(P2) P1." (Turvey et al. 1981) Although Ecological theorists define affordances in rather general terms, those which are commonly introduced to explain the idea tend to be of a particular type. For example we are given climbability, grabability, crawl-intoability (Turvey et al. 1981), sit-onability, and drink-fromability (Michaels & Carello 1981). These affordances are said to be the objects of perception specified by variables of stimulation and as such they share one essential property. This is that the characteristics of surfaces responsible for structuring the optic array (and thereby providing information for the affordances) 37 are the same characteristics which underlie their ecological significance. In other words, a group of surfaces provides certain affordances by virtue of its shape, and it is its shape which structures the information-bearing light. In fact the case for direct perception of affordances rests on this type of specification and the additional assertion that there is a one-to-one mapping between layout and variables of the optic array. Whether or not affordances are directly perceived in vision, the situation in audition is clearly somewhat different. In general the ecological significance of a sound-emitter need have little to do with the means by which it produces sound, although there are certainly exceptions. A snake may be identified as a threat by hearing its hiss, but it is threatening because it is a snake and not because it lacks vocal cords. The ringing of a telephone has significance because a telephone is a messageconveying device and not because it contains a brass bell or a buzzer. In both of these examples it is perception-as which is important, and not perception-of. Of course counter-examples are also available; a woodpecker may detect hollows in a tree trunk by tapping and an organism may judge the approximate size of an enclosure by variables related to reverberation. The affordances of objects are generally related to their shapes. These may be specified by light, but need not be specified by sound. Sounds are signifiers as well as specifiers. The sorts of characteristics which comprise affordances are not always the sorts of things which can be specified in the acoustic array. Michaels and Carello (1981) state that "to detect affordances is, quite simply, to detect meaning", but it seems clear that meaning can be detected without affordances as they are typically construed. If one accepts this, it would seem that we have found an instance of one of Mace's "five ways to have a theory of indirect perception" (Mace 1977) because meaning is not specified without an additional step of identification or recognition. Mace points out that a direct theory of perception must be an Ecological one, although an ecological theory need not be direct (Fowler 1990). Unless meaning itself, in other words affordance, is specified and picked up, mediation is required to interface perception with the psychological systems controlling action. The difficulty of translating the standard concept of affordance into the auditory domain is reflected in the infrequency with which writers on ecological acoustics use the term. For example it is not mentioned by Jenkins (1985), Fowler (1990, 1991), or Gaver (1993a, 1993b). When authors do refer to affordances in an auditory context they seem to do so with some carelessness, or in ways 38 which distress strict adherents of Gibson's program. For example in Handel's monograph, Listening (1989), he claims to take an approach inspired in part by Gibson, but makes the following statement about sound source identification: "At a third level, we hear objects. [The first two levels being physical features of sounds and more abstract timbral qualities.] I am thinking of 'violinness', 'President Carterness', 'President Reaganness', and 'airplaneness'. What is characteristic is that the sounds seem directly perceived as objects. Gibson (1979) has used the term affordances." for which he is rightly taken to task by Heine & Guski (1991). Affordances are not objects; they are what objects afford. Other explicit examples are few in number. Gibson (1966) mentions that sound sources afford orientation and localization, meaning that an organism can establish its position and heading in space relative to a sound-emitter. Note however that this affordance is related to spatial layout and not to the properties of the object which determine what sort of sound it produces. Michaels & Carello (1981) briefly discuss the complementarity of perception and action in the context of an articulatory basis for speech perception, but it is not clear that this really addresses the issue of affordances. While it is an indispensable part of an Ecological account, the theory of affordances seems to be one which has yet to be seriously addressed by Ecological acousticians. 39 7 Conclusions & Speculations In this paper I have attempted to review some theoretical writings on the nature of auditory perception, to examine the sorts of experimentation on distally-focused perception carried out by Establishment and Ecological researchers, and to look critically at state and prospects of Ecological acoustics. In general it seems that despite the emphasis on vision in perceptual theorizing, theories of audition have paralleled those of vision. Advocates of perception-as and perception-of accounts do not seem to differentiate between the perceptual systems. Those who propose unconscious inference, association, or representational transformations as accounts of visual perception do so also in the case of hearing. Those who maintain that perception is direct and unmediated and that the world is specified by stimulation apply their analysis with equal conviction to both modalities. These consistencies are evidence of the desire to develop a theory of perception, either mediated or direct, in which all the modalities are governed by the same principles. To date, attempts to devise an Ecological account of audition seem to suffer from three shortcomings. First, the proposed source-specifying invariants of sound are not in general invariants of the effective stimulus - the eardrum signal - in which sounds from many sources are superimposed. Therefore an account of direct source separation is required. Second, many soundstructuring properties of objects cannot be specified uniquely in the acoustic signal. Therefore it is important to give an account of what it is that is directly perceived. Finally, no serious attempt has been made to define auditory affordances, and without them a theory of perception-for-action is a step short of direct. I will conclude with some (perhaps ill-advised) speculations about the objects of auditory perception and the difference between audition and vision. In a series of papers, Diehl & Kluender and Fowler engage in a lively debate about what should properly be considered the objects of speech perception (Diehl & Kluender 1989a, Fowler 1989, Diehl & Kluender 1989b, Fowler 1990, Diehl et al. 1991, Fowler 1991). Fowler's position is that, directly or not, the auditory system attempts to recover sound-producing events, and thus the objects of speech perception are articulatory. Diehl & Kluender maintain that speech perception involves decoding auditory information, and that 40 effective communication does not require access to the vocal tract configurations of one's conversants. Does the auditory system seek to satisfy the same goals as the visual in general, or are there natural situations in which we hear sounds and not the properties of sound-emitters? Gaver makes a distinction between musical listening and everyday listening, in which the former involves attending to the timbre and other abstract auditory properties of a sound (not necessarily music), and the latter to hearing objects. The distinction is somewhat akin to the difference between Brunswik's analytic and naive-realistic attitudes. The type of listening one indulges in is to some extent under conscious control. Although everyday listening may be the default, it is quite possible to listen musically to environmental sounds, and in fact contemporary musical genres like musique concrète and the electroacoustic pieces of Alvin Lucier rely on this ability. In addition, shifting one's focus from everyday to musical listening does not result in a relocation of the percept to the ears of the listener. Thus the following criticism of Diehl & Kluender's account is moot: "In acoustic perception, Diehl et al. aver, however, stimulus structure in the air that has been caused by an event is hear in itself. Why? And why does this allegedly acoustic-signal perceiving system localize sound, not where the acoustic signal is (in the ear), but where the acoustic-signal causing event is in the world?" (Fowler 1991). It is simply a fact that this does not happen, whichever style of listening one happens to be involved in. Nor does it happen in vision. An observer can compare the relative projective sizes or the colors of objects without the location of the percepts jumping to the retina. Fowler mentions "nonsense stimuli" such as sinewave analogs to speech, which do not contain enough information to specify their causal sources, but these stimuli are localized in-the-world to the same degree that real speech or any other sound is. So, it is possible to avoid recovering a sound's causal source (or at least to ignore its recovery), but in the absence of adjusting one's Brunswikian attitude, is it always the case that this recovery takes place? It seems clear that the visual system attempts to recover light-structuring properties of the world, that is, surface layout. We do see surfaces and objects, but, regardless of identification, do we always hear materials and the events they are involved in? It is my feeling that 41 the answer depends on time scale. We seem to hear events such as bouncing and approach, which structure sound at a scale of tenths of seconds to seconds, but it is not clear that we hear properties which structure vibration at smaller time scales. Consider the following examples. When we hear ventilation noise, do we really hear air turbulence resonating in a duct? In perceiving speech, do we really hear the vocal cords vibrating? When we listen to a door swinging shut and slamming do we really hear the "stiction" in the rusty hinges and the vibration of the door, or just a squeak and a bang? When walking in the park do we really hear the crickets' little legs rubbing away or just a curious buzz? In fact, the sounds of many animals seem to pose this sort of problem; even Jenkins provides an example while extolling the richness of acoustic information: "From our backyard locale, my wife and I heard a remarkable burst of song - some kind of warbler. At length we located a small bird on a high wire at the end of the yard. Could it be that this tiny bird was the source of the song? We thought it unlikely, but we were rapidly convinced by the synchrony of the bursts of song and the movements of the bird." (Jenkins 1985) Note that the sound source was identified as a warbler, but even this did not help to specify the size of the bird. Is there any a priori reason to think that audition is fundamentally different from vision in this way? Assuming that these intuitions are correct, why should it be the case that we can hear sounds in the environment without hearing vibration-structuring properties? A possible explanation lies in the previously mentioned fact that vision is primarily directed at reflectors of energy, while audition is primarily directed at sources of energy. What happens when we view sources of light directly? It seems that the experience is of "a source of light of a particular color and intensity at a particular location". While one can perhaps identify the spectrum-structuring properties of the source (it's an LED, it's a sodium lamp etc.) the primary experience is of the radiant light itself. Gibson makes the following remarks about radiant light: "Is there any kind of information in radiant light? The answer must be yes, for the spectrum of any radiant beam specifies vibrations in the atoms that emitted the 42 energy. The astronomer with a spectroscope can identify the substance of the star. One could aim the instrument at a luminous object and determine whether it is incandescent, fluorescent, bioluminescent, etc. But note that an eye cannot do this; it cannot register the distribution of wavelengths and cannot measure their absolute intensities. This is not the kind of information an eye can pick up. A single spot of light in darkness conveys only a minimum of information to an eye." (Gibson 1966). "Radiant light has no structure; ambient light has structure. Radiant light is propagated; ambient light is not, it is simply there. Radiant light comes from atoms and returns to atoms; ambient light depends on an environment of surfaces. Radiant light is energy; ambient light can be information." (Gibson 1979). The perception of radiant light is an exceptional case in Gibson's visual theory, but radiant sound is the main stuff of audition. It seems somewhat perverse to hold that radiant light specifies atoms but that atoms are not perceived while maintaining that radiant sound specifies vibration-structuring properties of objects and that these can be perceived. If it is indeed the case that we can hear temporally-extended events but that we sometimes hear only sounds (while still being able to identify their sources), perhaps Schubert's concepts of Source Identification and Source Behavior Recognition can serve as a model for a uniquely auditory theory of perception. Whether or not the foregoing comments are convincing, it is clear that transferring any perceptual theory wholesale from one modality to another can be problematic. The ecology of audition poses unique challenges which must be taken seriously by theorists of any stripe. 43 References Bailley G, Laboissière R, Schwartz JL (1991): A model of coarticulation based on connectionist sequential networks: can we recover articulatory movements from acoustics. Conference on Current Phonetic Research Paradigms: Implications for Speech Motor Control. Stockholm, Sweden, August 1991. (cited in Kluender 1991) Ballas JA, Howard Jr. JH (1987): Interpreting the language of environmental sounds. Environment and Behavior 19(1):91-114. Ballesteros S (ed) (1994): Cognitive approaches to human perception. Laurence Erlbaum Associates. Boring EG (1942): Sensation and perception in the history of experimental psychology. AppletonCentury. Bregman AS (1981): Asking the "what for" question in auditory perception. In Perceptual Organization, ed Kubovy M & Pomerantz JR. Laurence Erlbaum Associates. Bregman AS (1990): Auditory scene analysis. MIT Press. Bregman AS, Campbell J (1971): Primary auditory stream segregation and perception of order in rapid sequences of tones. J.Exp.Psych. 89:244-249. Bruce V, Green PR (1990): Visual perception: Physiology, psychology and ecology. Laurence Erlbaum Associates. Cipra B (1992): You can't hear the shape of a drum. Science 255:1642-1643. Dowling JW, Lung KM, Herrbold S (1987): Aiming attention in pitch and time in the perception of interleaved melodies. Perception & Psychophysics 41(6):642-656. Diehl RL, Kluender KR (1989a): On the objects of speech perception. Eco.Psych. 1(2):121-144. Diehl RL, Kluender KR (1989b): Reply to commentators. Eco.Psych. 1(2):195-225. Diehl RL, Walsh MA, Kluender KR (1991): On the interpretability of speech/nonspeech comparisons: A reply to Fowler. J.Acoust.Soc.Am. 89(6):2905-2909. Driscol A (1995): Eigenmodes of isospectral drums. World Wide Web document. URL: http://cam.cornell.edu/~driscol/research/drums.html. 44 Ellis D (1995): Hard problems in computational auditory scene analysis. World Wide Web document. URL: http://sound.media.mit.edu/~dpwe/writing/hard-probs-1995jul09.html. Fodor JA (1975): The language of thought. Harvard University Press. Fodor J, Pylyshyn Z (1981): How direct is visual perception? Some reflections on Gibson's 'Ecological Approach'. Cognition 9:139-196. Fowler CA (1989): Real objects of speech perception: A commentary on Diehl and Kluender. Eco.Psych. 1(2):145-160. Fowler CA (1990): Sound-producing sources as objects of perception: Rate normalization and nonspeech perception. J.Acoust.Soc.Am. 88(3):1236-1249. Fowler CA (1991): Auditory perception is not special: We see the world, we feel the world, we hear the world. J.Acoust.Soc.Am. 89(6):2910-2915. Freed D (1990): Auditory correlates of perceived mallet hardness for a set of recorded percussive sound events. J.Acoust.Soc.Am. 87:311-322. Gardner MB (1969): Distance estimation of 0 or apparent 0 -oriented speech signals in anechoic space. J.Acoust.Soc.Am. 45:47-53. Gaver WW (1993a): What in the world do we hear? An Ecological approach to auditory event perception. Eco.Psych. 5(1), 1-29. Gaver WW (1993b): How do we hear the world?: Explorations in ecological acoustics. Eco.Psych. 5(4):285-313. Gibson JJ (1966): The senses considered as perceptual systems. Houghton Mifflin. Gibson JJ (1979): The ecological approach to visual perception. Houghton Mifflin. Gordon C, Webb D, Wolpert S (1992): One cannot hear hte shape of a drum. Bull.Am.Math.Soc. 27:134-138. Green DM, Swets JA (1966): Signal detection theory and psychophysics. Wiley. Gregory RL (1993): Seeing and thinking. Italian J.Psych. 20:749-769. Guski R (1992): Acoustic tau: An easy analogue to visual tau? Eco.Psych. 4(3): 189-197. Handel S (1989): Listening: An introduction to the perception of auditory events. MIT Press. 45 Hatfield G (1990): Gibsonian representations and connectionist symbol processing: Prospects for unification. Psych.Rev. 52:243-252. Heine WD, Guski R (1991): Listening: The perception of auditory events? An essay review of Listening: an introduction to the perception of auditory events. by Stephen Handel. Eco.Psych. 3(3):263-275. Heine WD, Guski R (1993): Using auditory information for active contact with sound sources moving rectilinearly with respect to a listener. In Contributions to psychological acoustics: Results of the 6th Oldenburg Symposium on Psychological Acoustics, ed. Schick A. 349-359. Heine WD, Guski R, Pittenger JB (1993): Perceiving numbers of stell balls by audition. In Contributions to psychological acoustics: Results of the 6th Oldenburg Symposium on Psychological Acoustics, ed. Schick A. 361-371. Helmholtz H von (1867/1925). Physiological optics. Vol. 3. Optical Society of America. Helmholtz H von (1877/1954): On the sensations of tone. Dover. Hochberg J. Perceptual theory and visual cognition. In Cognitive approaches to human perception. ed. Ballesteros S. Laurence Erlbaum Associates. 269-289. Höger R (1993): Acoustic texture in distance perception. In Contributions to psychological acoustics: Results of the 6th Oldenburg Symposium on Psychological Acoustics, ed. Schick A. 337348. Jackendoff R (1987): Consciousness and the computational mind. MIT Press. James W (1890/1950): The principles of psychology Vol.2. Dover. Jenison RL (1994): On acoustic information for auditory motion. Perception. (in press?). Jenkins JJ (1985): Acoustic information for objects, places and events. In Persistence and change: Proc. 1st Internat. Conf. on Event Perception, eds. Warren W, Shaw R. Laurence Erlbaum Associates. 115-138. Johansson G (1985): About visual event perception. In Persistence and change: Proc. 1st Internat. Conf. on Event Perception, eds. Warren W, Shaw R. Laurence Erlbaum Associates. 29-54. Kluender KR (1991): Psychoacoustic complementarity and the dynamics of speech perception and production. Perilus XIV:131-136. 46 Kluender KR, Jenison RL (1992): Effects of glide slope, noise intensity, and noise duration on the extrapolation of FM glides through noise. Perception & Psychophysics 51(3):231-238. Ladefoged P, Harshmann R, Goldstein L, Rice L (1978): Generating vocal tract shapes from formant frequencies. J.Acoust.Soc.Am 64:1027-1035. Lerdahl F & Jackendoff R (1983): A generative theory of tonal music. MIT Press. Licklider JCR (1959): Three auditory theories. In Psychology: A study of a science, ed S. Koch. McGraw-Hill. Lombardo TJ (1987): The reciprocity of perceiver and environment: The evolution of James J. Gibson's ecological psychology. Laurence Erlbaum Associates, Hillsdale NJ. Lyon RF (1983): Binaural localization and source separation. Proc. ICASSP 83:1148-1151. (reprinted in Richards 1988) Mace WM (1977): James J. Gibson's strategy for perceiving: Ask not what's inside your head, but what your head's inside of. In Perceiving, acting, and knowing: Towards an ecological psychology. ed Shaw R, Bransford J. Laurence Erlbaum Associates. Marr D (1982): Vision. Freeman. Michaels CF & Carello C (1981): Direct Perception. Prentice-Hall. Mohrmann K (1939): Lautheitkonstanz im Entfurnungswechsel. Z. Psychol. 145: 146-199. (cited in Postman & Tolman 1959). Morse PM, Ingard KU (1968): Theoretical acoustics. Princeton University Press. Nunn D (1995): Pictures of some research issues. World Wide Web document. URL: http://capella.dur.ac.uk/doug/pictures.html. Pickles JO (1988): An introduction to the physiology of hearing. Academic Press. Postman L & Tolman EC (1959): Brunswik's probabilistic functionalism. In Psychology: A study of a science. ed. Koch S McGraw-Hill. 502-564. Pylyshyn ZW (1984): Computation and cognition. MIT Press. Reisberg D (ed) (1992): Auditory imagery. Laurence Erlbaum Associates. 47 Repp BH (1987): The sound of two hands clapping: an exploratory study. J.Acoust.Soc.Am. 81(4):1100-1109. Richards W (ed) (1988): Natural computation. MIT Press. Rock I (1980): Difficulties with a theory of direct perception. Behavioral and Brain Sciences 3:398399. (Commentary on Ullman 1980). Rock I (1983): The logic of perception. MIT Press. Rosenblum LD (1993): Acoustical information for controlled collisions. In Contributions to psychological acoustics: Results of the 6th Oldenburg Symposium on Psychological Acoustics, ed. Schick A. 303-322. Rosenblum LD, Carello C, Pastore RE (1987): Relative effectiveness of three stimulus variables for locating a moving sound source. Perception 16:175-186. Schubert ED (1974): The role of auditory perception in language processing. In Reading, perception and language. eds Duane DD, Rawson MB. York Press, Baltimore. Searle CJ (1982): Representing acoustic information. Can.J.Psych. 36:402-419. (reprinted in Richards 1988) Searle JR (1992): The rediscovery of the mind. MIT Press. Shaw BK, McGowan RS, Turvey MT (1991): An acoustic variable specifying time-to-contact. Eco.Psych. 3(3):253-261. Shepard RN (1990): Mind Sights. Freeman. Sloman A (1989): On designing a visual system: Towards a Gibsonian computational model of vision. J. Experimental & Theoretical Artificial Intelligence 1:289-337. Stellmack MA (1994): The reduction of binaural interference by the temporal nonoverlap of components. J.Acoust.Soc.Am. 96(3):1465-1470. Strutt JW (1907): On our perception of sound direction. Philosophical Magazine 13:214-232. Turvey, Shaw, Reed, Mace (1981): Ecological Laws of perceiving and acting: In reply to Fodor and Pylyshyn (1981). Cognition 9, 237-304. Ullman S (1980): Against direct perception. (with commentaries). Behavioral and Brain Sciences 3:373-415. 48 Vanderveer NJ (1979): Ecological acoustics: human perception of environmental sounds. Dissertation Abstracts International, 40: 4543B. (University Microfilms no. 8004002). (Cited by Ballas and Howard, 1987). Warren RM (1982): Auditory perception: a new synthesis. Pergamon. Warren WH & Verbrugge RR (1984): Auditory perception of breaking and bouncing events. J.Exp.Psych.:Human Perception and Performance 10:704-712. (reprinted in Richards 1988). Wightman FL, Jenison RL (1995): Auditory spatial layout. In Handbook of perception and cognition Vol 5: Perception of space and motion. eds Epstein W, Rogers S. Academic Press. (in press?) Wildes RP, Richards WA (1988): Recovering material properties from sound. In Natural Computation. ed Richards WA. MIT Press. 356-363. Yost (1990): Auditory image perception and analysis: The basis for hearing. Hearing Research 56:818. 49