For the European Review of Philosophy 7 Auditory Objectsi Mohan Matthen University of Toronto What do we directly hear? In section I, I define direct perception, and outline the logical atomist way of attacking the question. I argue in section II that atomism fails. Then, in sections III-V, I propose that a better alternative to atomism is to revive and modernize another traditional empiricist doctrine: that we directly sense what the senses deliver to automatic (i.e., sub-personal) processes of learning. Having discussed the criterial issue, I return to the question. One obvious answer to our question, on any way of proceeding, is that we directly hear sounds. I argue in sections VI-VII that this obvious answer starts us off in the right direction, provided that it is informed by a proper conception of the nature of sounds. Despite its virtues, though, this notion about the direct objects of audition is overly circumscribed. The doctrine that sounds are the only direct audibles is seriously infected by the atomistic methodology that I question early in the paper. It demands significant supplementation. One of my main aims in this paper is to show that even if everything we directly hear consists of sounds – and I agree with this notion – the latter have no priority from the perspective of audition. That something is an auditory composite does not imply that we hear it indirectly, i.e, in virtue of its component sounds. For example, melodic phrases are composite entities composed of sounds – they are not themselves sounds, I shall argue – but they are often direct objects of audition alongside the notes of which they are composed – they are not heard merely in virtue of hearing the notes that make them up. AUDITORY OBJECTS The auditory system presents its objects under cross-cutting types. At any given time, a perceiver is conscious of a temporally extended auditory scene in which there are melodies, harmonies, sequences of phonemes, individual voices, meaning-carrying sounds, and so on. The very same sounds may simultaneously belong to more than one of these types: for instance, some of the sounds that constitute a sentence may also be parts of a melody in parts, sung by two individual human voices. Often, the perceiver cannot simultaneously attend to all of the above-mentioned elements of a scene. In section VIII, I argue that this is largely because sounds have to be differently grouped to constitute these different kinds of items. In order to hear a composite one has to sense the grouping as well as the constituents, and it is difficult to attend to several crosscutting groupings together. I conclude, in section IX, by attempting to indicate how aesthetic appreciation depends on this variety. I. Perceptual Atomism Empiricists generally subscribe to the “atomistic” doctrine that we sense wholes in virtue of sensing their parts – except in the case of wholes that have no parts. These partless wholes ground all other sensation, and may be entitled sensory atoms. In the empiricist tradition, direct sensing is understood as follows: D1. S senses x directly if S senses x, and there is no y (distinct from x) such that S senses x in virtue of sensing y. (cf. Jackson 1977, 19-20) I take the ‘in virtue of’ in D1 to imply something along the following lines: D1 (Codicil) S senses x in virtue of sensing (where is a set) if S constructs her x-sensum from her sensa of members of . In the Codicil, “construction” can be taken in a fairly non-demanding sense. Consider the figure sketched below. 2 AUDITORY OBJECTS This figure can be seen in a variety of ways: as (a) two touching circles, as (b) a figureof-eight, and as (c) a stylized version of the symbol for infinity. Should one say that one sees (b) and (c) in virtue of (a)? In order to say this, it is not sufficient to show that the figure-of-eight and infinity symbols decompose into two circles, but not vice versa. What one must show, by D1 (Codicil), is first and foremost that the sensa of the two circles come first, and further that the auditor (S in the definition) constructs sensa of (b) and (c) from these. Thus, what needs to be shown is that the auditory system on its own delivers (a), but not (b) and (c) – the auditor’s mental activity is required for (b) and (c). Empiricists traditionally assumed (see Lewis 1966) that when one senses a whole that has sensible parts, one senses it in virtue of sensing its parts. This implies: A1. If x is an object that has parts that can be discriminated from one another by sense modality M, then S senses x through M in virtue of sensing those parts of x that themselves have no parts that can be discriminated from one another by M (and which are available to M at the timeii). 3 AUDITORY OBJECTS Thus, empiricists assumed that the only things we directly sense are minimal sensa. They assume that, in the figure sketched above, the composite objects are sensed in virtue of the circles, and the circles themselves in virtue of the minimal arcs that compose them. The latter are the only things we directly sense. We do not just sense objects; we also apprehend their sensory qualities or features. Atomist assumptions prevail here as well. Minimal visual objects (in the sense of A1) have only colour and brightness, empiricists assume. If we see visually extended objects in virtue of apprehending their minimal parts, it must follow that the additional visual qualities attributed to these extended objects – their shape, texture, and size, and so on – are constructed from the colour and brightness of their minimal parts. Similarly, minimal sounds have only pitch and loudness. It would follow by atomism that the additional auditory qualities of non-minimal auditory objects – the contour of a melody, the chord produced by two strings plucked together, the timbre of a voice, etc – are constructed from the pitch and loudness of the minimal auditory parts of these objects. Thus we have A2. If F is a feature of an extended object O, then S senses F in virtue of sensing qualities that belong to the minimal parts of O. Together A1 and A2 constitute a method for identifying the things we directly see and hear as per D1. Let us call this method perceptual atomism. My concern in the first two sections of this is with this doctrine interpreted in the light of D1 (Codicil). Now, one kind of argument against perceptual atomism is that it is often harder to register the qualities of the parts of a thing than to register the thing itself. For as anybody who has tried to draw from life or to compose a photograph will readily attest, it takes a more effort to attend to or describe the two-dimensional fuzzy-edged outline that a cat projects to the retina than it does to attend to the features of the cat itself. This 4 AUDITORY OBJECTS has often been taken to suggest that we see the cat’s fuzzy outline in virtue of seeing the cat, not the other way around. Roderick Firth (1949) tried to mount such an argument against the atomist order of priority. (He saw himself as following the lead of phenomenologists such as Edmund Husserl and Gestalt psychologists such as Wolfgang Kohler.) His strategy was to posit a post-sensory operation known as perceptual reduction, which is supposed to account for our awareness of things like the two-dimensional projections of threedimensional objects: The operation of perceptual reduction . . . make(s) the ostensible physical object progressively less and less determinate. If I were to perform the operation while looking at a tomato, for example, the ostensible tomato which is present to consciousness would, so to speak, become less specifically distinguished as an individual. Starting as a tomato with worm holes it might be reduced to a tomato with "some sort of holes" in it, and then to a tomato with spots on its surface, and so on. It might eventually become "some sort of globular object", or even just "some sort of physical thing ". But when this last stage is reached, or perhaps even before, there is a second effect: a radical change takes place and a new object of consciousness appears and grows more and more determinate . . . this new object is . . . not an ostensible physical object at all. . . (I)t is not until this second stage in the process has begun that we are able to describe what we " really see "and to report, for example, that we are presented with " a red patch of a round and somewhat bulgy shape ". (ibid., 460) Firth ascribes the difficulty of seeing two-dimensional outlines to perceptual priority: the apprehension of a three-dimensional object comes first, then threedimensional properties are gradually reduced to abstractions compatible with twodimensionality – worm holes become spots, the tomato itself becomes a somewhat bulgy shape. Perceptual reduction brings it about that a quite different thing – a surface or a shape – begins to appear. His claim is that surfaces and projections are seen only by means of the mental act of perceptual reduction, and that this act can only be 5 AUDITORY OBJECTS performed when one already has before one’s mind the percept of the threedimensional object. This throws doubt on perceptual atomism by D1, because it seems to imply, by D1 (Codicil), that we see a “coloured shape” in virtue of seeing the threedimensional object that projects it, not vice versa. Firth’s argument is not conclusive, however. He seems to be thinking of cases where one has to report on, or in some other way be explicitly conscious of, certain aspects of what one senses. This is where his argument misses its mark. The atomist may not be worried about the kind of awareness that underlies verbal report; he might be trying to get at the latent content of a visual or auditory state. Thus, the atomist might well concede that something like Firth’s perceptual reduction is necessary in order to attend to the parts of a whole, but he might still think that sensing the whole demands sensing the parts as a prior condition. This is, in fact, the position that Firth himself attributes to H. H. Price (to whom the quoted words at the end of the above passage allude). What Firth misses is the possibility that perceptual reduction might just bring to a perceiver’s attention what she has already sensed. He does not show, or even try to show, that the red-patch sensum was created by perceptual reduction. II. Gestalts Over Parts The Gestalt psychologists have a more direct argument against atomism. They show that our sensory apprehension of a whole can often influence how we apprehend its parts. Consider the display known as the Kanizsa triangle (see Kanizsa 1976): 6 AUDITORY OBJECTS In the left figure, we appear to see a white triangular object in the foreground, partially obscuring three objects in the background – three black circles and the outline of a triangle, each partially occluded. (Note the contrast with the figure on the right which appears as a simple two-dimensional pattern, and hence with no occlusion.) In the present context, the interesting thing about the foreground triangle on the left is that it appears brighter than the background. This is not the case on the right. Perceptual atomism demands that we see the extended bright white of the foreground triangle in virtue of seeing its bright parts and that the brightness of the triangle is constructed from the brightness of its parts. If this were true, awareness of the brightness of the central point of the display would precede that of the foreground triangle. But this seems false. The foreground triangle is, of course, exactly as bright as the background, and if one were to isolate it by covering up other relevant parts of the display, one would see this quite easily. (On the right, the same kinds of parts coexist with a different appearance of the whole.) 7 AUDITORY OBJECTS The reason why the foreground triangle looks brighter in the left-hand Kanizsa display is that the visual system (wrongly) infers the presence of a triangular occluding object in front of the occluded objects. For if the figure were modified in such a way as to make the foreground triangle disappear – for instance, by filling in the wedges of the black figures or by rotating these figures in such a way that the wedges don’t line up – then the very same parts of the above display that appear brighter would cease to appear so. Prevent the visual system from inferring the presence of an occluding triangle, or prevent it from inferring that it is in the foreground (as in the right hand figure), and the parts will be seen differently. When the foreground triangle is inferred, the visual system actively enhances its brightness. The look of brightness is, in other words, inserted by the visual system itself, in order to mark the object it infers (and perhaps to mark it that it is closer to the light). Let’s call this qualia-insertion. It implies that at least as far as the visual system is concerned, seeing the whole is a prior condition of seeing the parts. It does not deliver sensa of the parts without a sensum of the whole – on the contrary, the sensa of the parts depend on that of the whole – and so the perceiver does not become aware of the whole by assembling the parts. Similar whole-over-part phenomena are found in the auditory realm. Consider a syllable like /da/. Phenomenologically, it appears that this syllable consists of two phones in temporal order, the consonant /d/ followed by the vowel /a/. Further, when we hear /di/ or /du/, we seem to hear the same initial sound – the consonant /d/ – followed by a vowel. Thus, speech perception appears compositional: we seem to hear distinct sounds strung together, each corresponding to individual consonants or vowels. This is the thought that Aristotle seems to be expressing when he asserts that “written marks are symbols of spoken sounds” – the written ‘d’ is a symbol of a distinct spoken “sound” /d/ that precedes the /a/ in ‘da’, he seems to imply. (See, however, note 4 below.) This phenomenology would suggest that we hear /da/ in virtue of hearing /d/ and /a/ in that order. 8 AUDITORY OBJECTS Investigation of the actual pattern of wave forms in speech reveals, however, that this part-over-whole model is not correct. There is, as it turns out, no auditory wave pattern common to the different syllables that start with /d/ – that is, the acoustic patterns corresponding to /da/, /do/, /du/, etc. share no initial segment. The analysis of these acoustic patterns reveals that the first overtone (the “second formant”) of each consists in an upward or downward glide from a fixed frequency of roughly 1800 hertz, smoothly rising or falling to the frequency of the vowel part – these glides are different in slope, and so they cannot be identified with a common element in the syllables that start with /d/. Moreover, the same vowel combined with different consonants will sound at different frequencies. Thus, as Alvin Liberman and his colleagues (1967) write: The speech signal typically does not contain segments corresponding to the discrete and commutable phenomena . . . We cannot cut either the /di/ or the /du/ pattern in such a way as to obtain some piece that will produce /d/ alone. If we cut progressively from the right hand end, we hear /d/ plus a vowel, or a non-speech sound; at no point will we hear only a /d/ (436) Liberman et al say, illuminatingly, that the acoustic form of speech is not a cipher, in which there are discrete parts, each of which stands for some part of what is enciphered. Rather, speech is a code in which a temporally extended sound pattern can stand as a whole for a sequence of phones, without parts of that sound pattern standing for the individual constituents of the encoded sequence. The phenomenology of /da/ and /du/ suggests compositionality or encipherment, but in fact they do not share any common acoustic element that corresponds to /d/. These syllables are encoded in the acoustic streams that we produce; they are not enciphered therein. 9 AUDITORY OBJECTS In order to understand this phenomenon, it is necessary to understand how /da/ is produced. A speaker produces /da/ by first creating a resonant frequency of 1800 hertz in the articulatory tract. This is done by closing the tract off with the tongue pressed against the palate. Because the tract is closed, this first articulatory “gesture” is actually silent. Having performed this first gesture, the speaker then opens her mouth in the /a/ shape, and releases the acoustic energy stored in the articulatory tract. This produces a glide up to the vowel frequency and then a steady sound at that frequency. Thus, /da/ is produced in three phases, (i) a silent resonance, (ii) a glide, and (iii) a steady pitch. It is the starting point of the glide that is embedded in (i), the silent part, that corresponds to /d/ – this is what /da/ and /du/ and the rest of the initial /d/ syllables share. (Correspondingly, syllables that end with /d/ are produced by a sequence of articulatory gestures that ends with such a silent resonance.) 10 AUDITORY OBJECTS But (i) is silent and forms no part of the acoustic signal. We have to conclude, therefore, that the speech perception system infers (i) from (ii) and (iii), and inserts a separate /d/ quale – an auditory experience that is the same in different /d/ syllables despite their auditory variety – into sensory consciousness to mark the silent gesture it infers – just as in the Kanizsa triangle, the visual system inserts qualia of increased brightness to mark the triangle it infers. This is why we hear a /d/ followed by /a/, though in the acoustic signal itself there are no such separate components. As I said earlier, perceptual atomism would indicate that we hear /da/ in virtue of hearing /d/ (and /a/). But this seems wrong. In fact, the reverse seems to be the case: that is, the speech perception system has to decode the entire syllable from the information available in the acoustic signal before it can insert the auditory quale corresponding to the consonant with which the syllable begins or ends. One more example. Albert Bregman (1990, 27-29) describes an alternately falling and rising pure tonal glide – in effect, something like a melody, but a particularly predictable (and boring) one. Call this “melody” glides. Now, modify glides in two successive ways. First, snip some portions out. When this is done, one hears, as one might expect, a broken series of short tonal glides – not the entirety of what we have called glides, but a series of disconnected fragments thereof. Call this bursts. Now, fill in bursts by inserting broadband noise into the gaps. The result is a continuous acoustic signal that consists of fragments of glides separated by noise. Call this patches. This third pattern is not heard as bursts with noise in the gaps – though this, of course, is what it is. Rather, the listener hears (or rather appears to hear) the original glides, with noise superimposed over but only partially masking the parts that were snipped out. In other words, the snipped out portions of glides reappear with the second modification, though partially obscured by noise. The snipped out portions are phantoms! They are heard, but they are not there. They are exactly the same in this respect as consonants, 11 AUDITORY OBJECTS and the “subjective contours” in the Kanizsa display: they are inserted by the sensory system; all of these are phantoms. The atomist’s part-over-whole principle would lead him to say that in the case of patches, one hears the whole of glides in virtue of hearing the parts, including the phantom parts. But the phantoms are actually hallucinations inserted by the auditory system. Since they are inserted in virtue of reconstructing the whole, it is wrong to think that the sensum of glides (when patches is played) is simply constructed out of one’s sensa of the bits. One would not hear these parts but for the appearance of the whole. It seems, thus, to be more reasonable to hold that one hears the auditory system delivers the snipped out portions directly after reconstructing the whole. In the examples reviewed in this section, the phenomenology is that of parts and wholes. In each case, we sense a spatially or temporally extended whole as well as its 12 AUDITORY OBJECTS parts. Reflecting on this phenomenology, some philosophers have supposed that the wholes are seen in virtue of their parts. This is wrong. The mistake is that the partwhole phenomenology is taken as indicating something about the processes that create sensa. Phenomenology cannot deliver this kind of knowledge. Even if it is authoritative about what sensa are present to consciousness, it does not even pretend to reveal whence they came. III. Sensory States: A Modular Approach Atomism fails as a way of demarcating what we sense directly. But there is another thread in traditional empiricism that yields a more promising line of inquiry. The beginnings of this idea can be discerned in a passage from Berkeley’s First Dialogue Between Hylas and Philonous: We may, in one acceptation, be said to perceive sensible things mediately by sense – that is, when, from a frequently perceived connection, the immediate perception of ideas by one sense suggests to the mind others, perhaps belonging to another sense, which are wont to be connected with them. For instance, when I hear a coach drive along the streets, immediately I perceive only the sound; but from experience I have had that such a sound is connected with a coach, I am said to hear the coach. It is nevertheless evident that, in truth and strictness, nothing can be heard but sound, and the coach is not then properly perceived by sense, but suggested from experience. Berkeley’s way of drawing the distinction between what is immediately, or properly, heard and what is only mediately heard appeals to a distinction between what is delivered by the senses, and what is “suggested by experience” as a result of a “frequently perceived connection”. (In what follows, I shall stick to the terminology of ‘direct’ and ‘indirect’ and will not use Berkeley’s term, ‘immediate’.) Berkeley himself was unable to make much progress with this important idea; as Gary Hatfield (1990, 42) says, he provides “precious little direct analysis of ‘suggestion’.” But he does (as Hatfield points out) distinguish “suggestion” from 13 AUDITORY OBJECTS “judgements and inferences [made] by the understanding.” For a later author like Hermann von Helmholtz, this was important. By Helmholtz’s time, the association of ideas – a more contemporary term for what Berkeley calls “suggestion” – had become, by contrast with judgement, an automatic mechanism beyond the voluntary control of the perceiver. In Berkeley, this was at best implicit in the distinction between suggestion and judgement. (Note that “suggestion” is cross-modal in Berkeley’s usage: the sound of the carriage suggests a certain visual and tactual object – here I am indebted to Nicolas Bullot.) Now, in Berkeley’s view, the association of ideas acts in a manner akin to what I called “qualia-insertion” above. For instance, it accounts, according to him, for the Moon appearing larger when it is low in the sky than when it is in the zenith, though it projects the same-sized retinal image in both positions. Berkeley held that this Moon Illusion arose out of associations of ideas established by past experience. Yet, like sensation, it is involuntary and experience-modifying – the Moon really looks larger when it is low in the sky, and there is nothing one can do to change how it looks. In these ways, it operates differently from voluntary “judgements” – judging that something is of a certain size does not make it look that way. Similarly, for Helmholtz, things look the same colour in diverse conditions of illumination. According to him, this is because perceivers come, with experience, to be able to “discount the illuminant”. This may or may not lead these perceivers to judge that things are constant in colour, but it does result in things looking the same colour in different conditions of illumination. Today, most sensory psychologists treat the senses as active informationprocessing systems that are innately equipped to make inferences about the state of the external world, even in the absence of suggestion or acquired associations of ideas. Their attitude is that the sensory systems comprise not just receptors, but also dataprocessing pathways that extract information about the external world from receptoral 14 AUDITORY OBJECTS activation states. (See Matthen 2005 passim, but especially Part I for a detailed treatment of this point.) As Peterson (2001, 175) succinctly says, Gestalt “grouping processes [such as the one involved in the Kanizsa display] are visual processes.” Similarly, the inferential processes at work in phonetic perception and in patches simply are auditory processes. Contrary to Berkeley and Helmholtz and others of their empiricist persuasion, not all sensory awareness of external things should be attributed to postsensory associations of ideas. (In a moment, I shall argue, however, that Berkeley’s insights were nonetheless extremely valuable.) From the state of receptoral arrays, automatic and innate processes extract information (cf. Pylyshyn 1999) about constant colour, three-dimensional shape, objective motion, phonemes, melodies, and so on. This view of sensory processing renders part-whole approaches largely irrelevant. Atomism loses its appeal, for there is very little reason to believe that the content delivered by sensory modules concerns only minimal parts. Indeed, the holistic phenomena discussed in the last section demonstrate that they deal with extended things. But nor is there good reason for supposing that the whole has priority over the parts, as Gestalt psychology maintains. In the Kanizsa display, both the foreground triangle and the bright qualia that constitute it are delivered by the visual module. Neither has priority over the other in the sense that it is the sensory material from which the other is constructed. Similarly, speech perception delivers /da/ and its constituent parts in a single act; melodic perception provides both the continuity of the melody in patches, and the (false) awareness of the phantom notes. These wholes may have priority within the sensory process, but not on the terms proposed by D1 (Codicil). We are not aware of /d/ independently of /da/. The best-attested current views have it, then, that sensory systems include dataprocessors as well as receptive organs. In effect, they analyse the data received by the sensory receptors and infer the presence of external objects and objective features that belong to them. Sensory awareness is the record of this activity. Perceivers are more or 15 AUDITORY OBJECTS less passive with respect to sensory awareness – they do not control its character. The mistake that perceptual atomists made was to assume that perceivers voluntarily assemble their awareness of temporally and spatially extended wholes from parts. We cannot follow them in this assumption. We need another way of identifying the deliverances of the senses. IV. Sensory States and Epistemic Operations Though he was wrong about the role of associations of ideas, Berkeley’s insight is nonetheless very important. There is a level at which sensation is simply an event in consciousness to which the perceiver’s history does not contribute. At this level, sensation is an internal event that provides the perceiver with information about an external event that has just occurred. For example, I might hear a loud noise, and thus come to know that something has fallen off the counter in the next room. Or I see a blue thing, and come to know that such a thing is in front of me. Let’s call this the eventtracking function of sensation. What the empiricists noticed was that sensation has another function as well. Through automatic post-sensory processes, it contributes to an organism’s representation of the world not just as it is at the moment of sensing, but as it is in a more extensive time-frame. The formation of associations of ideas is an example of this. I put a fruit in my mouth and find it bitter; automatically an expectation forms within me that fruits of that kind are bitter. Here, of course, sensation is contributing to my knowledge of general truth that is more or less permanently true. But it can also contribute to knowledge of lasting but impermanent conditions of the world. For instance, if I observe somebody putting something into a box, then automatically a memory forms in me of where that object is or, less specifically, of there being something in the box. (Such memories form even in very young infants who display surprise when because of a trick of an experimenter the box is found to be empty.) I shall call this the record-keeping function of sensation. 16 What Berkeley draws our AUDITORY OBJECTS attention to in the passage quoted above is that some of this record-keeping is also automatic as well, just as the event-tracking function is. The question remains: how should we demarcate the direct objects of sensation. The connection between sensation and automatic associations of ideas helps us here. Rather they are defined by contrast with “suggestion”, and recognized (though Berkeley does explicitly say so) by their phenomenal character. In effect, Berkeley treats of them as inputs to “suggestion”. Could one not demarcate the contents of sensory awareness by this aspect of their role? This is what I shall now attempt to do. Consider Pavlovian, or classical, conditioning. Here, a naturally motivational event – placing food in the mouth, in Pavlov’s classic experiment with dogs – is repeatedly presented slightly after an event that is motivationally neutral with regard to ingestion, namely a tone. As a result, the motivationally neutral event, or conditioned stimulus, begins to elicit the same response – lubricating salivation and other digestive preparations for the ingestion of food – as the naturally motivational event, or unconditioned stimulus. Pavlov (1904/1965) recognized that this association had psychological significance: When an object from a distance attracting the attention of the dog produces a flow of saliva, one has ground for assuming that this is a psychical and not a physiological phenomenon. When, however, after the dog has eaten something or has had something forced into his mouth, saliva flows, it is necessary to prove that in this phenomenon, there is actually present a physiological cause, and not only a purely psychical one which, owing to the special conditions, is perhaps reinforced . . . (565-566) Pavlov recounts how by cutting the sensory nerves of the tongue, and by “more radical measures, such as poisoning the animal or extirpation of the higher parts of the central nervous system,” one can “convince oneself that between a substance stimulating the 17 AUDITORY OBJECTS oral cavity and the salivary glands there exists not only a psychical but a purely physiological connection.” What he means is that placing food in the mouth will stimulate salivation even when the animal’s “psychical” faculties are “extirpated” – thus one has reason to conclude that this connection is not routed through sensation or cognition. By contrast, the conditioned stimulus acts by an essentially psychical connection; sensory input and brain functioning are necessary to establish the connection between the tone and salivation. In a closely related paradigm, operant conditioning (discovered by Edward Thorndike 1898) is used to probe the perceptual discrimination abilities of animals. For instance, a honeybee or moth might be presented with blue dishes filled with pure water, and yellow dishes with sugar-water. Once they have had a chance to sample the contents of each type of dish, it is found that they learn preferentially to sample the yellow dishes to find the sugar water (which they happen to prefer). Here, a “psychic” connection is established between an initially unmotivated impulse to feed from yellow dishes, and a reward, the sugar water. Generally, experiments of this type are used to show that the subject animals possess certain abilities of sensory discrimination, in this case colour vision. Similar experiments can and are conducted in auditory contexts in order to map out the auditory similarity space of various animals. The idea is that sensory discrimination is required for operant conditioning. The exact character of these processes is somewhat contested, though increasingly a representational view has become standard (Gallistel 1990 passim, but see especially chapters 12-13; see also Mackintosh 1994b and Hall 1994). In such a representational view, we have three components. First, there are certain perceptual processes: the animal’s perception of coloured dishes, and of being rewarded when feeding from them. 18 AUDITORY OBJECTS Second, there is an innate process of record-keeping into which this perception feeds: as Gallistel puts it, “something form[s] inside the organism isomorphic to the dependencies to which the animal’s behaviour becomes adapted” (1990, 385). Gallistel’s idea – which parallels Pavlov’s own (see above) – is that there is an objective “dependency” between yellow dishes and sugar water, or between the tone and the presentation of food. When an organism experiences instances of this dependency, something gradually forms within it that parallels the dependency – a “psychical” representation of the dependency, to use Pavlov’s adjective. To this, we may add: if the dependency concerns an individual object, such as a place, a natural object, or another organism – if, for instance, the animal forms an expectation that food is to be found in a particular location or in a particular receptacle – then the psychical representation will denote that individual. Third, there is a learned response: the expectation that the conditioned stimulus will precede food, or (more accurately) the impulse to search in yellow dishes for sugar water. In view of the role that the representation plays (in the second step above) in mediating contingent dependency and appropriate impulse, conditioning can be regarded as an automatic representation-forming or record-keeping process – an epistemic operation, as I have called it elsewhere (cf. Matthen 2005, chapter 9). (When I speak of an epistemic operation, I mean in the present context to be restricting myself to such automatic processes, though I will usually add the qualifier for clarity.) On the modular view, sensory content is the output of subpersonal sensory processors. The replacement proposal that I would like to make is that sensory states can equally well be demarcated by reference to epistemic operations of the sort outlined in the preceding paragraph. Sensation is to be regarded as the input to core automatic 19 AUDITORY OBJECTS epistemic operations. For the sake of definiteness, I propose that it be regarded as the unlearned input to classical or operant conditioning. This yields the following pair of definitions: D2. Sensory states are those that (a) provide unlearned input to classical or operant conditioning, and (b) are potentially or actually conscious. (Clause (b) is optional: it is added in order to accommodate the intuition, if it is held, that completely non-conscious inputs to conditioning and other automatic epistemic operations should be excluded, since these may not properly be regarded as sensory.) Having defined sensory states in this way, we may add: D3. Objects and features are directly sensed if they figure in the representational (or record-keeping) content of sensory states (as these states are defined by D2). Notice how D2 and D3 complement D1, the definition of the direct objects of sense. We find that (as remarked earlier) both the parts and wholes of certain sensory entities may be directly sensed. In the case of phonemic perception, the auditory system analyses the shape of the acoustic signal incident on the basilar membrane, diagnoses the articulatory gestures made to produce this signal, and produces qualia to mark these gestures. On the assumption that the sensory system picks the whole as well as the parts out of the ambient flux of acoustic energy, both are directly sensed, by D2 and D3. It follows from D1 that neither is sensed in virtue of the other. Much the same holds for melody. If being conditioned by a melody, or learning it, or recognizing it, is a “holistic” process – not merely the sum of learning processes targeted on the constituent notes – then there is a sensory state involving melodies, by D2. But then melodies are directly sensed, by D3. And the same goes for individual 20 AUDITORY OBJECTS notes: if there is direct conditioning on notes, they too are directly heard. Some find it intuitive to say that we hear the melody in virtue of hearing the notes. But this goes against the evidence. In the first place, Bregman’s patches example indicates that at least sometimes, the notes are created by the auditory system to complete an inferred melody. In this case, the notes are clearly supplied by the sensory system simultaneously with the whole – just as in the Kanizsa triangle. But this makes it plausible to think that when one hears an unobscured melody, one’s auditory system comprehends the whole in some fashion. V. Indications of Direct Sensing Let us return now to the question of what things are directly heard. In the light of D2 and D3, Berkeley’s assertion that “nothing can be heard but sound” appears to be unsupported. Berkeley’s line of thinking about “suggestion” runs broadly in parallel with the treatment of innate and automatic epistemic operations in the preceding section. That is, he thinks of “suggestion” as an automatic extra-sensory faculty, much as I have been presenting conditioning. Further, it is at least consistent with his text to suggest that the direct objects of sense are those that initiate the processes he calls “suggestion.” But he gives us no reason to suppose that sounds are the only objects provided by audition to “suggestion”. As we noted earlier, the question cannot be settled a priori by reference to a doctrine such as atomism. So the question arises: On what evidence does Berkeley think that we come to know that a coach is present because it is “suggested from experience”? How can he be sure that coaches are not direct objects of auditory experience? It is my view that he has no evidence, and is not entitled to any confidence on this point. What sort of evidence would support the proposition that this or that kind of object is or is not directly sensed? Automaticity hypotheses are notoriously insecure, and it is even more difficult to identify the inputs to hypothesized automatic learning 21 AUDITORY OBJECTS mechanisms. However, there are certain experimental and phenomenological marks which indicate the automaticity of a sensory process. Here is an incomplete list of such indications, some of which have figured in our argument up to this point. Separate Conditionability As mentioned above, operant conditioning has traditionally been used to probe the sensory discrimination abilities of animals. In Berkeley’s example, one would want to test how reactions to the sounds of a coach are reinforced, persist, and are extinguished. If it is the coach itself that is the target of conditioning, then the conditioned reaction should transfer to other auditory coach-related stimuli. In other words, if coaches were inputs to conditioning, then we would expect that there would be “constancy phenomena” related to objects of this sort – equivalences among the different auditory stimuli that emanate from the same object, such that a response to one transfers to equivalents. Now, audition seems, by this criterion, to be concerned not so much with objects as with their activity. The rumbling of a coach does not characterize the coach itself, because the coach does not rumble when it is stabled and at rest. Rather, the rumbling of the coach characterizes its activity when it is moving over a road. With regard to this activity, constancies do apply – the rumbling will trigger conditioning, and will be immediately recognizable, even when heard over other sounds or from a distance etc. This indicates that audition does not track objects as such, but tracks rather their activities and conditions. Constancy Continuing with the line of thought concerning separate conditionability, many complex objects are instinctively recognized as the same from different points of view. In vision, three-dimensional objects can directly be recognized through rotation (provided that the initial view is sufficiently good), and animals can, for instance, recognize a receptacle within which their food is stored, even when it is viewed from different attitudes or perspectives. This implies that different stimulus arrays are 22 AUDITORY OBJECTS instinctively recognized as emanating from the same object –despite the different outline an object projects when it is rotated, it is still recognizable. Since these objects are instinctively reidentified, the epistemic operations will be indifferent to through substitutions of equivalent sensory arrays. Here the nature of the equivalence marks the kind of object that the sense modality targets. Vision targets shape, I have been suggesting, and shape is a property of material objects. So one might conclude that vision targets material objects. To figure out what kind of object audition targets, one must reflect upon the nature of auditory equivalence. Turning, then, to audition, melodies are analogues of three-dimensional objects in vision. When a child repeats a melody you sing to him, he repeats it in a higher key. Yet it is the same melody. Thus, as Daniel J. Levitin (2006) says: A melody is an auditory object that maintains its identity in spite of transformations, just as a chair maintains its identity when you move it to the other side of the room, turn it upside down, or paint it red. (25) Not only does the melody retain its identity, but we recognize it despite transformations. In fact the lay person does not even notice when somebody sings a melody in a different key. These are reasons to think that sufficiently brief melodic fragments – “phrases”, let us call them – are directly perceived. (I don’t think that a whole tune or theme is directly perceived; it is probably perceived in virtue of constituent phrases.) Along the same lines, think of timbre constancy. Albert Bregman says: A friend’s voice has the same perceived timbre in a quiet room as at a cocktail party. Yet at the party, the set of frequency components arising from that voice is mixed at the listener’s ear with frequency components from other sources. The total spectrum of energy that reaches the ear may be quite different in different environments. To recognize the unique timbre of the voice we have to isolate the frequency components that are responsible for it from other that are present at the same time. (1990, 2) 23 AUDITORY OBJECTS Both melody and timbre feed into learning processes. That they do so is, as the famous photograph, “His Master’s Voice”, illustrates for timbre, independent of the notes that (in the case of melody, sequentially, and in the case of timbre, simultaneously) constitute them. A voice heard over an old phonograph is considerably distorted. Yet the dog recognizes it. This indicates that it is heard directly, and not in virtue of its parts. Note, however, that constancy of melody or timbre is not necessarily constancy regarding its bearer. The same singer can sing a different melody; she can sing at different pitches and variable volume, and the timbre of her voice will change accordingly. The target of audition is not, then, the singer – it is the song. (Do we, however, hear voices directly? – Certainly, people are recognizable in this way, and this may be a conspecific recognition mechanism, much as face recognition is. So audition might sometimes be targeted on individual people.) I must here enter a caveat hinted at in the introduction. I have been suggesting that melodies, for instance, are objects for the auditory system on the grounds that the system processes them for conditioning etc. even when they are transformed. I am thus identifying objects by the activity of the auditory system – in effect, I am giving an account of what the auditory system treats as an object, rather than an account of objecthood independently of the system. Thus, I intend no claim about what auditory objects there really are, and I am not suggesting that the auditory system’s object24 AUDITORY OBJECTS construction activity is explained by the function of tracking naturally unified objects. Of course, it is possible to make some observations along these lines. Earlier, I suggested that audition tracks the activities of material objects. This spawns some norms. For it follows that it would be an error, or an illusion, if, listening to a stereophonic recording of a train, one seemed to hear the movement of an object from left to right. But piecemeal observations of this kind do not imply the existence of general norms regulating the correctness of the auditory system’s activity. For example, melodies can be sung in many parts, and thus they are not always activities of single objects or single agents. It would be wrong to think that this lack of correspondence to the activities of a single object implies there is some kind of illusion involved in their appearing as unitary objects. There is no general norm on auditory object-construction that would license such a conclusion. Qualia Insertion States that are marked by characteristic qualia are almost certainly the products of sensory systems. This was noted in the examples discussed in section II above. The Kanizsa triangle looks brighter than its “background”. The phenomenal awareness of brightness cannot have been inserted by what Berkeley calls “judgement”; judgement does not have the power to alter visual phenomenology. Nor does “suggestion”, i.e., conditioning et. al. – though Berkeley probably thought that it did. (See note 3 for some qualifications.) That is, I may judge how bright a certain surface is by calculating what light is falling on it, but such a calculation will not make it look brighter. (In the Kanizsa display, I can easily verify that the triangle is actually no brighter than the background, but I am unable to adjust how bright it looks.) That the triangle has a characteristic look indicates the involvement of a sub-personal sensory system. Qualia insertion in the case of phonemes and “melodies” such as patches indicates that these items too are directly processed by the auditory system. Categoricity In the case of phonemes, a sharp phenomenological distinction is felt between phonemes – that is, a consonant will be heard as /d/ or /t/ with no 25 AUDITORY OBJECTS intermediates. Yet, the spectrographic form of spoken sounds can vary smoothly. As these sounds are varied, the observer will notice a sudden change from /d/ to /t/ at a certain point. This mismatch between the smooth variation of acoustic signals and the discrete nature of their sensory representation again reveals active sensory processing. The discontinuity pertains to articulatory gestures, as we have seen, not to acoustics; once the gestures have been decoded, the system creates phenomenological discontinuity to accord with gestural discontinuity. Categoricity is a form of qualia insertion. Non-decomposability Sometimes a quality is presented holistically even though the perceiver has separate access to the elements from which a sensory system computes it. In vision, face perception is a well-known example. A face has a look, and is reidentified by that look. Presumably we also possess visual access to the facial features that the system uses in face-recognition – the shape of the face, the shape and distance between the eyes, the shape and placement of the nose, etc. Yet, we have no instinctive idea of how these component features combine to make up the look of a face. As it happens, we know on independent grounds that face perception is modular – at least in the sense that it (or some sub-process thereof) is localizable in the brain, automatically computed, used to recognize other members of our species, and thus to update our knowledge base concerning these individuals. (See Bergeron and Matthen 2007 for discussion and references.) Something of the same kind holds of voice recognition. A man and a woman can sing the same melody in the same key at the same pitch. Yet each may be recognizable as a man or as a woman. This ability has to do with the timbre of each voice (mostly with formants above the second). We have separate access to the overtones that constitute timbre, but we cannot instinctively analyse the components of each voice. We just sense that a voice sounds like a man’s, or like a woman’s. The same is true of individual voices – your spouse’s voice just sounds different from your sibling’s. You 26 AUDITORY OBJECTS may well have separate access to what makes them different, but you do not consciously weigh up these factors when you recognize a voice (over a telephone, for instance). These are indications of automaticity. Familiarity Certain things can be reidentified in a non-decomposable, or holistic, manner – faces, voices, melodies are examples discussed above. A feeling of familiarity is a sign that there are already internal records concerning such individuals, and that the incoming signal is linked to that record for purposes of update. Thus, it indicates that originally the stimulus was directly sensed. VI. Auditory Object Construction There is a gap in Berkeley’s argument, as we saw. Nevertheless, we are now in a position to appreciate that he was more or less on the right track in the particular case that he discusses. This can be shown by a relatively direct use of D3, i.e., by inspecting the form of auditory states. The features or qualities that audition delivers to consciousness are of the following sort: loud, soft, high, low, and so forth. Features of this sort are not attributable to the coach or to its wheels. The squeak of the coach’s wheels may be high and the rumble that it makes as it rolls along the road might be low. However, the wheels themselves are not high, and the coach itself is not low. This is more than a matter of language: loudness and pitch are not continuing features of material objects, but of certain events in which these objects are involved. Since auditory features characterize something other than the coach (or its wheels), the target of auditory representation must equally be something other than the coach. If the auditory state eventually leads to an update of internal records concerning the coach, it is indirectly – through whatever is the bearer of the auditory features, high/low, soft/loud, etc. If it is right to say that we hear the coach because we hear something high (some squeaks) and something low (some rumbles), then it seems right to say that we hear the coach only because its presence is suggested by our hearing its 27 AUDITORY OBJECTS squeaks and rumbles. And squeaks and rumbles are sounds – they are the things to which auditory qualities, or features, are attributed. Berkeley seems to be correct therefore: it is sounds that are properly or directly heard in this case, not a coach. (Berkeley, of course, thinks of sounds as sensations, and, as we shall see, this is a mistake.) There is another kind of reason for thinking that we hear sounds rather than material objects. It lies in a disanalogy with vision. The disanalogy is a fine one though, and it lies within a broader analogy. Let us first examine the broad analogy. Vision demarcates objects by marking their boundaries off from the background – the Kanizsa triangle is an example of how this is done. That is, visual objects are most often demarcated by figure-ground boundaries. “A perceptual object is that which is susceptible to figure-ground segregation,” say Kubovy and Van Valkenberg (2001, 102). Albert Bregman (1990) has argued in great detail that the figure-ground operations of vision are all repeated in audition, but with sound-streams, not material objects – i.e., with temporally (rather than spatially) extended figures – patches is but one of the examples that he offers. (See also Kubovy and Van Valkenberg 2001 and Griffiths and Warren 2004.) This is a strong reason for supposing that there are auditory objects, just as there are visual objects. Now, many of the direct objects of vision are material objects (see Matthen 2005, chapter 12): that is, the visual system uses figure-ground segregation in ways to indicate the boundaries of material objects. But the direct objects of audition are not material objects: for as we saw in the preceding section, auditory constancy does not mark material objects as such, but rather certain events and activities in which material objects are involved, for, as we saw two paragraphs ago, auditory features are not attributable to material objects. Nevertheless, there is some reason for thinking that auditory objects often correlate with material objects. Audition tracks voices. When somebody is singing, her voice sounds like a continuous and connected stream 28 AUDITORY OBJECTS emanating from the place where she is singing. So one might think that here, audition is targeting a material object. It parses the voice as a single continuous stream, one might think, because the voice emanates from a single person. Here, the evidence suggests that the auditory system is tracking a material object: the sound appears to be continuous and connected because it is analysed as coming from a single material object. Perhaps this is true, but there are certain strong disanalogies between visual and auditory objects. In the first place, audition presents its objects as temporally composed. Thus, if O is an auditory object that persists through time, and P(O,t) is what we hear of O at t, then P(O,t) will often be sensed as a part of O. For example, a phoneme will be heard as part of a word, a note as part of a melody, and so on. On the other hand, the synchronic components of auditory objects are usually not sensed as distinct parts. For example, suppose that a string quartet plays a chord by each instrument playing one note of that chord: the resultant harmony has a holistic quality in which the separate notes are not heard as separate parts. If, on the other hand, the parts of the string quartet are not in harmony, they are heard as separate auditory objects – separate voices. In this case, there is no one object of which they are all sensed as parts. In vision, the situation is reversed. Thus, if O is a visual object that persists through time (for instance, a blue sphere), and P(O,t) is what we see of O at t, P(O,t) is generally not sensed as a part of O – what we see of a blue sphere at t is sensed as the whole of the blue sphere, not a part thereof. On the other hand, spatial components of the blue sphere that are seen at the same time – for example, the blue hemispheres – are generally seen as parts. This phenomenological difference arises from an important difference in the principles of object construction used by these modalities – vision joins spatial parts together to form unitary wholes; audition works on temporal parts in the same way. 29 AUDITORY OBJECTS Material objects are visually presented as continuing through time, but not as consisting of a sequence of temporal parts. Auditory objects, by contrast, are presented as unfolding through time, but not as having simultaneous parts. The part-whole ontology of vision parallels the ontology of material objects. Auditory objects, however, are of a fundamentally different kind from visual objects – they don’t have synchronic, but do have temporal parts. They are not material objects. Of course, some do hold that material objects have temporal parts, and I don’t want to deny that they do: my point is that the intuitive part-whole ontology of material objects is that of spatial parts, and that of events is of temporal parts. The ontologies of vision and audition parallel these intuitive ontologies. Moreover, some of the principles of auditory figure-ground segregation work against material object identification. We just saw that there are occasions when several distinct voices appear to merge: namely when they sound in harmony. According to the principle articulated by Kubovy and Van Valkenberg (see above), this is an auditory object. For here too the sensed unity is attributable to figure-ground segregation in accordance with a principle that plays a role in vision. In vision, strongly correlated edges are seen as edges of the same object: if two edges run more or less in parallel, the squiggles of one correlating closely with those of the other, they will be perceived as two edges of a single figure seen against a ground (cf. Hoffman 1998, 60-61). In vision, this principle is clearly targeted at material objects: the evolutionary rationale for the principle is that edges can be correlated only if they have a common cause, namely a single object. The merging of sound-streams that consistently harmonize with each other is an instance of the same phenomenon – they are merged into a single object because they are highly correlated. However, such dual-source sound streams are not correlated with single material objects. They are emitted by two objects singing in parallel. 30 AUDITORY OBJECTS VII. Located Sounds The majority view is that we hear sounds. This accords with the facts about attribution mentioned earlier: sounds are bearers of auditory features (though not the only such bearers, as we shall see in a moment). I shall argue in the following section that sounds are not the only thing we hear. In the preceding section, it was already possible to glimpse reasons for denying that sounds are the only audibles: figure-ground segregation is often concerned with sound streams – temporally extended collections of sounds, sometimes from different sources – not individual sounds. For the next few pages, I want to ignore this. In the present section, I want simply to inquire into what sounds are. Berkeley thought that sounds are sensations. Others have thought that they are vibrations of the air. However, Robert Pasnau (1999) has shown in a seminal paper that these views are false. His argument is very simple: We do not hear sounds as being in the air; we hear them as being at the place where they are generated. Listening to the birds outside your window, the students outside your door, the cars going down your street, in the vast majority of cases you will perceive those sounds as being located at the place where they originate. At least, you will hear those sounds as being located somewhere in the distance, in a certain general direction. But if sounds are in the air, as the standard view holds, then the cries of birds and of students are all around you. (ibid., 311) Now, as will emerge, I don’t think that everything we hear is located in quite so simple a manner as Pasnau suggests, and I certainly do not think that each audible thing originates in a single place. Nevertheless, there is certainly an important sub-class of audibles that have definite location in just the way that Pasnau indicates. Generally, but not always, these arise from a discrete event – a bang or a whimper or a laugh. That these sounds seem to have location is not merely an illusion or error of audition. For it is clear that audition is generally quite accurate about the location of events from which 31 AUDITORY OBJECTS air-vibrations originate. By contrast with such originating events, air-vibrations are diffuse; they have no confined location. This indicates that audition is (often) functionally targeted on the origins of the air-vibrations, not on the vibrations themselves. If sounds are what we hear, then sounds are located events. Let’s call these audibles located sounds. Sounds are like colours, Pasnau urges: they are located at “their point of origin”. They are not like odours: they do not fill the air (313). The same argument tells against sounds being sensations. Sensations are not located outside your door. What you hear may be, or may seem to be, outside your door; sensations, however, are not, and don’t seem to be in any physical place – they are in the mind. (Again, the argument of section IV tells against sounds being sensations: there are no constancies regarding sensations. And auditory features like high and loud do not belong to sensations; rather, they belong to things in the public domain.) Sensations are the hearings of things; they are mental episodes. They should not be confused with the things that are heard. Sounds are not episodes of audition; they are what we hear. What then are sounds? Pasnau proposes that they are located in material objects – they “either are the vibrations of such objects, or supervene on such vibrations,” he says (316). This seems right. However, Pasnau occasionally implies – he slips here, I think – that the sound we hear is a property of an object. “We should insist on putting sound back where it belongs, among the various sensible properties of objects: among colour, shape, and size” (324). Presumably, he is led to this view by thinking that a vibration can be a property of an object. For instance, the vibration of a trumpet – its sound, according to Pasnau – can be regarded as a property of the trumpet. I do not wish to contest that vibration is a property of the trumpet. I do want to note, however, that in general we sense both objects and their features -- here, I mean ‘object’ to range more widely than material objects. For instance, we sense a particular 32 AUDITORY OBJECTS object – a disc in the corner – and sense of it that it is blue. Here, sensation represents a subject-predicate connection between the disc and its colour. Auditory sensations represent subject-predicate connections too. We sense of auditory objects that they have auditory features. If Pasnau was right about sounds being properties or features, then we would sense of their subjects that they had these properties. For instance, when we listen to Purcell’s Voluntary, we would be sensing of a trumpet (the subject) that it is vibrating (the feature). But this implies by D3 that we (directly) hear the trumpet. And this is precisely the conclusion that I argued against earlier. The trumpet is not high or piercing; it is the sound that it emits that possesses these characteristics. Audition does indeed represent subject-predicate connections, but it is the sound (not the maker of the sound) that is the subject, and features like high/low, loud/soft that are predicates. But this does not tell the whole story. I do not have the space to argue the point in detail here, but audition tends to individuate located sounds as if the material object – not just the place – from which they emanate is important (cf. O’Callaghan forthcoming b). In addition to the evidence adduced in the last section – that audition tends to track melodic lines and voices – there is this additional consideration. Audition is closely allied to object characterization in vision. For instance, visual attention moves to the heard location of a noise. Then, there is the ventriloquist’s effect: if there is a moving mouth in the vicinity of an auditorily located sound, the auditory system relocates it in the mouth. Again, there is the McGurk effect: the visually apprehended movement of the mouth and tongue will influence what one hears somebody saying. In recognition of the importance of material-object location in the demarcation of sounds, I will say that sounds are not merely located, but object-located events. (See O’Callaghan forthcoming a, b, for a similar view.) This, I believe, does justice to the intuitions that led Pasnau to suggest that they are properties of objects. But he is wrong to think that sounds are attributes of individuals. Object-located events are not material objects; they are events. 33 AUDITORY OBJECTS Consider then a chain of events, the last member of which are vibrations of air. I have in mind a chain like this: Violinist reads music violinist moves bow across string string vibrates air vibrates at string-air interface. In such a chain, there is a last member that is a cause of vibration propagated through the air, but which is not itself a vibration of air. That last item is a sound. (It should be noted that air-flow can be a sound – when one whistles, or when the exhaust of a jet engine makes a roar. Air-flow is not in itself air-vibration.) In the chain shown above, the third item is the last cause of vibration in the air. My claim (partially following Pasnau), therefore, is that this vibration-in-the-string is a sound. This is the subject of the representational content of the auditory sensation; the predicates are its pitch and loudness. (Note that on this view, sounds can be silent: the bow can cause a string to vibrate even in a vacuum when no air-vibrations are created. Pasnau defends this as follows: “If x has the property of being a squeaker, it would seem peculiar to claim that x loses that property when it is put in a vacuum for five minutes. After all, it would still be squeaking when you take it out of the vacuum.”iii) It could be objected that sounds so identified lack the auditory characteristics we normally attribute to them – loudness or softness, highness or lowness, and so on (cf. Pasnau 319). Loudness and pitch, it might be said, belong to sound waves. This objection does not seem correct. It is certainly true that vibrations in the air have amplitude and frequency. Sounds, however, do not have amplitude and frequency; they have loudness and pitch. The latter are, of course, closely related to amplitude and frequency. But they are different. The loudness and pitch of a located sound are definitely located in space and in objects, just as sounds are. The loudness of an aircraft taking off is not simply an ambient quality; it is located somewhere near the aircraft, just where the sound is. So it is more accurate, I think, to say that loudness and pitch 34 AUDITORY OBJECTS are qualities that the auditory system attributes to sounds on the basis of the amplitude and frequency of the air-vibrations that these sounds cause. (Loudness, it should be said, is a perspectival quality, something like the visual quality of being above or below the subject, or looming over her. It is a property that a sound has from the perspective or at the place where the auditor is. Thus: “That stereo is too soft/loud [from where I am sitting].” Moreover, one can hear a loud sound – indeed, it can seem loud – even when one is very far away and the amplitude of the sound waves that one’s ear receives is small.) VIII. Other Audibles Located sounds are not the only things we hear directly. Consider melodic phrases. These, recall, are relatively brief sequences of notes – the opening ta-ta-ta-tum of Beethoven’s Fifth Symphony is an example – that are recognized as wholes. Phrases are heard as possessing contour, metre, and rhythm. It is sometimes said that they are heard in virtue of hearing their constituent notes – but since a phrase transposed into a different key, or played at higher or lower volume, retains its identity, this seems false. As well, as patches shows, how one hears a phrase will influence how one hears the notes. As we argued in section II, IV, and V, this is a case where both the whole and the parts are heard directly. Melodic phrases are not object-located sounds. A phrase could be started by the violins and completed by the cellos. The opening notes would then be in one object and the closing notes in another, without the phrase ever occupying the places in between. Thus, object-located sounds are not the only things we directly hear. Again, consider harmonies. Contrast a chord sounded by a string quartet playing different constituent notes from one sounded by a single instrument. The first can be heard, but where is it? Certainly not in the four separate locations that the component sounds occupy. The harmony is a single thing and is not splintered in such 35 AUDITORY OBJECTS a way. Perhaps it spreads into the large location that the whole quartet occupies. But then it is nothing like an object-located sound, for no one object is the source of the airvibrations by which the chord is heard, and no one object occupies the large location. Here as with melodies, the individual notes may be heard directly, but they have no priority: we hear the harmony directly – it has a distinctive and non-decomposable quality that persists when new notes (in the same interval) are substituted for the old. A fifth sounds like a fifth regardless of the key. Here, again, audition is not tracking material objects, but constructing auditory objects, which have individuation conditions and figure-ground segregation conditions of their own. Melodic phrases and harmonies are auditory objects. They are not sounds, but are composed of sounds. Often they are heard directly. So not everything we hear is a sound. Some of the things we hear are composed of sounds. Audibles come in other varieties too. A landscape or visual scene consists in things you see at the same time, and their visually apprehended relationships to one another. A soundscape or auditory scene consists in things that one hears together over an extended period of time, and their auditorily apprehended spatial relationships to one another. An auditory scene may consist then of several sound streams arrayed in space; a sound stream of many sounds arrayed in time. (Question: do we ever hear soundscapes directly? I am not sure of the answer to this question.) Suppose that you are listening to “Yesterday” by the Beatles. You hear several things here. You hear several melodic lines: a human voice, a bass guitar, and a string quartet. The string quartet is sometimes one voice, sometimes two; and sometimes a single voice will pass from one instrument to another. Each of these is a temporally extended unit – a sound stream. As well, the human voice is not just a melody. It utters a sequence of phonemes; and these form words and meaningful text. Moreover, you hear a number of harmonies or chords: the backup, the bass guitar etc. are responsible 36 AUDITORY OBJECTS for these. Lastly, you hear individual tones. These things are not all that you hear when you are listening to “Yesterday” – you hear various environmental sounds as well – but let us pause to consider these elements. Let’s consider the melodic lines first. The separation of these lines from the total energy flux at the ears is no mean feat. Bregman (1990, 3) makes the point by contrasting the following displays: AI CSAITT STIOTOS AI CSAITT STIOTOS The top line makes little sense; the visual cues provided in the second line help decode the message. The signal that the ear receives when listening to a many-part composition like “Yesterday” is much like the first line; yet what we actually seem to hear is segregated in the manner of the second. A “baby starts to imitate her mother’s voice,” Bregman says. “However, she does not insert into the imitation the squeaks of her cradle that have been occurring at the same time” (5). This is the auditory equivalent of the kind of separation that occurs in the second line of the display shown above. Auditory properties are distributed among entities in a soundscape. Frank Jackson (1977, 65) drew attention to a certain kind of structure that obtains in visual images. The image of a green square to the left of a red circle is different from that of a green circle to the left of a red square. Jackson points out that this shows that the visual image doesn’t just contain red, circle, green, and square. Rather these properties are bound together in determinate ways: the red either to circle or to square, and similarly for the green. Similarly, as Albert Bregman says: Suppose there are two acoustic sources of sound, one high and near and other low and far. It is only because of the fact that nearness and highness are grouped as properties of one 37 AUDITORY OBJECTS stream and farness and lowness as properties of the other that we can experience the uniqueness of the two individual sounds rather than a mush of four properties. (1990, 11) In the visual stream, the properties are not merely co-located; they are co-predicated (Matthen 2005, 272-277). The redness belongs to the circle or the square; it isn’t merely co-located with these things – thus, circularity individuates the subject; redness is a feature attributed to this subject. Similarly, here: a melodic line is something that falls or rises, and possesses timbre and loudness, etc. The melodic line is individuated by figure-ground segregation; various auditory features are attributed to it. It is not merely that these qualities happen to be associated with the melodic line; they are predicated of the melodic line. (This, I argued in the previous section, is what Pasnau overlooks when he argues that sounds are properties of objects.) The separate soundstreams heard in “Yesterday” each have different features; this, as Bregman suggests, is how the auditory scene is constituted. Notice that the different objects in an auditory scene consist of different groupings of object-located sounds. A chord consists of notes sounded simultaneously, whether by the same or by different objects. A melody consists of successive notes sounded by the same voice, or by different voices in sequence, or a group of voices in harmony. Thus, one might say that a scene has certain elemental parts which get combined in different ways to form different extended objects. These extended objects overlap to some extent; there are elemental parts that belong to more than one of them. The elemental parts are object-located sounds in the sense of the preceding section; the extended wholes are not. These complex part-whole relations involving individual sounds no doubt encourage a form of auditory atomism, the doctrine that sounds are the only direct objects of audition. But I have argued throughout this paper that such atomism is misguided. The minimal parts of everything we hear are object-located sounds, but we do not hear all other things in virtue of hearing object-located sounds. On the other hand, it is certainly true that the auditory system does segment the 38 AUDITORY OBJECTS ambient acoustic energy flux received by the ears into complex overlapping wholes. The important point to keep in mind is that these wholes are not heard in virtue of their parts. IX. Aesthetic Appreciation and the Variety of Audibles Consider, in conclusion, the following oddity. It is difficult to attend to contrapuntal harmonies at the same time as one attends to melodies. It is, of course, easy to hear harmonies when a single melody is played as a sequence of chords. But when two melodic lines are played, it is hard to hear the chords formed by the simultaneous sounding of notes across the two lines. For example, it is harder to hear the chords formed by the notes of the singer together with the violin in “Yesterday” than to hear the chords that sound when the string quartet is playing together as one sound-stream. This is puzzling. It’s not that one has difficulty apprehending a plurality of auditory features simultaneously. For example, one can listen to the contour of a melodic line at the same time as one attends to its beat, metre, and internal harmonies. Why then is it hard to hear contrapuntal harmony at the same time as one attends to the separated melodic lines of violin and voice? Here again, a parallel with vision throws light on the puzzle. In vision, it is hard to attend to what one might call accidental relations between two visual objects, relations that arise out of some peculiarity of our viewpoint. When one is composing a photograph of a friend, one fails to see that there is a lamp-post or pillar behind her, which in the photograph will appear to sprout from her skull. Looking at paintings from the 16th century, one fails to notice – unless one has read about it in advance – that the main figures are arranged in a triangle. Why is this? Because such juxtapositions are accidents of one’s point of view, and the visual system disregards accidental juxtapositions, since they are not germane to the actual spatial relations that obtain between the objects. That is, the juxtaposition of lamp-post and head, or the arrangement of figures in a triangle, would be disturbed if one shifted one’s position a 39 AUDITORY OBJECTS little bit. By contrast, conjunctive features that constitute an object – the attachment of head to shoulders, the triangularity of a Yield sign – persist with changes of point of view. Automatic visual processes tend to ignore accidental conjunctions and highlight intrinsic conjunctions. In audition, something similar seems to be at work. Harmonic relationships that would be obvious when they are present within a melodic line are difficult to perceive when they hold between notes in distinct melodic lines. Why is this? Because the within-object relations are constitutive of the object, while the cross-object ones are accidental (or would be if they hadn’t been created by an artist). Cross-object relations are essentially a product of the situation, and could easily have been different. Automatic auditory processes ignore them, just as automatic visual processes ignore cross-object juxtapositions. It takes close attention to perceive these relationships, and prima facie this indicates that they are indirectly heard – i.e., heard in virtue of hearing the individual melodic lines by means of a post-sensory process. In addition to the musical entities considered above, “Yesterday” also incorporates phonetic streams. These too have to be extracted from the total acoustic flux. Imagine a person singing a prolonged ‘ah’ at high C (C6). Now imagine her singing ‘oh’ at the same note. Obviously, one can readily tell the difference – but how? What is the sonic difference between an ‘ah’ and an ‘oh’ sung at the same frequency? It turns out that the difference lies in the timbre of the two vowels – namely, in the frequency of the second formant (Cogan 1969, Handel 2006, chapter 8). This means that if the singer were to sing a diphthong that went from ‘ah’ to ‘oh’, while holding a single note, the phonetic system would pick up the change of vowel, though the melody – which resides in the fundamental – would consist of a single note. In this way, the phonetic line will be separate from though it will complement the melodic line in sung music. Again, consonants will be perceived as separate from the main melodic line, since, as we saw in section II, they involve glides in the second formant. Consonants 40 AUDITORY OBJECTS are perceived as points of attack, or metrical elements, rather than as part of a smooth melodic line. The melodic line possesses, then, a somewhat different contour than the phonetic stream. Yet, we are not readily aware of these departures from parallelism. We are not readily aware of the changes of timbre as changes of timbre – we hear only transitions from one spoken phone to another. Objects of aesthetic appreciation always contain accidental relations of this sort. They are deliberately inserted by the creator, and a full grasp of the aesthetic properties of a work of art demands an appreciation of these relations. In his essay, “The Work of Art as an Object”, Richard Wollheim (1973) asserts that “modern art, or the painting of our age, exhibits, across its breadth, a common theory . . . according to which a work of art is importantly or significantly, and not just peripherally, a physical object” (118). What he means is something like this. Pictures depict objects by means of spots of coloured dye on a flat surface. When you look at a picture, you see both the flat array of dye, and the depicted objects. In realistic art, there is a strong tendency to see the depicted objects. But in documentary photography, for instance – photography of which the main concern is to document the objects – one tends not to see the photograph itself. That is, one attends not to the coloured marks on the surface of a piece of paper, but to the objects that these coloured marks depict. But in “modern art”, Wollheim says, whether depictive or not, the artist wants to draw your attention to the flat array. It was . . . optional for Velazquez or for Gainsborough whether they expressed their predilection for the medium. What was necessary within their theory of art was that, if they did, it found expression within the depiction of natural phenomena. For, say, Matisse or Rothko, the priorities are reversed (ibid., 120) Without wishing in any way to challenge Wollheim’s sketch of priorities, it is worth pointing out that the tension between the medium and the view of “natural phenomena” is a feature of every work of art. 41 That is, every artistic depiction AUDITORY OBJECTS deliberately inserts accidental correlations into the medium, and conveys meaning by these correlations. In painting, principles of composition are features of the medium: the devices of arrangement that a painter uses to highlight and decorate a work of art are juxtapositions that would be absent in a natural view simply because slight changes of perspective or position would “rearrange” the objects. Thus, to appreciate how a painter has composed a picture is to attend to what I have called accidental relations in his scene – to the triangular composition, or the relationships of size and colour, etc. In real life, the fact that one figure is in sunlight and another is in shade is of purely accidental significance – that relationship could change in a minute or two. Automatic visual processes thus disregard the difference and compensate for the colour and brightness changes consequent upon it. In art, the fact that somebody is in the light conveys the artist’s attitude toward that person. To appreciate this attitude, one has to attend to something that vision itself tends to ignore. An analogous point holds of auditory works of art. A work of music or a recitation creates an auditory scene that is not natural – it is a range of auditory objects plucked out of the flux of acoustic energy as commanded by the composer or performer; it is not a range of objects that can be found in nature. Here too there are two kinds of thing that one hears and to which one attends: the natural ones consisting of the voices, vocables, and other sound-streams that emanate from the performers and the unnatural acoustic scene that they constitute. Crucial to appreciating these works as aesthetic objects is appreciating accidental relations between different auditory objects in this scene – how the rhythm of spoken words interacts with the melody, the contrapuntal harmonies, the merging and separation of voices in a piece. All of these relations are possible only because of the variety of auditory objects that we have discussed in this article. The artist creates these objects and makes them stand in accidental relations. To hear and understand these accidental relations is of the essence of auditory appreciation. 42 AUDITORY OBJECTS LITERATURE CONSULTED Bergeron, Vincent and Matthen, Mohan (2007) Assembling the Emotions. Canadian Journal of Philosophy Bregman, Albert S. (1990) Auditory Scene Analysis: The Perceptual Organization of Sound Cambridge Mass: Bradford Books, MIT Press. Casati, Roberto and Dokic , Jerome (2005) Sounds. Stanford Encyclopedia of Philosophy (Fall Edition), Edward N. Zalta (ed.), URL=<http://plato.stanford.edu/archives/fall2005/entries/sounds/>. Cogan, Robert (1969) Toward a Theory of Timbre: Verbal Timbre and Musical Line in Purcell, Sessions, and Stravinsky. Perspectives of New Music 8: 75-81. Firth, Roderick (1949) Sense-Data and the Percept Theory. Part I. Mind 57: 434-65. Gallistel, C. R. (1990) The Organization of Learning Cambridge Mass: Bradford Books, MIT Press. Griffiths, Timothy D. and Warren, Jason D. ((2004) What is an Auditory Object? Nature Reviews Neuroscience 5: 887-892. Hall, Geoffrey (1994) Pavlovian Conditioning: Laws of Association. In Mackintosh 1994a: 15-43. Handel, Stephen (2006) Perceptual Coherence: Hearing and Seeing New York: Oxford University Press. Hatfield, Gary (1990) The Natural and the Normative: Theories of Spatial Perception from Kant to Helmholtz Cambridge MA: Bradford Books, MIT Press. 43 AUDITORY OBJECTS Hickok, Gregory and Poeppel, David (2007) The Cortical Organization of Speech Processing. Nature Reviews Neuroscience 8: 393-402. Hoffman, Donald D. (1998) Visual Intelligence: How We Create What We See New York: W. W. Norton. Jackson, Frank (1977) Perception: A Representative Theory Cambridge England: Cambridge University Press. Kanizsa, Gaetano (1976) Subjective contours. Scientific American 234: 48–52. Kubovy, Michael and Van Valkenberg, David (2001) Auditory and Visual Objects. Cognition 80: 97-126. Kumar, S., Stephen K.E., Warren, J. D., Friston, K. J.. and Griffiths, T. D. (2007) Hierarchical Processing of Auditory Objects in Humans. PLoS Computational Biology 3: e100. Doi10.1371/journal.pcbi.0030100. Leviton, Daniel J. (2006) This is Your Brain on Music: The Science of a Human Obsession New York: Dutton. Lewis, David (1966) Percepts and Color Mosaics in Visual Experience. The Philosophical Review 75: 357-68. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., and Studdert-Kennedy M. (1967) Perception of the Speech Code. Psychological Review 74: 431-61. Mackintosh, N. J. (1994a) Animal Learning and Cognition. San Diego: Academic Press. (1994b) Introduction. In Mackintosh 1994a: 1-13. O’Callaghan, Casey 44 AUDITORY OBJECTS (forthcoming a) Sounds Oxford: Oxford University Press. (forthcoming b) Seeing What You Hear: Cross-Modal Illusions and Perception. Philosophical Issues. Pasnau, Robert (1999) What is Sound? Philosophical Quarterly 49: 309-324. Peterson, Mary A. (2001) Object Perception. In E. B. Goldsmith (ed.) Blackwell Handbook of Perception Oxford: Blackwell: 168-203. Pavlov, Ivan Petrovich (1904/1968) The 1904 Nobel Lecture, excerpted in a translation by W. Horsley Grant in Richard Herrnstein and Edwin G. Boring (eds) A Source Book in the History of Psychology. Cambridge: Harvard University Press. Pylyshyn, Zenon (1999) Is Vision Continuous with Cognition? The Case for Cognitive Impenetrability of Visual Perception. Behavioral and Brain Sciences 22: 341-423. Recanzone, Gregg H. (2002) Where was that? – Human Auditory Spatial Processing. Trends in Cognitive Sciences 6: 319-20 Thorndike, Edward L. (1898) Animal Intelligence: An Experimental Study of the Associative Processes in Animals. Psychological Review Monograph Supplement, 2 (no. 4): 1-109. 45 AUDITORY OBJECTS NOTES i Many thanks to Nicolas Bullot and Casey O’Callaghan for detailed written comments and extensive discussion of issues covered in this article. ii It is generally thought that one senses an extended object by sensing some, but not necessarily all, of its parts. For example, one sees a cube by seeing its facing surfaces. Thus, one does not have to see all of a thing’s minimal parts in order to see it. The empiricist doctrine takes note of this by maintaining that one sees something in virtue of seeing those minimal parts that are in view. iii It is interesting that on this view, it could be held that silent articulatory gestures are sounds. They are events that would have caused propagated vibrations of the air if the articulatory tract had been open. So it is possible to argue, with Aristotle, that consonants are sounds after all. Of course, the silent articulatory gesture that the speech perception system decodes as /b/ would sound quite different if the articulatory tract had been open. 46