Auditory Objects[1] - University of Toronto

advertisement
For the European Review of Philosophy 7
Auditory Objectsi
Mohan Matthen
University of Toronto
What do we directly hear? In section I, I define direct perception, and outline the logical
atomist way of attacking the question. I argue in section II that atomism fails. Then, in
sections III-V, I propose that a better alternative to atomism is to revive and modernize
another traditional empiricist doctrine: that we directly sense what the senses deliver to
automatic (i.e., sub-personal) processes of learning.
Having discussed the criterial issue, I return to the question.
One obvious
answer to our question, on any way of proceeding, is that we directly hear sounds. I
argue in sections VI-VII that this obvious answer starts us off in the right direction,
provided that it is informed by a proper conception of the nature of sounds. Despite its
virtues, though, this notion about the direct objects of audition is overly circumscribed.
The doctrine that sounds are the only direct audibles is seriously infected by the
atomistic methodology that I question early in the paper.
It demands significant
supplementation.
One of my main aims in this paper is to show that even if everything we directly
hear consists of sounds – and I agree with this notion – the latter have no priority from
the perspective of audition. That something is an auditory composite does not imply
that we hear it indirectly, i.e, in virtue of its component sounds. For example, melodic
phrases are composite entities composed of sounds – they are not themselves sounds, I
shall argue – but they are often direct objects of audition alongside the notes of which
they are composed – they are not heard merely in virtue of hearing the notes that make
them up.
AUDITORY OBJECTS
The auditory system presents its objects under cross-cutting types. At any given
time, a perceiver is conscious of a temporally extended auditory scene in which there
are melodies, harmonies, sequences of phonemes, individual voices, meaning-carrying
sounds, and so on. The very same sounds may simultaneously belong to more than one
of these types: for instance, some of the sounds that constitute a sentence may also be
parts of a melody in parts, sung by two individual human voices. Often, the perceiver
cannot simultaneously attend to all of the above-mentioned elements of a scene. In
section VIII, I argue that this is largely because sounds have to be differently grouped to
constitute these different kinds of items. In order to hear a composite one has to sense
the grouping as well as the constituents, and it is difficult to attend to several crosscutting groupings together. I conclude, in section IX, by attempting to indicate how
aesthetic appreciation depends on this variety.
I.
Perceptual Atomism
Empiricists generally subscribe to the “atomistic” doctrine that we sense wholes in
virtue of sensing their parts – except in the case of wholes that have no parts. These
partless wholes ground all other sensation, and may be entitled sensory atoms.
In the empiricist tradition, direct sensing is understood as follows:
D1.
S senses x directly if S senses x, and there is no y (distinct from x) such
that S senses x in virtue of sensing y. (cf. Jackson 1977, 19-20)
I take the ‘in virtue of’ in D1 to imply something along the following lines:
D1 (Codicil) S senses x in virtue of sensing  (where  is a set) if S constructs
her x-sensum from her sensa of members of .
In the Codicil, “construction” can be taken in a fairly non-demanding sense.
Consider the figure sketched below.
2
AUDITORY OBJECTS
This figure can be seen in a variety of ways: as (a) two touching circles, as (b) a figureof-eight, and as (c) a stylized version of the symbol for infinity. Should one say that one
sees (b) and (c) in virtue of (a)? In order to say this, it is not sufficient to show that the
figure-of-eight and infinity symbols decompose into two circles, but not vice versa.
What one must show, by D1 (Codicil), is first and foremost that the sensa of the two
circles come first, and further that the auditor (S in the definition) constructs sensa of (b)
and (c) from these. Thus, what needs to be shown is that the auditory system on its
own delivers (a), but not (b) and (c) – the auditor’s mental activity is required for (b)
and (c).
Empiricists traditionally assumed (see Lewis 1966) that when one senses a whole
that has sensible parts, one senses it in virtue of sensing its parts. This implies:
A1.
If x is an object that has parts that can be discriminated from one
another by sense modality M, then S senses x through M in virtue of sensing
those parts of x that themselves have no parts that can be discriminated from
one another by M (and which are available to M at the timeii).
3
AUDITORY OBJECTS
Thus, empiricists assumed that the only things we directly sense are minimal sensa.
They assume that, in the figure sketched above, the composite objects are sensed in
virtue of the circles, and the circles themselves in virtue of the minimal arcs that
compose them. The latter are the only things we directly sense.
We do not just sense objects; we also apprehend their sensory qualities or features.
Atomist assumptions prevail here as well. Minimal visual objects (in the sense of A1)
have only colour and brightness, empiricists assume.
If we see visually extended
objects in virtue of apprehending their minimal parts, it must follow that the additional
visual qualities attributed to these extended objects – their shape, texture, and size, and
so on – are constructed from the colour and brightness of their minimal parts.
Similarly, minimal sounds have only pitch and loudness. It would follow by atomism
that the additional auditory qualities of non-minimal auditory objects – the contour of a
melody, the chord produced by two strings plucked together, the timbre of a voice, etc –
are constructed from the pitch and loudness of the minimal auditory parts of these
objects. Thus we have
A2.
If F is a feature of an extended object O, then S senses F in virtue of
sensing qualities that belong to the minimal parts of O.
Together A1 and A2 constitute a method for identifying the things we directly see and
hear as per D1. Let us call this method perceptual atomism. My concern in the first two
sections of this is with this doctrine interpreted in the light of D1 (Codicil).
Now, one kind of argument against perceptual atomism is that it is often harder
to register the qualities of the parts of a thing than to register the thing itself. For as
anybody who has tried to draw from life or to compose a photograph will readily attest,
it takes a more effort to attend to or describe the two-dimensional fuzzy-edged outline
that a cat projects to the retina than it does to attend to the features of the cat itself. This
4
AUDITORY OBJECTS
has often been taken to suggest that we see the cat’s fuzzy outline in virtue of seeing the
cat, not the other way around.
Roderick Firth (1949) tried to mount such an argument against the atomist order
of priority.
(He saw himself as following the lead of phenomenologists such as
Edmund Husserl and Gestalt psychologists such as Wolfgang Kohler.) His strategy was
to posit a post-sensory operation known as perceptual reduction, which is supposed to
account for our awareness of things like the two-dimensional projections of threedimensional objects:
The operation of perceptual reduction . . . make(s) the ostensible physical object
progressively less and less determinate. If I were to perform the operation while looking at
a tomato, for example, the ostensible tomato which is present to consciousness would, so to
speak, become less specifically distinguished as an individual. Starting as a tomato with
worm holes it might be reduced to a tomato with "some sort of holes" in it, and then to a
tomato with spots on its surface, and so on. It might eventually become "some sort of
globular object", or even just "some sort of physical thing ".
But when this last stage is reached, or perhaps even before, there is a second effect: a radical
change takes place and a new object of consciousness appears and grows more and more
determinate . . . this new object is . . . not an ostensible physical object at all. . . (I)t is not until
this second stage in the process has begun that we are able to describe what we " really see
"and to report, for example, that we are presented with " a red patch of a round and
somewhat bulgy shape ". (ibid., 460)
Firth ascribes the difficulty of seeing two-dimensional outlines to perceptual
priority: the apprehension of a three-dimensional object comes first, then threedimensional properties are gradually reduced to abstractions compatible with twodimensionality – worm holes become spots, the tomato itself becomes a somewhat
bulgy shape. Perceptual reduction brings it about that a quite different thing – a surface
or a shape – begins to appear. His claim is that surfaces and projections are seen only
by means of the mental act of perceptual reduction, and that this act can only be
5
AUDITORY OBJECTS
performed when one already has before one’s mind the percept of the threedimensional object. This throws doubt on perceptual atomism by D1, because it seems
to imply, by D1 (Codicil), that we see a “coloured shape” in virtue of seeing the threedimensional object that projects it, not vice versa.
Firth’s argument is not conclusive, however. He seems to be thinking of cases
where one has to report on, or in some other way be explicitly conscious of, certain
aspects of what one senses. This is where his argument misses its mark. The atomist
may not be worried about the kind of awareness that underlies verbal report; he might
be trying to get at the latent content of a visual or auditory state. Thus, the atomist
might well concede that something like Firth’s perceptual reduction is necessary in
order to attend to the parts of a whole, but he might still think that sensing the whole
demands sensing the parts as a prior condition. This is, in fact, the position that Firth
himself attributes to H. H. Price (to whom the quoted words at the end of the above
passage allude). What Firth misses is the possibility that perceptual reduction might
just bring to a perceiver’s attention what she has already sensed. He does not show, or
even try to show, that the red-patch sensum was created by perceptual reduction.
II.
Gestalts Over Parts
The Gestalt psychologists have a more direct argument against atomism. They show
that our sensory apprehension of a whole can often influence how we apprehend its
parts.
Consider the display known as the Kanizsa triangle (see Kanizsa 1976):
6
AUDITORY OBJECTS
In the left figure, we appear to see a white triangular object in the foreground, partially
obscuring three objects in the background – three black circles and the outline of a
triangle, each partially occluded. (Note the contrast with the figure on the right which
appears as a simple two-dimensional pattern, and hence with no occlusion.) In the
present context, the interesting thing about the foreground triangle on the left is that it
appears brighter than the background. This is not the case on the right.
Perceptual atomism demands that we see the extended bright white of the
foreground triangle in virtue of seeing its bright parts and that the brightness of the
triangle is constructed from the brightness of its parts. If this were true, awareness of
the brightness of the central point of the display would precede that of the foreground
triangle. But this seems false. The foreground triangle is, of course, exactly as bright as
the background, and if one were to isolate it by covering up other relevant parts of the
display, one would see this quite easily. (On the right, the same kinds of parts coexist
with a different appearance of the whole.)
7
AUDITORY OBJECTS
The reason why the foreground triangle looks brighter in the left-hand Kanizsa
display is that the visual system (wrongly) infers the presence of a triangular occluding
object in front of the occluded objects. For if the figure were modified in such a way as
to make the foreground triangle disappear – for instance, by filling in the wedges of the
black figures or by rotating these figures in such a way that the wedges don’t line up –
then the very same parts of the above display that appear brighter would cease to
appear so. Prevent the visual system from inferring the presence of an occluding
triangle, or prevent it from inferring that it is in the foreground (as in the right hand
figure), and the parts will be seen differently. When the foreground triangle is inferred,
the visual system actively enhances its brightness. The look of brightness is, in other
words, inserted by the visual system itself, in order to mark the object it infers (and
perhaps to mark it that it is closer to the light). Let’s call this qualia-insertion. It implies
that at least as far as the visual system is concerned, seeing the whole is a prior
condition of seeing the parts. It does not deliver sensa of the parts without a sensum of
the whole – on the contrary, the sensa of the parts depend on that of the whole – and so
the perceiver does not become aware of the whole by assembling the parts.
Similar whole-over-part phenomena are found in the auditory realm. Consider a
syllable like /da/. Phenomenologically, it appears that this syllable consists of two
phones in temporal order, the consonant /d/ followed by the vowel /a/. Further,
when we hear /di/ or /du/, we seem to hear the same initial sound – the consonant
/d/ – followed by a vowel. Thus, speech perception appears compositional: we seem
to hear distinct sounds strung together, each corresponding to individual consonants or
vowels. This is the thought that Aristotle seems to be expressing when he asserts that
“written marks are symbols of spoken sounds” – the written ‘d’ is a symbol of a distinct
spoken “sound” /d/ that precedes the /a/ in ‘da’, he seems to imply. (See, however,
note 4 below.) This phenomenology would suggest that we hear /da/ in virtue of
hearing /d/ and /a/ in that order.
8
AUDITORY OBJECTS
Investigation of the actual pattern of wave forms in speech reveals, however, that
this part-over-whole model is not correct. There is, as it turns out, no auditory wave
pattern common to the different syllables that start with /d/ – that is, the acoustic
patterns corresponding to /da/, /do/, /du/, etc. share no initial segment. The analysis
of these acoustic patterns reveals that the first overtone (the “second formant”) of each
consists in an upward or downward glide from a fixed frequency of roughly 1800 hertz,
smoothly rising or falling to the frequency of the vowel part – these glides are different
in slope, and so they cannot be identified with a common element in the syllables that
start with /d/. Moreover, the same vowel combined with different consonants will
sound at different frequencies.
Thus, as Alvin Liberman and his colleagues (1967) write:
The speech signal typically does not contain segments corresponding to the discrete and
commutable phenomena . . . We cannot cut either the /di/ or the /du/ pattern in such a
way as to obtain some piece that will produce /d/ alone. If we cut progressively from the
right hand end, we hear /d/ plus a vowel, or a non-speech sound; at no point will we hear
only a /d/ (436)
Liberman et al say, illuminatingly, that the acoustic form of speech is not a cipher, in
which there are discrete parts, each of which stands for some part of what is
enciphered. Rather, speech is a code in which a temporally extended sound pattern can
stand as a whole for a sequence of phones, without parts of that sound pattern standing
for the individual constituents of the encoded sequence. The phenomenology of /da/
and /du/ suggests compositionality or encipherment, but in fact they do not share any
common acoustic element that corresponds to /d/. These syllables are encoded in the
acoustic streams that we produce; they are not enciphered therein.
9
AUDITORY OBJECTS
In order to understand this phenomenon, it is necessary to understand how /da/
is produced. A speaker produces /da/ by first creating a resonant frequency of 1800
hertz in the articulatory tract. This is done by closing the tract off with the tongue
pressed against the palate. Because the tract is closed, this first articulatory “gesture” is
actually silent. Having performed this first gesture, the speaker then opens her mouth
in the /a/ shape, and releases the acoustic energy stored in the articulatory tract. This
produces a glide up to the vowel frequency and then a steady sound at that frequency.
Thus, /da/ is produced in three phases, (i) a silent resonance, (ii) a glide, and (iii) a
steady pitch. It is the starting point of the glide that is embedded in (i), the silent part,
that corresponds to /d/ – this is what /da/ and /du/ and the rest of the initial /d/
syllables share.
(Correspondingly, syllables that end with /d/ are produced by a
sequence of articulatory gestures that ends with such a silent resonance.)
10
AUDITORY OBJECTS
But (i) is silent and forms no part of the acoustic signal. We have to conclude,
therefore, that the speech perception system infers (i) from (ii) and (iii), and inserts a
separate /d/ quale – an auditory experience that is the same in different /d/ syllables
despite their auditory variety – into sensory consciousness to mark the silent gesture it
infers – just as in the Kanizsa triangle, the visual system inserts qualia of increased
brightness to mark the triangle it infers. This is why we hear a /d/ followed by /a/,
though in the acoustic signal itself there are no such separate components. As I said
earlier, perceptual atomism would indicate that we hear /da/ in virtue of hearing /d/
(and /a/). But this seems wrong. In fact, the reverse seems to be the case: that is, the
speech perception system has to decode the entire syllable from the information
available in the acoustic signal before it can insert the auditory quale corresponding to
the consonant with which the syllable begins or ends.
One more example. Albert Bregman (1990, 27-29) describes an alternately falling
and rising pure tonal glide – in effect, something like a melody, but a particularly
predictable (and boring) one. Call this “melody” glides. Now, modify glides in two
successive ways. First, snip some portions out. When this is done, one hears, as one
might expect, a broken series of short tonal glides – not the entirety of what we have
called glides, but a series of disconnected fragments thereof. Call this bursts. Now, fill in
bursts by inserting broadband noise into the gaps. The result is a continuous acoustic
signal that consists of fragments of glides separated by noise. Call this patches. This
third pattern is not heard as bursts with noise in the gaps – though this, of course, is
what it is. Rather, the listener hears (or rather appears to hear) the original glides, with
noise superimposed over but only partially masking the parts that were snipped out. In
other words, the snipped out portions of glides reappear with the second modification,
though partially obscured by noise. The snipped out portions are phantoms! They are
heard, but they are not there. They are exactly the same in this respect as consonants,
11
AUDITORY OBJECTS
and the “subjective contours” in the Kanizsa display: they are inserted by the sensory
system; all of these are phantoms.
The atomist’s part-over-whole principle would lead him to say that in the case of
patches, one hears the whole of glides in virtue of hearing the parts, including the
phantom parts. But the phantoms are actually hallucinations inserted by the auditory
system. Since they are inserted in virtue of reconstructing the whole, it is wrong to
think that the sensum of glides (when patches is played) is simply constructed out of
one’s sensa of the bits. One would not hear these parts but for the appearance of the
whole. It seems, thus, to be more reasonable to hold that one hears the auditory system
delivers the snipped out portions directly after reconstructing the whole.
In the examples reviewed in this section, the phenomenology is that of parts and
wholes. In each case, we sense a spatially or temporally extended whole as well as its
12
AUDITORY OBJECTS
parts. Reflecting on this phenomenology, some philosophers have supposed that the
wholes are seen in virtue of their parts. This is wrong. The mistake is that the partwhole phenomenology is taken as indicating something about the processes that create
sensa.
Phenomenology cannot deliver this kind of knowledge.
Even if it is
authoritative about what sensa are present to consciousness, it does not even pretend to
reveal whence they came.
III.
Sensory States: A Modular Approach
Atomism fails as a way of demarcating what we sense directly. But there is another
thread in traditional empiricism that yields a more promising line of inquiry.
The beginnings of this idea can be discerned in a passage from Berkeley’s First
Dialogue Between Hylas and Philonous:
We may, in one acceptation, be said to perceive sensible things mediately by sense – that is,
when, from a frequently perceived connection, the immediate perception of ideas by one
sense suggests to the mind others, perhaps belonging to another sense, which are wont to be
connected with them. For instance, when I hear a coach drive along the streets, immediately
I perceive only the sound; but from experience I have had that such a sound is connected
with a coach, I am said to hear the coach. It is nevertheless evident that, in truth and
strictness, nothing can be heard but sound, and the coach is not then properly perceived by
sense, but suggested from experience.
Berkeley’s way of drawing the distinction between what is immediately, or properly,
heard and what is only mediately heard appeals to a distinction between what is
delivered by the senses, and what is “suggested by experience” as a result of a
“frequently perceived connection”. (In what follows, I shall stick to the terminology of
‘direct’ and ‘indirect’ and will not use Berkeley’s term, ‘immediate’.)
Berkeley himself was unable to make much progress with this important idea; as
Gary Hatfield (1990, 42) says, he provides “precious little direct analysis of
‘suggestion’.”
But he does (as Hatfield points out) distinguish “suggestion” from
13
AUDITORY OBJECTS
“judgements and inferences [made] by the understanding.” For a later author like
Hermann von Helmholtz, this was important. By Helmholtz’s time, the association of
ideas – a more contemporary term for what Berkeley calls “suggestion” – had become,
by contrast with judgement, an automatic mechanism beyond the voluntary control of
the perceiver.
In Berkeley, this was at best implicit in the distinction between
suggestion and judgement. (Note that “suggestion” is cross-modal in Berkeley’s usage:
the sound of the carriage suggests a certain visual and tactual object – here I am
indebted to Nicolas Bullot.)
Now, in Berkeley’s view, the association of ideas acts in a manner akin to what I
called “qualia-insertion” above. For instance, it accounts, according to him, for the
Moon appearing larger when it is low in the sky than when it is in the zenith, though it
projects the same-sized retinal image in both positions. Berkeley held that this Moon
Illusion arose out of associations of ideas established by past experience. Yet, like
sensation, it is involuntary and experience-modifying – the Moon really looks larger
when it is low in the sky, and there is nothing one can do to change how it looks. In
these ways, it operates differently from voluntary “judgements” – judging that
something is of a certain size does not make it look that way. Similarly, for Helmholtz,
things look the same colour in diverse conditions of illumination.
According to him,
this is because perceivers come, with experience, to be able to “discount the illuminant”.
This may or may not lead these perceivers to judge that things are constant in colour,
but it does result in things looking the same colour in different conditions of
illumination.
Today, most sensory psychologists treat the senses as active informationprocessing systems that are innately equipped to make inferences about the state of the
external world, even in the absence of suggestion or acquired associations of ideas.
Their attitude is that the sensory systems comprise not just receptors, but also dataprocessing pathways that extract information about the external world from receptoral
14
AUDITORY OBJECTS
activation states. (See Matthen 2005 passim, but especially Part I for a detailed treatment
of this point.) As Peterson (2001, 175) succinctly says, Gestalt “grouping processes
[such as the one involved in the Kanizsa display] are visual processes.” Similarly, the
inferential processes at work in phonetic perception and in patches simply are auditory
processes.
Contrary to Berkeley and Helmholtz and others of their empiricist
persuasion, not all sensory awareness of external things should be attributed to postsensory associations of ideas. (In a moment, I shall argue, however, that Berkeley’s
insights were nonetheless extremely valuable.) From the state of receptoral arrays,
automatic and innate processes extract information (cf. Pylyshyn 1999) about constant
colour, three-dimensional shape, objective motion, phonemes, melodies, and so on.
This view of sensory processing renders part-whole approaches largely
irrelevant. Atomism loses its appeal, for there is very little reason to believe that the
content delivered by sensory modules concerns only minimal parts. Indeed, the holistic
phenomena discussed in the last section demonstrate that they deal with extended
things. But nor is there good reason for supposing that the whole has priority over the
parts, as Gestalt psychology maintains. In the Kanizsa display, both the foreground
triangle and the bright qualia that constitute it are delivered by the visual module.
Neither has priority over the other in the sense that it is the sensory material from
which the other is constructed. Similarly, speech perception delivers /da/ and its
constituent parts in a single act; melodic perception provides both the continuity of the
melody in patches, and the (false) awareness of the phantom notes. These wholes may
have priority within the sensory process, but not on the terms proposed by D1 (Codicil).
We are not aware of /d/ independently of /da/.
The best-attested current views have it, then, that sensory systems include dataprocessors as well as receptive organs. In effect, they analyse the data received by the
sensory receptors and infer the presence of external objects and objective features that
belong to them. Sensory awareness is the record of this activity. Perceivers are more or
15
AUDITORY OBJECTS
less passive with respect to sensory awareness – they do not control its character. The
mistake that perceptual atomists made was to assume that perceivers voluntarily
assemble their awareness of temporally and spatially extended wholes from parts. We
cannot follow them in this assumption.
We need another way of identifying the
deliverances of the senses.
IV.
Sensory States and Epistemic Operations
Though he was wrong about the role of associations of ideas, Berkeley’s insight is
nonetheless very important. There is a level at which sensation is simply an event in
consciousness to which the perceiver’s history does not contribute.
At this level,
sensation is an internal event that provides the perceiver with information about an
external event that has just occurred. For example, I might hear a loud noise, and thus
come to know that something has fallen off the counter in the next room. Or I see a blue
thing, and come to know that such a thing is in front of me. Let’s call this the eventtracking function of sensation.
What the empiricists noticed was that sensation has another function as well.
Through
automatic
post-sensory
processes,
it
contributes
to
an
organism’s
representation of the world not just as it is at the moment of sensing, but as it is in a
more extensive time-frame. The formation of associations of ideas is an example of this.
I put a fruit in my mouth and find it bitter; automatically an expectation forms within
me that fruits of that kind are bitter. Here, of course, sensation is contributing to my
knowledge of general truth that is more or less permanently true. But it can also
contribute to knowledge of lasting but impermanent conditions of the world. For
instance, if I observe somebody putting something into a box, then automatically a
memory forms in me of where that object is or, less specifically, of there being
something in the box. (Such memories form even in very young infants who display
surprise when because of a trick of an experimenter the box is found to be empty.) I
shall call this the record-keeping function of sensation.
16
What Berkeley draws our
AUDITORY OBJECTS
attention to in the passage quoted above is that some of this record-keeping is also
automatic as well, just as the event-tracking function is.
The question remains: how should we demarcate the direct objects of sensation.
The connection between sensation and automatic associations of ideas helps us here.
Rather they are defined by contrast with “suggestion”, and recognized (though
Berkeley does explicitly say so) by their phenomenal character. In effect, Berkeley treats
of them as inputs to “suggestion”. Could one not demarcate the contents of sensory
awareness by this aspect of their role? This is what I shall now attempt to do.
Consider Pavlovian, or classical, conditioning. Here, a naturally motivational
event – placing food in the mouth, in Pavlov’s classic experiment with dogs – is
repeatedly presented slightly after an event that is motivationally neutral with regard to
ingestion, namely a tone. As a result, the motivationally neutral event, or conditioned
stimulus, begins to elicit the same response – lubricating salivation and other digestive
preparations for the ingestion of food – as the naturally motivational event, or
unconditioned stimulus.
Pavlov (1904/1965) recognized that this association had psychological
significance:
When an object from a distance attracting the attention of the dog produces a flow of saliva,
one has ground for assuming that this is a psychical and not a physiological phenomenon.
When, however, after the dog has eaten something or has had something forced into his
mouth, saliva flows, it is necessary to prove that in this phenomenon, there is actually
present a physiological cause, and not only a purely psychical one which, owing to the
special conditions, is perhaps reinforced . . . (565-566)
Pavlov recounts how by cutting the sensory nerves of the tongue, and by “more radical
measures, such as poisoning the animal or extirpation of the higher parts of the central
nervous system,” one can “convince oneself that between a substance stimulating the
17
AUDITORY OBJECTS
oral cavity and the salivary glands there exists not only a psychical but a purely
physiological connection.” What he means is that placing food in the mouth will
stimulate salivation even when the animal’s “psychical” faculties are “extirpated” –
thus one has reason to conclude that this connection is not routed through sensation or
cognition.
By contrast, the conditioned stimulus acts by an essentially psychical
connection; sensory input and brain functioning are necessary to establish the
connection between the tone and salivation.
In a closely related paradigm, operant conditioning (discovered by Edward
Thorndike 1898) is used to probe the perceptual discrimination abilities of animals. For
instance, a honeybee or moth might be presented with blue dishes filled with pure
water, and yellow dishes with sugar-water. Once they have had a chance to sample the
contents of each type of dish, it is found that they learn preferentially to sample the
yellow dishes to find the sugar water (which they happen to prefer). Here, a “psychic”
connection is established between an initially unmotivated impulse to feed from yellow
dishes, and a reward, the sugar water. Generally, experiments of this type are used to
show that the subject animals possess certain abilities of sensory discrimination, in this
case colour vision. Similar experiments can and are conducted in auditory contexts in
order to map out the auditory similarity space of various animals. The idea is that
sensory discrimination is required for operant conditioning.
The exact character of these processes is somewhat contested, though
increasingly a representational view has become standard (Gallistel 1990 passim, but see
especially chapters 12-13; see also Mackintosh 1994b and Hall 1994).
In such a
representational view, we have three components.
First, there are certain perceptual processes: the animal’s perception of
coloured dishes, and of being rewarded when feeding from them.
18
AUDITORY OBJECTS
Second, there is an innate process of record-keeping into which this
perception feeds: as Gallistel puts it, “something form[s] inside the organism
isomorphic to the dependencies to which the animal’s behaviour becomes
adapted” (1990, 385). Gallistel’s idea – which parallels Pavlov’s own (see
above) – is that there is an objective “dependency” between yellow dishes
and sugar water, or between the tone and the presentation of food. When an
organism experiences instances of this dependency, something gradually
forms within it that parallels the dependency – a “psychical” representation of
the dependency, to use Pavlov’s adjective. To this, we may add: if the
dependency concerns an individual object, such as a place, a natural object, or
another organism – if, for instance, the animal forms an expectation that food
is to be found in a particular location or in a particular receptacle – then the
psychical representation will denote that individual.
Third, there is a learned response: the expectation that the conditioned stimulus
will precede food, or (more accurately) the impulse to search in yellow
dishes for sugar water.
In view of the role that the representation plays (in the second step above) in mediating
contingent dependency and appropriate impulse, conditioning can be regarded as an
automatic representation-forming or record-keeping process – an epistemic operation, as I
have called it elsewhere (cf. Matthen 2005, chapter 9). (When I speak of an epistemic
operation, I mean in the present context to be restricting myself to such automatic
processes, though I will usually add the qualifier for clarity.)
On the modular view, sensory content is the output of subpersonal sensory
processors. The replacement proposal that I would like to make is that sensory states
can equally well be demarcated by reference to epistemic operations of the sort outlined
in the preceding paragraph. Sensation is to be regarded as the input to core automatic
19
AUDITORY OBJECTS
epistemic operations. For the sake of definiteness, I propose that it be regarded as the
unlearned input to classical or operant conditioning.
This yields the following pair of definitions:
D2.
Sensory states are those that (a) provide unlearned input to classical
or operant conditioning, and (b) are potentially or actually conscious.
(Clause (b) is optional: it is added in order to accommodate the intuition, if it
is held, that completely non-conscious inputs to conditioning and other
automatic epistemic operations should be excluded, since these may not
properly be regarded as sensory.)
Having defined sensory states in this way, we may add:
D3.
Objects and features are directly sensed if they figure in the
representational (or record-keeping) content of sensory states (as these states
are defined by D2).
Notice how D2 and D3 complement D1, the definition of the direct objects of
sense. We find that (as remarked earlier) both the parts and wholes of certain sensory
entities may be directly sensed.
In the case of phonemic perception, the auditory
system analyses the shape of the acoustic signal incident on the basilar membrane,
diagnoses the articulatory gestures made to produce this signal, and produces qualia to
mark these gestures. On the assumption that the sensory system picks the whole as
well as the parts out of the ambient flux of acoustic energy, both are directly sensed, by
D2 and D3. It follows from D1 that neither is sensed in virtue of the other.
Much the same holds for melody. If being conditioned by a melody, or learning
it, or recognizing it, is a “holistic” process – not merely the sum of learning processes
targeted on the constituent notes – then there is a sensory state involving melodies, by
D2. But then melodies are directly sensed, by D3. And the same goes for individual
20
AUDITORY OBJECTS
notes: if there is direct conditioning on notes, they too are directly heard. Some find it
intuitive to say that we hear the melody in virtue of hearing the notes. But this goes
against the evidence. In the first place, Bregman’s patches example indicates that at least
sometimes, the notes are created by the auditory system to complete an inferred
melody.
In this case, the notes are clearly supplied by the sensory system
simultaneously with the whole – just as in the Kanizsa triangle. But this makes it
plausible to think that when one hears an unobscured melody, one’s auditory system
comprehends the whole in some fashion.
V.
Indications of Direct Sensing
Let us return now to the question of what things are directly heard.
In the light of D2 and D3, Berkeley’s assertion that “nothing can be heard but
sound” appears to be unsupported. Berkeley’s line of thinking about “suggestion” runs
broadly in parallel with the treatment of innate and automatic epistemic operations in
the preceding section. That is, he thinks of “suggestion” as an automatic extra-sensory
faculty, much as I have been presenting conditioning. Further, it is at least consistent
with his text to suggest that the direct objects of sense are those that initiate the
processes he calls “suggestion.” But he gives us no reason to suppose that sounds are
the only objects provided by audition to “suggestion”.
As we noted earlier, the
question cannot be settled a priori by reference to a doctrine such as atomism. So the
question arises: On what evidence does Berkeley think that we come to know that a
coach is present because it is “suggested from experience”? How can he be sure that
coaches are not direct objects of auditory experience? It is my view that he has no
evidence, and is not entitled to any confidence on this point.
What sort of evidence would support the proposition that this or that kind of
object is or is not directly sensed? Automaticity hypotheses are notoriously insecure,
and it is even more difficult to identify the inputs to hypothesized automatic learning
21
AUDITORY OBJECTS
mechanisms. However, there are certain experimental and phenomenological marks
which indicate the automaticity of a sensory process. Here is an incomplete list of such
indications, some of which have figured in our argument up to this point.
Separate Conditionability As mentioned above, operant conditioning has traditionally
been used to probe the sensory discrimination abilities of animals.
In Berkeley’s
example, one would want to test how reactions to the sounds of a coach are reinforced,
persist, and are extinguished. If it is the coach itself that is the target of conditioning,
then the conditioned reaction should transfer to other auditory coach-related stimuli. In
other words, if coaches were inputs to conditioning, then we would expect that there
would be “constancy phenomena” related to objects of this sort – equivalences among
the different auditory stimuli that emanate from the same object, such that a response to
one transfers to equivalents.
Now, audition seems, by this criterion, to be concerned not so much with objects
as with their activity. The rumbling of a coach does not characterize the coach itself,
because the coach does not rumble when it is stabled and at rest. Rather, the rumbling
of the coach characterizes its activity when it is moving over a road. With regard to this
activity, constancies do apply – the rumbling will trigger conditioning, and will be
immediately recognizable, even when heard over other sounds or from a distance etc.
This indicates that audition does not track objects as such, but tracks rather their
activities and conditions.
Constancy Continuing with the line of thought concerning separate conditionability,
many complex objects are instinctively recognized as the same from different points of
view. In vision, three-dimensional objects can directly be recognized through rotation
(provided that the initial view is sufficiently good), and animals can, for instance,
recognize a receptacle within which their food is stored, even when it is viewed from
different attitudes or perspectives.
This implies that different stimulus arrays are
22
AUDITORY OBJECTS
instinctively recognized as emanating from the same object –despite the different
outline an object projects when it is rotated, it is still recognizable. Since these objects
are instinctively reidentified, the epistemic operations will be indifferent to through
substitutions of equivalent sensory arrays.
Here the nature of the equivalence marks
the kind of object that the sense modality targets. Vision targets shape, I have been
suggesting, and shape is a property of material objects. So one might conclude that
vision targets material objects. To figure out what kind of object audition targets, one
must reflect upon the nature of auditory equivalence.
Turning, then, to audition, melodies are analogues of three-dimensional objects
in vision. When a child repeats a melody you sing to him, he repeats it in a higher key.
Yet it is the same melody. Thus, as Daniel J. Levitin (2006) says:
A melody is an auditory object that maintains its identity in spite of transformations, just as
a chair maintains its identity when you move it to the other side of the room, turn it upside
down, or paint it red. (25)
Not only does the melody retain its identity, but we recognize it despite transformations.
In fact the lay person does not even notice when somebody sings a melody in a different
key. These are reasons to think that sufficiently brief melodic fragments – “phrases”, let
us call them – are directly perceived. (I don’t think that a whole tune or theme is
directly perceived; it is probably perceived in virtue of constituent phrases.)
Along the same lines, think of timbre constancy. Albert Bregman says:
A friend’s voice has the same perceived timbre in a quiet room as at a cocktail party. Yet at
the party, the set of frequency components arising from that voice is mixed at the listener’s
ear with frequency components from other sources. The total spectrum of energy that
reaches the ear may be quite different in different environments. To recognize the unique
timbre of the voice we have to isolate the frequency components that are responsible for it
from other that are present at the same time. (1990, 2)
23
AUDITORY OBJECTS
Both melody and timbre feed into learning processes. That they do so is, as the famous
photograph, “His Master’s Voice”, illustrates for timbre, independent of the notes that
(in the case of melody, sequentially, and in the case of timbre, simultaneously)
constitute them. A voice heard over an old phonograph is considerably distorted. Yet
the dog recognizes it. This indicates that it is heard directly, and not in virtue of its
parts.
Note, however, that constancy of melody or timbre is not necessarily constancy
regarding its bearer. The same singer can sing a different melody; she can sing at
different pitches and variable volume, and the timbre of her voice will change
accordingly. The target of audition is not, then, the singer – it is the song. (Do we,
however, hear voices directly? – Certainly, people are recognizable in this way, and this
may be a conspecific recognition mechanism, much as face recognition is. So audition
might sometimes be targeted on individual people.)
I must here enter a caveat hinted at in the introduction. I have been suggesting
that melodies, for instance, are objects for the auditory system on the grounds that the
system processes them for conditioning etc. even when they are transformed. I am thus
identifying objects by the activity of the auditory system – in effect, I am giving an
account of what the auditory system treats as an object, rather than an account of
objecthood independently of the system. Thus, I intend no claim about what auditory
objects there really are, and I am not suggesting that the auditory system’s object24
AUDITORY OBJECTS
construction activity is explained by the function of tracking naturally unified objects.
Of course, it is possible to make some observations along these lines.
Earlier, I
suggested that audition tracks the activities of material objects. This spawns some
norms.
For it follows that it would be an error, or an illusion, if, listening to a
stereophonic recording of a train, one seemed to hear the movement of an object from
left to right. But piecemeal observations of this kind do not imply the existence of
general norms regulating the correctness of the auditory system’s activity. For example,
melodies can be sung in many parts, and thus they are not always activities of single
objects or single agents. It would be wrong to think that this lack of correspondence to
the activities of a single object implies there is some kind of illusion involved in their
appearing as unitary objects. There is no general norm on auditory object-construction
that would license such a conclusion.
Qualia Insertion States that are marked by characteristic qualia are almost certainly the
products of sensory systems. This was noted in the examples discussed in section II
above. The Kanizsa triangle looks brighter than its “background”. The phenomenal
awareness of brightness cannot have been inserted by what Berkeley calls “judgement”;
judgement does not have the power to alter visual phenomenology.
Nor does
“suggestion”, i.e., conditioning et. al. – though Berkeley probably thought that it did.
(See note 3 for some qualifications.) That is, I may judge how bright a certain surface is
by calculating what light is falling on it, but such a calculation will not make it look
brighter. (In the Kanizsa display, I can easily verify that the triangle is actually no
brighter than the background, but I am unable to adjust how bright it looks.) That the
triangle has a characteristic look indicates the involvement of a sub-personal sensory
system. Qualia insertion in the case of phonemes and “melodies” such as patches
indicates that these items too are directly processed by the auditory system.
Categoricity In the case of phonemes, a sharp phenomenological distinction is felt
between phonemes – that is, a consonant will be heard as /d/ or /t/ with no
25
AUDITORY OBJECTS
intermediates. Yet, the spectrographic form of spoken sounds can vary smoothly. As
these sounds are varied, the observer will notice a sudden change from /d/ to /t/ at a
certain point. This mismatch between the smooth variation of acoustic signals and the
discrete nature of their sensory representation again reveals active sensory processing.
The discontinuity pertains to articulatory gestures, as we have seen, not to acoustics;
once the gestures have been decoded, the system creates phenomenological
discontinuity to accord with gestural discontinuity. Categoricity is a form of qualia
insertion.
Non-decomposability Sometimes a quality is presented holistically even though the
perceiver has separate access to the elements from which a sensory system computes it.
In vision, face perception is a well-known example. A face has a look, and is reidentified by that look. Presumably we also possess visual access to the facial features
that the system uses in face-recognition – the shape of the face, the shape and distance
between the eyes, the shape and placement of the nose, etc. Yet, we have no instinctive
idea of how these component features combine to make up the look of a face. As it
happens, we know on independent grounds that face perception is modular – at least in
the sense that it (or some sub-process thereof) is localizable in the brain, automatically
computed, used to recognize other members of our species, and thus to update our
knowledge base concerning these individuals. (See Bergeron and Matthen 2007 for
discussion and references.)
Something of the same kind holds of voice recognition. A man and a woman can
sing the same melody in the same key at the same pitch. Yet each may be recognizable
as a man or as a woman. This ability has to do with the timbre of each voice (mostly
with formants above the second).
We have separate access to the overtones that
constitute timbre, but we cannot instinctively analyse the components of each voice.
We just sense that a voice sounds like a man’s, or like a woman’s. The same is true of
individual voices – your spouse’s voice just sounds different from your sibling’s. You
26
AUDITORY OBJECTS
may well have separate access to what makes them different, but you do not
consciously weigh up these factors when you recognize a voice (over a telephone, for
instance). These are indications of automaticity.
Familiarity Certain things can be reidentified in a non-decomposable, or holistic, manner
– faces, voices, melodies are examples discussed above. A feeling of familiarity is a sign
that there are already internal records concerning such individuals, and that the
incoming signal is linked to that record for purposes of update. Thus, it indicates that
originally the stimulus was directly sensed.
VI.
Auditory Object Construction
There is a gap in Berkeley’s argument, as we saw. Nevertheless, we are now in a
position to appreciate that he was more or less on the right track in the particular case
that he discusses. This can be shown by a relatively direct use of D3, i.e., by inspecting
the form of auditory states.
The features or qualities that audition delivers to
consciousness are of the following sort: loud, soft, high, low, and so forth. Features of
this sort are not attributable to the coach or to its wheels. The squeak of the coach’s
wheels may be high and the rumble that it makes as it rolls along the road might be
low. However, the wheels themselves are not high, and the coach itself is not low. This
is more than a matter of language: loudness and pitch are not continuing features of
material objects, but of certain events in which these objects are involved.
Since auditory features characterize something other than the coach (or its
wheels), the target of auditory representation must equally be something other than the
coach. If the auditory state eventually leads to an update of internal records concerning
the coach, it is indirectly – through whatever is the bearer of the auditory features,
high/low, soft/loud, etc.
If it is right to say that we hear the coach because we hear
something high (some squeaks) and something low (some rumbles), then it seems right
to say that we hear the coach only because its presence is suggested by our hearing its
27
AUDITORY OBJECTS
squeaks and rumbles. And squeaks and rumbles are sounds – they are the things to
which auditory qualities, or features, are attributed.
Berkeley seems to be correct
therefore: it is sounds that are properly or directly heard in this case, not a coach.
(Berkeley, of course, thinks of sounds as sensations, and, as we shall see, this is a
mistake.)
There is another kind of reason for thinking that we hear sounds rather than
material objects. It lies in a disanalogy with vision. The disanalogy is a fine one
though, and it lies within a broader analogy. Let us first examine the broad analogy.
Vision demarcates objects by marking their boundaries off from the background – the
Kanizsa triangle is an example of how this is done. That is, visual objects are most often
demarcated by figure-ground boundaries.
“A perceptual object is that which is
susceptible to figure-ground segregation,” say Kubovy and Van Valkenberg (2001, 102).
Albert Bregman (1990) has argued in great detail that the figure-ground operations of
vision are all repeated in audition, but with sound-streams, not material objects – i.e.,
with temporally (rather than spatially) extended figures – patches is but one of the
examples that he offers. (See also Kubovy and Van Valkenberg 2001 and Griffiths and
Warren 2004.) This is a strong reason for supposing that there are auditory objects, just
as there are visual objects.
Now, many of the direct objects of vision are material objects (see Matthen 2005,
chapter 12): that is, the visual system uses figure-ground segregation in ways to indicate
the boundaries of material objects. But the direct objects of audition are not material
objects: for as we saw in the preceding section, auditory constancy does not mark
material objects as such, but rather certain events and activities in which material
objects are involved, for, as we saw two paragraphs ago, auditory features are not
attributable to material objects. Nevertheless, there is some reason for thinking that
auditory objects often correlate with material objects. Audition tracks voices. When
somebody is singing, her voice sounds like a continuous and connected stream
28
AUDITORY OBJECTS
emanating from the place where she is singing. So one might think that here, audition
is targeting a material object. It parses the voice as a single continuous stream, one
might think, because the voice emanates from a single person. Here, the evidence
suggests that the auditory system is tracking a material object: the sound appears to be
continuous and connected because it is analysed as coming from a single material
object.
Perhaps this is true, but there are certain strong disanalogies between visual and
auditory objects. In the first place, audition presents its objects as temporally composed.
Thus, if O is an auditory object that persists through time, and P(O,t) is what we hear of
O at t, then P(O,t) will often be sensed as a part of O. For example, a phoneme will be
heard as part of a word, a note as part of a melody, and so on. On the other hand, the
synchronic components of auditory objects are usually not sensed as distinct parts. For
example, suppose that a string quartet plays a chord by each instrument playing one
note of that chord: the resultant harmony has a holistic quality in which the separate
notes are not heard as separate parts. If, on the other hand, the parts of the string
quartet are not in harmony, they are heard as separate auditory objects – separate
voices. In this case, there is no one object of which they are all sensed as parts.
In vision, the situation is reversed. Thus, if O is a visual object that persists
through time (for instance, a blue sphere), and P(O,t) is what we see of O at t, P(O,t) is
generally not sensed as a part of O – what we see of a blue sphere at t is sensed as the
whole of the blue sphere, not a part thereof. On the other hand, spatial components of
the blue sphere that are seen at the same time – for example, the blue hemispheres – are
generally seen as parts.
This phenomenological difference arises from an important difference in the
principles of object construction used by these modalities – vision joins spatial parts
together to form unitary wholes; audition works on temporal parts in the same way.
29
AUDITORY OBJECTS
Material objects are visually presented as continuing through time, but not as consisting
of a sequence of temporal parts.
Auditory objects, by contrast, are presented as
unfolding through time, but not as having simultaneous parts. The part-whole ontology
of vision parallels the ontology of material objects. Auditory objects, however, are of a
fundamentally different kind from visual objects – they don’t have synchronic, but do
have temporal parts. They are not material objects. Of course, some do hold that
material objects have temporal parts, and I don’t want to deny that they do: my point is
that the intuitive part-whole ontology of material objects is that of spatial parts, and that
of events is of temporal parts. The ontologies of vision and audition parallel these
intuitive ontologies.
Moreover, some of the principles of auditory figure-ground segregation work
against material object identification. We just saw that there are occasions when several
distinct voices appear to merge: namely when they sound in harmony. According to
the principle articulated by Kubovy and Van Valkenberg (see above), this is an auditory
object. For here too the sensed unity is attributable to figure-ground segregation in
accordance with a principle that plays a role in vision. In vision, strongly correlated
edges are seen as edges of the same object: if two edges run more or less in parallel, the
squiggles of one correlating closely with those of the other, they will be perceived as
two edges of a single figure seen against a ground (cf. Hoffman 1998, 60-61). In vision,
this principle is clearly targeted at material objects: the evolutionary rationale for the
principle is that edges can be correlated only if they have a common cause, namely a
single object. The merging of sound-streams that consistently harmonize with each
other is an instance of the same phenomenon – they are merged into a single object
because they are highly correlated. However, such dual-source sound streams are not
correlated with single material objects. They are emitted by two objects singing in
parallel.
30
AUDITORY OBJECTS
VII.
Located Sounds
The majority view is that we hear sounds. This accords with the facts about attribution
mentioned earlier: sounds are bearers of auditory features (though not the only such
bearers, as we shall see in a moment). I shall argue in the following section that sounds
are not the only thing we hear. In the preceding section, it was already possible to
glimpse reasons for denying that sounds are the only audibles: figure-ground
segregation is often concerned with sound streams – temporally extended collections of
sounds, sometimes from different sources – not individual sounds. For the next few
pages, I want to ignore this. In the present section, I want simply to inquire into what
sounds are.
Berkeley thought that sounds are sensations. Others have thought that they are
vibrations of the air. However, Robert Pasnau (1999) has shown in a seminal paper that
these views are false. His argument is very simple:
We do not hear sounds as being in the air; we hear them as being at the place where they are
generated. Listening to the birds outside your window, the students outside your door, the
cars going down your street, in the vast majority of cases you will perceive those sounds as
being located at the place where they originate. At least, you will hear those sounds as being
located somewhere in the distance, in a certain general direction. But if sounds are in the air,
as the standard view holds, then the cries of birds and of students are all around you. (ibid.,
311)
Now, as will emerge, I don’t think that everything we hear is located in quite so simple
a manner as Pasnau suggests, and I certainly do not think that each audible thing
originates in a single place. Nevertheless, there is certainly an important sub-class of
audibles that have definite location in just the way that Pasnau indicates. Generally, but
not always, these arise from a discrete event – a bang or a whimper or a laugh. That
these sounds seem to have location is not merely an illusion or error of audition. For it
is clear that audition is generally quite accurate about the location of events from which
31
AUDITORY OBJECTS
air-vibrations originate. By contrast with such originating events, air-vibrations are
diffuse; they have no confined location.
This indicates that audition is (often)
functionally targeted on the origins of the air-vibrations, not on the vibrations
themselves. If sounds are what we hear, then sounds are located events. Let’s call these
audibles located sounds.
Sounds are like colours, Pasnau urges: they are located at “their point of origin”.
They are not like odours: they do not fill the air (313). The same argument tells against
sounds being sensations. Sensations are not located outside your door. What you hear
may be, or may seem to be, outside your door; sensations, however, are not, and don’t
seem to be in any physical place – they are in the mind. (Again, the argument of section
IV tells against sounds being sensations: there are no constancies regarding sensations.
And auditory features like high and loud do not belong to sensations; rather, they belong
to things in the public domain.) Sensations are the hearings of things; they are mental
episodes. They should not be confused with the things that are heard. Sounds are not
episodes of audition; they are what we hear.
What then are sounds? Pasnau proposes that they are located in material objects
– they “either are the vibrations of such objects, or supervene on such vibrations,” he
says (316). This seems right. However, Pasnau occasionally implies – he slips here, I
think – that the sound we hear is a property of an object. “We should insist on putting
sound back where it belongs, among the various sensible properties of objects: among
colour, shape, and size” (324). Presumably, he is led to this view by thinking that a
vibration can be a property of an object. For instance, the vibration of a trumpet – its
sound, according to Pasnau – can be regarded as a property of the trumpet.
I do not wish to contest that vibration is a property of the trumpet. I do want to
note, however, that in general we sense both objects and their features -- here, I mean
‘object’ to range more widely than material objects. For instance, we sense a particular
32
AUDITORY OBJECTS
object – a disc in the corner – and sense of it that it is blue. Here, sensation represents a
subject-predicate connection between the disc and its colour.
Auditory sensations
represent subject-predicate connections too. We sense of auditory objects that they have
auditory features. If Pasnau was right about sounds being properties or features, then
we would sense of their subjects that they had these properties. For instance, when we
listen to Purcell’s Voluntary, we would be sensing of a trumpet (the subject) that it is
vibrating (the feature). But this implies by D3 that we (directly) hear the trumpet. And
this is precisely the conclusion that I argued against earlier. The trumpet is not high or
piercing; it is the sound that it emits that possesses these characteristics. Audition does
indeed represent subject-predicate connections, but it is the sound (not the maker of the
sound) that is the subject, and features like high/low, loud/soft that are predicates.
But this does not tell the whole story. I do not have the space to argue the point
in detail here, but audition tends to individuate located sounds as if the material object –
not just the place – from which they emanate is important (cf. O’Callaghan forthcoming
b). In addition to the evidence adduced in the last section – that audition tends to track
melodic lines and voices – there is this additional consideration. Audition is closely
allied to object characterization in vision. For instance, visual attention moves to the
heard location of a noise. Then, there is the ventriloquist’s effect: if there is a moving
mouth in the vicinity of an auditorily located sound, the auditory system relocates it in
the mouth. Again, there is the McGurk effect: the visually apprehended movement of
the mouth and tongue will influence what one hears somebody saying. In recognition
of the importance of material-object location in the demarcation of sounds, I will say
that sounds are not merely located, but object-located events.
(See O’Callaghan
forthcoming a, b, for a similar view.) This, I believe, does justice to the intuitions that
led Pasnau to suggest that they are properties of objects. But he is wrong to think that
sounds are attributes of individuals. Object-located events are not material objects; they
are events.
33
AUDITORY OBJECTS
Consider then a chain of events, the last member of which are vibrations of air. I
have in mind a chain like this:
Violinist reads music  violinist moves bow across string  string vibrates
 air vibrates at string-air interface.
In such a chain, there is a last member that is a cause of vibration propagated through
the air, but which is not itself a vibration of air. That last item is a sound. (It should be
noted that air-flow can be a sound – when one whistles, or when the exhaust of a jet
engine makes a roar. Air-flow is not in itself air-vibration.) In the chain shown above,
the third item is the last cause of vibration in the air. My claim (partially following
Pasnau), therefore, is that this vibration-in-the-string is a sound. This is the subject of
the representational content of the auditory sensation; the predicates are its pitch and
loudness. (Note that on this view, sounds can be silent: the bow can cause a string to
vibrate even in a vacuum when no air-vibrations are created. Pasnau defends this as
follows: “If x has the property of being a squeaker, it would seem peculiar to claim that
x loses that property when it is put in a vacuum for five minutes.
After all, it would
still be squeaking when you take it out of the vacuum.”iii)
It could be objected that sounds so identified lack the auditory characteristics we
normally attribute to them – loudness or softness, highness or lowness, and so on (cf.
Pasnau 319). Loudness and pitch, it might be said, belong to sound waves. This
objection does not seem correct. It is certainly true that vibrations in the air have
amplitude and frequency. Sounds, however, do not have amplitude and frequency;
they have loudness and pitch. The latter are, of course, closely related to amplitude and
frequency. But they are different. The loudness and pitch of a located sound are
definitely located in space and in objects, just as sounds are. The loudness of an aircraft
taking off is not simply an ambient quality; it is located somewhere near the aircraft,
just where the sound is. So it is more accurate, I think, to say that loudness and pitch
34
AUDITORY OBJECTS
are qualities that the auditory system attributes to sounds on the basis of the amplitude
and frequency of the air-vibrations that these sounds cause. (Loudness, it should be
said, is a perspectival quality, something like the visual quality of being above or below
the subject, or looming over her. It is a property that a sound has from the perspective or
at the place where the auditor is. Thus: “That stereo is too soft/loud [from where I am
sitting].” Moreover, one can hear a loud sound – indeed, it can seem loud – even when
one is very far away and the amplitude of the sound waves that one’s ear receives is
small.)
VIII.
Other Audibles
Located sounds are not the only things we hear directly.
Consider melodic phrases. These, recall, are relatively brief sequences of notes –
the opening ta-ta-ta-tum of Beethoven’s Fifth Symphony is an example – that are
recognized as wholes. Phrases are heard as possessing contour, metre, and rhythm. It
is sometimes said that they are heard in virtue of hearing their constituent notes – but
since a phrase transposed into a different key, or played at higher or lower volume,
retains its identity, this seems false. As well, as patches shows, how one hears a phrase
will influence how one hears the notes. As we argued in section II, IV, and V, this is a
case where both the whole and the parts are heard directly. Melodic phrases are not
object-located sounds. A phrase could be started by the violins and completed by the
cellos. The opening notes would then be in one object and the closing notes in another,
without the phrase ever occupying the places in between. Thus, object-located sounds
are not the only things we directly hear.
Again, consider harmonies.
Contrast a chord sounded by a string quartet
playing different constituent notes from one sounded by a single instrument. The first
can be heard, but where is it? Certainly not in the four separate locations that the
component sounds occupy. The harmony is a single thing and is not splintered in such
35
AUDITORY OBJECTS
a way. Perhaps it spreads into the large location that the whole quartet occupies. But
then it is nothing like an object-located sound, for no one object is the source of the airvibrations by which the chord is heard, and no one object occupies the large location.
Here as with melodies, the individual notes may be heard directly, but they have no
priority: we hear the harmony directly – it has a distinctive and non-decomposable
quality that persists when new notes (in the same interval) are substituted for the old.
A fifth sounds like a fifth regardless of the key. Here, again, audition is not tracking
material objects, but constructing auditory objects, which have individuation conditions
and figure-ground segregation conditions of their own.
Melodic phrases and harmonies are auditory objects. They are not sounds, but
are composed of sounds. Often they are heard directly. So not everything we hear is a
sound. Some of the things we hear are composed of sounds.
Audibles come in other varieties too. A landscape or visual scene consists in
things you see at the same time, and their visually apprehended relationships to one
another. A soundscape or auditory scene consists in things that one hears together over
an extended period of time, and their auditorily apprehended spatial relationships to
one another. An auditory scene may consist then of several sound streams arrayed in
space; a sound stream of many sounds arrayed in time. (Question: do we ever hear
soundscapes directly? I am not sure of the answer to this question.)
Suppose that you are listening to “Yesterday” by the Beatles. You hear several
things here. You hear several melodic lines: a human voice, a bass guitar, and a string
quartet. The string quartet is sometimes one voice, sometimes two; and sometimes a
single voice will pass from one instrument to another. Each of these is a temporally
extended unit – a sound stream. As well, the human voice is not just a melody. It utters
a sequence of phonemes; and these form words and meaningful text. Moreover, you
hear a number of harmonies or chords: the backup, the bass guitar etc. are responsible
36
AUDITORY OBJECTS
for these. Lastly, you hear individual tones. These things are not all that you hear
when you are listening to “Yesterday” – you hear various environmental sounds as well
– but let us pause to consider these elements.
Let’s consider the melodic lines first. The separation of these lines from the total
energy flux at the ears is no mean feat.
Bregman (1990, 3) makes the point by
contrasting the following displays:
AI CSAITT STIOTOS
AI CSAITT STIOTOS
The top line makes little sense; the visual cues provided in the second line help decode
the message.
The signal that the ear receives when listening to a many-part
composition like “Yesterday” is much like the first line; yet what we actually seem to
hear is segregated in the manner of the second. A “baby starts to imitate her mother’s
voice,” Bregman says. “However, she does not insert into the imitation the squeaks of
her cradle that have been occurring at the same time” (5).
This is the auditory
equivalent of the kind of separation that occurs in the second line of the display shown
above.
Auditory properties are distributed among entities in a soundscape.
Frank
Jackson (1977, 65) drew attention to a certain kind of structure that obtains in visual
images. The image of a green square to the left of a red circle is different from that of a
green circle to the left of a red square. Jackson points out that this shows that the visual
image doesn’t just contain red, circle, green, and square. Rather these properties are
bound together in determinate ways: the red either to circle or to square, and similarly for
the green. Similarly, as Albert Bregman says:
Suppose there are two acoustic sources of sound, one high and near and other low and far.
It is only because of the fact that nearness and highness are grouped as properties of one
37
AUDITORY OBJECTS
stream and farness and lowness as properties of the other that we can experience the
uniqueness of the two individual sounds rather than a mush of four properties. (1990, 11)
In the visual stream, the properties are not merely co-located; they are co-predicated
(Matthen 2005, 272-277). The redness belongs to the circle or the square; it isn’t merely
co-located with these things – thus, circularity individuates the subject; redness is a
feature attributed to this subject. Similarly, here: a melodic line is something that falls
or rises, and possesses timbre and loudness, etc. The melodic line is individuated by
figure-ground segregation; various auditory features are attributed to it.
It is not
merely that these qualities happen to be associated with the melodic line; they are
predicated of the melodic line. (This, I argued in the previous section, is what Pasnau
overlooks when he argues that sounds are properties of objects.) The separate soundstreams heard in “Yesterday” each have different features; this, as Bregman suggests, is
how the auditory scene is constituted.
Notice that the different objects in an auditory scene consist of different
groupings of object-located sounds. A chord consists of notes sounded simultaneously,
whether by the same or by different objects. A melody consists of successive notes
sounded by the same voice, or by different voices in sequence, or a group of voices in
harmony. Thus, one might say that a scene has certain elemental parts which get
combined in different ways to form different extended objects. These extended objects
overlap to some extent; there are elemental parts that belong to more than one of them.
The elemental parts are object-located sounds in the sense of the preceding section; the
extended wholes are not. These complex part-whole relations involving individual
sounds no doubt encourage a form of auditory atomism, the doctrine that sounds are
the only direct objects of audition. But I have argued throughout this paper that such
atomism is misguided. The minimal parts of everything we hear are object-located
sounds, but we do not hear all other things in virtue of hearing object-located sounds.
On the other hand, it is certainly true that the auditory system does segment the
38
AUDITORY OBJECTS
ambient acoustic energy flux received by the ears into complex overlapping wholes.
The important point to keep in mind is that these wholes are not heard in virtue of their
parts.
IX.
Aesthetic Appreciation and the Variety of Audibles
Consider, in conclusion, the following oddity. It is difficult to attend to contrapuntal
harmonies at the same time as one attends to melodies. It is, of course, easy to hear
harmonies when a single melody is played as a sequence of chords. But when two
melodic lines are played, it is hard to hear the chords formed by the simultaneous
sounding of notes across the two lines. For example, it is harder to hear the chords
formed by the notes of the singer together with the violin in “Yesterday” than to hear
the chords that sound when the string quartet is playing together as one sound-stream.
This is puzzling. It’s not that one has difficulty apprehending a plurality of auditory
features simultaneously. For example, one can listen to the contour of a melodic line at
the same time as one attends to its beat, metre, and internal harmonies. Why then is it
hard to hear contrapuntal harmony at the same time as one attends to the separated
melodic lines of violin and voice?
Here again, a parallel with vision throws light on the puzzle. In vision, it is hard
to attend to what one might call accidental relations between two visual objects, relations
that arise out of some peculiarity of our viewpoint.
When one is composing a
photograph of a friend, one fails to see that there is a lamp-post or pillar behind her,
which in the photograph will appear to sprout from her skull. Looking at paintings
from the 16th century, one fails to notice – unless one has read about it in advance – that
the main figures are arranged in a triangle. Why is this? Because such juxtapositions
are accidents of one’s point of view, and the visual system disregards accidental
juxtapositions, since they are not germane to the actual spatial relations that obtain
between the objects.
That is, the juxtaposition of lamp-post and head, or the
arrangement of figures in a triangle, would be disturbed if one shifted one’s position a
39
AUDITORY OBJECTS
little bit. By contrast, conjunctive features that constitute an object – the attachment of
head to shoulders, the triangularity of a Yield sign – persist with changes of point of
view. Automatic visual processes tend to ignore accidental conjunctions and highlight
intrinsic conjunctions.
In audition, something similar seems to be at work. Harmonic relationships that
would be obvious when they are present within a melodic line are difficult to perceive
when they hold between notes in distinct melodic lines. Why is this? Because the
within-object relations are constitutive of the object, while the cross-object ones are
accidental (or would be if they hadn’t been created by an artist). Cross-object relations
are essentially a product of the situation, and could easily have been different.
Automatic auditory processes ignore them, just as automatic visual processes ignore
cross-object juxtapositions. It takes close attention to perceive these relationships, and
prima facie this indicates that they are indirectly heard – i.e., heard in virtue of hearing
the individual melodic lines by means of a post-sensory process.
In addition to the musical entities considered above, “Yesterday” also
incorporates phonetic streams. These too have to be extracted from the total acoustic
flux. Imagine a person singing a prolonged ‘ah’ at high C (C6). Now imagine her
singing ‘oh’ at the same note. Obviously, one can readily tell the difference – but how?
What is the sonic difference between an ‘ah’ and an ‘oh’ sung at the same frequency? It
turns out that the difference lies in the timbre of the two vowels – namely, in the
frequency of the second formant (Cogan 1969, Handel 2006, chapter 8). This means that
if the singer were to sing a diphthong that went from ‘ah’ to ‘oh’, while holding a single
note, the phonetic system would pick up the change of vowel, though the melody –
which resides in the fundamental – would consist of a single note. In this way, the
phonetic line will be separate from though it will complement the melodic line in sung
music. Again, consonants will be perceived as separate from the main melodic line,
since, as we saw in section II, they involve glides in the second formant. Consonants
40
AUDITORY OBJECTS
are perceived as points of attack, or metrical elements, rather than as part of a smooth
melodic line. The melodic line possesses, then, a somewhat different contour than the
phonetic stream. Yet, we are not readily aware of these departures from parallelism.
We are not readily aware of the changes of timbre as changes of timbre – we hear only
transitions from one spoken phone to another.
Objects of aesthetic appreciation always contain accidental relations of this sort.
They are deliberately inserted by the creator, and a full grasp of the aesthetic properties
of a work of art demands an appreciation of these relations. In his essay, “The Work of
Art as an Object”, Richard Wollheim (1973) asserts that “modern art, or the painting of
our age, exhibits, across its breadth, a common theory . . . according to which a work of
art is importantly or significantly, and not just peripherally, a physical object” (118).
What he means is something like this. Pictures depict objects by means of spots of
coloured dye on a flat surface. When you look at a picture, you see both the flat array of
dye, and the depicted objects. In realistic art, there is a strong tendency to see the
depicted objects. But in documentary photography, for instance – photography of
which the main concern is to document the objects – one tends not to see the
photograph itself. That is, one attends not to the coloured marks on the surface of a
piece of paper, but to the objects that these coloured marks depict. But in “modern art”,
Wollheim says, whether depictive or not, the artist wants to draw your attention to the
flat array.
It was . . . optional for Velazquez or for Gainsborough whether they expressed their
predilection for the medium. What was necessary within their theory of art was that, if they
did, it found expression within the depiction of natural phenomena. For, say, Matisse or
Rothko, the priorities are reversed (ibid., 120)
Without wishing in any way to challenge Wollheim’s sketch of priorities, it is
worth pointing out that the tension between the medium and the view of “natural
phenomena” is a feature of every work of art.
41
That is, every artistic depiction
AUDITORY OBJECTS
deliberately inserts accidental correlations into the medium, and conveys meaning by
these correlations. In painting, principles of composition are features of the medium:
the devices of arrangement that a painter uses to highlight and decorate a work of art
are juxtapositions that would be absent in a natural view simply because slight changes
of perspective or position would “rearrange” the objects. Thus, to appreciate how a
painter has composed a picture is to attend to what I have called accidental relations in
his scene – to the triangular composition, or the relationships of size and colour, etc. In
real life, the fact that one figure is in sunlight and another is in shade is of purely
accidental significance – that relationship could change in a minute or two. Automatic
visual processes thus disregard the difference and compensate for the colour and
brightness changes consequent upon it. In art, the fact that somebody is in the light
conveys the artist’s attitude toward that person. To appreciate this attitude, one has to
attend to something that vision itself tends to ignore.
An analogous point holds of auditory works of art. A work of music or a
recitation creates an auditory scene that is not natural – it is a range of auditory objects
plucked out of the flux of acoustic energy as commanded by the composer or
performer; it is not a range of objects that can be found in nature. Here too there are
two kinds of thing that one hears and to which one attends: the natural ones consisting
of the voices, vocables, and other sound-streams that emanate from the performers and
the unnatural acoustic scene that they constitute. Crucial to appreciating these works as
aesthetic objects is appreciating accidental relations between different auditory objects
in this scene – how the rhythm of spoken words interacts with the melody, the
contrapuntal harmonies, the merging and separation of voices in a piece. All of these
relations are possible only because of the variety of auditory objects that we have
discussed in this article. The artist creates these objects and makes them stand in
accidental relations. To hear and understand these accidental relations is of the essence
of auditory appreciation.
42
AUDITORY OBJECTS
LITERATURE CONSULTED
Bergeron, Vincent and Matthen, Mohan (2007) Assembling the Emotions. Canadian
Journal of Philosophy
Bregman, Albert S. (1990) Auditory Scene Analysis: The Perceptual Organization of Sound
Cambridge Mass: Bradford Books, MIT Press.
Casati, Roberto and Dokic , Jerome (2005) Sounds. Stanford Encyclopedia of Philosophy
(Fall Edition), Edward N. Zalta (ed.),
URL=<http://plato.stanford.edu/archives/fall2005/entries/sounds/>.
Cogan, Robert (1969) Toward a Theory of Timbre: Verbal Timbre and Musical Line in
Purcell, Sessions, and Stravinsky. Perspectives of New Music 8: 75-81.
Firth, Roderick (1949) Sense-Data and the Percept Theory. Part I. Mind 57: 434-65.
Gallistel, C. R. (1990) The Organization of Learning Cambridge Mass: Bradford Books,
MIT Press.
Griffiths, Timothy D. and Warren, Jason D. ((2004) What is an Auditory Object? Nature
Reviews Neuroscience 5: 887-892.
Hall, Geoffrey (1994) Pavlovian Conditioning: Laws of Association. In Mackintosh
1994a: 15-43.
Handel, Stephen (2006) Perceptual Coherence: Hearing and Seeing New York: Oxford
University Press.
Hatfield, Gary (1990) The Natural and the Normative: Theories of Spatial Perception from
Kant to Helmholtz Cambridge MA: Bradford Books, MIT Press.
43
AUDITORY OBJECTS
Hickok, Gregory and Poeppel, David (2007) The Cortical Organization of Speech
Processing. Nature Reviews Neuroscience 8: 393-402.
Hoffman, Donald D. (1998) Visual Intelligence: How We Create What We See New York:
W. W. Norton.
Jackson, Frank (1977) Perception: A Representative Theory Cambridge England:
Cambridge University Press.
Kanizsa, Gaetano (1976) Subjective contours. Scientific American 234: 48–52.
Kubovy, Michael and Van Valkenberg, David (2001) Auditory and Visual Objects.
Cognition 80: 97-126.
Kumar, S., Stephen K.E., Warren, J. D., Friston, K. J.. and Griffiths, T. D. (2007)
Hierarchical Processing of Auditory Objects in Humans. PLoS Computational Biology
3: e100. Doi10.1371/journal.pcbi.0030100.
Leviton, Daniel J. (2006) This is Your Brain on Music: The Science of a Human Obsession
New York: Dutton.
Lewis, David (1966) Percepts and Color Mosaics in Visual Experience. The Philosophical
Review 75: 357-68.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., and Studdert-Kennedy M. (1967)
Perception of the Speech Code. Psychological Review 74: 431-61.
Mackintosh, N. J.
(1994a) Animal Learning and Cognition. San Diego: Academic Press.
(1994b) Introduction. In Mackintosh 1994a: 1-13.
O’Callaghan, Casey
44
AUDITORY OBJECTS
(forthcoming a) Sounds Oxford: Oxford University Press.
(forthcoming b) Seeing What You Hear: Cross-Modal Illusions and Perception.
Philosophical Issues.
Pasnau, Robert (1999) What is Sound? Philosophical Quarterly 49: 309-324.
Peterson, Mary A. (2001) Object Perception. In E. B. Goldsmith (ed.) Blackwell Handbook
of Perception Oxford: Blackwell: 168-203.
Pavlov, Ivan Petrovich (1904/1968) The 1904 Nobel Lecture, excerpted in a translation
by W. Horsley Grant in Richard Herrnstein and Edwin G. Boring (eds) A Source
Book in the History of Psychology. Cambridge: Harvard University Press.
Pylyshyn, Zenon (1999) Is Vision Continuous with Cognition? The Case for Cognitive
Impenetrability of Visual Perception. Behavioral and Brain Sciences 22: 341-423.
Recanzone, Gregg H. (2002) Where was that? – Human Auditory Spatial Processing.
Trends in Cognitive Sciences 6: 319-20
Thorndike, Edward L. (1898) Animal Intelligence: An Experimental Study of the
Associative Processes in Animals. Psychological Review Monograph Supplement, 2 (no.
4): 1-109.
45
AUDITORY OBJECTS
NOTES
i
Many thanks to Nicolas Bullot and Casey O’Callaghan for detailed written comments
and extensive discussion of issues covered in this article.
ii
It is generally thought that one senses an extended object by sensing some, but not
necessarily all, of its parts. For example, one sees a cube by seeing its facing surfaces.
Thus, one does not have to see all of a thing’s minimal parts in order to see it. The
empiricist doctrine takes note of this by maintaining that one sees something in virtue
of seeing those minimal parts that are in view.
iii
It is interesting that on this view, it could be held that silent articulatory gestures are
sounds. They are events that would have caused propagated vibrations of the air if the
articulatory tract had been open.
So it is possible to argue, with Aristotle, that
consonants are sounds after all. Of course, the silent articulatory gesture that the speech
perception system decodes as /b/ would sound quite different if the articulatory tract
had been open.
46
Download