- Telluride Workshop

advertisement
Workshop report for Object counting project
Serin Atiani and sridhar krishna nemala
When we are presented with a rich acoustic scene we inadvertently parse acoustic scenes
into individual auditory events. This analysis of auditory scene relies on the
decomposition of the sensory input into auditory objects or streams. The working
hypothesis is that the auditory system like the visual system groups low to intermediate
level cues or features to form our perception of an auditory objects or streams, which
makes temporal coincidence a key in this grouping. Therefore the organization of sound
is expected to include acoustic events spanning different scales of time, and frequency.
The problem of object formation in the auditory system is multilayered, and can be
informed by the exploration of the limits of perception of auditory objects/streams. What
is the maximum number of auditory objects/sources are we able to identify in an acoustic
scene? How does that inform us about the cues or features important to the formation of
these auditory objects/streams? And how do we in tern use these acoustic features to
model our ability to identify different auditory objects/sources?
We started this project by designing a psychoacoustic experiment to address this
question; we used a combination of speech sounds by females and male speakers, natural
sounds like the sound of water, wind, fire, animal and human vocalizations, in addition to
some non-natural sounds generated by machinery and equipments used in daily life.
Existing literature on this issue puts the max number of perceivable object/sources
between 3 and 6. []. We synthesized a set of auditory scenes by including combinations
of speech sounds only, natural sounds only, non-natural sounds only and a mixture of all
these sounds. We varied the number of sound objects/sources in these scenes between 2
and 6. We asked people to report the number of objects/sources heard in each of the 14
different auditory scenes in two conditions, the first only allowing them to hear each
auditory scene once, and the second condition allowing them to play the sound as often as
they wanted before reporting the number of objects/sources. Results of the both
conditions are included below in Fig1
Fig1
A number of observations can be made from the results of the experiment; first, there is a
limit of an average of 4 objects that can be perceived in an auditory scene this was
observed in when people where allowed to replay the sound multiple times. That limit
objects dropped when people where only allowed to play the sound once. Second in the
case of Eng4, Eng5, Eng6, where the sources constituted of 4, 5 and 6 speakers of
English sentences, people reported hearing only 3 speakers, and there was very little
variability in the response to both as the number of speakers increased, interestingly the
performance of the participants didn’t improve when they were allowed to hear the
samples more than once. In the cases of the combination of natural sounds as well as the
non-natural sounds increasing the number of sources resulted in decrease in the reported
number of objects which might be a result of two or more objects/sources fuse. The
opposite seems to take place when the combination included speech, natural sounds and
non-natural sounds which might be due to the different profiles of these sounds.
Fig2
To model how auditory objects/streams form and how we recognize a certain number of
objects/sources we needed to identify the relative low to intermediate level features that
are combined by the auditory system to form objects/sources. The insight to what these
features are was taken from literature in the field and confirmed by psychoacoustic
experiment that we conducted here in Telluride. The relevant features that we worked
with in this are stimuli frequency, temporal modulate rate (Rate (Hz)), spectral modulates
rates (scale (cycles/oct)) , pitch and transience of the sound. We used the auditory cortical
model [] to extract the frequency, scale and rate of the stimuli (figure 2). We used a pitch
extraction algorithm to model the pitch of the stimuli, and we high passed the
spectrogram of the sound to emphasize the transients of the sound.
Fig 3
We computed the coherence matrix at every instance in time between the scale-frequency
channels integrating across different rates. We computed the eigen vectors and eigen
values for this symmetric matrix and used it as indicator of the complexity of the scene.
The simpler the scene (the lower the number of objects/sources) the lower the rank of the
matrix, and the higher the value of first few eigen values, the more complex the scene
(the higher the number of sources) the smaller the contribution of the first few eigen
values to the over all sum of eigen values. For a more intuitive relationship we took
1/(ratio of eigen value contribution to the sum) as a mesure and plotted against the
number of objects/sources in each of the 19 different auditory scenes, showing in Figure
3
Fig4
This figure shows a positive correlation between the number of objects/sources in an
auditory scene and 1/(contribution of eigen value contribution to the sum). i.e there is a
negative correlation between number of objects and the contribution of the first few eigen
values to the sum of all eigen values. We fitted this relationship with a linear equation >
This relationship allows us to predict the number of objects from the contribution of the
eigen values to the sum of all eigen values.
This project is not complete, but it is a very promising pilot for a more comprehensive
project to recognize number of auditory objects/sources and model how biological
systems form auditory objects. Future work will include the following:
 Larger list of relevant auditory features for object formation and counting
 Study possible grouping algorithms to group these features into auditory object
formation
 develop a more comprehensive set of auditory scenes that include different types
of objects and stimuli that represent as diverse of a situation as possible
 Develop a large testing set of auditory scenes to test the model
 Investigate the relationship between the limits of the model and the limits of
human behavior and if the limits are systematically related, and if they can be
related to certain auditory scenes objects or sources
Download