Workshop report for Object counting project Serin Atiani and sridhar krishna nemala When we are presented with a rich acoustic scene we inadvertently parse acoustic scenes into individual auditory events. This analysis of auditory scene relies on the decomposition of the sensory input into auditory objects or streams. The working hypothesis is that the auditory system like the visual system groups low to intermediate level cues or features to form our perception of an auditory objects or streams, which makes temporal coincidence a key in this grouping. Therefore the organization of sound is expected to include acoustic events spanning different scales of time, and frequency. The problem of object formation in the auditory system is multilayered, and can be informed by the exploration of the limits of perception of auditory objects/streams. What is the maximum number of auditory objects/sources are we able to identify in an acoustic scene? How does that inform us about the cues or features important to the formation of these auditory objects/streams? And how do we in tern use these acoustic features to model our ability to identify different auditory objects/sources? We started this project by designing a psychoacoustic experiment to address this question; we used a combination of speech sounds by females and male speakers, natural sounds like the sound of water, wind, fire, animal and human vocalizations, in addition to some non-natural sounds generated by machinery and equipments used in daily life. Existing literature on this issue puts the max number of perceivable object/sources between 3 and 6. []. We synthesized a set of auditory scenes by including combinations of speech sounds only, natural sounds only, non-natural sounds only and a mixture of all these sounds. We varied the number of sound objects/sources in these scenes between 2 and 6. We asked people to report the number of objects/sources heard in each of the 14 different auditory scenes in two conditions, the first only allowing them to hear each auditory scene once, and the second condition allowing them to play the sound as often as they wanted before reporting the number of objects/sources. Results of the both conditions are included below in Fig1 Fig1 A number of observations can be made from the results of the experiment; first, there is a limit of an average of 4 objects that can be perceived in an auditory scene this was observed in when people where allowed to replay the sound multiple times. That limit objects dropped when people where only allowed to play the sound once. Second in the case of Eng4, Eng5, Eng6, where the sources constituted of 4, 5 and 6 speakers of English sentences, people reported hearing only 3 speakers, and there was very little variability in the response to both as the number of speakers increased, interestingly the performance of the participants didn’t improve when they were allowed to hear the samples more than once. In the cases of the combination of natural sounds as well as the non-natural sounds increasing the number of sources resulted in decrease in the reported number of objects which might be a result of two or more objects/sources fuse. The opposite seems to take place when the combination included speech, natural sounds and non-natural sounds which might be due to the different profiles of these sounds. Fig2 To model how auditory objects/streams form and how we recognize a certain number of objects/sources we needed to identify the relative low to intermediate level features that are combined by the auditory system to form objects/sources. The insight to what these features are was taken from literature in the field and confirmed by psychoacoustic experiment that we conducted here in Telluride. The relevant features that we worked with in this are stimuli frequency, temporal modulate rate (Rate (Hz)), spectral modulates rates (scale (cycles/oct)) , pitch and transience of the sound. We used the auditory cortical model [] to extract the frequency, scale and rate of the stimuli (figure 2). We used a pitch extraction algorithm to model the pitch of the stimuli, and we high passed the spectrogram of the sound to emphasize the transients of the sound. Fig 3 We computed the coherence matrix at every instance in time between the scale-frequency channels integrating across different rates. We computed the eigen vectors and eigen values for this symmetric matrix and used it as indicator of the complexity of the scene. The simpler the scene (the lower the number of objects/sources) the lower the rank of the matrix, and the higher the value of first few eigen values, the more complex the scene (the higher the number of sources) the smaller the contribution of the first few eigen values to the over all sum of eigen values. For a more intuitive relationship we took 1/(ratio of eigen value contribution to the sum) as a mesure and plotted against the number of objects/sources in each of the 19 different auditory scenes, showing in Figure 3 Fig4 This figure shows a positive correlation between the number of objects/sources in an auditory scene and 1/(contribution of eigen value contribution to the sum). i.e there is a negative correlation between number of objects and the contribution of the first few eigen values to the sum of all eigen values. We fitted this relationship with a linear equation > This relationship allows us to predict the number of objects from the contribution of the eigen values to the sum of all eigen values. This project is not complete, but it is a very promising pilot for a more comprehensive project to recognize number of auditory objects/sources and model how biological systems form auditory objects. Future work will include the following: Larger list of relevant auditory features for object formation and counting Study possible grouping algorithms to group these features into auditory object formation develop a more comprehensive set of auditory scenes that include different types of objects and stimuli that represent as diverse of a situation as possible Develop a large testing set of auditory scenes to test the model Investigate the relationship between the limits of the model and the limits of human behavior and if the limits are systematically related, and if they can be related to certain auditory scenes objects or sources