Unstructured Audio Classification for Environment Recognition Selina Chu

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)
Unstructured Audio Classification for Environment Recognition
Selina Chu
Signal and Image Processing Institute and Viterbi School of Engineering
Department of Computer Science, University of Southern California
Los Angeles, CA 90089, USA
selinach@sipi.usc.edu
Abstract
but the activity is much less as compared to that for speech
or music. Other applications include those in the domain of
wearables and context-aware applications (Waibel et al.,
2004; Ellis et al., 2004). Unstructured environment
characterization is still in its infancy. Most research in
environmental sounds has centered mostly on recognition
of specific events or sounds (Cai et al., 2006). To date,
only a few systems have been proposed to model raw
environment audio without pre-extracting specific events
or sounds (Eronen et al., 2006; Malkin et al., 2005).
Similarly, our focus is not in analyzing and recognition of
discrete sound events, but rather on characterizing the
general acoustic environment types as a whole.
My thesis aims to contribute towards building autonomous
agents that are able to understand their surrounding
environment through the use of both audio and visual
information. To capture a more complete description of a
scene, the fusion of audio and visual information can be
advantageous in enhancing the system’s context awareness.
The goal of this work is on the characterization of
unstructured environmental sounds for understanding and
predicting the context surrounding of an agent. Most
research on audio recognition has focused primarily on
speech and music. Less attention has been paid to the
challenges and opportunities for using audio to characterize
unstructured environments. Unlike speech and music, which
have formantic structures and harmonic structures,
environmental sounds are considered unstructured since
they are variably composed from different sound sources.
My research will investigate challenging issues in
characterizing environmental sounds such as the
development of appropriate features extraction algorithm
and learning techniques for modeling the dynamics of the
environment. A final aspect of my research will consider the
decision making of an autonomous agent based on the
fusion of visual and audio information.
Time- and Frequency- Domain Feature
Extraction
The first step in building a recognition system for auditory
environment was to investigate on techniques for
developing a scene classification system using audio
features. We performed the study by first collecting real
world audio with a robot and then building a classifier to
discriminate different environments, which allows us to
explore and investigate on suitable features and the
feasibility of designing an automatic environment
recognition system using audio information (Chu et al,
2006). We showed that we can predict with fairly accurate
results the environment in which the robot is positioned
(92% accuracy for 5 types of environments).
Many previous efforts utilize a high dimension feature
vector to represent audio signals (Eronen et al., 2006). We
showed in (Chu et al., 2006) that a high dimension feature
set for classification does not always produce good
performance. This in turn leads to the issue of selecting an
optimal subset of features from a larger set of possible
features to yield the most effective subset. In the same
work, we utilized a simple forward feature selection
algorithm to obtain a smaller feature set to reduce the
computational cost and running time and achieve a higher
classification rate. Although the results showed
improvements, the features found after the feature selection
process were more specific to each classifier and
environment type. It is with these findings that motivated
us to look for a more effective approach for representing
environmental sounds. Toward this goal, we investigated
Acoustic Environment Recognition
We consider the task of recognizing environment sounds
for the understanding of a scene (or context) surrounding
an audio sensor. By auditory scenes, we refer to a location
with different acoustic characteristics such as a coffee
shop, park or quiet hallway. Consider, for example,
applications in robotic navigation and obstacle detection,
assistive robots, surveillance, and other mobile devicebased services. Many of these systems are dominantly
vision-based.
When being employed to understand
unstructured environments, their robustness or utility will
be lost if visual information is compromised or totally
absent. Audio data could be easily acquired, in spite of
challenging external conditions such as poor lighting or
visual obstruction, and is relatively cheap to store and
compute than visual signals. To enhance the system’s
context awareness, we need to incorporate and adequately
utilize such audio information.
Research in general audio environment recognition has
received some interest in the last few years (Ellis, 1996;
Huang, J. 2002; Malkin et al., 2005; Eronen et al., 2006),
Copyright © 2008, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1845
in ways of extracting features and introduce a novel idea of
using matching pursuit as a way to extract features for
unstructured sounds (Chu et al, 2008).
As with most pattern recognition systems, selecting
proper features is the key to effective performances. Audio
signals have been traditionally characterized by Melfrequency cepstral coefficients (MFCCs) or some other
time-frequency representations such as the short-time
Fourier transform and the wavelet transform, etc. (Rabiner
et al., 1993). MFCCs have been shown to work well for
structured sounds such as speech and music sounds, but
their performance degrades in the presence of noise.
Environmental sounds, for example, contain a large variety
of sounds, which may include components with strong
temporal domain signatures, such as chirpings of insects
and sounds of rain. These sounds are in fact noise-like with
a broad flat spectrum and are not effectively modeled by
MFCCs.
Therefore, we proposed a novel feature extraction
method that utilizes the matching pursuit (MP) algorithm
to select a small set of time-domain features (Chu et al,
2008), which we called MP-features. MP-features have
shown to classify sounds where the frequency domain
features (e.g., MFCCs) fail and can be advantageous when
combining with MFCCs to improve the overall
performance. Extensive experiments were conducted to
demonstrate the advantages of MP-features as well as joint
MFCC and MP-features in environmental sound
classification. To the best of our knowledge, we were the
first to propose using MP for feature extraction for
environmental sounds. This method has shown to perform
well in classifying fourteen different audio environments,
achieving 83% classification accuracy. This result is very
promising, considering that, due to the high variance and
other difficulties in working with environmental sounds,
recognition rates have thus far been limited as the number
of targeted classes increases, i.e. approximately 60% for 13
or more classes (Eronen et al., 2006).
Decision Making in Dynamic Environments
The final goal of this thesis is to perform planning and
decision making in dynamic environments using a fusion
of audio and visual information. We will investigate on
methods and techniques for dealing with decision making
under uncertainty that utilize basic methods based on
probability, decision and utility theories. We start by
building a framework for decision making that
incorporates different modalities. The first step is to
develop a framework for information fusion from the two
different modalities, audio and visual sensor information,
to enhance the characterization of the environment or scene
context. A major issue in this area is the synchronization or
coupling of events from different data streams. An initial
approach would be to build a system where the decisions
for determining certain scenes or environment are made in
each individual information stream, based on low-level
information processing. Then from further processing, such
as correlation analysis on the training data, we can extract
linkage information between features of the different data
streams to combine the two decisions for making
predictions for decision making and planning.
References
Cai, R., Lu, L., Hanjalic, A., Zhang, H., and Cai, L.-H. 2006.
A flexible framework for key audio effects detection and
auditory context inference. In IEEE Trans on Audio, Speech
and Language Processing, 14(3):1026–1039.
Cristani, M., Bicego, M., Murino, V. 2004. On-Line Adaptive
Background Modelling for Audio Surveillance. In proc. Of
ICPR.
Chu, S., Narayanan, S., and Kuo, J. C.-C. 2006. Content
Analysis for Acoustic Environment Classification in Mobile
Robots. In Proc. of AAAI Fall Symposium.
Chu, S., Narayanan, S., and Kuo, J. C.-C. 2008.
Environmental Sound Recognition using MP-based Features.
In proc. of IEEE ICASSP.
Ellis, D. P. W. 1996. Prediction-driven computational
auditory scene analysis, Ph.D. thesis, MIT Dept. of EE and
CS.
Ellis, D. P. W. and Lee, K. 2004. Minimal-impact audio-based
personal archives. In proc. of CARPE.
Eronen, A., Peltonen, V., Tuomi, J., Klapuri, A., Fagerlund,
S., Sorsa, T. Lorho, G., Huopaniemi, J. 2006. Audio-based
context recognition. In IEEE Trans On Speech and Audio
Processing.
Huang, J. 2002. Spatial auditory processing for a hearing
robot. In proc. of ICME.
Malkin, R., Waibel, A. 2005. Classifying User Environment
for Mobile Applications using Linear Autoencoding of
Ambient Audio. In Proc. of IEEE ICASSP.
Rabiner, L. and Juang, B.-H. 1993. Fundamentals of Speech
Recognition. Prentice-Hall.
Waibel, A., Steusloff, H., Stiefelhagen, R., and the CHIL
Project Consortium 2004. Chil - computers in the human
interaction loop. In proc. of WIAMIS.
Adaptive Audio Background Modeling
The next goal of this thesis is to develop a way to perform
background modeling that can capture the dynamic nature
of the environment. Such models should have the
following characteristics: 1) require none or little
assumptions on prior knowledge, 2) able to adapt to audio
changes over time, and 3) able to handle multiple dynamic
audio sources. Most methods on background modeling for
audio have been adaptation of works for video (Cristani et
al., 2004). The adaptation used for audio has been using
specific fixed parameters for each type of context or
environment. A starting point would be to extend the work
of existing background modeling technique to make the
system more flexible, by coming up with ways to integrate
some learning/statistical model into the process.
1846