Robert A. Hicks II ECE 4007 RCC, Advisor: Dr. Elliot Moore “The

advertisement
Robert A. Hicks II
ECE 4007 RCC, Advisor: Dr. Elliot Moore
“The Colors of Music”
Audio Feature Extraction and Pattern Recognition
Introduction
Extracting, recognizing, and categorizing features of audio signals (such as genre, tempo, timbre, or mood) are
processes that humans perform almost subconsciously. However, audio feature recognition and categorization is
largely experience-based. Therefore, the digital implementation of these processes has been largely subjective,
but has recently gained momentum due to the growing availability of powerful yet adaptive Digital Signal
Processing (DSP) software packages. This paper focuses on the audio feature extraction and pattern recognition
aspects of DSP, discusses some important commercial applications of this technology, and identifies some
algorithms and software packages that are used to analyze audio data.
Commercial Applications
According to [1], the recent explosion of extensive multimedia libraries online has created a growing interest in
automating the music information retrieval and classification process. The traditional means of searching for
and classifying music files has involved the use of annotated metadata. Recently proposed frameworks [2] use a
combination of high and low level feature extraction to automatically classify songs based on genre, artist, and
even artist gender. Recent research [3] has also led to algorithms that can use these extracted features to
recognize and classify different parts of a song that may have different characteristics. Automated Music
Information Retrieval (MIR) technology is currently being tested and implemented as an improvement over the
traditional methods of searching annotated metadata or manual music classification.
Another important application of audio feature extraction and pattern recognition lies in the field of speech
recognition, which was introduced by Texas Instruments in the 1960s [4]. In security applications, the end user's
voice is used as a biometric password, and features of the user's voice are extracted and used to create a voice
key. Speech recognition is also becoming more important in control of embedded systems and human-machine
interaction. Traditional speech processing methods such as Artificial Neural Networks (ANN) and the Hidden
Markov Model (HMM) are difficult to implement and computationally intensive[5], and are therefore not
feasible for applications with power limitations. However, algorithms based on fuzzy logic have recently been
used [5] to provide speech processing for low-cost and deeply imbedded systems. Fuzzy logic- based algorithms
can be used in combination with pattern recognition rules that are tuned by self-learning instead of the traditional
manual tuning, and are subsequently being investigated in applications involving automated human-machine
interaction.
Underlying Technology
Audio feature extraction and pattern recognition consists of two distinct processes. First, a set of predetermined
features are extracted from the audio file or signal. Second, the feature set is processed by one or more
algorithms to obtain an end result determined by the application. According to [6], the crucial step in analyzing
audio data is finding a compact but effective set of features to analyze. The desired feature set is generally
processed in either the time or frequency domain, but according to [7], the best results are often obtained using a
Joint Time-Frequency (TF) approach. Feature extraction algorithms generally rely on analysis of training data to
gain a basis for recognizing how attributes are represented in each domain. Once the desired attributes are
extracted from the audio data, one or more algorithms process this feature set to obtain a desired result. For
example, in an automated MIR system the musical emotion is determined by mapping the extracted features onto
an array of arousal and valence (AV) values with a fuzzy inference engine [8]. The AV array reflects emotion
type as well as the emotional intensity. In this way, MIR systems can recognize musical emotion from low-level
extracted features such as beat, pitch, rhythm, and tempo. In other applications, there is not an extensive set of
training data with which to practice feature recognition and extraction. In these cases, discriminatory training
methods such as Support Vector Machine (SVM) can be used to obtain improved feature recognition rates [9].
In some applications, such as speech extraction, the training data (the target voice) is well defined, but extracting
this target voice from noisy interference in the practical environment is challenging. For example, MelFrequency Cepstral Coefficients (MFCC) work well for speech extraction when the testing environment matches
the training environment, but may fail in the presence of noisy interference. According to [10], post-processing
with a minimum variance modulation filter can be used to successfully extract speech from a noisy background,
or when the training and testing environments are different. However, this method tends to be weak when the
target voice and the interference have similar spectral bases [11].
Building Blocks (Implementation)
Because audio signal processing often involves complex and computationally intensive algorithms, the most
efficient implementation of required processes is generally software-intensive (if not software-exclusive). In
applications that do not require an embedded hardware audio processing component, it has become common to
use commercially available software packages for algorithm implementation. For example, MATLAB [12] has a
built-in toolbox that contains a collection of industry standard digital and analog signal processing algorithms.
The strength of such software packages lies in the power of adaptability while preserving ease of use. MATLAB
implements a wide array of industry standard tools an open programming language, making it possible for users
to view and modify source code for advanced algorithm development. Simulink, a powerful software package
that runs in parallel with MATLAB, also uses a collection of signal processing toolboxes, but implements the
industry-standard tools in “block” format. This graphical user interface makes it easier to implement complex
processing algorithms in a visual manner. Simulink also has the ability to perform complex audio processing
algorithms in real-time[13].
The hardware required for audio feature extraction and pattern recognition varies widely based on the particular
application's requirements. When applications are power-constrained or require embedded processing, such as
human-machine interaction or audio processing systems for portable multimedia devices, it often becomes
difficult to find a microprocessor capable of satisfying computational requirements while meeting the
application's constraints on size and power consumption. However, to meet such requirements ALTERA [14]
has developed hardware that is able to run “next-generation” complex DSP algorithms on a single FPGA.
[1]
M.F. McKinney, M. Bruderer, and A. Novello, “Perceptual Constraints in Systems for Automatic Music
Information Retrieval,” ICCE 2007. Digest of Technical Papers. International Conference on
Consumer Electronics, 2007. pp.1-2, Jan. 2007.
[2]
N.A. Draman, C. Wilson, and S. Ling, “Bio-inspired Audio Content-Based Retrieval Framework (BACRF),” In Proceedings of World Academy of Science, Engineering, and Technology May 2009,
Vol. 41, pp.791-796.
[3]
C. Xu, Y. Zhu, and Q. Tian, “Automatic music summarization based on temporal, spectral, and cepstral
features,” In Proceedings of 2002 IEEE International Conference on Multimedia and Expo, Aug. 2002,
Vol. 1, pp.117-120.
[4]
School of Industrial Engineering, Purdue University, “Voice Recognition Technology,” Purdue
University. [Online] Available: http://cobweb.ecn.purdue.edu/~tanchoco/MHE/ADC-is/Voice/main.shtml
[Accessed Jan. 25, 2010].
[5]
M. Malcangi, “Fuzzy Logic-Based Audio Pattern Recognition,” In AIP Conference Proceedings
11/6/2008, Vol. 1060 Issue 1, pp225-228.
[6]
J.D. Deng, C. Simmermacher, and S. Cranefield, “A Study on Feature Analysis for Musical Instrument
Classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 38, Issue 2,
pp.429-438, April 2008. [Abstract]. Available: http://ieeexplore.ieee.org. [Accessed Jan. 20, 2010].
[7]
Z. Jiming, W. Guohua, and Y. Chunde, “Modified Local Discriminant Bases and Its Application in Audio
Feature Extraction,” International Forum on Information Technology and Applications, 2009. IFITA '09,
May 2009, Vol. 3, pp. 49-52.
[8]
J. Sanghoon, R. Seungmin, B. Han, and E. Hwang, “A Fuzzy Inference-based Music Emotion
Recognition System,” 5th International Conference on Visual Information Engineering, 2008, July-Aug.
2008, pp.673-677.
[9]
A. Sloin, A. Alfandary, and D. Burshtein, “Support Vector Machine Re-Scoring Algorithm of Hidden
Markov Models,” MUSCLE Network of Escellence, Jan. 2008. [Online].
Available: http://www.ifs.tuwien.ac.at/mir/muscle/del/audio_tools.html. [Accessed Jan. 20, 2010].
[10]
Y. Chiu and R. M. Stern, “Minimum Variance Modulation Filter for Robust Speech Recognition,”
In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, April 2009,
pp.3917-3920.
[11]
S. Park, J. Yoo, and S. Choi, “Target Speech Extraction With Learned Spectral Bases,”
In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, April 2009,
pp.1789-1792.
[12]
The Mathworks, Inc., “Signal Processing Toolbox 6,” 9317v05 datasheet, May 2004.
[13]
W. J. Palm III, Introduction To MATLAB 7 For Engineers. New York, NY: McGraw Hill, 2005.
[14]
Altera Corporation, “Audio Processing,” Altera Corporation. [Online].
Available: http://www.altera.com/end-markets/auto/audio/aut-processing.html.
[Accessed Jan. 20, 2010].
Download