Robert A. Hicks II ECE 4007 RCC, Advisor: Dr. Elliot Moore “The Colors of Music” Audio Feature Extraction and Pattern Recognition Introduction Extracting, recognizing, and categorizing features of audio signals (such as genre, tempo, timbre, or mood) are processes that humans perform almost subconsciously. However, audio feature recognition and categorization is largely experience-based. Therefore, the digital implementation of these processes has been largely subjective, but has recently gained momentum due to the growing availability of powerful yet adaptive Digital Signal Processing (DSP) software packages. This paper focuses on the audio feature extraction and pattern recognition aspects of DSP, discusses some important commercial applications of this technology, and identifies some algorithms and software packages that are used to analyze audio data. Commercial Applications According to [1], the recent explosion of extensive multimedia libraries online has created a growing interest in automating the music information retrieval and classification process. The traditional means of searching for and classifying music files has involved the use of annotated metadata. Recently proposed frameworks [2] use a combination of high and low level feature extraction to automatically classify songs based on genre, artist, and even artist gender. Recent research [3] has also led to algorithms that can use these extracted features to recognize and classify different parts of a song that may have different characteristics. Automated Music Information Retrieval (MIR) technology is currently being tested and implemented as an improvement over the traditional methods of searching annotated metadata or manual music classification. Another important application of audio feature extraction and pattern recognition lies in the field of speech recognition, which was introduced by Texas Instruments in the 1960s [4]. In security applications, the end user's voice is used as a biometric password, and features of the user's voice are extracted and used to create a voice key. Speech recognition is also becoming more important in control of embedded systems and human-machine interaction. Traditional speech processing methods such as Artificial Neural Networks (ANN) and the Hidden Markov Model (HMM) are difficult to implement and computationally intensive[5], and are therefore not feasible for applications with power limitations. However, algorithms based on fuzzy logic have recently been used [5] to provide speech processing for low-cost and deeply imbedded systems. Fuzzy logic- based algorithms can be used in combination with pattern recognition rules that are tuned by self-learning instead of the traditional manual tuning, and are subsequently being investigated in applications involving automated human-machine interaction. Underlying Technology Audio feature extraction and pattern recognition consists of two distinct processes. First, a set of predetermined features are extracted from the audio file or signal. Second, the feature set is processed by one or more algorithms to obtain an end result determined by the application. According to [6], the crucial step in analyzing audio data is finding a compact but effective set of features to analyze. The desired feature set is generally processed in either the time or frequency domain, but according to [7], the best results are often obtained using a Joint Time-Frequency (TF) approach. Feature extraction algorithms generally rely on analysis of training data to gain a basis for recognizing how attributes are represented in each domain. Once the desired attributes are extracted from the audio data, one or more algorithms process this feature set to obtain a desired result. For example, in an automated MIR system the musical emotion is determined by mapping the extracted features onto an array of arousal and valence (AV) values with a fuzzy inference engine [8]. The AV array reflects emotion type as well as the emotional intensity. In this way, MIR systems can recognize musical emotion from low-level extracted features such as beat, pitch, rhythm, and tempo. In other applications, there is not an extensive set of training data with which to practice feature recognition and extraction. In these cases, discriminatory training methods such as Support Vector Machine (SVM) can be used to obtain improved feature recognition rates [9]. In some applications, such as speech extraction, the training data (the target voice) is well defined, but extracting this target voice from noisy interference in the practical environment is challenging. For example, MelFrequency Cepstral Coefficients (MFCC) work well for speech extraction when the testing environment matches the training environment, but may fail in the presence of noisy interference. According to [10], post-processing with a minimum variance modulation filter can be used to successfully extract speech from a noisy background, or when the training and testing environments are different. However, this method tends to be weak when the target voice and the interference have similar spectral bases [11]. Building Blocks (Implementation) Because audio signal processing often involves complex and computationally intensive algorithms, the most efficient implementation of required processes is generally software-intensive (if not software-exclusive). In applications that do not require an embedded hardware audio processing component, it has become common to use commercially available software packages for algorithm implementation. For example, MATLAB [12] has a built-in toolbox that contains a collection of industry standard digital and analog signal processing algorithms. The strength of such software packages lies in the power of adaptability while preserving ease of use. MATLAB implements a wide array of industry standard tools an open programming language, making it possible for users to view and modify source code for advanced algorithm development. Simulink, a powerful software package that runs in parallel with MATLAB, also uses a collection of signal processing toolboxes, but implements the industry-standard tools in “block” format. This graphical user interface makes it easier to implement complex processing algorithms in a visual manner. Simulink also has the ability to perform complex audio processing algorithms in real-time[13]. The hardware required for audio feature extraction and pattern recognition varies widely based on the particular application's requirements. When applications are power-constrained or require embedded processing, such as human-machine interaction or audio processing systems for portable multimedia devices, it often becomes difficult to find a microprocessor capable of satisfying computational requirements while meeting the application's constraints on size and power consumption. However, to meet such requirements ALTERA [14] has developed hardware that is able to run “next-generation” complex DSP algorithms on a single FPGA. [1] M.F. McKinney, M. Bruderer, and A. Novello, “Perceptual Constraints in Systems for Automatic Music Information Retrieval,” ICCE 2007. Digest of Technical Papers. International Conference on Consumer Electronics, 2007. pp.1-2, Jan. 2007. [2] N.A. Draman, C. Wilson, and S. Ling, “Bio-inspired Audio Content-Based Retrieval Framework (BACRF),” In Proceedings of World Academy of Science, Engineering, and Technology May 2009, Vol. 41, pp.791-796. [3] C. Xu, Y. Zhu, and Q. Tian, “Automatic music summarization based on temporal, spectral, and cepstral features,” In Proceedings of 2002 IEEE International Conference on Multimedia and Expo, Aug. 2002, Vol. 1, pp.117-120. [4] School of Industrial Engineering, Purdue University, “Voice Recognition Technology,” Purdue University. [Online] Available: http://cobweb.ecn.purdue.edu/~tanchoco/MHE/ADC-is/Voice/main.shtml [Accessed Jan. 25, 2010]. [5] M. Malcangi, “Fuzzy Logic-Based Audio Pattern Recognition,” In AIP Conference Proceedings 11/6/2008, Vol. 1060 Issue 1, pp225-228. [6] J.D. Deng, C. Simmermacher, and S. Cranefield, “A Study on Feature Analysis for Musical Instrument Classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 38, Issue 2, pp.429-438, April 2008. [Abstract]. Available: http://ieeexplore.ieee.org. [Accessed Jan. 20, 2010]. [7] Z. Jiming, W. Guohua, and Y. Chunde, “Modified Local Discriminant Bases and Its Application in Audio Feature Extraction,” International Forum on Information Technology and Applications, 2009. IFITA '09, May 2009, Vol. 3, pp. 49-52. [8] J. Sanghoon, R. Seungmin, B. Han, and E. Hwang, “A Fuzzy Inference-based Music Emotion Recognition System,” 5th International Conference on Visual Information Engineering, 2008, July-Aug. 2008, pp.673-677. [9] A. Sloin, A. Alfandary, and D. Burshtein, “Support Vector Machine Re-Scoring Algorithm of Hidden Markov Models,” MUSCLE Network of Escellence, Jan. 2008. [Online]. Available: http://www.ifs.tuwien.ac.at/mir/muscle/del/audio_tools.html. [Accessed Jan. 20, 2010]. [10] Y. Chiu and R. M. Stern, “Minimum Variance Modulation Filter for Robust Speech Recognition,” In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, April 2009, pp.3917-3920. [11] S. Park, J. Yoo, and S. Choi, “Target Speech Extraction With Learned Spectral Bases,” In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, April 2009, pp.1789-1792. [12] The Mathworks, Inc., “Signal Processing Toolbox 6,” 9317v05 datasheet, May 2004. [13] W. J. Palm III, Introduction To MATLAB 7 For Engineers. New York, NY: McGraw Hill, 2005. [14] Altera Corporation, “Audio Processing,” Altera Corporation. [Online]. Available: http://www.altera.com/end-markets/auto/audio/aut-processing.html. [Accessed Jan. 20, 2010].