Using Sub-Band Wavelet Packets Strategy for Feature Extraction Jiang Hai, Er Meng Joo School of Electrical and Electronic Engineering, Nanyang Technological University S1-B4b-06, Nanyang Avenue, Nanyang Technological University, Singapore 639798 Telephone: (65) 67905472 EDICS: 2.SPEE Abstract: In this paper, a new filter structure using Wavelet Packets (WPs) decomposition strategy based on different frequency bands is presented. These filters have the advantage of having more decomposition times in the human-ear-sensitive frequency segment than those in the human-ear-insensitive frequency segments. Each frequency segment has frequency bands spacing similar to the Mel scale. The framework of Linear Discriminant Analysis (LDA) is used to derive an efficient and reduced-dimension speech parametric vector space for the speech recognition system. Experiments are performed on the speaker independent isolated-word speech recognition task. It is found that the Sub-Band WPs method achieves better recognition performance than the Mel Filter-Like WPs method and the most popular Mel Frequency Cepstral Coefficients (MFCC) feature extraction method in a noisy environment. Keywords Wavelet Packets, Linear Discriminant Analysis, Mel Frequency Cepstral Coefficients. 1 Introduction Most of the speech recognition systems use Mel Frequency Cepstral Coefficients (MFCC) for speech recognition[1]-[3]. The drawback of MFCC is that it uses Short Time Fourier Transform (STFT), which has poor time-frequency resolution and an assumption that the signal is stationary. Thus, it is relatively difficult to recognize stop (plosive) phonemes by these features. -1- Wavelet transform has been used recently for the purpose of feature extraction. The discrete wavelet transform has been used with limited success because of its left recursive nature [1],[2]. A filter structure with Mel spacing and based on admissible wavelet packets (WPs) was proposed[3]. In this paper, an improved filter structure based on the Mel Filter-Like WPs method to extract features was proposed. These features are found to be superior to those based on the Mel Filter-Like WPs method. 2 Mel Filter-Like Wavelet Packets Decomposition Basically, wavelet transform can decompose a signal space Ai into a lower resolution (approximation) space Ai 1 and a detailed space Di 1 by dividing the orthogonal basis (i (t 2 i n )) nZ of A j into two i 1 new orthogonal bases, namely (i 1 (t 2 n )) nZ of Ai 1 and ( i 1 (t 2 j 1 n )) nZ of Di 1 where z is a set of integers, and (t ) and (t ) are scaling and wavelet functions respectively. The WPs decomposes the approximate space as well as the detailed space. Each subspace in the tree is indexed by its depth i and the number of subspaces p below it. The two WPs orthogonal bases at a parent node (i, p ) are defined by i2p1 ( k ) and 2 p 1 i 1 h[n] n (k ) p i ( k 2 in ) g[n] p i ( k 2in ) (1) (2) n where h[n ] is a low-pass filter and g[n ] is a high-pass filter given by and h[n ] i2p1 (u ), ip (u 2 in ) (3) g[n ] i2p1 1 (u ), ip (u 2in ) (4) Thus, by using full i level WPs decomposition, a balanced binary tree structure is formed having more i 1 2 than 2 orthogonal bases. Decomposition by WPs helps to partition the frequency axis at the higher frequency side into smaller bands that cannot be achieved by using discrete wavelet transform. WPs -2- decomposition gives an over-complete set of basis, and the problem is to select the best set of basis for a given signal (in other words, selecting the best partitioning of the frequency axis). For the Mel scale, the edge bandwidths were calculated based on [4] y 2595 log 10 (1 f ) 700 (5) where y is the Mel scale value and f is the bandwidth frequency. The speech in the Texas Instruments database TI46 is sampled at 11.025 kHz, giving a 5.512 kHz bandwidth signal. According to the Mel Filter-like method [3], first, a full two-level WPs decomposition is carried out. This partitions the frequency axis into four bands, each of bandwidth of 1378 Hz. The lowest band of 0~1378 Hz is further decomposed by applying again the full three-level WPs decomposition. This divides the 0~1378 Hz band into eight subbands each of bandwidth of 172 Hz. The frequency band of 1378~2756 Hz is further decomposed by applying two-level WPs decomposition, thereby giving four subbands of 344 Hz. Next, the 2756~4134 Hz is selected to complete one-level decomposition. This gives two bands each of bandwidth of 689 Hz. The frequency between 4134 and 5512 is also decomposed by one-level WPs. The results are summarized in Table 1. Filter Number Bandwith Frequency Mel Scale Mel Bandwidth 0 0 0 1 172 229 229 2 344 426 200 3 516 600 173 4 688 755 155 5 860 895 140 6 1032 1023 138 7 1204 1140 127 8 1376 1249 99 -3- 9 1721 1444 195 10 2065 1616 162 11 2409 1769 153 12 2756 1909 140 13 3445 2152 243 14 4134 2360 208 15 4823 2541 181 16 5512 2703 162 Table 1 Frequency bands for Mel Filter-Like WPs 3 Sub-Band Wavelet Packets Decomposition It is well known that the sensitive-to-human-ear frequency scope is from 1 kHz to 6 kHz, and the most sensitive frequency scope is from 1 kHz to 3 kHz. At first, we divide the frequency band into three bigger bands. The first band from 0-1 kHz and the third band from 3-5.5 kHz are the wide dividing frequency bands. The second band from 1-3 kHz will be divided into detailed frequency bands spacing similar to the Mel scale. First, a full four-level WPs decomposition is carried out. This partitions the frequency axis into 16 bands. We choose the most-sensitive-to-human-ear frequency bands 1034~3101 Hz from these bands to be further decomposed. The band of 1034~1378 Hz is further decomposed by applying again the full onelevel WPs decomposition. This divides the 1034~1378 Hz band into two subbands each of bandwidth of 172 Hz. The frequency band of 1034~1207 Hz is further decomposed by applying one-level WPs decomposition, thereby giving two subbands of 87 Hz. Next, the band of 1378~1723 Hz, the band of 1723~2067 Hz and the band of 2067~2412 Hz are selected to complete one-level decomposition. This gives six bands each of bandwidth of 172 Hz. From the relative insensitive-to-human-ear frequency band -4- 3101~5512 Hz, we choose only the 3101~4134 Hz frequency band and the 4134~5512 Hz frequency band as the feature extraction. As a consequence, the following table is obtained: Filter Number Bandwidth Frequency Mel Scale Mel Bandwidth 0 0 0 1 345 451 451 2 689 772 321 3 1034 1022 250 4 1120 1077 55 5 1207 1129 52 6 1378 1226 97 7 1552 1317 90 8 1723 1400 83 9 1897 1477 77 10 2067 1549 72 11 2242 1618 68 12 2412 1681 63 13 2756 1800 119 14 3101 1907 107 3445 2005 3790 2095 4134 2178 4479 2256 4823 2328 5168 2396 5512 2460 15 16 -5- 270 280 Table 2 Frequency bands for Sub-Band WPs After performing decomposition by Mel Filter-Like WPs and Sub-Band WPs, energy in each of the frequency bands is calculated. The log of energy is then calculated giving a total of 16 coefficients taken as features. In order to get a more robust acoustic feature effective for the automatic speech recognition system, the static segment WPs coefficients plus dynamic WPs coefficients are given by WPs + delta WPs, where WPs is the 16th order WPs decomposition coefficients applied to the windowed speech frame and delta WPs coefficients are calculated by using a classical regression method. Now, we have a 32-dimensional vector space to represent the speech utterance signal. The 16-dimension feature space is obtained by the LDA transformation with minimal loss in discrimination information. 4 Experimental Results The algorithm is applied to the T146 corpus which is designed and collected at the Texas Instruments (TI). The TI46 corpus contains 16 speakers: 8 males and 8 females. There are 15 utterances of each English digit (0~9) from each speaker: 10 designated as training tokens and 5 designed as testing tokens in the proposed system. A speech independent English digits recognition system is built in the experiment. In all our simulations, we chose output feature space dimensionality of m 16 . To compare different feature extraction methods, we employed 16-dimensional feature vectors compressed by the Discrete Cosine Transform method from the 40-band MFCC in our simulations. At the same time, a speech recognition system is also created based on the method of general Mel Filter-Like WPs. To test the robust performance of recognition systems at the noisy environment, we need to build noisy speech data. The noisy speech data sets were created by adding white Gaussian noise to the clear speech data sets according to different SNR values. The speech recognition rates were obtained on the clear data sets, the noisy data sets when SNR=20 and when SNR=10. The test results are displayed in Figure 1 as shown. -6- MFCC Mel Filter-Like WPs Sub-Band WPs % recognition 100 90 80 70 60 50 CLEAR SNR=20 SNR=10 AVERAGE Figure 1 Speech recognition results The proposed Sub-Band WPs method shows 3%, 2% and 4% recognition rate increase in the quite environment, SNR=20 and SNR=10 than the general Mel Filter-Like WPs method respectively. The general Mel Filter-Like WPs also shows better results than the traditional MFCC in the clear environment and SNR=20 noisy condition. From above results, the Sub-Band WPs method shows relatively better performance than the general Mel Filter-Like WPs method in different testing data sets. From Figure 1, we can also observe that the average recognition rate of Sub-Band WPs method successfully outperforms that of the MFCC and general Mel Filter-Like WPs from 80.3% and 84.3% to 87.3%. This observation suggests that our proposed Sub-Band WPs method is especially useful when the feature dimension hold 16 same as that of MFCC and general Mel Filter-Like WPs. 5 Conclusions In this paper, we have presented a Sub-Band WPs strategy for speech feature extraction. Speakersindependent experiments to recognize ten English digits using Sub-Band WPs method were carried out successfully. Among different SNR values, the Sub-Band WPs method shows better recognition rates than the traditional MFCC and the general Mel Filter-Like WPs methods. Experiment results suggest that the proposed Sub-Band WPs method increase the speech recognition rate while the dimension of feature doesn’t increase. The comparative study with the state-of-the-art method shows that our method is superior. -7- References: [1] Potamifis,I., Fakotakis, N., Kokkinakis, G., “Improving the robustness of noisy MFCC features using minimal recurrent neural networks” Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, Volume:5, 2000 Page(s):271-276 vol.5 [2] Wei-Wen Hung, Hsiao-Chuan Wang, “On the use of weighted filter bank analysis for the derivation of robust MFCCs” IEEE Signal Processing Letters, Volume:8 Issue: 3, March 2001 Page(s):70-73 [3] Shang-Ming Lee, Shi-Hau Fang, Jeih-welh Hung, Lin-Shan Lee, “Improved mfcc feature extraction by pca-optimized filter-bank for speech recognition” Automatic Speech Recognition and Understanding, 2001.ASRU ’01.IEEE Workshop on, 2001 [4] Tufekci,Z., and Gowdy,J.N.:“Feature extraction using discrete wavelet transform for speech recognition”. Proc. IEEE Southeastcon 2000, Nashville, USA, 2000, pp. 116-123 [5] Farooq,O., and DattaS.: “Dynamic feature extraction by wavelet analysis”. Proc. 6yh Int. Conf. Spoken Language Processing, Beijing, China, Oct. 2000, Vol.4, pp. 696-699 [6] O.Farooq and S. Datta, “Mel Filter-Like Admissible Wavelet Packet Structure for Speech Recognition” IEEE Signal Processing Letters, vol.8, No.7, July 2001. [7] E.Zwicker and E.Terhardt, “Analytical expression for critical band rate and critical bandwidth as a function of frequency,” J.Acoust.Soc.Amer., vol.68,pp.1523-1525, Dec.1980 -8-