Feature Extraction for Speech Recognition Based on LDA

advertisement
Using Sub-Band Wavelet Packets Strategy for Feature Extraction
Jiang Hai, Er Meng Joo
School of Electrical and Electronic Engineering, Nanyang Technological University
S1-B4b-06, Nanyang Avenue, Nanyang Technological University, Singapore 639798
Telephone: (65) 67905472
EDICS: 2.SPEE
Abstract: In this paper, a new filter structure using Wavelet Packets (WPs) decomposition strategy based
on different frequency bands is presented. These filters have the advantage of having more decomposition
times in the human-ear-sensitive frequency segment than those in the human-ear-insensitive frequency
segments. Each frequency segment has frequency bands spacing similar to the Mel scale. The framework
of Linear Discriminant Analysis (LDA) is used to derive an efficient and reduced-dimension speech
parametric vector space for the speech recognition system. Experiments are performed on the speaker
independent isolated-word speech recognition task. It is found that the Sub-Band WPs method achieves
better recognition performance than the Mel Filter-Like WPs method and the most popular Mel
Frequency Cepstral Coefficients (MFCC) feature extraction method in a noisy environment.
Keywords Wavelet Packets, Linear Discriminant Analysis, Mel Frequency Cepstral Coefficients.
1 Introduction
Most of the speech recognition systems use Mel Frequency Cepstral Coefficients (MFCC) for speech
recognition[1]-[3]. The drawback of MFCC is that it uses Short Time Fourier Transform (STFT), which
has poor time-frequency resolution and an assumption that the signal is stationary. Thus, it is relatively
difficult to recognize stop (plosive) phonemes by these features.
-1-
Wavelet transform has been used recently for the purpose of feature extraction. The discrete wavelet
transform has been used with limited success because of its left recursive nature [1],[2]. A filter structure
with Mel spacing and based on admissible wavelet packets (WPs) was proposed[3]. In this paper, an
improved filter structure based on the Mel Filter-Like WPs method to extract features was proposed.
These features are found to be superior to those based on the Mel Filter-Like WPs method.
2 Mel Filter-Like Wavelet Packets Decomposition
Basically, wavelet transform can decompose a signal space Ai into a lower resolution (approximation)
space Ai 1 and a detailed space Di 1 by dividing the orthogonal basis (i (t  2 i n )) nZ of A j into two
i 1
new orthogonal bases, namely (i 1 (t  2 n )) nZ of Ai 1 and ( i 1 (t  2 j 1 n )) nZ of Di 1 where z is
a set of integers, and  (t ) and  (t ) are scaling and wavelet functions respectively.
The WPs decomposes the approximate space as well as the detailed space. Each subspace in the tree is
indexed by its depth i and the number of subspaces p below it. The two WPs orthogonal bases at a parent
node (i, p ) are defined by
 i2p1 ( k ) 

and
2 p 1
i 1

 h[n]
n  
(k ) 
p
i
( k  2 in )

 g[n]
p
i
( k  2in )
(1)
(2)
n  
where h[n ] is a low-pass filter and g[n ] is a high-pass filter given by
and
h[n ]   i2p1 (u ),  ip (u  2 in )
(3)
g[n ]   i2p1 1 (u ),  ip (u  2in )
(4)
Thus, by using full i level WPs decomposition, a balanced binary tree structure is formed having more
i 1
2
than 2 orthogonal bases. Decomposition by WPs helps to partition the frequency axis at the higher
frequency side into smaller bands that cannot be achieved by using discrete wavelet transform. WPs
-2-
decomposition gives an over-complete set of basis, and the problem is to select the best set of basis for a
given signal (in other words, selecting the best partitioning of the frequency axis).
For the Mel scale, the edge bandwidths were calculated based on [4]
y  2595 log 10 (1 
f
)
700
(5)
where y is the Mel scale value and f is the bandwidth frequency.
The speech in the Texas Instruments database TI46 is sampled at 11.025 kHz, giving a 5.512 kHz
bandwidth signal. According to the Mel Filter-like method [3], first, a full two-level WPs decomposition
is carried out. This partitions the frequency axis into four bands, each of bandwidth of 1378 Hz. The
lowest band of 0~1378 Hz is further decomposed by applying again the full three-level WPs
decomposition. This divides the 0~1378 Hz band into eight subbands each of bandwidth of 172 Hz. The
frequency band of 1378~2756 Hz is further decomposed by applying two-level WPs decomposition,
thereby giving four subbands of 344 Hz. Next, the 2756~4134 Hz is selected to complete one-level
decomposition. This gives two bands each of bandwidth of 689 Hz. The frequency between 4134 and
5512 is also decomposed by one-level WPs. The results are summarized in Table 1.
Filter Number
Bandwith Frequency
Mel Scale
Mel Bandwidth
0
0
0
1
172
229
229
2
344
426
200
3
516
600
173
4
688
755
155
5
860
895
140
6
1032
1023
138
7
1204
1140
127
8
1376
1249
99
-3-
9
1721
1444
195
10
2065
1616
162
11
2409
1769
153
12
2756
1909
140
13
3445
2152
243
14
4134
2360
208
15
4823
2541
181
16
5512
2703
162
Table 1 Frequency bands for Mel Filter-Like WPs
3 Sub-Band Wavelet Packets Decomposition
It is well known that the sensitive-to-human-ear frequency scope is from 1 kHz to 6 kHz, and the most
sensitive frequency scope is from 1 kHz to 3 kHz. At first, we divide the frequency band into three bigger
bands. The first band from 0-1 kHz and the third band from 3-5.5 kHz are the wide dividing frequency
bands. The second band from 1-3 kHz will be divided into detailed frequency bands spacing similar to the
Mel scale.
First, a full four-level WPs decomposition is carried out. This partitions the frequency axis into 16 bands.
We choose the most-sensitive-to-human-ear frequency bands 1034~3101 Hz from these bands to be
further decomposed. The band of 1034~1378 Hz is further decomposed by applying again the full onelevel WPs decomposition. This divides the 1034~1378 Hz band into two subbands each of bandwidth of
172 Hz. The frequency band of 1034~1207 Hz is further decomposed by applying one-level WPs
decomposition, thereby giving two subbands of 87 Hz. Next, the band of 1378~1723 Hz, the band of
1723~2067 Hz and the band of 2067~2412 Hz are selected to complete one-level decomposition. This
gives six bands each of bandwidth of 172 Hz. From the relative insensitive-to-human-ear frequency band
-4-
3101~5512 Hz, we choose only the 3101~4134 Hz frequency band and the 4134~5512 Hz frequency
band as the feature extraction. As a consequence, the following table is obtained:
Filter Number
Bandwidth Frequency
Mel Scale
Mel Bandwidth
0
0
0
1
345
451
451
2
689
772
321
3
1034
1022
250
4
1120
1077
55
5
1207
1129
52
6
1378
1226
97
7
1552
1317
90
8
1723
1400
83
9
1897
1477
77
10
2067
1549
72
11
2242
1618
68
12
2412
1681
63
13
2756
1800
119
14
3101
1907
107
3445
2005
3790
2095
4134
2178
4479
2256
4823
2328
5168
2396
5512
2460
15
16
-5-
270
280
Table 2 Frequency bands for Sub-Band WPs
After performing decomposition by Mel Filter-Like WPs and Sub-Band WPs, energy in each of the
frequency bands is calculated. The log of energy is then calculated giving a total of 16 coefficients taken
as features.
In order to get a more robust acoustic feature effective for the automatic speech recognition system, the
static segment WPs coefficients plus dynamic WPs coefficients are given by WPs + delta WPs, where
WPs is the 16th order WPs decomposition coefficients applied to the windowed speech frame and delta
WPs coefficients are calculated by using a classical regression method. Now, we have a 32-dimensional
vector space to represent the speech utterance signal. The 16-dimension feature space is obtained by the
LDA transformation with minimal loss in discrimination information.
4 Experimental Results
The algorithm is applied to the T146 corpus which is designed and collected at the Texas Instruments
(TI). The TI46 corpus contains 16 speakers: 8 males and 8 females. There are 15 utterances of each
English digit (0~9) from each speaker: 10 designated as training tokens and 5 designed as testing tokens
in the proposed system. A speech independent English digits recognition system is built in the
experiment. In all our simulations, we chose output feature space dimensionality of m  16 .
To compare different feature extraction methods, we employed 16-dimensional feature vectors
compressed by the Discrete Cosine Transform method from the 40-band MFCC in our simulations. At the
same time, a speech recognition system is also created based on the method of general Mel Filter-Like
WPs. To test the robust performance of recognition systems at the noisy environment, we need to build
noisy speech data. The noisy speech data sets were created by adding white Gaussian noise to the clear
speech data sets according to different SNR values. The speech recognition rates were obtained on the
clear data sets, the noisy data sets when SNR=20 and when SNR=10. The test results are displayed in
Figure 1 as shown.
-6-
MFCC
Mel Filter-Like WPs
Sub-Band WPs
% recognition
100
90
80
70
60
50
CLEAR
SNR=20
SNR=10
AVERAGE
Figure 1 Speech recognition results
The proposed Sub-Band WPs method shows 3%, 2% and 4% recognition rate increase in the quite
environment, SNR=20 and SNR=10 than the general Mel Filter-Like WPs method respectively. The
general Mel Filter-Like WPs also shows better results than the traditional MFCC in the clear environment
and SNR=20 noisy condition. From above results, the Sub-Band WPs method shows relatively better
performance than the general Mel Filter-Like WPs method in different testing data sets. From Figure 1,
we can also observe that the average recognition rate of Sub-Band WPs method successfully outperforms
that of the MFCC and general Mel Filter-Like WPs from 80.3% and 84.3% to 87.3%. This observation
suggests that our proposed Sub-Band WPs method is especially useful when the feature dimension hold
16 same as that of MFCC and general Mel Filter-Like WPs.
5 Conclusions
In this paper, we have presented a Sub-Band WPs strategy for speech feature extraction. Speakersindependent experiments to recognize ten English digits using Sub-Band WPs method were carried out
successfully. Among different SNR values, the Sub-Band WPs method shows better recognition rates
than the traditional MFCC and the general Mel Filter-Like WPs methods. Experiment results suggest that
the proposed Sub-Band WPs method increase the speech recognition rate while the dimension of feature
doesn’t increase. The comparative study with the state-of-the-art method shows that our method is
superior.
-7-
References:
[1]
Potamifis,I., Fakotakis, N., Kokkinakis, G., “Improving the robustness of noisy MFCC features
using minimal recurrent neural networks” Neural Networks, 2000. IJCNN 2000, Proceedings of the
IEEE-INNS-ENNS International Joint Conference on, Volume:5, 2000 Page(s):271-276 vol.5
[2]
Wei-Wen Hung, Hsiao-Chuan Wang, “On the use of weighted filter bank analysis for the
derivation of robust MFCCs” IEEE Signal Processing Letters, Volume:8 Issue: 3, March 2001
Page(s):70-73
[3]
Shang-Ming Lee, Shi-Hau Fang, Jeih-welh Hung, Lin-Shan Lee, “Improved mfcc feature
extraction by pca-optimized filter-bank for speech recognition” Automatic Speech Recognition and
Understanding, 2001.ASRU ’01.IEEE Workshop on, 2001
[4]
Tufekci,Z., and Gowdy,J.N.:“Feature extraction using discrete wavelet transform for speech
recognition”. Proc. IEEE Southeastcon 2000, Nashville, USA, 2000, pp. 116-123
[5]
Farooq,O., and DattaS.: “Dynamic feature extraction by wavelet analysis”. Proc. 6yh Int. Conf.
Spoken Language Processing, Beijing, China, Oct. 2000, Vol.4, pp. 696-699
[6]
O.Farooq and S. Datta, “Mel Filter-Like Admissible Wavelet Packet Structure for Speech
Recognition” IEEE Signal Processing Letters, vol.8, No.7, July 2001.
[7]
E.Zwicker and E.Terhardt, “Analytical expression for critical band rate and critical bandwidth as a
function of frequency,” J.Acoust.Soc.Amer., vol.68,pp.1523-1525, Dec.1980
-8-
Download