Speech Feature Analysis Using Step-Weighted Linear Discriminant Analysis Jiang Hai, Er Meng Joo School of Electrical and Electronic Engineering, Nanyang Technological University S1-B4b-06, Nanyang Avenue, Nanyang Technological University, Singapore 639798 Telephone: (65) 67905472 EDICS: 2.SPEE Abstract: In the speech feature extraction procedure, the relative simple strategy to promote the discriminant of feature vectors is to plus their deltas. Followed the dimension of the feature vector will increase remarkably. Therefore, how to effectively decrease the feature space dimension is key to the performance of calculation. In this paper, a step-weighted linear discriminant dimensionality reduction technique is proposed. Dimensionality reduction using the linear discriminant analysis (LDA) is commonly based on optimization of certain separability criteria in the output space. The resulting optimization problem using LDA is linear, but these separability criteria are not related to the classification accuracy in the output space directly. As a result, even the best weighting function among the input-space results in poor classification of data in the output-space. Through the step-weighted linear discriminant dimensionality reduction technique, we can adjust the weight function of between-class scatter matrix based on the output-space when one dimension is reduced. We describe this method and present an application to a speaker-independent isolated digit recognition task . Keywords: Dimensionality reduction, Linear Discriminant Analysis, Speech recognition -1- n, k , l Counter and number of patterns N The total number of data samples K The total number of classes d (kl ) The Euclidean distance between the means of class k and class l in the input-space The weight function nk The number of training vectors in class k xnk , xnk A training pattern and the label. For n 1,..., N and k 1,...K SW , S W The within-class scatter matrix SB , SB The between-class scatter matrix of the class means vk , vk The mean of class k v,v The global sample mean Tn The transformation matrix is formed by [1 , 2 ,..., n ] The mapping : x y m, n The dimension of the input-space and output-space Table 1 Notation Conventions Used in This Paper 1. Introduction Dimensionality reduction is the process of mapping high dimensional patterns to a lower dimensional subspace and is typically used as a preprocessing step in classification application. The optimality criterion of choice for classification purposes is the Bayes error, which is the minimum achievable classification error, given the underlying distribution. However, it is a time consuming or unreliable task to estimate the Bayes error. Because the difficulty in directly estimating the Bayes error, linear projections based on using the scatter matrices are quite popular in dimensionality reduction for the purposes of classification. The optimality criteria of Fisher’s Linear Discriminant is J tr ( SW1S B ) [1]. It is common to apply linear discriminant analysis (LDA) for statistical pattern classification tasks to reduce computation and to decrease the dimension. The LDA transformation attempts -2- to reduce dimension with keeping most of discrimination information in the feature space. Recently, LDA and improved LDA has been applied to several problems, such as face recognition[2][3] and speech recognition[4]. In speech recognition task, feature space dimension can be increased by extending the feature vector to add a range of neighboring frame data. Doing this will noticeably increase discrimination of feature space, but the computation becomes impractical at the same time. Efficiently compress the dimension of feature space is very useful in speech signal processing. Due to an optimality criterion based on scatter matrices in general is not directly related to classification accuracy. Therefore, a weighted scatter matrix is often constructed in which smaller distances are more heavily weighted than larger distances [5]. However, the scatter matrix is calculated in the input-space and the true value of scatter matrix in output-space is far from correct when the dimensionality is reduced more than one dimension. Rohit Lotlikar and Ravi Kothari proposed the Fractional-Step Dimensionality Reduction method to overcome this problem [6]. However, they just considered the between-class scatter matrix and without calculating the within-class scatter matrix. This will lead to lose useful discriminant information in the procedure of projection. In this paper, we introduce the concept of step-weighted dimensionality reduction, wherein, the dimensionality is reduced from n to m (m n) at one dimension a step. In addition to describing the algorithm of step-weighted LDA method, we present an application to the speaker-independent isolated digit word recognition problem. 2 The conventional LDA The LDA problem is formulated as follows [7]. Let x n be a feature vector. We seek to find a transformation x x , : n m with m n , such that in the transformed space, minimum loss of discrimination occurs. In practice, m is much smaller than n . A common form of an optimality criteria to be maximized is the function J tr ( SW1S B ) . In classical LDA, the corresponding input-space within-class and between-class scatter matrix are defined by, -3- K S B nk ( k )( k ) t (1) k 1 nk K SW k 1 k 1 nk 1 N ( xnk k )( xnk k ) t n 1 nk x n 1 (3) k n K n k 1 k (2) (4) k The LDA is to maximize in some sense the ratio of between-class and within-class scatter matrices after transformation. This will enable to choose a transform that keeps the most discriminative information while reducing the dimension. Precisely, we want to maximize the objective function max S B t S w t The columns of the optimum are the relative generalized eigenvectors corresponding to the first p maximal magnitude eigenvalues of the equation S B S w (5) 3 Step-weighted Linear Discriminant Analysis (SW-LDA) Because the definition of between-class scatter matrix is not directly related to classification accuracy. Therefore, a weighted scatter matrix is often constructed in which smaller distances are more heavily weighted than larger distances [5]. K K S B (d ( kl ) )nk nl (vk vl )(vk vl ) t (6) k 1 l 1 In conventional LDA, if we wish reduce the dimensionality from n to m ( m is much smaller than n ), we would compute S b and its eigenvectors 1 , 2 ,..., n . We would obtain the m dimensional representation which is spanned by 1 , 2 ,..., m . When there are have enough many classes, it is very possible that a pair of classes have the same orientation as n or n1 , n2 ,..., m1 . Because the 1 , 2 ,..., n are orthogonal respectively, the two classes would heavily overlap in the m dimensional space. Though the two classes are -4- well-separated in the original space, they were not sufficiently weighted in computing S b after projecting some dimensions, We gradually compress the data one dimension per step. At each dimensional reduction step, we recompute the between-class and within-class scatter matrix based on the changed interclass distances and intraclass distances and rebuild the weighting function, then compute its eigenvectors. Thereby those class centers which come closer together can be increasingly weighted. The corresponding output-space within-class and between-class scatter matrix are defined by K S B nk ( k )( k ) t (7) k 1 K K S B (d ( kl ) )nk nl ( k l )( k l ) t (8) k 1 l 1 nk K SW k 1 n 1 nk k 1 nk x 1 N n n 1 (10) k n K k 1 k (9) ( xnk k )( xnk k ) t (11) k The entire procedure is expressed for a reduced dimensionality of m . Step 1: Calculating the S B and S w according to the equation (1) and (2); Step 2: Computing the transformation matrix Tn1 [1 , 2 ,, n1 ] and reduce the feature space dimensionality from n to n 1 ; Step 3: Calculating the S B and S w according to the equation (8) and (9); Step 4: Computing the transformation matrix Tn2 [1 , 2 , , n2 ] and reduce the feature space dimensionality from n 1 to n 2 ; Step 5: Repeat the Step 3 and Step 4 until feature space reach to m dimension; Step 6: Computing the transformation matrix T Tm * Tm1 *... * Tn2 * Tn1 . Step 7: After training procedure, using transformation matrix T project the observed feature vectors from n to m ;. -5- 4 Application to Speech Database Our speech recognition experiments were based upon a HMM based speech recognizer for speaker independent isolated English digits task. The TI46 corpus of isolated words which was designed and collected at Texas Instruments (TI) is used in the proposed system. The TI46 corpus contains 16 speakers: 8 males labeled and 8 females. There are 15 utterances of each English digit (0~9) from each speaker: 10 designated as training tokens and 5 designed as testing tokens in the proposed system. The front-end features of this system were 16 Mel-frequency cepstral coefficients plus their deltas. Therefore, the original dimensionality of the speech feature space is 32. The recognition system was trained by 10 clear training tokens per person and the training corpus has 1600 speech utterances altogether. To test the robustness of the speech recognition, we add white noise on the clear testing utterances according to different Signal-to-Noise Ratio (SNR) . To apply LDA, common weighting LDA (W-LDA) and step-weighted LDA (SW-LDA) to our speech recognition system, we need labeled training data. The labels come from the training data based on different digit. In all our simulations, we chose the dimensionality of compressed feature space m =24 and m =16. We present results obtained using each data set with the LDA, W-LDA and the proposed SW-LDA algorithms. The W-LDA and SW-LDA ran the simulations for (d ) taken from the set {d 0 , d 2 , d 4 , d 6 , d 8 , d 10 , d 12 , d 14 , d 16 } . For each choice of (d ) , the training accuracy was noted. LDA SW-LDA W-LDA The power of d The power of d SNR=100 SNR=20 -6- -16 -14 -12 -10 -8 -6 -4 -2 97 96 95 94 93 92 91 90 0 -16 -14 -12 -10 -8 -6 -4 -2 % recognition 98.5 98 97.5 97 96.5 96 95.5 95 0 % recognition LDA SW-LDA W-LDA The power of d -16 -14 -12 -2 0 -16 -14 -12 -10 -8 -6 -4 -2 0 70 -10 % 75 -8 80 -6 85 94 93 92 91 90 89 88 87 -4 90 % recognition LDA SW-LDA W-LDA recognition LDA SW-LDA W-LDA The power of d Average Recognition Rate SNR=10 Figure 1 The Recognition Rate for dimensionality reducing from 32 to 24 Figure 1 shows the speech recognition results when the dimension of feature vectors is reduced from 32 to 24. The accuracy of recognition shows that the SW-LDA has better performance than the W-LDA method generally. The SW-LDA is better than the common LDA when weighting function is d 0 , d 1 and d 2 . The best weighting function for SW-LDA is (d ) d 0 . LDA SW-LDA W-LDA 98 95 96 90 % recognition 94 92 90 80 75 The power of d -16 -14 -12 The power of d SNR=100 SNR=20 LDA SW-LDA W-LDA LDA SW-LDA W-LDA 90 % recognition 80 75 70 65 60 55 50 45 85 80 75 The power of d The power of d Average Recognition Rate SNR=10 Figure 2 The Recognition Rate of dimensionality reducing from 32 to 16 -7- -16 -14 -12 -10 -8 -6 -4 -2 0 -16 -14 -12 -10 -8 -6 -4 -2 70 0 % recognition -10 -8 -6 -4 0 -16 -14 -12 -10 -8 -6 -4 -2 70 0 88 85 -2 % recognition LDA SW-LDA W-LDA Figure 2 shows the testing accuracies obtained with common LDA, conventional W-LDA and SW-LDA for different weighting functions. The rates of speech recognition show that SW-LDA are better than W-LDA along the range of powers. SW-LDA has better performance than conventional LDA when the range of powers of d 8 . The best weighting function for SW-LDA is (d ) d 8 . 5. Conclusion We proposed a method of dimensionality reduction based on SW-LDA method. Using SW-LDA, one can obtain good dimensionality reduction performance than the common LDA technique. When the dimensionality is reduced much more, the SW-LDA method shows relatively better property than the conventional weighting LDA method. Applying the SW-LDA, the speech recognition accuracy rates based on MFCC feature extraction obtained more increase than that of common LDA and weighting LDA obviously. References [1] K.Fukunaga, “Introduction to Statistical Pattern Recognition” New York: Academic, 1990 [2] Belhumeur, Hespanha and Kriegman, “Eigenfaces vs. Fisherfaces: recognition using class specific linear projection”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Volume: 19 Issue: 7, July 1997, Page(s): 711-720 [3] K.Etemad, R.Chellappa, “Discriminant analysis for recognition of human face images”, J. Opt. Soc. Am. A 14 (8) (1997) 1724-1733 [4] Martin, Charlet and Mauuary, L. “Robust speech/non-speech detection using LDA applied to MFCC” Acoustics, Speech , and Signal Processing, 2001. Proceedings. (ICASSP ’01). 2001 IEEE International Conference on, Volume: 1,7-11 May 2001 [5] Y., Y. Gao and H. Erdogan (2000), “Weighted pairwise scatter to improve linear discriminant analysis”, in Proc. ICSLP 4, pp.608~611. [6] Rohit Lotlikar and Ravi Kothari, “Fractional-Step Dimensionality Reduction” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.22, No.6, June 2000 [7] Duchene and S. Leclercq, "An Optimal Transformation for Discriminant Principal Component Analysis," IEEE Trans. On Pattern Analysis and Machine Intelligence,Vol. 10, No 6, November 1988 -8-