2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation VTLN Based Approaches for Speech Recognition with Very Limited Training Speakers Sung Min Ban, Bo Kyung Choi, Young Ho Choi, Hyung Soon Kim Department of Electronics Engineering Pusan National University Busan, South Korea Email: {bansungmin, choibok15, choiyh, kimhs}@pusan.ac.kr techniques have been proposed, and they include maximum a posteriori (MAP) adaptation [7], maximum likelihood linear regression (MLLR) adaptation [8] and speaker clulstering based adaptation such as eigenvoice approach [9]. Eigenvioice adaptation is known to be a fast speaker adaptation because it needs to estimate only a small number of parameters to describe a particular speaker. VTLN, one of the speaker normalization techniques, reduces the acoustic mismatch caused by different speakers by normalizing the length of vocal tract of each speaker. Unlike the speaker adaptation techniques mentioned above, VTLN does not usually require additional adaptation data from the test speaker, which is a strong point in deploying the technique in real-world applications. Due to vocal tract length, intrinsic property of speaker, positions of spectral formant peaks of speech are changed according to speaker [10]. More specifically the positions of formant peaks are inversely proportional to the vocal tract length, which results in the acoustic mismatch. In VTLN, normalization of the vocal tract length is performed by optimally warping the frequency axis in the process of feature extraction, and thereby compensating for the differences of speaker characteristics. In this paper, two approaches using VTLN are examined to deal with the acoustic mismatch due to different speakers in automatic speech recognition for the special case that the training data is available only for a small number of speakers. Target application of this special case is speech recognition for resource-limited languages, where obtaining sufficient speech data to build SI models is either not feasible or very difficult and costly. This paper is organized as follows: In Section 2, conventional VTLN method is introduced and building a virtually SI acoustic model is described in Section 3. Finally, the performance of the two described algorithms are evaluated in Section 4, and the conclusion of this paper is drawn. Abstract—In this paper, two approaches using vocal tract length normalization (VTLN) are examined to deal with the acoustic mismatch due to different speakers in automatic speech recognition for the special case that training data is available only for a small number of speakers. One is the conventional VTLN approach in which both training and test utterances are frequency warped according to the maximum likelihood (ML) based warping factor estimation scheme, in order to normalize the speaker characteristics. The other approach is to build a virtually speaker-independent (SI) acoustic model using artificially generated multiple speaker data by VTLN based frequency warping of training utterances from the limited speakers. To compare the performance of the two approaches, Korean isolated word recognition experiments are performed with a small amount of training data from limited speakers. The experimental results show that the virtually SI acoustic model approach yields better performance than both the conventional VTLN approach and the baseline system in case of very limited training speakers. Keywords-speech recognition; vocal tract length normalization I. I NTRODUCTION Mismatch between training and test environments degrades the speech recognition performance. Among the sources of such mismatch, additive noise, channel distortion, inter-speaker variability and speaking rate are included. There are many studies to alleviate these discrepancies. To enhance the speech signals contaminated by additive noise, Wiener filter and spectral subtraction methods are widely used [1], [2]. These techniques guarantee a reliable performance in the stationary noises. Cepstral mean normalization (CMN) is a very simple and powerful tool to remove short-term convolutional channel distortion [3] and can be combined with other temporal modulation filtering methods [4]. There are only a few researches trying to normalize speaking rate, and recently multiple acoustic modeling method was proposed [5], where multiple feature sets were generated from various speaking rate by continuous frame rate normalization technique [6]. In this paper, we focus on the issue to reduce the mismatch introduced by inter-speaker variability. To reduce this mismatch, various speaker adaptation and normalization 2166-0662/14 $31.00 © 2014 IEEE DOI 10.1109/ISMS.2014.55 II. C ONVENTIONAL VTLN METHOD In the process of VTLN, warping factor is estimated to normalize the acoustic mismatch due to difference of vocal 285 (a) (b) Figure 1. Warping functions (a) piece-wise linear warping function (b) bilinear warping function tract lengths of each speaker, and then frequency axis is scaled according to this warping factor in feature extraction. Using warped features from training data, the normalized acoustic model is built and test data is also normalized by estimated warping factor to match the normalized acoustic model. The following equations represent two commonly used warping functions: piece-wise linear warping function [11] and bilinear warping function [12]. αω if ω < ω0 ω̃ = (1) bω + c if ω ≥ ω0 ω̃ = ω + 2tan−1 ( (1 − α)sin(ω))) ) 1 − (1 − α)cos(ω) Figure 2. λ and transcription wu to equation (3), and feature vector is normalized using estimated warping factor. After normalized model λN with warped features is obtained, then λ is substituted by λN in equation (3), and this process is iterated until no additional changes in the estimated warping factors are observed. Block diagram describing the VTLN based speech recognition method is shown in Fig. 2. To estimate warping factor for test data, transcription for test utterance is required and it is obtained from the 1st-pass speech recognition step using the unwarped feature vectors. Among the possible warping factors in the acceptable range, the optimal warping factor is chosen to maximize the likelihood of the warped feature vectors. Then, final (or 2nd-pass) recognition result is determined using the feature vectors warped with optimal warping factor. (2) Here, α is the warping factor representing speaker characteristics, and ω̃ is the transformed frequency from unwarped freqeuncy ω. ω0 is a fixed value to control the bandwidth mismatching problem, and values b, and c, can be calculated with ω0 . In equation (2), bilinear warping function is depends only warping factor and is nonlinear function contrary to the piece-wise linear function. Fig. 1 shows the piece-wise linear warping function and the bilinear warping function. In Fig 1(a), α > 1.0 corresponds to compressing the spectrum, and α < 1.0 corresponds to stretching the spectrum, and α = 1.0 corresponds to no warping case. The same is true for the bilinear warping function in Fig. 1(b), and it is known that α = 0.42 corresponds to the Mel scale warping. In maximum likelihood (ML) method, the optimal warping factor is obtained by maximizing likelihood function as follows: u α̂i = argmax p(xα i |λ, w ) α VTLN based speech recognition [10] III. V IRTUALLY SI ACOUSTIC MODEL In the case of speech recognition for resource-limited language, it is difficult to guarantee a stable recognition performance over a variety of test speakers. Employing the conventional VTLN method can resolve the problem to a certain extent. Alternatively, we can approach to this problem by virtually building SI acoustic model. Generally, in order to build the SI model, speech data from many speakers are required, but it may not feasible or very costly for resource-limited languages. In this paper, instead of collecting large amount of speech data from many speakers, a method to build a virtually SI model is proposed to ensure (3) Here, xα i is the feature vectors of speaker i normalized by warping factor α and wu denotes the transcription of unwarped feature vectors xi . In training stage, warping factor α̂i is estimated by applying unwarped acoustic model 286 Table I Performance of the VTLN based speech recognition with respect to the normalized acoustic model 8000 7000 6000 Iteration 1 2 3 4 5 Counts 5000 4000 3000 Unwarped test data 82.40 82.57 82.28 80.41 79.86 Warped test data 82.53 83.22 83.61 82.94 83.07 2000 A. Conventional VTLN method 1000 0 In the VTLN based speech recognition process, optimal warping factor for each speaker is chosen from a set of 13 factors evenly spaced over the range from 0.88 to 1.12 [10]. Fig. 3 represents the distribution of the estimated warping factor from the test data. These various warping factors are normalized properly by conventional VTLN. Table 1 shows the speech recognition performance of the VTLN based speech recognition according to the number of training iterations. As expected, warping both the training and the test data shows better performance than warping training data only. In the case of using the warped test data, best performance is observed at 3th iteration. 0.88 0.90 0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 1.08 1.10 1.12 Warping factor Figure 3. Distribution of the estimated warping factor the diversity of acoustic characteristics of various speakers. For this purpose, a number of feature sets are extracted according to multiple warping factors to cover the variety of speaker characteristics, and then they are used to build the SI model. The range of warping factors for speaker i is defined as max[αmin , α̂i − β] ≤ αiSI ≤ min[αmax , α̂i + β] B. Virtually SI acoustic model In training of virtually SI acoustic model, 13 warping factors are applied to each utterance in the process of feature extraction. The range of warping factors are bounded by αmin = 0.8, αmax = 1.2 in equation (4). Fig. 4 shows the speech recognition performance for the virtually SI acoustic (4) Here, α̂i is the optimal warping factor of speaker i which is obtained from equation (3) and various αiSI values within this range is applied to building the virtually SI acoustic model. 2β represents the range of possible warping factor, and αmin and αmax are the lower and upper limits of the warping factors, respectively, to prevent excessive warping. 86.5 Virtually SI model 86 Conventional VTLN IV. E XPERIMENTAL RESULTS Baseline 85.5 Recognition rate (%) We compare the virtually SI model with conventional VTLN method. To evaluate the performance of these methods, Korean isolated word recognition experiment is performed. Phonetically optimized words (POW) DB [13] and phonetically balanced words (PBW) DB [14] are used for training and test, respectively. To construct the very limited training DB set from POW DB, only 4 female and 4 male speakers are randomly selected. PBW DB contains 452 words from 32 female and 38 male speakers and about 31,000 utterances are used for test. Hidden Markov model (HMM) based acoustic model is trained and each model has 3 states with 6 mixtures. There are 150 tied-states using tree-based clustering (TBC) in case of baseline system. We get 39-dimensional Mel-frequency cepstral coefficients (static:13,delta:13,delta-delta:13) with C0. 85 84.5 84 83.5 83 82.5 82 0.04 Figure 4. 287 0.08 0.12 β 0.16 0.20 Performance of virtually SI model method according to β model according to β assuming α̂i = 1.0. As shown in this figure, the best result is obtained at β = 0.2 and it has better performance than both the baseline acoustic model and the VTLN based speech recognition with error rate reduction of 18.5% and 15.0%, respectively. [10] L. Lee and R. Rose, “A frequency warping approach to speaker normalization,” IEEE Trans. on Speech and Audio Processing, vol. 6, no. 1, pp. 49-59, Jan. 1998. [11] L. E Uebel and P. C. Woodland, “An investigation into vocal tract length normalization,” in Proc. of the EUROSPEECH99, Budapest, Hungary, 1999. [12] A. Acero and R. M. Stem, “Robust speech recognition by normalization of the acoustic space,” in Proc. of the ICASSP91, Toronto. Canada, 1991. [13] Y. Lim and Y. Lee, “Implementation of the POW (phonetically optimized words) algorithm for speech database,” in International Conference on [14] Y.-J Lee, B.-W. Kim, J.-J Kim, O.-Y. Yang, and S.-Y. Lim, “Some considerations for construction of PBW set,” in Proc. of the 12th Workshop on Speech Communications and Signal Processing. Acoustical Society of Korea, pp. 310-314. June 1995. V. C ONCLUSION In this paper, two approaches using VTLN are examined to deal with mismatch introduced by inter-speaker variability when training data is available only for a small number of speakers. The experimental results show that the virtually SI acoustic model approach yields better performance than the conventional VTLN approach in case of very limited training speakers. As a future work, in order to improve the performance of the virtually SI acoustic model for resourcelimited language, it will be examined to use warping factor according to the ML based measure of reliability with respect to the acoustic model in building virtually SI acoustic model. ACKNOWLEDGMENT This work was supported by the Quality of Life Technology development program 10036438, ”Development of speech synthesizer and AAC software for the visually and vocally impaired” funded by the Ministry of Trade, Industry and Energy of Korea. R EFERENCES [1] J. S. Lim and A. V. Oppenheim, “Enhancement and band width compression of noisy speech,” Proc. of the IEEE, Vol. 67, No. 12, pp. 1586-1604, Dec. 1979. [2] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-120, April, 1979. [3] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Commun., vol. 25, pp. 133-147, 1998. [4] C. P. Chen amd J. Bilmes, “MVA processing of speech features,” IEEE Trans. Audio Speech Language Process. vol. 15, no. 1, pp. 257-270, 2007. [5] S. M. Ban and H. S. Kim, “Speaking rate dependent multiple acoustic models using continuous frame rate normalization,” in Proc. Asia-Pacific Signal and Information Processing Association, Dec. 2012. [6] S. M. Chu and K. Povey, “Speaking rate adaptation using continuous frame rate normalization,” in Proc. ICASSP, pp. 4306-4309, Mar. 2010. [7] C.-H. Lee, C.-H. Lin, and B.-H. Juang, “A study on speaker adaptation of the parameters of continuous density hidden Markov models,” IEEE Transactions on Signal Processing, vol. 39, no. 4, pp. 806-814, April 1991. [8] C. J. Leggetter and P. C.Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, vol. 9, no. 2, pp. 171-185, April 1995. [9] R. Kuhn, P. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke, K. Field, and M. Contolini, “Eigenvoices for speaker adaptation,” in Proceedings of the 5th International Conference on Spoken Language Processing, vol. 5, pp. 17711774, 1998. 288