Discriminative Disfluency Modeling for Spontaneous Speech Recognition Chung-Hsien Wu and Gwo-Lang Yan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. {Chwu,yangl}@csie.ncku.edu.tw Abstract Most automatic speech recognizers (ASRs) have concentrated on read speech, which is different from speech with the presence of disfluencies. These ASRs cannot handle the speech with a high rate of disfluencies such as filled pauses, repetition, repairs, false starts, and silence pauses in actual spontaneous speech or dialogues. In this paper, we focus on the modeling of the filled pauses “uh” and “um.” The filled pauses contain the characteristics of nasal and lengthening, and the acoustic parameters for these characteristics are analyzed and adopted for disfluency modeling. A Gaussian mixture model (GMM), trained by a discriminative training algorithm that minimizes the recognition error, is proposed. A transition probability density function is defined from the GMM and used to weight the transition probability between the boundaries of fluency and disfluency models in the one-stage algorithm. Experimental result shows that the proposed method yields an improvement rate of 27.3% for disfluency compared to the baseline system. conversation turn, in oral communication. The proposed system architecture is shown in Figure 1. The parameters of the input speech are analyzed according to the properties of these two kinds of disfluencies. These parameters contain MFCC, delta MFCC, formant 1(F1), formant 2(F2) and formant magnitude ratio (FMR). The GMM [8], whose weights are estimated by a discriminative training algorithm that minimizes the recognition error recursively, is proposed to model these parameters. A transition probability density function is defined and used to weight the transition probability between the boundaries of fluency and disfluency models in the one-stage algorithm. Disfluency GMM Fluency GMM F1 F2 FMR Estimation of Boundary Transition Probability Speech Wave One-Stage Algorithm MFCC delta MFCC 1. Introduction The growing demands for ASR are used in applications such as dialog systems, call managers, and weather forecasting systems in recent years. The most noticeable problem is the poor recognition rate of disfluent speech because spontaneous speech is punctuated with and interrupted by a wide variety of seemingly meaningless words such as “uh” and “um.” These types of disfluent speech contain filled pauses, repetition, repairs, false starts, and silence pauses. Any kind of disfluencies will destroy the smooth speaking style of speech and therefore decrease the ability of ASR. Most of the researches on disfluency that have been reported in the literature deals with read speech and has treated the phenomena as a general recognition model [1]. These speech recognizers are typical HMM based and accept only fluent read or planned speech without disfluency. They have difficulties in dealing with filled pauses and word lengthening because the duration of a phone tends to lengthen differently. Some previous researches [2][3] focus on the language model to overcome and correct the recognition errors caused by disfluency. These works either take the difluency into account or skipped the disfluency words in the language model. It is also not effective enough to deal with filled pauses because the pause can be inserted at almost arbitrary positions. Some other researches [4][5] analyze the prosody of difluent speech. They exploit F0 and spectrum to derive some rules and parameters to detect the disflent position in a speech. In this paper, the filled pauses “ah” and “um” are investigated because the properties of these two disfluencies are similar. The recognition of these pauses is important because they play valuable roles, such as thinking and helping a speaker keep a Recognizer HMMs with ah,um Syllable Figure 1: System architecture of the speech recognizer for speech with disfluency 2. Parameter Analysis of Filled Pauses “ah” and “um” Since the filled pauses appear anywhere in speech when people are talking with each other, the corpus of spontaneous conversations was collected from the natural dialogues between human and human. The spontaneous speech database contains over 30 hours of recorded speech, spoken by over 40 speakers of both sexes. According to our preliminary observation, the filled pauses “ah” and “um” can be characterized by two properties: lengthening and nasal. The acoustic analysis is described in the following sections. 2.1 Parameter for Lengthening Characteristic For voice lengthening, the vocal cord vibrates periodically and the vocal tract is maintained in a relatively stable configuration throughout the utterance. In other words, the produced voice changes smoothly. Figure 2 (a) and (b) show the waveform and spectrogram of the utterance “嗯…你好”(um … how are you). The lengthening voice “um” happens at the beginning of the utterance. The spectrogram is almost steady compared to the voice at the end of the utterance. According to the property, the cepstral coefficients modeling the vocal tract are chosen as the parameters. In our approach, we choose 12 mel-frequency cesptrum coefficients (MFCCs) and 12 delta MFCCs, which is useful for detecting the steady property of the utterance. Figure 2 (c) shows the 12 MFCCs and 12 delta MFCCs are stable for the lengthening and steady properties (*) is "a",(+) is "i",square is "u",dimend is "e",(o) is "o", (x) are "ah" and "um" 3000 2500 F2 (Hz) 2000 1500 1000 500 (a) 0 100 200 300 400 500 600 F1 (Hz) 700 800 900 1000 Figure 3: Plot of F2 versus F1 for vowels “a”, “i”, “u”, “e”, and “o”, and filled pauses “ah” and “um.” (b) Table 1: The formant frequencies F1 and F2 of “ah” and “um” Ah Um Formant 1 239Hz 233Hz Formant 2 1021Hz 1268Hz 2.3 Formant Magnitude Ratio Since the filled pauses have the characteristics of steady F1 and F2, the magnitude of the formants play an important role for characterizing the disfluencies. A formant magnitude ratio is thus defined in the following. R Magnitude( F 2) Magnitude( F 1) (1) where Magnitude(F1) represents the magnitude of F1. The equation formulates the degree of the decrease in magnitude when the voice is produced through the nostrils. Figure 4 shows the histogram of formant magnitude ratio for “ah” and “um”. The mean of the histogram is about 0.08. (c) Figure 2: (a) Waveform, (b) spectrogram, and (c) 12 MFCCs and 12 delta MFCCs of the utterance ”嗯..你好”(um … How are you) 2.2 Parameter for Nasal Effect The second property of “ah” and “um” is the nasal effect. In the production of nasal, the resonance characteristics are conditioned by the oral cavity characteristics forward and backward from the velum and by the nasal tract characteristics from the velum to the nostrils. The special production procedure causes the particular formant change. Many researches [6][7] have been reported about the nasalized voice. The noticeable cues are the first two formants, F1 and F2 (at about 300 and 1000Hz) compared to the normal sound with F1 and F2 at about 250-800 and 700-2500Hz. Figure 3 shows the plot of F2 versus F1 for vowels “a”, “i”, “u”, “e”, and “o”, and filled pauses “ah” and “um.” The marks “x”, representing the filled pauses “ah” and “um,” in this figure can be distinguished from the vowels in the vowel triangle. It is trivial that the frequencies F1 and F2 for “ah” and “um” characterize the nasal sound well so that they are also chosen as the parameters. In our experiment, the average frequencies of F1 and F2 for “ah” and “um” are listed in Table 1. Figure 4: The histogram of FMR for “ah” and “um” Figure 5: The histogram of FMR for normal voices On the contrary, Figure 5 shows the histogram of the formant magnitude ratio for normal voices. The distribution mean is about 0.8. So the formant magnitude ratio is very special for nasal characteristic of “ah” and “um.” Gaussian mixture model is the most commonly used statistical model in speech and speaker recognition systems. In this model, the covariance matrix is usually assumed to be diagonal in application. This assumption discards the cross-correlation between parameters and takes the advantage of computation. In speech or speaker recognition systems, parameters are modeled as a class whose output probability is represented by a Gaussian mixture density. In the GMM [8], the output probability is calculated by each class with its weight of importance. It is because each class should have its different contribution to the output probability. The framework is depicted in Figure 6. Input Parameters Mixture 1 W1 Mixture 2 W2 Mixture 3 W3 . . . . . . Mixture M WM Output Probability Wm [WH ,m ,W H ,m (4) (5) ] and is a constant that controls the steepness of the sigmoid function. According to the usual discriminative training methodology, an optimization criterion is defined to minimize the recognition error and a gradient descent algorithm is used to interactively update the mixture weights ideally. However, since the probability density function of xt is not known, a gradient-based iterative procedure is used to minimize R as follows: (Wm )n 1 (Wm )n R((Wm )n , xt ) (6) where is the step size and R((Wm )n , xt ) is the gradient of the loss function with respect to Wm estimated from the training samples. The one-stage algorithm is employed to calculate the local distances for each test frame against the state mixtures in HMM and applies two types of transition constraints for the models in the interior and the models at the boundaries. Finally, the n-best paths are backtracked. A minimum accumulated distance, D( xi , s, k ) , along any path to the grid point is defined as In our approach, these parameters analyzed in Section 2 are modeled by the GMM with 16 Gaussian mixtures using the modified k-means algorithm. The weights are initially set to the percentage of the number of parameter sets belonging to each class with respect to the total number. Then they are updated based on the gradient descent using the discriminative training algorithm. In this model, the input parameters are fed into all Gaussian mixtures together with their corresponding weights to output the probability obtained from the weighted summation of the mixture outputs. 3.2 Discriminative Training of Mixture Weights In this work, we define a discriminative training framework that is tailored to the disfluency and fluency GMMs. A disfluency verification function is defined to form a linear discriminator whose weights are discriminatively trained. Given a disfluency GMM, GMM H , and a fluency GMM, GMM H , the verification function is defined as m (3) 3.3 Integration of One-Stage Algorithm and GMM Figure 6: The framework of the GMM V ( xt ; H ) [WH , m GMM H , m ( xt ) W 1 1 exp[ bV ( xt ; H )] 1 if x t H b 1 if x t H 3.1 Gaussian Mixture Model Mixture Weights R(Wm , xt ) where 3. Disfluency Modeling Using GMM Gaussian Mixture Model minimized with respect to the weights. The loss function represents a smooth functional form of the verification error. It takes the form of a sigmoid function, which is written as H ,m GMM H ,m ( xt )]) (2) D( xi , s, k ) d ( xi , s, k ) min{D( xi 1, s, k ), D( xi 1, s 1, k ), D( xi , s 1, k )} (7) where d ( xi , s, k ) is the local distance between the feature xi and the s-th state of the acoustic model k . With the model boundaries at s=1, the boundary transition yields: D( xi ,1, k ) d ( xi ,1, k ) min{D( xi 1,1, k ), D( xi 1, SN (k * ), k * )} * where SN (k ) represents the state number of model In traditional approaches, the model boundary probability is set to 1. In this paper, the boundary probability is determined by the GMM. The probability density function is defined as (8) . transition transition transition k V TPDFH (V ) GH ( V ) 1 2 exp[ 1 x 2 ( ) ]dx 2 (9) where V is the verification score calculated by Equation 2 and G is the integral function for normal curves of verification score estimated from the training examples which are tagged as disfluencies. The transition probability density function is embedded in the transition between model boundaries as where xt represents the t-th parameter vector of the input speech, and the weight vectors WH ,m and WH ,m are the m-th D( xi ,1, k ) d ( xi ,1, k ) min{D( xi 1,1, k ), BTP( xi | k * ) * D( xi 1, SN (k * ), k * )} * mixture weights for disfluency and fluency model H and H , WH , mGMM H , m ( xt ) respectively. The terms and where the boundary transition probability one-stage algorithm is defined as follows: W H ,m GMM H ,m ( xt ) represent the output probabilities of the disfluency and fluency GMMs. A loss function is defined and BTP( xi | k * ) (10) in the * Disfluency BTP( x | * ) TPDF * (V ( x , * )) k i k i k k * BTP( xi | k * ) T 0 T 1 k Disfluency (11) where T is a threshold and is chosen as 0.6 according to our experiment. The approach can minimize the recognition errors caused by the disfluent speech and the GMM can control the model boundary transition between fluency and disfluency models. 4. Experiments generated from the GMM is helpful to avoid incorrect transition in the one-stage algorithm. Table 2: Recognition rates for the baseline system and the baseline system with GMM Fluent Disfluent sentences sentences Baseline system 80.4 74 Baseline system with GMM 79.1 81.1 4.1 Baseline System and Database The baseline system for this work is an ASR with normal subsyllable HMMs and disfluency HMMs for “ah” and “um.” The training corpus consists of two parts. The first part is the TCC300 collected by three famous universities in Taiwan. TCC300 contains 1642 and 1131 sentences from male and female speakers respectively without disfluencies. The second part of the corpus is collected from 40 speakers. This corpus contains 240 and 160 sentences from male and female speakers respectively with disfluencies “ah” and “um.” The testing data is also collected and consists of 398 fluent and 127 disfluent sentences. 4.2 Experiments on Baseline System with GMM The experiment was conducted to compare the performance between baseline system and baseline system with GMM. Before the comparison, the threshold T in Equation 11 must be determined. An experiment was designed and evaluated according to the different values of threshold T. The experimental result is shown in Figure 7. The average recognition rate decreases dramatically when T is smaller than 0.5. The reason is because the transition probability dominates some incorrect boundary transitions in the one-stage algorithm. When the threshold T is chosen as 0.6, the system achieved the best average recognition rate. 0.83 Average Recognition Rate 0.81 0.79 0.77 0.75 0.73 Recogniton Rate of Fluent Sentences 0.71 0.69 Recogniton Rate of Disfluent Sentences 0.67 Average Recognition Rate 0.65 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 T Figure 7: Recognition rates for fluency and disfluency as a function of the values of T The recognition rate with T=0.6 is shown in Table 2. In the testing of fluent sentences, the performance of the baseline system is slightly better than the baseline system with GMM. This is because “ah” and “um” have similar phonetic properties with some sub-syllables. This results in a little degradation in recognition performance. In the testing of disfluent sentences, the performance of the baseline system with GMM is much better than that of the baseline system. The improvement rate achieved 27.3%. This is because the transition probability density function correctly guides the model boundary transition. The experimental result shows the transition probability 5. Conclusion In this paper, the properties of filled pauses “ah” and “um” have been analyzed and modeled by the GMM. A discriminative training methodology is employed to train the weights in the GMM with a gradient-based iterative procedure. Then the GMM generates the boundary transition probability for model transition. This boundary transition probability is then integrated into the one-stage algorithm to give the final recognition results. Experimental results show the transition probability generated from the GMM is helpful to avoid incorrect transitions in the one-stage algorithm. Without affecting the performance for fluent speech, a significant improvement for disfluency was achieved using the GMM. 6. Acknowledgment The authors would like to thank the National Science Council, R.O.C., for its financial support of this work, under Contract No. NSC89-2614-H-006-002-F20. 7. References [1] A. Kai and S. Nakagawa. Investigation on unknown word processing and startegies for spontaneous speech understanding. In proc. Of Eurospeech’95, pp. 2095-2098, 1995. [2] Stolcke, A. and Shriberg, E. “Statistical Language Model for speech disfluencies.” Proceedings of ICASSP-96, Page(s): 405 -408 vol. 1 [3] Manhung Siu and Ostendorf, M. “Variable N-Grams and extensions for conversational speech Language Modeling.” Speech and Audio Processing, IEEE Transactions on Volume: 8 1, Jan. 2000, Page(s): 63 -75 [4] M. Gabrea and D. O’Shaughnessy. “Detection of filled pauses in spontaneous conversation speech.” Proceedings of ICSLP 2000. [5] Masataka Goto, Katunobu Itou and Satoru Hayamizu. “A Read time filled pause detection system for spontaneous speech recognition.” Proceedings of ICASSP 2000. [6] G. Feng and E. Castelli. “Some acoustic feature of nasal and nasalized vowels: A target for vowel nasalization.” J. Acoust. Soc. Am., 99(6) : 3694-3706, 1996 [7] Daniel Recasens. “Place cues for nasal consonants with special reference to Catalan.” J. Acoust. Soc. Am., 73(4) : 1346-1353, 1983 [8] Beaufays, F.; Weintraub, M.; Yochai Konig. “Discriminative mixture weight estimation for large gaussian mixture models.” Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on Page(s): 337 -340 vol.1