DISTORTION-CLASS WEIGHTED ACOUSTIC MODELING FOR ROBUST SPEECH RECOGNITION UNDER GSM RPE-LTP CODING Juan M. Huerta and Richard M. Stern Carnegie Mellon University Department of Electrical and Computer Engineering Email: {juan, rms}@speech.cs.cmu.edu separate the set of phonemes into cluster their relative spectral distortion. In Sec to weight two sets of acoustic mo We present a method to reduce the effect a ofmethod full-rate the distortion categories introduced in S GSM RPE-LTP coding by combining two sets of acoustic describe the results of recognition experi models during recognition, one set trained on GSMtechniques in Section 6. distorted speech and the other trained on clean uncoded ABSTRACT speech. During recognition, a posteriori the probabilities THE RPE-LTP CODEC AS A SOURCE of an utterance are calculated as a sum of the 2. posteriors of the individual models, weighted according to OF theACOUSTIC DEGRADATION distortion class each state in the model represents. We The full-rate GSM codec decomposes the sp analyze the origin of spectral distortion to the quantized into a set of LAR coefficients and a shor long-term residual introduced by the RPE-LTP codec and signal [1]. The LAR coefficients are qu discuss how this distortion varies according to phonetic transmitted to the decoder while the shortclass. For the Research Management corpus, the proposed method reduces the degradation in segmented frame-by- into subframes and coded by an coder In this section we discuss frame phonetic recognition accuracy introduced by [3]. GSM degradation that exist in the RPE-LTP code coding by more than 20 percent. Figures 1 and 2, a simplified version of LTP codec and a simplified version of a r 1.INTRODUCTION codec respectively. Speech coding reduces the accuracy of Figure speech1 depicts an ideal codec able to recognition systems [4]. As speech recognition signal that is identical to the original application in cellular and mobile environments Even become though this codec would not achieve a more ubiquitous, robust recognition in these of conditions the bit rate of the input sequence it h becomes crucial. Even though the speech codec is only sources and nature of the distortion of t one of the several factors that contribute to the introduced by the actual RPE-LTP codec. degradation in recognition accuracy, it is necessary to The in ideal RPE-LTP codec works as follow understand and compensate for this degradation order sequence (the short-term residual) enter to achieve a system’s full potential. We focus on x[n] the and GSM is codec compared to a predicted sequ acoustic distortion introduced by the full-rate reconstructedof short-term residual) prod [1]. This distortion can be traced to the quantization predictorandblock (the LTP section). The the log-area ratio (LAR) and to the quantization between downsampling performed in the RPE-LTP process. The these two signals is what the pred unable to predict, which we will refe distortion introduced in the residual signal affects “innovation sequence”. In the absence of recognition to a larger extent than the quantization that acoustic degradation, our ideal codec wi the LAR coefficients undergo [2]. generate a reconstructed time sequence In Section 2 of this paper, we discuss the together origin ofthe theinnovation sequence and th distortion in the RPE-LTP codec. We observe sequence. that based This reconstructed sequence wil on the “predictability” of the short-term residual signal, time sequence to the original by definition. x[n] the RPE-LTP will be able to minimize the error in the reality, the RPE-LTP coder, does not t quantized long-term residual. This predictability is later necessary information for the decoder t shown to be related to phonetic characteristics of innovation the sequence exactly. subsampled It sends and signal. In Section 3 we show that the relative spectral quantized information that will result in distortion introduced in the quantized long-term residual innovation sequence (called the reconstruc tends to be concentrated around three levels, and that that this will approximate the origi residual) amount of relative spectral distortion can be loosely sequence to the extent possible. The related to the relative degradation introduced to the frame degradation in the reconstructed signal wi recognition accuracy by GSM coding. In Section we the4 energy of the innovation signal which on how good the predictor module (RPE) We observ “follow” or predict the next subframe ofdistortion the time observed in a frame. sequence based on previous reconstructed subframes. counts are roughly clustered in 3 regions: and high distortion, separated by breakpoint Input 33 and 67 percent. We also observe that the sequence the frames suffer only a small amount of dist Innovation most of the time the LTP section of the code do a reasonably good job of predicting the residual signal. Predicted sequence Predictor . 6 Reconstructed sequence 5 Input sequence Innovation Subsampling and Coding Log Counts 4 coder. Figure 1. Simplified block diagram of an ideal RPE-LTP Reconstructed Innovation 2 Predicted sequence Predictor 3 1 0 0 Reconstructed sequence 0.2 0.4 0.6 0.8 Relative Log Spectral Distortion 1 Figure 3. Histogram of the log distribution o spectral Figure 2. Simplified block diagram of an actual RPE-LTPdistortion introduced to a large samp coder. 1.3.Impact of relative log spectral distor on phonetic recognition Percentage Increase in Frame Error Rate 3.THE RPE-LTP INDUCED SPECTRAL DISTORTION 140 3.1.Relative log spectral distortion introduced 120 by the RPE-LTP coder We use the relative log spectral distance 100below to measure the dissimilarity between the reconstructed and 80 the original innovation sequence. ω ) is theE( power spectrum of the innovation sequence ω ) isand theER ( 60 power spectrum of the quantized innovation sequence. We integrate the absolute value of the difference between 40 both log power spectra and normalize it by the integral of the log of the power spectrum of the 20original innovation signal. π L1 = ∫0 log( E(ω ) − log( E R (ω ))) dω π ∫0 log( E(ω )) dω 0 0 0.17 0.33 0.5 0.67 0.88 1.0 Relative Spectral Distortion Figure 4. Relative degradation in frame-ba phonetic accuracy as a function of the rel 1.2.Distribution of the relative log spectral spectral distortion introduced by the RTEdistortion coder. We computed the relative log spectral distortion order to analyze the relation that exi introduced by the RPE-LTP codec to the RM In corpus. degradation in recognition performance and t Figure 3 is a histogram that shows the log of the relative of relative log spectral distortion introduce frequency of observing various levels of log spectral LTP block, we performed two phonetic reco distortion. The horizontal axis is the amount of relative experiments: one training and testing i.e .,using A plot of the normalized histograms of the non-GSM coded) speech data and a second experiment below. These five classes exhibit using speech that underwent GSM coding. We shown computed pattern of distortion in each of the three the frame accuracy for each amount of relative spectral defined in Section 3.2. Classes 1 and 3 distortion both for when GSM was present and when it percentage of counts in the high distort was absent. Figure 4 depicts how much the frame-based phonetic recognition error rate increased presence due to in thethe middle area, while Classes introduction of the GSM coding as a function the in the middle area. Class 5 more of counts relative spectral distortion. We observe that, almostgenerally completely concentrated in the lo speaking, the phonemes that produced a greater amount region. of distortion due to GSM coding also suffered greater amounts of frame error rate. 0.4 Class 1 0.2 4.RELATING RELATIVE LOG SPECTRAL DISTORSION PATTERNS TO PHONETIC CLASSES 0 0 1 0.4 Class 2 0.2 0 0 1 0.4 4.1.Clustering Phonetic-classes using the 0.2 relative log-spectral distortion 0 Class 3 0 1 We grouped the 52 phonetic units used by our 0.4recognition Class 4 0.2 system into phonetic clusters by incrementally clustering the closest histograms of the counts of the0 relative log 0 1 spectral distortion for the frames associated with each 0.4 Class 5 phonetic unit in the corpus. The distance 0.2between each pair of histograms was calculated using normalized-area 0 0 0.4 0.6 0.8 1 histograms. The clustering yielded five classes, as 0.2shown Relative Log Spectral Distortion in the table below. Figure 5.Histograms of the relative spectral dist five phonetic classes of Table 1. Class Phonemes in cl Frame Acc. Frame Acc., GSM % Degradation 1 PD KD 2 IX B BD DD DH 68.28% 58.34% 31.34% M N JH NG V W Y DX ZH AX SH 3 D G K P T TD CH 70.88% 65.94% 16.96% 4 F HH S Z TH TS 72.39% 57.63% 53.43% 5 46.76% 44.10% 5% IY OW OY UW L 74.37% 67.90% 25.24% R AA AE AH AXR EH AO IH ER AW AY UH EY 5.ACOUSTIC MODEL WEIGHTING Acoustic modeling for HMM-based speech re commonly makes use of mixtures of G distribution representing a set of tied sta probability that an observed vector has be certain state is thus expressed by b j (ot ) = Mj ∑ c[ j, k ]N (ot , µ jk , C jk ) k =1 The term c[j, k] expresses the prior probabi kth th Gaussian component j of the HMM. For a givenj st Table 1. Phonetic classes generated by automatically the sum of c[j,k] over k all is equal to 1. clustering the phonemes distortion histograms We can the amount of distortion t Classes 1, 3, and 4 encompass the majority of consider the a phonetic class undergoes while evaluatin consonants, Class 4 being mostly fricatives. Class 2 using several models that includes the remaining consonants and a probability couple of vowels, while Class 5 encompass the vowels, distortion diphthongs regions. We can express this f that weights th and semivowels. Silence units were omittedintroducing from this a function th probability of the model p dependingdt, onthe analysis. We can see that even though the classes do not distortion of the observed j, frame which t, indicat and exactly correspond to phonetic classes, they achieve a the model representing a given phonetic reasonable partition between broad classes of sounds. f also depends on the prior probabi The pattern of distributions of relative function log spectral model crelated p[j,k]. This function can also be distortion introduced by the RPE-LTP is strongly weighted version [j,k]. of c p The resulting po to phonetic properties of the signal. becomes 2 M also made these weights dependent of the phon using three automatically-clustered phonet p =1 k =1 While there is a modest advantage of weig models differently, either of these two weigh The function f can also dependent be on the distortion class the model represents. f will This weight way, moreprovide better results than using multi-s models.that The 3-class optimized weights red the clean models for states that model phonemes degradation in frame error rate introduced suffer small average GSM distortion. Alternatively, one coding by 27%, from 5.36% to 3.87%. The can make the function f depend on knowledge of the frame weighting improves recognition ac instantaneous relative distortiondt of if each this frame approximately equally for all three phonetic c information is available. This function should give more b j, d (ot ) = ∑ ∑ f (c p [ j, k ], dt , j )) N p (ot , µ jk , C jk ) weight to the distorted models when the relative frame Weighted modeling distortion is greater. 6.SPEECH RECOGNITION EXPERIMENTS Equal weights Phone Accuracy Frame Accuracy 61.2% 65.77% 3-class optimized weights 61.2% 65.99% Phonetic recognition experiments were performed using Table 3. Frame accuracy obtained when using equally the Resource Management corpus and the SPHINX-3 and tied when using phonetic-class based weightin system. Our basic acoustic models consistedmodels of 2500 states, each modeled by a mixture of two Gaussians. One set of models was trained on clean speech while the other 7.CONCLUSIONS set of models was trained with speech that underwent In this paper we examined the distortion in GSM coding. We obtained our reference transcription by the RPE-LTP GSM codec to the short-term r performing automatic forced alignment on the 1600 test signal. We related this distortion to the re utterances using their pronunciations. We computed both distortion the phone accuracy and the frame accuracy. Phone produced by the quantized long-te We observed accuracy is calculated considering insertions and that this distortion influences performance, deletions of phones in the hypothesis. We evaluated the and that it can be related to properties of the speech signal. We introduc frame accuracy by making a frame-by-frame comparison mixwith two the acoustic models that had been traine of the output of the phonetic recognizer and distorted speech by taking into consid alignment. Table 2 summarizes recognition accuracy for amount of distortion suffered by the phonetic clean and GSM speech, using clean and GSM models. We also show the results when models are trained using both GSM-coded speech and clean speech (multi-style 8.REFERENCES trained models). Testing Training da data Phone Accuracy Frame Accuracy Clean Clean 63.3% 69.86% GSM Clean 55.7% 61.20% GSM GSM 60.1% 64.50% 60.1% 64.64% GSM GSM+Clean Multistyle [1] European Telecommunication Standards Institute pean digital telecommunications system (Phase rate speech processing functions (GSM 06.01 1994. [2] Huerta, J. M. and Stern R. M., “Speech Recog GSM Codec Parameters” Proc. ICSLP-98, 1998. [3] Kroon, P., Deprettere, E. F., Sluyter, R. F. “ Excitation - A Novel Approach to Effective and Multi-pulse Coding of Speech”, IEEE Trans. on A Speech and Signal Processing, 34:1054-1063, 1986. Table 2. Baseline frame accuracy and phoneme recognition [4] Lilly, B. T., Paliwal, and K. K. “Effect of Speech C accuracy in Resource Management on Speech Recognition Performance”, Proc. ICSLP-96, 1996. From Table 2 we can see that the effect of having GSM codec distortion during training and recognition reduces frame accuracy by 5.35 points. Recognition accuracy can improved somewhat by using multi-style trained models. 6.1.Experiments using weighted acoustic models We performed recognition experiments using weighted models by considering different mixing factors for two acoustic models, trained GSM-coded and clean speech. Table 3 shows results weighting both models equally. We