ACOUSTIC –PHONETIC SPEECH PARAMETERS FOR SPEAKER-INDEPENDENT SPEECH RECOGNITION Om Deshmukh, Carol Y. Espy-Wilson and Amit Juneja University of Maryland, Department of Electrical & Computer Engineering, A. V. Williams Bldg., College Park, MD 20752 ABSTRACT Coping with inter-speaker variability (i.e., differences in the vocal tract characteristics of speakers) is still a major challenge for Automatic Speech Recognizers. In this paper, we discuss a method that compensates for differences in speaker characteristics. In particular, we demonstrate that when continuous density hidden Markov model based system is used as the back-end , a Knowledge-Based Front End (KBFE) can outperform the traditional Mel-Frequency Cepstral Coefficients (MFCCs), particularly when there is a mismatch in the gender and ages of the subjects used to train and test the recognizer. This work was supported by NSF grant # SBR9729688 and NIH grant # 1K02DC00149. 1. INTRODUCTION A major challenge for present speech recognition systems is tolerance to speaker differences and variations in context and environment. Differences in speaker characteristics are a major source of interspeaker variation. A considerable amount of research has attempted to deal with one aspect of this variability, the differences in the vocal tract length across speakers, by developing Vocal Tract Length Normalization (VTLN) [1,2] techniques. The most popular VTLN technique performs speaker-specific piecewise linear frequency scaling of the Mel-Frequency Cepstral Coefficients (MFCCs) [3]. The overall improvement in the word error rate (WER) obtained with this technique is usually on the order of 0.6% as compared to results obtained without VTLN. Other methods use an exponential warping scale or a one-bark-shift in the filter bank analysis of male speech relative to female speech [4]. In this paper, we extend previous work reported in [5] where we developed a front end that consisted of acoustic parameters (APs) related to the manner phonetic features: sonorant, syllabic, continuant, voiced and strident, in addition to silence. Recognition results showed that the relative measures used for the APs were considerably more speaker independent than the MFCCs. Presently, we have extended the set of APs to include relative measures that extract information about some of the place phonetic features. Specifically, we focused on the place features alveolar and palatal for strident fricatives and alveolar and labial for stop consonants. Additionally, APs were developed for the place features high, low, front and back for vowels, and rhotic for /r/. Section 2 discusses the design procedure and the APs used. Section three talks about the details of the HMM models and other system specifications. Results from the two recognition systems - AP front end and MFCC front end- are compared in section 4. Important conclusion and future work are outlined in section 5. 2. KBFE Phonetic components in a speech signal carry the relevant information needed to interpret the linguistic message. To extract this information, we use a signal representation that consists of continuous density parameters that automatically extract the acoustic correlates of the phonetic features. In previous work [6], we showed the importance of using relative measures to reduce the speaker dependency of the APs. In particular, the frequency bands used to characterize the spectral shape of strident fricatives was based on the third formant (F3) estimated from each utterance. Similarly, in the present system, we have incorporated the spectral shape parameters Ahi-A23 and Av-Ahi proposed in [7] for characterizing alveolar and labial stop consonants. We have modified these parameters to make them more speaker-independent through F3 normalization. The APs for the place features high,low,front and back in vowels and rhotic in /r/ were normalized by computing differences between formants in Hertz and Bark. Note that at present, KBFE consists of only 28 APs since we have not yet designed automatic and robust algorithms to extract the acoustic correlates of all of the 20-odd phonetic features. The 28 APs were designed to capture the acoustic correlates of only 13 of the phonetic features (there is some redundancy in the parameters that needs to be removed). Notably, we do not have algorithms yet for the manner features nasal, and lateral, the place features alveolar, velar and labial for nasal consonants, the place feature labial and dental for the nonstrident fricatives, and the place feature velar for stop consonants. 3. SYSTEM DISCRIPTION HTK [8], an HMM-based sub-word recognition system, was used for the back end processing. Thirteen MelCepstral Coefficients along with their first and second derivatives were computed every 5ms. 30 channels were used for the Mel coefficients calculations and a window size of 25ms was used. Zero mean normalization was also incorporated. The APs were also computed every 5ms over a window of 25ms. 3.1. Model Topography 10 different HMM models, one for each digit were developed. Different topographies with various tying schemes were tried out and the one that gave optimal results with Cepstral parameters was selected The final topography was as follows: Each of the digits was modeled as a three state (plus 2 terminal states) HMM with eight Gaussian mixtures in each state. Each mixture was initialized as zero mean and unit diagonal covariance. Each mixture had diagonal variance and all the mixtures weights in all the states were initialized at the same value. Left-to-right state transition with no skips was incorporated with the additional constraint that each model has to start with the first state. All of the allowable transitions were initialized as equi-propable. The same topography was maintained for both the MFCCs and the KBFE. 3.2. Databases Recognition experiments were run on isolated digits from the TIDIGITS [9] and TI46 [10] databases. The adult speech was taken from TI46 whereas the children speech was taken from TIDIGITS. The children speech, with the original sampling rate of 20KHz was appropriately down sampled to the sampling rate of TI46 data, 12.5KHz. The suggested training set of TI46 has 160 utterances of each digit: 10 repetitions by each of the 8 male and 8 female subjects. The testing set has 256 utterances of each digit: 16 repetitions by each of the 8 male and 8 female subjects. The TIDIGTS’ training set has 102 utterances of each digit: two repetitions by each of the 25 boys and 26 girls. The testing set has 100 utterances of each digit: two repetitions by each of the 25 boys and 25 girls. Boys were between the ages of 6-14 and girls were between the ages of 7-15. 4. RESULTS 4.1. Experiments A series of experiments were carried out to test the robustness of the parameters towards gender and age variability. The performance benchmark was set by training and testing on the subjects of same group. (Adult training with adult testing and children training with children testing) To test the effect of age variability the system was trained on adult data and tested on children data. In a different experiment, train and test data were swapped. To test the effect of gender variability the system was trained on males, tested on females & trained on boys, tested on girls and then with the training and testing subjects swapped. 4.2. Performance Tables 1-3 show the overall word accuracy results for the two systems described above. Table: 1 Results for gender variability in adults (in % accuracy). ‘Tr’ specifies the training data and ‘Te’ specifies the test data. MFCCs (39 pars) KBFE (28 pars) Tr: Females Tr: Adult Tr: Males Te: Adult Te: Females Te: Males 99.88 70.27 68.29 97.26 91.67 78.33 Table: 2 Results for gender variability in children (in % accuracy). ‘Tr’ specifies the training data and ‘Te’ specifies the test data. MFCCs (39 pars) KBFE (28 pars) Tr: Children Tr: boys Te: Children Te: girls 98.30 95.60 Tr: girls Te: boys 95.00 94.27 96.56 92.33 Table: 3 Results for age variability (in % accuracy) ‘Tr’ specifies the training data and ‘Te’ specifies the test data. Tr:Adult Tr:Child Tr:Adult Tr:Child Te:Adult Te:Child Te:Child Te:Adult MFCCs 99.88 98.30 60.20 62.37 (39 pars) KBFE 97.26 94.71 85.15 85.88 (28 pars) 4.3. Discussion The first column of Tables 1-3 show that at present, KBFE does not perform quite as well as the MFCCs. Presumably, this slightly lower performance is due to the fact that KBFE does not consist of a full speech signal representation. As stated above, KBFE consists of only 28 APs. APs for some phonetic features needed for digit recognition, e.g. the feature nasal, and the place features for labial and dental weak fricatives, are missing. Analyses of the recognition results show that the major sources of error are the confusions between “one” and “nine” (8.3% substitutions), and between “five” and “nine”(2.4% substitutions). Thus, we feel strongly that the performance of KBFE with HTK will improve greatly once APs have been added for the missing phonetic features. The other columns in Tables 1-3 (with the exception of Table 2) show that KBFE outperforms the MFCCs when the test data is very different from the training data. Relative to the results obtained with the MFCCs, the results with KBFE show an error reduction of 39% when there is a gender difference between the subjects used to train and test the system (Table 1). Similarly, KBFE shows an error reduction of about 63% when there is a considerable age difference between the subjects used to train and test the recognizer (Table 3). Note that there is very little change in the results of Table 2. Such consistent results suggest that boys and girls between 7 and 15 years of age have very similar vocal tract lengths. 5. CONCLUSION & FUTURE WORK In this paper, we have shown that the APs, which are based on phonetic features, target the linguistic information in the speech signal. Further, we have shown that the use of relative measures can greatly reduce differences in speaker characteristics. Work is in progress to finish the development of KBFE. Once completed we expect considerable improvement in accuracy and even more consistent results across different speaker groups. REFERENCE [1] Zhan Puming, Waibel Alex “Vocal tract length normalization for large vocabulary continuous speech recognition”, CMU-CS-97-148, May 1997. [2] Zhan Puming and Westphal Martin “Speaker normalization based on frequency warping”, 1997 ICASSP 1997, Munich. [3] Hain T., et. al, “ The 1998 HTK system for transcription of conversational telephone speech”, Proc. ICASSP’99, Phoenix, pp 57-60. [4] Ono Yoshino, et. al, “ Speaker normalization using constrained spectra shifts in auditory filter domain”, Eurospeech-93, 355-358,1993. [5] Bitar, N. and Espy-Wilson, C., “Knowledge based p parameters for HMM speech recognition”, 1996 ICASSP, Vol 1., 29-32, May 1996 [6] Bitar, N. and Espy-Wilson, C., “The design of acoustic parameters for speaker-independent speech recognition”, Eurospeech 97, Vol 3., 1239-1242 [7] Stevens, K.N., and S.Y. Manuel, “Revisiting Place of Articulation Measures for Stop Consonants: Implications for Models of Consonant Production”, In the Proceedings of XIV International congress of phonetic sciences Vol.2 pp 1117-1120. [8] HMM Tool Kit (HTK), 3.0, http://htk.eng.cam.ac.uk/ [9] R. Gary Leonard and George Doddington, TIDIGITS, http://www.ldc.upenn.edu/ [10] Mark Liberman, TI46-Word, http://www.ldc.upenn.edu/