GESTURE-BASED DYNAMIC BAYESIAN NETWORK FOR NOISE ROBUST SPEECH RECOGNITION Vikramjit Mitra1, Hosung Nam2, Carol Y. Espy-Wilson1, Elliot Saltzman23, Louis Goldstein24 1 Institute for Systems Research & Department of ECE, University of Maryland, College Park, MD 2 Haskins Laboratories, New Haven, CT 3 Department of Physical Therapy and Athletic Training, Boston University, USA 4 Department of Linguistics, University of Southern California, USA vmitra@umd.edu, nam@haskins.yale.edu, espy@umd.edu, esaltz@bu.edu, louisgol@usc.edu ABSTRACT Previously we have proposed different models for estimating articulatory gestures and vocal tract variable (TV) trajectories from synthetic speech. We have shown that when deployed on natural speech, such models can help to improve the noise robustness of a hidden Markov model (HMM) based speech recognition system. In this paper we propose a model for estimating TVs trained on natural speech and present a Dynamic Bayesian Network (DBN) based speech recognition architecture that treats vocal tract constriction gestures as hidden variables, eliminating the necessity for explicit gesture recognition. Using the proposed architecture we performed a word recognition task for the noisy data of Aurora2. Significant improvement was observed in using the gestural information as hidden variables in a DBN architecture over using only the mel-frequency cepstral coefficient based HMM or DBN backend. We also compare our results with other noise-robust front ends. Index Terms— Noise-robust Speech Recognition, Dynamic Bayesian Network, Articulatory Speech Recognition, Vocal-Tract variables, Articulatory Phonology, Task Dynamic model. 1. INTRODUCTION Current state-of-the-art automatic speech recognition (ASR) systems suffer from acoustic variability present in spontaneous speech utterances. Such variability can be broadly categorized into three classes: variability due to (i) background noises, (ii) speaker differences and (iii) differences in recording device, which is commonly known as the channel variation. Studies [1, 2, 3] have shown that articulatory information can potentially improve robustness in speech recognition systems against noise contamination and speaker variation. However, the motivation behind the use of articulatory information in speech recognition is to model coarticulation and reduction in a more systematic way. Coarticulation has been described in several ways, including the spreading of features from one segment to another [4], influence on one phone by its neighboring phones and so on. Current phone-based spontaneous speech ASR systems models speech as a string of non-overlapping phone units [5], which limits the acoustic model’s ability to properly learn the underlying variations in spontaneous or conversational speech. Studies [5] suggest that such a phone-based acoustic model is inadequate for ASR tasks specifically for spontaneous or conversational speech. In an attempt to model coarticulation, current ASR systems use tri- 978-1-4577-0539-7/11/$26.00 ©2011 IEEE or quin-phone acoustic models, where different models are created for each phone in all possible phone-contexts. Unfortunately such tri- or quin-phone models limit the contextual influence only to immediately neighboring phones and hence may fail to account certain coarticulatory effects (e.g., syllable deletions [6]). Also, some tri- or quin-phone units are less frequent than others introducing data-sparsity that necessitate a large training database to create all possible models of tri- or quin-phone units. The goal of this study is to propose an ASR architecture inspired by Articulatory Phonology (AP) that models speech as overlapping articulatory gestures [7, 8] and can potentially overcome the limitations of phone-based units in addressing coarticulation. AP models a speech utterance as a constellation of articulatory gestures (known as gestural score) [7, 8], where the gestures are invariant action units that define the onset and the offset of constriction actions by various constricting organs (lips, tongue tip, tongue body, velum, and glottis) and a set of dynamic parameters (e.g., target, stiffness etc.) [7]. In AP, the intergestural temporal overlap accounts for acoustic variations in speech originating from coarticulation, reduction, rate variations etc. Gestures are defined by relative measures of the constriction degree and location at distinct constriction organs, that is, by one of the tract variables in Table 1. The gestures' dynamic realizations are the tract variable trajectories, named as TVs here. Note that a TV (described with more details in [9]) does not stand for a tract variable itself but its time function output, i.e. its temporal trajectory. To be able to use the gestures as ASR units, they somehow need to be recognized from the speech signal. In addition we have observed that the accuracy of gesture recognition improves with prior knowledge of TVs [10], indicating the necessity of estimating TVs from the acoustic signal before performing the gesture based ASR. In [10] we presented a Hidden Markov Model (HMM) based ASR architecture whose inputs were Mel-Frequency Cepstral Coefficients (MFCCs), estimated TVs and recognized gestures. Such an architecture required explicit recognition of articulatory gestures from the speech signal. In our current work we propose a Gesture-based Dynamic Bayesian Network (G-DBN) architecture that uses the articulatory gestures as hidden variables, eliminating the necessity of explicit recognition of articulatory gestures. The proposed G-DBN uses MFCCs and estimated TVs as observations, where the estimated TVs are obtained from an Artificial Neural Network (ANN) based TV-estimator. The organization of the paper is as follows: Section 2 describes 5172 ICASSP 2011 the data used in this paper, Section 3 describes the hybrid ANNDBN architecture; Section 4 presents the results followed by conclusion and future directions in Section 5. Table 1. Constriction organs, tract variables Constriction organs Tract Variables Lip Lip Aperture (LA) Lip Protrusion (LP) Tongue Tip Tongue tip constriction degree (TTCD) Tongue tip constriction location (TTCL) Tongue Body Tongue body constriction degree (TBCD) Tongue body constriction location (TBCL) Velum Velum (VEL) Glottis Glottis (GLO) 2. DATA We proposed an architecture in [11] to annotate articulatory gestures and TV trajectories for a natural speech utterance. Using the methodology, we annotated the entire X-ray microbeam (XRMB [12]) database and Aurora-2 [13] clean training set with gestures and TVs that were sampled at 200Hz. There are eight TV trajectories and corresponding articulatory gestures defining the constriction location and degree of different constricting organs in the vocal tract as shown in Table 1. The XRMB database includes speech utterances recorded from 47 different American English speakers, each producing at most 56 read-speech tasks consisting of strings of digits, TIMIT sentences or a paragraph read from a book. In this work we have used the acoustic data and its corresponding TV trajectories for the 56 tasks performed by male speaker 12 from the XRMB database to train the TV estimator. 76.8% of the data was used for training, 10.7% for validation and the rest for testing. For the TV-estimator the speech waveform was downsampled to 8KHz and was parameterized as 13 MFCC coefficients, which were analyzed at a frame rate of 5ms with window duration of 10ms. The acoustic features and the TV trajectories were z-normalized (with std. dev = 0.25). The resulting acoustic coefficients were scaled such that their dynamic range was confined within [-0.95, +0.95]. The acoustic features were created by stacking the scaled and normalized acoustic coefficients from nine frames (selecting every other frame), where the 5th frame is centered at the current time instant. The resulting contextualized acoustic feature vector had a dimensionality of 117 (= 9×13). The Aurora-2 [13] database consists of connected digits spoken by American English speakers, sampled at 8 kHz. The annotated gestures for the Aurora-2 training corpus were used to train the G-DBN. Test set A and B, which contain eight different noise types at seven different SNR levels, were used to evaluate the noise robustness of the proposed word recognizer. 3. THE HYBRID ANN-DBN ARCHITECTURE Our prior study [10] has shown that articulatory gestures can be recognized with a higher accuracy if the knowledge of TV trajectory is used in addition to the acoustic parameters (MFCCs) as opposed to using the acoustic parameters alone. However in a typical ASR setup, the only available observable is the acoustic signal, which is parameterized as acoustic features. This indicates the necessity to estimate the TVs from the acoustic features. We have performed an exhaustive study using synthetic speech in [9] and have observed that a feed-forward ANN can be used to estimate the TV trajectories from acoustic features. 5173 We have used a 4-hidden layer feedforward ANN using tanhsigmoid activation function with 117 input nodes and 8 outputs nodes (corresponding to 8 TVs shown in Table 1). The number of hidden layers was restricted to 4 because with increase in the number of hidden layers (a) the error surface of the cost function becomes complex with a large number of spurious minima, (b) the training time as well as the network complexity increased, and (c) no appreciable improvement of performance was observed with further addition of hidden layers. The ANN was trained using the training set of XRMB for up to 4000 iterations. The number of neurons in each hidden layer of the ANN was optimized using the validation set, where the optimal number of neurons in each of the hidden layer was found to be 225, 150, 225 and 25. Finally, the ANN outputs were processed with a Kalman smoother to filter out the estimation noise. In the hybrid ANN-DBN architecture, the ANN outputs were used as an observation set by the DBN [14]. A DBN is basically a Bayesian Network (BN) containing temporal dependency. A BN is a form of graphical model where a set of random variables (RVs) and their inter-dependencies are modeled using nodes and edges of a directed acyclic graph (DAG). The nodes represent the RVs and the edges represent their functional dependency. BNs help to exploit the conditional independence properties between a set of RVs, where dependence is reflected by a connecting edge between a pair of RVs and independence is reflected by its absence. For N RVs, X1, X2, … Xn RVs, the joint distribution is given by p ( x1 , x2 ,....xN ) = p ( x1 ) p ( x2 | x1 ) p ( x3 | x1 x2 )..... p ( xN | x1...xN ) (1) Given the knowledge of conditional independence, a BN simplifies equation (1) into N p ( x1 , x2 ,....xN ) = ∏ p( xi | xπ i ) (2) i =1 where X π i are the conditional parents of X i . Fig. 1 shows a DBN with four discrete hidden RVs (W, P, S and T), two continuous observable RVs (O1 and O2) and N partlyobservable and partly-hidden RVs (A1 to AN). The ‘prologue’ and the ‘epilogue’ in Fig. 1 represent the initial and the final frames and the ‘center’ represents the intermediate frames, which are unrolled in time to match the duration of a specific utterance [15]. Unlike HMMs, DBNs offer the flexibility to realize multiple hidden state variables at a time, which makes DBN appropriate for realizing articulatory gesture models that involve several variables, one for each articulatory gesture. The other advantage of DBN is that they can explicitly model the interdependencies amongst the gestures and simultaneously perform gesture recognition and word recognition, hence eliminating the necessity to perform gesture recognition as a prior separate step. In this work we have used the GMTK [15] to implement our DBN models, in which conditional probability tables (CPT) are used to describe the probability distributions of the discrete RVs given their parents, and Gaussian mixture models (GMMs) are used to define the probability distributions of the continuous RVs. In a typical HMM based ASR setup, word recognition is performed using Maximum a Posteriori probability P ( wi ) P (o | wi ) w = arg max i P ( wi | o) = arg max P (o ) wi (3) = arg max P ( wi ) P (o | wi ) wi where o is the observation variable and P(wi) is the language model, which can be ignored for an isolated word recognition problem where all the words w are equally probable. Hence we can only focus on P(o|wi) which can be simplified further as P (o | w) = ¦ P (q, o | w) = ¦ P (q | w) P (o | q, w) q q n ≈ ¦ P (q1 | w) P (o1 | q1 , w)∏ P (qi | qi −1 , w) P (oi | qi , w) (4) presents the correlation obtained between the estimated and the ground-truth TVs for the XRMB test set. Table 2. PPMC for the estimated TVs GLO VEL LA LP TBCL TBCD TTCL TTCD 0.853 0.854 0.801 0.834 i=2 where q is the hidden state in the model. Thus in this setup the likelihood of the acoustic observation given the model is calculated in terms of the emission probabilities P(oi|qi) and the transition probabilities P(qi|qi-1). Use of articulatory information introduces another RVs a and then (4) can be reformulated as P (o | w) ≈ ¦ P ( q1 | w) P (o1 | q1 , a1 , w) × q n ∏ P(qi | qi −1, w) P(ai | ai −1 , qi ) P(oi | qi , ai , w) (5) i =2 A DBN can realize the causal relationship between the articulators and the acoustic observations P(o|q,a,w) and also model the dependency of the articulators on the current phonetic state and previous articulators P(ai|ai-1,qi). Based on this formulation, the GDBN shown in Fig. 1 can be constructed, where the discrete hidden RVs, W, P, T and S represent the word, word-position, word-transition and word-state. The continuous observed RVs, O1 is the acoustic observation in the form of MFCCs and, O2 is the articulatory observation in the form of the estimated TVs. The partially shaded discrete RVs, A1, …AN represent the discrete hidden gestures and they are partially shaded as they are observed at the training stage and then made hidden during the testing stage. The overall hybrid ANN-DBN architecture is shown in Fig 2. Fig. 1 The G-DBN graphical model 0.860 0.851 0.807 0.801 We implemented 3 different versions of the DBN, in the first version we used only the 39 dimensional MFCCs as the acoustic observation and no articulatory gesture RV was used. We name this model as the DBN-MFCC-baseline system. In this setup the word models consisted of 18 states (16 states per word and 2 dummy states). There were 11 whole word models (zero to nine and oh) and 2 models for ‘sil’ and ‘sp’, where ‘sil’ had 3 states and ‘sp’ had one state. The maximum number of mixtures allowed per state was four with vanishing of mixture-coefficients allowed for weak mixtures. The second version is identical to the first version, except that there was an additional observation RV corresponding to the estimated TVs. We name this model as the DBN-MFCC-TV system. Finally the third version was the G-DBN architecture (shown in Fig. 1, with MFCC and the estimated TVs as two sets of observation) where we used 6 articulatory gestures as hidden RV, so N in Fig. 1 was 6. Note that the articulatory gesture RVs modeled only the gestural activations, i.e., they were only binary RVs reflecting whether the gesture is active or not and do not have any target information (i.e., degree and location of the constriction information). This was done deliberately to keep the system tractable, otherwise the multi-dimensional CPT linking the word state RVs and the gesture state RVs became extremely large making the DBN overly complex. Hence our current implementation of G-DBN uses 6 gesture RVs: GLO, VEL, LA, LP, TT and TB. Since the gestural activations for TTCL and TTCD are identical they were replaced by a single RV TT (tongue tip) and same is true for TBCL and TBCD, which were replaced by TB (tongue body). Since the TVs were used as a set of observation and the TVs by themselves contain coarse target specific information about the gestures, it can be expected that the system has gestural target information to some extent. In the G-DBN another major change was the number of states per word, which was reduced to eight with two additional dummy states. The number of states for ‘sil’ and ‘sp’ were kept the same as before. In this setup the discrete gesture RVs are treated as observable during the training session and then converted to a hidden RV during the testing. Fig. 3 shows the overall word recognition accuracy obtained from the three DBN versions implemented. Overall Word Rec. Acc. (%) (averaged across all noise types and levels) q Fig. 2. The hybrid ANN-DBN architecture. In the hybrid ANN-DBN architecture two sets of observations were fed to the DBN, (1) O1: the 39 dimensional MFCCs (13 cepstral coefficients and their ǻ and ǻ2), (2) O2: the estimated TVs obtained from the feed-forward ANN based TV-estimator. Note that the TV-estimator has a Kalman smoother in it that smoothes the raw TV estimates generated by the feed-forward ANN. 75 70 72.8% DBN−MFCC−baseline DBN−MFCC−TV G−DBN 65 62.7% 60 56.6% 55 50 45 40 4. RESULTS Fig. 3. Overall word recognition accuracy obtained from the three DBN versions The TV-estimator was trained and optimized using the training and the validation set for speaker # 12 of XRMB database. Table 2 Fig. 3 shows that the use of G-DBN setup provided the best overall word recognition accuracy despite having lower number of states 5174 per word model. Use of estimated TVs in addition to the MFCCs offered higher recognition accuracy than the MFCC-baseline which is in line with our previous observation [16]. In Table 3 we compare our recent results with those of some previously obtained HMM based results. Table 3 shows that the performance of the GDBN architecture is better than our previously proposed gesturebased HMM architecture [10]. Overall the G-DBN accuracy is better than all other results shown in Table 3 except the ETSIadvanced. Note that only G-DBN in Table 3 uses an eight state/word model while all other use a sixteen state/word model. DBN HMM Table 3. Averaged Recognition accuracies (0 to 20dB) obtained from using DBN architectures presented in this paper, our prior HMM based articulatory gestures based system [10] and some state-of-theart word recognition systems that has been reported so far. MFCC [15] MFCC+TV+Gesture [10] Soft Margin Estimation (SME) [17] ETSI-advanced [18] MFCC MFCC+TV G-DBN Rec. Acc (%) 51.05 72.71 67.44 86.13 58.00 66.37 78.77 To witness if the proposed G-DBN architecture provide any improvement over a mono-phone based ASR system, we created a phone based DBN model (modeling 60 phones, with maximum number of phones per word = 30). Results on a clean test set of Aurora-2 showed that the G-DBN offered a 0.63% relative improvement over the mono-phone based model. 5. CONLUSION We have proposed an articulatory gesture based DBN architecture that uses acoustic observations in the form of MFCC and estimated TV trajectories as input. Using an eight state/word model we have shown that the G-DBN architecture can significantly improve the word recognition accuracy over the DBN architectures using MFCCs only or MFCCs along with TVs as input. Our results also show that the proposed G-DBN significantly improves the performance over a gesture based HMM architecture we previously proposed, indicating the capability of DBNs to properly model parallel streams of information (in our case the gestures). Note that the current system has several limitations as follows. First, the TV estimator is trained with only a single speaker, and a multi-speaker trained TV estimator can potentially increase the accuracy of the TV estimates for Aurora-2 database, which in turn can further increase the word recognition accuracy. Second, we only modeled the gestural activations as hidden binary RVs, future research should include gestural target information as well. Finally, we have seen [10] that contextualized acoustic observation can potentially increase the performance of gesture recognition, however in our current implementation the acoustic observation had no contextual information, which should be pursued in future research. We also compared the performance of the proposed G-DBN architecture with a mono-phone based word recognition model and witnessed a 0.63% absolute improvement. Given the limited vocabulary size of Aurora-2 a logical future direction is to compare the phone based models (probably using contextualized phone units such as tri- or quin-phones) with the G-DBN architecture for a large vocabulary database. Also future research should try to use 5175 ETSI features with the TVs to see if they can offer any improvement beyond what has been observed from ETSI only. 6. ACKNOWLEDGEMENT The authors would like to thank Dr. Karen Livescu (TTI-Chicago) and Arthur Kantor (UIUC) for their valuable suggestions on GMTK 7. REFERENCES [1] K. Kirchhoff, G. Fink, and G. Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition”, Speech Comm., vol.37, pp. 303-319, 2000. [2] K. Kirchhoff, “Robust Speech Recognition Using Articulatory Information”, PhD Thesis, University of Bielefeld, 1999. [3] M. Richardson, J. Bilmes and C. Diorio, “Hidden-articulator Markov models for speech recognition”, Speech Comm., 41(2-3), pp. 511-529, 2003. [4] R. Daniloff and R. Hammarberg, “On defining coarticulation”, J. of Phonetics, Vol.1, pp. 239-248, 1973. [5] M. Ostendorf, “Moving beyond the 'beads-on-a-string’ model of speech”, Proc. of IEEE ASRU, Vol.1, pp. 79-83, CO, 1999. [6] D. Jurafsky, W. Ward, Z. Jianping, K. Herold, Y. Xiuyang and Z. Sen, “What kind of pronunciation variation is hard for triphones to model?”, Proc. of ICASSP, Vol.1, pp. 577-580, Utah, 2001. [7] C. Browman and L. Goldstein, “Articulatory Gestures as Phonological Units”, Phonology, 6: 201-251, 1989 [8] C. Browman and L. Goldstein, “Articulatory Phonology: An Overview”, Phonetica, 49: 155-180, 1992 [9] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman and L. Goldstein, “Retrieving Tract Variables from Acoustics: a comparison of different Machine Learning strategies”, IEEE J. of Selected Topics on Sig. Proc., Vol. 4(6), pp. 1027-1045, 2010. [10] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman and L. Goldstein, “Robust word recognition using articulatory trajectories and Gestures”, Proc. of Interspeech, pp. 2038-2041, Japan, 2010. [11] H. Nam, V. Mitra, M. Tiede, E. Saltzman, L. Goldstein, C. Espy-Wilson and M. Hasegawa-Johnson, “A procedure for estimating gestural scores from natural speech”, Proc. of Interspeech, pp. 30-33, Japan, 2010. [12] Westbury “X-ray microbeam speech production database user’s handbook”, Univ. of Wisconsin, 1994 [13] H.G. Hirsch and D. Pearce, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions”, Proc. of ISCA ITRW ASR2000, pp. 181-188, Paris, France, 2000. [14] Z. Ghahramani, “Learning dynamic Bayesian networks”, in, Adaptive Processing of Temporal Information, C. L. Giles and M. Gori, eds, pp. 168–197. Springer-Verlag, 1998. [15] J. Bilmes, “GMTK: The Graphical Models Toolkit”, SSLI Laboratory, Univ. of Washington, October 2002. [http://ssli.ee.washington.edu/~bilmes/gmtk/] [16] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman and L. Goldstein, “Tract variables for noise robust speech recognition”, To appear in IEEE Trans. Audio, Speech & Lang. Processing [17] X. Xiao, J. Li, E.S. Chng, H. Li and C. Lee, “A Study on the Generalization Capability of Acoustic Models for Robust Speech Recognition”, IEEE Trans. Audio, Speech & Lang. Processing, 18(6), pp. 1158-1169, 2009. [18] ETSI ES 202 050 Ver. 1.1.5, 2007. Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced frontend feature extraction algorithm; compression algorithms.