GESTURE-BASED DYNAMIC BAYESIAN NETWORK FOR NOISE ROBUST SPEECH RECOGNITION

advertisement
GESTURE-BASED DYNAMIC BAYESIAN NETWORK FOR NOISE ROBUST SPEECH
RECOGNITION
Vikramjit Mitra1, Hosung Nam2, Carol Y. Espy-Wilson1, Elliot Saltzman23, Louis Goldstein24
1
Institute for Systems Research & Department of ECE, University of Maryland, College Park, MD
2
Haskins Laboratories, New Haven, CT
3
Department of Physical Therapy and Athletic Training, Boston University, USA
4
Department of Linguistics, University of Southern California, USA
vmitra@umd.edu, nam@haskins.yale.edu, espy@umd.edu, esaltz@bu.edu, louisgol@usc.edu
ABSTRACT
Previously we have proposed different models for estimating
articulatory gestures and vocal tract variable (TV) trajectories from
synthetic speech. We have shown that when deployed on natural
speech, such models can help to improve the noise robustness of a
hidden Markov model (HMM) based speech recognition system. In
this paper we propose a model for estimating TVs trained on
natural speech and present a Dynamic Bayesian Network (DBN)
based speech recognition architecture that treats vocal tract
constriction gestures as hidden variables, eliminating the necessity
for explicit gesture recognition. Using the proposed architecture
we performed a word recognition task for the noisy data of Aurora2. Significant improvement was observed in using the gestural
information as hidden variables in a DBN architecture over using
only the mel-frequency cepstral coefficient based HMM or DBN
backend. We also compare our results with other noise-robust front
ends.
Index Terms— Noise-robust Speech Recognition, Dynamic
Bayesian Network, Articulatory Speech Recognition, Vocal-Tract
variables, Articulatory Phonology, Task Dynamic model.
1. INTRODUCTION
Current state-of-the-art automatic speech recognition (ASR)
systems suffer from acoustic variability present in spontaneous
speech utterances. Such variability can be broadly categorized into
three classes: variability due to (i) background noises, (ii) speaker
differences and (iii) differences in recording device, which is
commonly known as the channel variation.
Studies [1, 2, 3] have shown that articulatory information can
potentially improve robustness in speech recognition systems
against noise contamination and speaker variation. However, the
motivation behind the use of articulatory information in speech
recognition is to model coarticulation and reduction in a more
systematic way. Coarticulation has been described in several ways,
including the spreading of features from one segment to another
[4], influence on one phone by its neighboring phones and so on.
Current phone-based spontaneous speech ASR systems models
speech as a string of non-overlapping phone units [5], which limits
the acoustic model’s ability to properly learn the underlying
variations in spontaneous or conversational speech. Studies [5]
suggest that such a phone-based acoustic model is inadequate for
ASR tasks specifically for spontaneous or conversational speech.
In an attempt to model coarticulation, current ASR systems use tri-
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
or quin-phone acoustic models, where different models are created
for each phone in all possible phone-contexts. Unfortunately such
tri- or quin-phone models limit the contextual influence only to
immediately neighboring phones and hence may fail to account
certain coarticulatory effects (e.g., syllable deletions [6]). Also,
some tri- or quin-phone units are less frequent than others
introducing data-sparsity that necessitate a large training database
to create all possible models of tri- or quin-phone units.
The goal of this study is to propose an ASR architecture
inspired by Articulatory Phonology (AP) that models speech as
overlapping articulatory gestures [7, 8] and can potentially
overcome the limitations of phone-based units in addressing
coarticulation. AP models a speech utterance as a constellation of
articulatory gestures (known as gestural score) [7, 8], where the
gestures are invariant action units that define the onset and the
offset of constriction actions by various constricting organs (lips,
tongue tip, tongue body, velum, and glottis) and a set of dynamic
parameters (e.g., target, stiffness etc.) [7]. In AP, the intergestural
temporal overlap accounts for acoustic variations in speech
originating from coarticulation, reduction, rate variations etc.
Gestures are defined by relative measures of the constriction
degree and location at distinct constriction organs, that is, by one
of the tract variables in Table 1. The gestures' dynamic realizations
are the tract variable trajectories, named as TVs here. Note that a
TV (described with more details in [9]) does not stand for a tract
variable itself but its time function output, i.e. its temporal
trajectory. To be able to use the gestures as ASR units, they
somehow need to be recognized from the speech signal. In
addition we have observed that the accuracy of gesture recognition
improves with prior knowledge of TVs [10], indicating the
necessity of estimating TVs from the acoustic signal before
performing the gesture based ASR.
In [10] we presented a Hidden Markov Model (HMM) based
ASR architecture whose inputs were Mel-Frequency Cepstral
Coefficients (MFCCs), estimated TVs and recognized gestures.
Such an architecture required explicit recognition of articulatory
gestures from the speech signal. In our current work we propose a
Gesture-based Dynamic Bayesian Network (G-DBN) architecture
that uses the articulatory gestures as hidden variables, eliminating
the necessity of explicit recognition of articulatory gestures. The
proposed G-DBN uses MFCCs and estimated TVs as observations,
where the estimated TVs are obtained from an Artificial Neural
Network (ANN) based TV-estimator.
The organization of the paper is as follows: Section 2 describes
5172
ICASSP 2011
the data used in this paper, Section 3 describes the hybrid ANNDBN architecture; Section 4 presents the results followed by
conclusion and future directions in Section 5.
Table 1. Constriction organs, tract variables
Constriction organs
Tract Variables
Lip
Lip Aperture (LA)
Lip Protrusion (LP)
Tongue Tip
Tongue tip constriction degree (TTCD)
Tongue tip constriction location (TTCL)
Tongue Body
Tongue body constriction degree (TBCD)
Tongue body constriction location (TBCL)
Velum
Velum (VEL)
Glottis
Glottis (GLO)
2. DATA
We proposed an architecture in [11] to annotate articulatory
gestures and TV trajectories for a natural speech utterance. Using
the methodology, we annotated the entire X-ray microbeam
(XRMB [12]) database and Aurora-2 [13] clean training set with
gestures and TVs that were sampled at 200Hz. There are eight TV
trajectories and corresponding articulatory gestures defining the
constriction location and degree of different constricting organs in
the vocal tract as shown in Table 1.
The XRMB database includes speech utterances recorded from
47 different American English speakers, each producing at most 56
read-speech tasks consisting of strings of digits, TIMIT sentences
or a paragraph read from a book. In this work we have used the
acoustic data and its corresponding TV trajectories for the 56 tasks
performed by male speaker 12 from the XRMB database to train
the TV estimator. 76.8% of the data was used for training, 10.7%
for validation and the rest for testing. For the TV-estimator the
speech waveform was downsampled to 8KHz and was
parameterized as 13 MFCC coefficients, which were analyzed at a
frame rate of 5ms with window duration of 10ms. The acoustic
features and the TV trajectories were z-normalized (with std. dev =
0.25). The resulting acoustic coefficients were scaled such that
their dynamic range was confined within [-0.95, +0.95]. The
acoustic features were created by stacking the scaled and
normalized acoustic coefficients from nine frames (selecting every
other frame), where the 5th frame is centered at the current time
instant. The resulting contextualized acoustic feature vector had a
dimensionality of 117 (= 9×13).
The Aurora-2 [13] database consists of connected digits
spoken by American English speakers, sampled at 8 kHz. The
annotated gestures for the Aurora-2 training corpus were used to
train the G-DBN. Test set A and B, which contain eight different
noise types at seven different SNR levels, were used to evaluate
the noise robustness of the proposed word recognizer.
3. THE HYBRID ANN-DBN ARCHITECTURE
Our prior study [10] has shown that articulatory gestures can be
recognized with a higher accuracy if the knowledge of TV
trajectory is used in addition to the acoustic parameters (MFCCs)
as opposed to using the acoustic parameters alone. However in a
typical ASR setup, the only available observable is the acoustic
signal, which is parameterized as acoustic features. This indicates
the necessity to estimate the TVs from the acoustic features. We
have performed an exhaustive study using synthetic speech in [9]
and have observed that a feed-forward ANN can be used to
estimate the TV trajectories from acoustic features.
5173
We have used a 4-hidden layer feedforward ANN using tanhsigmoid activation function with 117 input nodes and 8 outputs
nodes (corresponding to 8 TVs shown in Table 1). The number of
hidden layers was restricted to 4 because with increase in the
number of hidden layers (a) the error surface of the cost function
becomes complex with a large number of spurious minima, (b) the
training time as well as the network complexity increased, and (c)
no appreciable improvement of performance was observed with
further addition of hidden layers. The ANN was trained using the
training set of XRMB for up to 4000 iterations. The number of
neurons in each hidden layer of the ANN was optimized using the
validation set, where the optimal number of neurons in each of the
hidden layer was found to be 225, 150, 225 and 25. Finally, the
ANN outputs were processed with a Kalman smoother to filter out
the estimation noise.
In the hybrid ANN-DBN architecture, the ANN outputs were
used as an observation set by the DBN [14]. A DBN is basically a
Bayesian Network (BN) containing temporal dependency. A BN is
a form of graphical model where a set of random variables (RVs)
and their inter-dependencies are modeled using nodes and edges of
a directed acyclic graph (DAG). The nodes represent the RVs and
the edges represent their functional dependency. BNs help to
exploit the conditional independence properties between a set of
RVs, where dependence is reflected by a connecting edge between
a pair of RVs and independence is reflected by its absence. For N
RVs, X1, X2, … Xn RVs, the joint distribution is given by
p ( x1 , x2 ,....xN ) = p ( x1 ) p ( x2 | x1 ) p ( x3 | x1 x2 )..... p ( xN | x1...xN ) (1)
Given the knowledge of conditional independence, a BN simplifies
equation (1) into
N
p ( x1 , x2 ,....xN ) = ∏ p( xi | xπ i )
(2)
i =1
where X π i are the conditional parents of X i .
Fig. 1 shows a DBN with four discrete hidden RVs (W, P, S
and T), two continuous observable RVs (O1 and O2) and N partlyobservable and partly-hidden RVs (A1 to AN). The ‘prologue’ and
the ‘epilogue’ in Fig. 1 represent the initial and the final frames
and the ‘center’ represents the intermediate frames, which are
unrolled in time to match the duration of a specific utterance [15].
Unlike HMMs, DBNs offer the flexibility to realize multiple
hidden state variables at a time, which makes DBN appropriate for
realizing articulatory gesture models that involve several variables,
one for each articulatory gesture. The other advantage of DBN is
that they can explicitly model the interdependencies amongst the
gestures and simultaneously perform gesture recognition and word
recognition, hence eliminating the necessity to perform gesture
recognition as a prior separate step. In this work we have used the
GMTK [15] to implement our DBN models, in which conditional
probability tables (CPT) are used to describe the probability
distributions of the discrete RVs given their parents, and Gaussian
mixture models (GMMs) are used to define the probability
distributions of the continuous RVs.
In a typical HMM based ASR setup, word recognition is
performed using Maximum a Posteriori probability
P ( wi ) P (o | wi )
w = arg max i P ( wi | o) = arg max
P (o )
wi
(3)
= arg max P ( wi ) P (o | wi )
wi
where o is the observation variable and P(wi) is the language
model, which can be ignored for an isolated word recognition
problem where all the words w are equally probable. Hence we can
only focus on P(o|wi) which can be simplified further as
P (o | w) = ¦ P (q, o | w) = ¦ P (q | w) P (o | q, w)
q
q
n
≈ ¦ P (q1 | w) P (o1 | q1 , w)∏ P (qi | qi −1 , w) P (oi | qi , w)
(4)
presents the correlation obtained between the estimated and the
ground-truth TVs for the XRMB test set.
Table 2. PPMC for the estimated TVs
GLO VEL
LA
LP
TBCL TBCD TTCL TTCD
0.853 0.854 0.801 0.834
i=2
where q is the hidden state in the model. Thus in this setup the
likelihood of the acoustic observation given the model is
calculated in terms of the emission probabilities P(oi|qi) and the
transition probabilities P(qi|qi-1). Use of articulatory information
introduces another RVs a and then (4) can be reformulated as
P (o | w) ≈ ¦ P ( q1 | w) P (o1 | q1 , a1 , w) ×
q
n
∏ P(qi | qi −1, w) P(ai | ai −1 , qi ) P(oi | qi , ai , w)
(5)
i =2
A DBN can realize the causal relationship between the articulators
and the acoustic observations P(o|q,a,w) and also model the
dependency of the articulators on the current phonetic state and
previous articulators P(ai|ai-1,qi). Based on this formulation, the GDBN shown in Fig. 1 can be constructed, where the discrete
hidden RVs, W, P, T and S represent the word, word-position,
word-transition and word-state. The continuous observed RVs, O1
is the acoustic observation in the form of MFCCs and, O2 is the
articulatory observation in the form of the estimated TVs. The
partially shaded discrete RVs, A1, …AN represent the discrete
hidden gestures and they are partially shaded as they are observed
at the training stage and then made hidden during the testing stage.
The overall hybrid ANN-DBN architecture is shown in Fig 2.
Fig. 1 The G-DBN graphical model
0.860
0.851 0.807 0.801
We implemented 3 different versions of the DBN, in the first
version we used only the 39 dimensional MFCCs as the acoustic
observation and no articulatory gesture RV was used. We name
this model as the DBN-MFCC-baseline system. In this setup the
word models consisted of 18 states (16 states per word and 2
dummy states). There were 11 whole word models (zero to nine
and oh) and 2 models for ‘sil’ and ‘sp’, where ‘sil’ had 3 states and
‘sp’ had one state. The maximum number of mixtures allowed per
state was four with vanishing of mixture-coefficients allowed for
weak mixtures. The second version is identical to the first version,
except that there was an additional observation RV corresponding
to the estimated TVs. We name this model as the DBN-MFCC-TV
system. Finally the third version was the G-DBN architecture
(shown in Fig. 1, with MFCC and the estimated TVs as two sets of
observation) where we used 6 articulatory gestures as hidden RV,
so N in Fig. 1 was 6. Note that the articulatory gesture RVs
modeled only the gestural activations, i.e., they were only binary
RVs reflecting whether the gesture is active or not and do not have
any target information (i.e., degree and location of the constriction
information). This was done deliberately to keep the system
tractable, otherwise the multi-dimensional CPT linking the word
state RVs and the gesture state RVs became extremely large
making the DBN overly complex. Hence our current
implementation of G-DBN uses 6 gesture RVs: GLO, VEL, LA,
LP, TT and TB. Since the gestural activations for TTCL and TTCD
are identical they were replaced by a single RV TT (tongue tip)
and same is true for TBCL and TBCD, which were replaced by TB
(tongue body). Since the TVs were used as a set of observation and
the TVs by themselves contain coarse target specific information
about the gestures, it can be expected that the system has gestural
target information to some extent. In the G-DBN another major
change was the number of states per word, which was reduced to
eight with two additional dummy states. The number of states for
‘sil’ and ‘sp’ were kept the same as before. In this setup the
discrete gesture RVs are treated as observable during the training
session and then converted to a hidden RV during the testing. Fig.
3 shows the overall word recognition accuracy obtained from the
three DBN versions implemented.
Overall Word Rec. Acc. (%)
(averaged across all noise types and levels)
q
Fig. 2. The hybrid ANN-DBN architecture.
In the hybrid ANN-DBN architecture two sets of observations
were fed to the DBN, (1) O1: the 39 dimensional MFCCs (13
cepstral coefficients and their ǻ and ǻ2), (2) O2: the estimated TVs
obtained from the feed-forward ANN based TV-estimator. Note
that the TV-estimator has a Kalman smoother in it that smoothes
the raw TV estimates generated by the feed-forward ANN.
75
70
72.8%
DBN−MFCC−baseline
DBN−MFCC−TV
G−DBN
65
62.7%
60
56.6%
55
50
45
40
4. RESULTS
Fig. 3. Overall word recognition accuracy obtained from the three
DBN versions
The TV-estimator was trained and optimized using the training and
the validation set for speaker # 12 of XRMB database. Table 2
Fig. 3 shows that the use of G-DBN setup provided the best overall
word recognition accuracy despite having lower number of states
5174
per word model. Use of estimated TVs in addition to the MFCCs
offered higher recognition accuracy than the MFCC-baseline
which is in line with our previous observation [16]. In Table 3 we
compare our recent results with those of some previously obtained
HMM based results. Table 3 shows that the performance of the GDBN architecture is better than our previously proposed gesturebased HMM architecture [10]. Overall the G-DBN accuracy is
better than all other results shown in Table 3 except the ETSIadvanced. Note that only G-DBN in Table 3 uses an eight
state/word model while all other use a sixteen state/word model.
DBN
HMM
Table 3. Averaged Recognition accuracies (0 to 20dB) obtained from
using DBN architectures presented in this paper, our prior HMM
based articulatory gestures based system [10] and some state-of-theart word recognition systems that has been reported so far.
MFCC [15]
MFCC+TV+Gesture [10]
Soft Margin Estimation (SME) [17]
ETSI-advanced [18]
MFCC
MFCC+TV
G-DBN
Rec. Acc (%)
51.05
72.71
67.44
86.13
58.00
66.37
78.77
To witness if the proposed G-DBN architecture provide any
improvement over a mono-phone based ASR system, we created a
phone based DBN model (modeling 60 phones, with maximum
number of phones per word = 30). Results on a clean test set of
Aurora-2 showed that the G-DBN offered a 0.63% relative
improvement over the mono-phone based model.
5. CONLUSION
We have proposed an articulatory gesture based DBN architecture
that uses acoustic observations in the form of MFCC and estimated
TV trajectories as input. Using an eight state/word model we have
shown that the G-DBN architecture can significantly improve the
word recognition accuracy over the DBN architectures using
MFCCs only or MFCCs along with TVs as input. Our results also
show that the proposed G-DBN significantly improves the
performance over a gesture based HMM architecture we
previously proposed, indicating the capability of DBNs to properly
model parallel streams of information (in our case the gestures).
Note that the current system has several limitations as follows.
First, the TV estimator is trained with only a single speaker, and a
multi-speaker trained TV estimator can potentially increase the
accuracy of the TV estimates for Aurora-2 database, which in turn
can further increase the word recognition accuracy. Second, we
only modeled the gestural activations as hidden binary RVs, future
research should include gestural target information as well.
Finally, we have seen [10] that contextualized acoustic observation
can potentially increase the performance of gesture recognition,
however in our current implementation the acoustic observation
had no contextual information, which should be pursued in future
research.
We also compared the performance of the proposed G-DBN
architecture with a mono-phone based word recognition model and
witnessed a 0.63% absolute improvement. Given the limited
vocabulary size of Aurora-2 a logical future direction is to compare
the phone based models (probably using contextualized phone
units such as tri- or quin-phones) with the G-DBN architecture for
a large vocabulary database. Also future research should try to use
5175
ETSI features with the TVs to see if they can offer any
improvement beyond what has been observed from ETSI only.
6. ACKNOWLEDGEMENT
The authors would like to thank Dr. Karen Livescu (TTI-Chicago) and
Arthur Kantor (UIUC) for their valuable suggestions on GMTK
7. REFERENCES
[1] K. Kirchhoff, G. Fink, and G. Sagerer, “Combining acoustic
and articulatory feature information for robust speech recognition”,
Speech Comm., vol.37, pp. 303-319, 2000.
[2] K. Kirchhoff, “Robust Speech Recognition Using Articulatory
Information”, PhD Thesis, University of Bielefeld, 1999.
[3] M. Richardson, J. Bilmes and C. Diorio, “Hidden-articulator
Markov models for speech recognition”, Speech Comm., 41(2-3),
pp. 511-529, 2003.
[4] R. Daniloff and R. Hammarberg, “On defining coarticulation”,
J. of Phonetics, Vol.1, pp. 239-248, 1973.
[5] M. Ostendorf, “Moving beyond the 'beads-on-a-string’ model
of speech”, Proc. of IEEE ASRU, Vol.1, pp. 79-83, CO, 1999.
[6] D. Jurafsky, W. Ward, Z. Jianping, K. Herold, Y. Xiuyang and
Z. Sen, “What kind of pronunciation variation is hard for triphones
to model?”, Proc. of ICASSP, Vol.1, pp. 577-580, Utah, 2001.
[7] C. Browman and L. Goldstein, “Articulatory Gestures as
Phonological Units”, Phonology, 6: 201-251, 1989
[8] C. Browman and L. Goldstein, “Articulatory Phonology: An
Overview”, Phonetica, 49: 155-180, 1992
[9] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman and L.
Goldstein, “Retrieving Tract Variables from Acoustics: a
comparison of different Machine Learning strategies”, IEEE J. of
Selected Topics on Sig. Proc., Vol. 4(6), pp. 1027-1045, 2010.
[10] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman and L.
Goldstein, “Robust word recognition using articulatory trajectories
and Gestures”, Proc. of Interspeech, pp. 2038-2041, Japan, 2010.
[11] H. Nam, V. Mitra, M. Tiede, E. Saltzman, L. Goldstein, C.
Espy-Wilson and M. Hasegawa-Johnson, “A procedure for
estimating gestural scores from natural speech”, Proc. of
Interspeech, pp. 30-33, Japan, 2010.
[12] Westbury “X-ray microbeam speech production database
user’s handbook”, Univ. of Wisconsin, 1994
[13] H.G. Hirsch and D. Pearce, “The Aurora experimental
framework for the performance evaluation of speech recognition
systems under noisy conditions”, Proc. of ISCA ITRW ASR2000,
pp. 181-188, Paris, France, 2000.
[14] Z. Ghahramani, “Learning dynamic Bayesian networks”, in,
Adaptive Processing of Temporal Information, C. L. Giles and M.
Gori, eds, pp. 168–197. Springer-Verlag, 1998.
[15] J. Bilmes, “GMTK: The Graphical Models Toolkit”, SSLI
Laboratory, Univ. of Washington, October 2002.
[http://ssli.ee.washington.edu/~bilmes/gmtk/]
[16] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman and L.
Goldstein, “Tract variables for noise robust speech recognition”,
To appear in IEEE Trans. Audio, Speech & Lang. Processing
[17] X. Xiao, J. Li, E.S. Chng, H. Li and C. Lee, “A Study on the
Generalization Capability of Acoustic Models for Robust Speech
Recognition”, IEEE Trans. Audio, Speech & Lang. Processing,
18(6), pp. 1158-1169, 2009.
[18] ETSI ES 202 050 Ver. 1.1.5, 2007. Speech processing,
transmission and quality aspects (STQ); distributed speech
recognition; advanced frontend feature extraction algorithm;
compression algorithms.
Download