Discriminative Disfluency Modeling for Spontaneous Speech Recognition

advertisement
Discriminative Disfluency Modeling for Spontaneous Speech Recognition
Chung-Hsien Wu and Gwo-Lang Yan
Department of Computer Science and Information Engineering,
National Cheng Kung University, Tainan, Taiwan, R.O.C.
{Chwu,yangl}@csie.ncku.edu.tw
Abstract
Most automatic speech recognizers (ASRs) have concentrated
on read speech, which is different from speech with the
presence of disfluencies. These ASRs cannot handle the speech
with a high rate of disfluencies such as filled pauses, repetition,
repairs, false starts, and silence pauses in actual spontaneous
speech or dialogues. In this paper, we focus on the modeling of
the filled pauses “uh” and “um.” The filled pauses contain the
characteristics of nasal and lengthening, and the acoustic
parameters for these characteristics are analyzed and adopted
for disfluency modeling. A Gaussian mixture model (GMM),
trained by a discriminative training algorithm that minimizes
the recognition error, is proposed. A transition probability
density function is defined from the GMM and used to weight
the transition probability between the boundaries of fluency
and disfluency models in the one-stage algorithm.
Experimental result shows that the proposed method yields an
improvement rate of 27.3% for disfluency compared to the
baseline system.
conversation turn, in oral communication.
The proposed system architecture is shown in Figure 1. The
parameters of the input speech are analyzed according to the
properties of these two kinds of disfluencies. These parameters
contain MFCC, delta MFCC, formant 1(F1), formant 2(F2) and
formant magnitude ratio (FMR). The GMM [8], whose weights
are estimated by a discriminative training algorithm that
minimizes the recognition error recursively, is proposed to
model these parameters. A transition probability density
function is defined and used to weight the transition probability
between the boundaries of fluency and disfluency models in the
one-stage algorithm.
Disfluency
GMM
Fluency
GMM
F1
F2
FMR
Estimation of
Boundary
Transition
Probability
Speech
Wave
One-Stage
Algorithm
MFCC
delta MFCC
1. Introduction
The growing demands for ASR are used in applications such as
dialog systems, call managers, and weather forecasting systems
in recent years. The most noticeable problem is the poor
recognition rate of disfluent speech because spontaneous
speech is punctuated with and interrupted by a wide variety of
seemingly meaningless words such as “uh” and “um.” These
types of disfluent speech contain filled pauses, repetition,
repairs, false starts, and silence pauses. Any kind of
disfluencies will destroy the smooth speaking style of speech
and therefore decrease the ability of ASR.
Most of the researches on disfluency that have been reported in
the literature deals with read speech and has treated the
phenomena as a general recognition model [1]. These speech
recognizers are typical HMM based and accept only fluent read
or planned speech without disfluency. They have difficulties in
dealing with filled pauses and word lengthening because the
duration of a phone tends to lengthen differently. Some
previous researches [2][3] focus on the language model to
overcome and correct the recognition errors caused by
disfluency. These works either take the difluency into account
or skipped the disfluency words in the language model. It is
also not effective enough to deal with filled pauses because the
pause can be inserted at almost arbitrary positions. Some other
researches [4][5] analyze the prosody of difluent speech. They
exploit F0 and spectrum to derive some rules and parameters to
detect the disflent position in a speech.
In this paper, the filled pauses “ah” and “um” are investigated
because the properties of these two disfluencies are similar. The
recognition of these pauses is important because they play
valuable roles, such as thinking and helping a speaker keep a
Recognizer
HMMs with
ah,um
Syllable
Figure 1: System architecture of the speech recognizer for
speech with disfluency
2. Parameter Analysis of Filled Pauses “ah” and
“um”
Since the filled pauses appear anywhere in speech when people
are talking with each other, the corpus of spontaneous
conversations was collected from the natural dialogues between
human and human. The spontaneous speech database contains
over 30 hours of recorded speech, spoken by over 40 speakers
of both sexes. According to our preliminary observation, the
filled pauses “ah” and “um” can be characterized by two
properties: lengthening and nasal. The acoustic analysis is
described in the following sections.
2.1 Parameter for Lengthening Characteristic
For voice lengthening, the vocal cord vibrates periodically and
the vocal tract is maintained in a relatively stable configuration
throughout the utterance. In other words, the produced voice
changes smoothly. Figure 2 (a) and (b) show the waveform and
spectrogram of the utterance “嗯…你好”(um … how are you).
The lengthening voice “um” happens at the beginning of the
utterance. The spectrogram is almost steady compared to the
voice at the end of the utterance. According to the property, the
cepstral coefficients modeling the vocal tract are chosen as the
parameters. In our approach, we choose 12 mel-frequency
cesptrum coefficients (MFCCs) and 12 delta MFCCs, which is
useful for detecting the steady property of the utterance. Figure
2 (c) shows the 12 MFCCs and 12 delta MFCCs are stable for
the lengthening and steady properties
(*) is "a",(+) is "i",square is "u",dimend is "e",(o) is "o", (x) are "ah" and "um"
3000
2500
F2 (Hz)
2000
1500
1000
500
(a)
0
100
200
300
400
500
600
F1 (Hz)
700
800
900
1000
Figure 3: Plot of F2 versus F1 for vowels “a”, “i”, “u”, “e”, and
“o”, and filled pauses “ah” and “um.”
(b)
Table 1: The formant frequencies F1 and F2 of “ah” and “um”
Ah
Um
Formant 1
239Hz
233Hz
Formant 2
1021Hz
1268Hz
2.3 Formant Magnitude Ratio
Since the filled pauses have the characteristics of steady F1 and
F2, the magnitude of the formants play an important role for
characterizing the disfluencies. A formant magnitude ratio is
thus defined in the following.
R
Magnitude( F 2)
Magnitude( F 1)
(1)
where Magnitude(F1) represents the magnitude of F1. The
equation formulates the degree of the decrease in magnitude
when the voice is produced through the nostrils. Figure 4 shows
the histogram of formant magnitude ratio for “ah” and “um”.
The mean of the histogram is about 0.08.
(c)
Figure 2: (a) Waveform, (b) spectrogram, and (c) 12 MFCCs
and 12 delta MFCCs of the utterance ”嗯..你好”(um … How
are you)
2.2 Parameter for Nasal Effect
The second property of “ah” and “um” is the nasal effect. In the
production of nasal, the resonance characteristics are
conditioned by the oral cavity characteristics forward and
backward from the velum and by the nasal tract characteristics
from the velum to the nostrils. The special production
procedure causes the particular formant change. Many
researches [6][7] have been reported about the nasalized voice.
The noticeable cues are the first two formants, F1 and F2 (at
about 300 and 1000Hz) compared to the normal sound with F1
and F2 at about 250-800 and 700-2500Hz. Figure 3 shows the
plot of F2 versus F1 for vowels “a”, “i”, “u”, “e”, and “o”, and
filled pauses “ah” and “um.” The marks “x”, representing the
filled pauses “ah” and “um,” in this figure can be distinguished
from the vowels in the vowel triangle. It is trivial that the
frequencies F1 and F2 for “ah” and “um” characterize the nasal
sound well so that they are also chosen as the parameters. In
our experiment, the average frequencies of F1 and F2 for “ah”
and “um” are listed in Table 1.
Figure 4: The histogram of FMR for “ah” and “um”
Figure 5: The histogram of FMR for normal voices
On the contrary, Figure 5 shows the histogram of the formant
magnitude ratio for normal voices. The distribution mean is
about 0.8. So the formant magnitude ratio is very special for
nasal characteristic of “ah” and “um.”
Gaussian mixture model is the most commonly used statistical
model in speech and speaker recognition systems. In this model,
the covariance matrix is usually assumed to be diagonal in
application. This assumption discards the cross-correlation
between parameters and takes the advantage of computation. In
speech or speaker recognition systems, parameters are modeled
as a class whose output probability is represented by a
Gaussian mixture density. In the GMM [8], the output
probability is calculated by each class with its weight of
importance. It is because each class should have its different
contribution to the output probability. The framework is
depicted in Figure 6.
Input
Parameters
Mixture 1
W1
Mixture 2
W2
Mixture 3
W3
.
.
.
.
.
.
Mixture M
WM
Output
Probability
Wm  [WH ,m ,W
H ,m
(4)
(5)
]
and  is a constant that controls the steepness of the sigmoid
function.
According to the usual discriminative training methodology, an
optimization criterion is defined to minimize the recognition
error and a gradient descent algorithm is used to interactively
update the mixture weights ideally. However, since the
probability density function of xt is not known, a
gradient-based iterative procedure is used to minimize R as
follows:
(Wm )n 1  (Wm )n  R((Wm )n , xt )
(6)
where  is the step size and R((Wm )n , xt ) is the gradient of
the loss function with respect to Wm estimated from the
training samples.
The one-stage algorithm is employed to calculate the local
distances for each test frame against the state mixtures in HMM
and applies two types of transition constraints for the models in
the interior and the models at the boundaries. Finally, the n-best
paths are backtracked. A minimum accumulated distance,
D( xi , s, k ) , along any path to the grid point is defined as
In our approach, these parameters analyzed in Section 2 are
modeled by the GMM with 16 Gaussian mixtures using the
modified k-means algorithm. The weights are initially set to the
percentage of the number of parameter sets belonging to each
class with respect to the total number. Then they are updated
based on the gradient descent using the discriminative training
algorithm. In this model, the input parameters are fed into all
Gaussian mixtures together with their corresponding weights to
output the probability obtained from the weighted summation
of the mixture outputs.
3.2 Discriminative Training of Mixture Weights
In this work, we define a discriminative training framework
that is tailored to the disfluency and fluency GMMs. A
disfluency verification function is defined to form a linear
discriminator whose weights are discriminatively trained.
Given a disfluency GMM, GMM H , and a fluency GMM,
GMM H , the verification function is defined as
m
(3)
3.3 Integration of One-Stage Algorithm and GMM
Figure 6: The framework of the GMM
V ( xt ; H )   [WH , m GMM H , m ( xt )  W
1
1  exp[  bV ( xt ; H )]
 1 if x t  H
b
 1 if x t  H
3.1 Gaussian Mixture Model
Mixture
Weights
R(Wm , xt ) 
where
3. Disfluency Modeling Using GMM
Gaussian
Mixture
Model
minimized with respect to the weights. The loss function
represents a smooth functional form of the verification error. It
takes the form of a sigmoid function, which is written as
H ,m
GMM
H ,m
( xt )])
(2)
D( xi , s, k )  d ( xi , s, k )  min{D( xi 1, s, k ), D( xi 1, s  1, k ), D( xi , s  1, k )}
(7)
where d ( xi , s, k ) is the local distance between the feature xi
and the s-th state of the acoustic model k . With the model
boundaries at s=1, the boundary transition yields:
D( xi ,1, k )  d ( xi ,1, k )  min{D( xi 1,1, k ), D( xi 1, SN (k * ), k * )}
*
where SN (k ) represents the state number of model
In traditional approaches, the model boundary
probability is set to 1. In this paper, the boundary
probability is determined by the GMM. The
probability density function is defined as
(8)
.
transition
transition
transition
k
V 
TPDFH (V )  GH (
V 

)

1

2

exp[ 
1 x 2
(
) ]dx
2 
(9)
where V is the verification score calculated by Equation 2 and
G is the integral function for normal curves of verification
score estimated from the training examples which are tagged as
disfluencies. The transition probability density function is
embedded in the transition between model boundaries as
where xt represents the t-th parameter vector of the input
speech, and the weight vectors WH ,m and WH ,m are the m-th
D( xi ,1, k )  d ( xi ,1, k )  min{D( xi 1,1, k ), BTP( xi | k * ) * D( xi 1, SN (k * ), k * )}
*
mixture weights for disfluency and fluency model H and H ,
WH , mGMM H , m ( xt )
respectively.
The
terms
and
where the boundary transition probability
one-stage algorithm is defined as follows:
W
H ,m
GMM
H ,m
( xt )
represent the output probabilities of the
disfluency and fluency GMMs. A loss function is defined and

BTP( xi | k * )
(10)
in the
 *  Disfluency BTP( x |  * )  TPDF * (V ( x ,  * ))
 k
i k
i k
k

*

BTP( xi | k * )  T 0  T  1
k  Disfluency
(11)
where T is a threshold and is chosen as 0.6 according to our
experiment. The approach can minimize the recognition errors
caused by the disfluent speech and the GMM can control the
model boundary transition between fluency and disfluency
models.
4. Experiments
generated from the GMM is helpful to avoid incorrect
transition in the one-stage algorithm.
Table 2: Recognition rates for the baseline system and the
baseline system with GMM
Fluent
Disfluent
sentences
sentences
Baseline system
80.4
74
Baseline system with GMM
79.1
81.1
4.1 Baseline System and Database
The baseline system for this work is an ASR with normal
subsyllable HMMs and disfluency HMMs for “ah” and “um.”
The training corpus consists of two parts. The first part is the
TCC300 collected by three famous universities in Taiwan.
TCC300 contains 1642 and 1131 sentences from male and
female speakers respectively without disfluencies. The second
part of the corpus is collected from 40 speakers. This corpus
contains 240 and 160 sentences from male and female speakers
respectively with disfluencies “ah” and “um.” The testing data
is also collected and consists of 398 fluent and 127 disfluent
sentences.
4.2 Experiments on Baseline System with GMM
The experiment was conducted to compare the performance
between baseline system and baseline system with GMM.
Before the comparison, the threshold T in Equation 11 must be
determined. An experiment was designed and evaluated
according to the different values of threshold T. The
experimental result is shown in Figure 7. The average
recognition rate decreases dramatically when T is smaller than
0.5. The reason is because the transition probability dominates
some incorrect boundary transitions in the one-stage algorithm.
When the threshold T is chosen as 0.6, the system achieved the
best average recognition rate.
0.83
Average Recognition Rate
0.81
0.79
0.77
0.75
0.73
Recogniton Rate of Fluent Sentences
0.71
0.69
Recogniton Rate of Disfluent Sentences
0.67
Average Recognition Rate
0.65
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
T
Figure 7: Recognition rates for fluency and disfluency as a
function of the values of T
The recognition rate with T=0.6 is shown in Table 2. In the
testing of fluent sentences, the performance of the baseline
system is slightly better than the baseline system with GMM.
This is because “ah” and “um” have similar phonetic properties
with some sub-syllables. This results in a little degradation in
recognition performance. In the testing of disfluent sentences,
the performance of the baseline system with GMM is much
better than that of the baseline system. The improvement rate
achieved 27.3%. This is because the transition probability
density function correctly guides the model boundary transition.
The experimental result shows the transition probability
5. Conclusion
In this paper, the properties of filled pauses “ah” and “um”
have been analyzed and modeled by the GMM. A
discriminative training methodology is employed to train the
weights in the GMM with a gradient-based iterative procedure.
Then the GMM generates the boundary transition probability
for model transition. This boundary transition probability is
then integrated into the one-stage algorithm to give the final
recognition results. Experimental results show the transition
probability generated from the GMM is helpful to avoid
incorrect transitions in the one-stage algorithm. Without
affecting the performance for fluent speech, a significant
improvement for disfluency was achieved using the GMM.
6. Acknowledgment
The authors would like to thank the National Science Council,
R.O.C., for its financial support of this work, under Contract
No. NSC89-2614-H-006-002-F20.
7. References
[1] A. Kai and S. Nakagawa. Investigation on unknown word
processing and startegies for spontaneous speech
understanding. In proc. Of Eurospeech’95, pp. 2095-2098,
1995.
[2] Stolcke, A. and Shriberg, E. “Statistical Language Model
for speech disfluencies.” Proceedings of ICASSP-96,
Page(s): 405 -408 vol. 1
[3] Manhung Siu and Ostendorf, M. “Variable N-Grams and
extensions for conversational speech Language
Modeling.” Speech and Audio Processing, IEEE
Transactions on Volume: 8 1, Jan. 2000, Page(s): 63 -75
[4] M. Gabrea and D. O’Shaughnessy. “Detection of filled
pauses in spontaneous conversation speech.” Proceedings
of ICSLP 2000.
[5] Masataka Goto, Katunobu Itou and Satoru Hayamizu. “A
Read time filled pause detection system for spontaneous
speech recognition.” Proceedings of ICASSP 2000.
[6] G. Feng and E. Castelli. “Some acoustic feature of nasal
and nasalized vowels: A target for vowel nasalization.” J.
Acoust. Soc. Am., 99(6) : 3694-3706, 1996
[7] Daniel Recasens. “Place cues for nasal consonants with
special reference to Catalan.” J. Acoust. Soc. Am., 73(4) :
1346-1353, 1983
[8] Beaufays, F.; Weintraub, M.; Yochai Konig.
“Discriminative mixture weight estimation for large
gaussian mixture models.” Acoustics, Speech, and Signal
Processing, 1999. Proceedings., 1999 IEEE International
Conference on Page(s): 337 -340 vol.1
Download