Flexible Speech Act Identification of Spontaneous Speech with Disfluency

advertisement
Flexible Speech Act Identification of Spontaneous Speech with Disfluency
Chung-Hsien Wu and Gwo-Lang Yan
Department of Computer Science and Information Engineering
National Cheng Kung University, Tainan, Taiwan, ROC
chwu@csie.ncku.edu.tw
ABSTRACT
This paper describes an approach for flexible speech act
identification of spontaneous speech with disfluency. In this
approach, semantic information, syntactic structure, and
fragment features of an input utterance are statistically
encapsulated into a proposed speech act hidden Markov model
(SAHMM) to characterize the speech act. To deal with the
disfluency problem in a sparse training corpus, an interpolation
mechanism is exploited to re-estimate the state transition
probability in SAHMM. Finally, the dialog system accepts the
speech act with best score and returns the corresponding
response. Experiments were conducted to evaluate the proposed
approach using a spoken dialogue system for the air travel
information service. A testing database from 25 speakers
containing 480 dialogues including 3038 sentences was collected
and used for evaluation. Using the proposed approach, the
experimental results show that the performance can achieve
90.3% in speech act correct rate (SACR) and 85.5% in fragment
correct rate (FCR) for fluent speech and gains a significant
improvement of 5.7% in SACR and 6.9% in FCR compared to
the baseline system without considering filled pauses for
disfluent speech.
1. Introduction
In the age of information explosion, data access to
information is increasingly becoming a pivotal service in our
daily lives. Spoken dialogue system is desirable to play a crucial
role as a human-computer interface, which enables users to
consult computers for their demands. In this decade, several
spoken dialogue systems have been demonstrated in real-world
applications such as air travel information services (ATIS),
weather forecast systems, automatic call managers, and railway
ticket reservations [1][2]. However, there are still many
problems, which make the dialogue systems impractical. Many
approaches have been proposed to deal with these problems. One
promising approach is to model the speech act of the input
speech, which conveys the speaker’s intention, goal, and need in
a spoken utterance. In addition, speech act is also helpful to
handle the ill-formed or badly recognized utterances [3][4].
Some approaches that used a large set of rules to explain the
syntactic and semantic possibilities for spoken sentences
suffered from a lack of robustness when faced with the wide
variety and disfluency problem of spoken sentences. The
derivation of syntactic and semantic rules is labor intensive, time
consuming and tedious. It is indeed difficult to collect
appropriate and complete rules to describe the syntactic and
semantic diversities. Also these methods do not consider the
disfluency problem. In the keyword-based approach [5],
keywords were collected for a specific speech act to identify the
intention from a sentence. This approach lacks the semantic
information and syntactic structure of a sentence. Ambiguous
results generally occur when losing these indispensable relations.
In this approach, a statistical speech act hidden Markov
model (SAHMM) is proposed to statistically characterize the
semantic information and syntactic structure of a speech act. An
interpolation mechanism is exploited to re-estimate the state
transition probability in SAHMM in order to deal with the
disfluency problem in a sparse training corpus. Performance
evaluation was conducted using a spoken dialogue system for an
ATIS. The system architecture includes four main components.
In speech recognition component, the speech recognizer,
containing the fluency and disfluency HMMs, are adopted to
recognize the input speech into possible fragment sequences
according to a fragment dictionary and a fragment bigram
language model. In semantic analysis component, 30 speech acts
are defined and their corresponding SAHMMs are trained and
used to select a potential speech act. For the identified speech
act, the dialogue component and the text-to-speech component
will give a corresponding response via speech.
2. Speech Act Hidden Markov Model
In a spoken dialogue system, an intelligent behavior depends on
explicitly identifying the speech act of a sentence. In this paper,
fragment is defined as a combination of some words or
characters that generally appear together in a specific domain. A
fragment extraction algorithm in [4] is adopted to extract the
fragments from the training corpus for the following modeling.
A speech act hidden Markov model (SAHMM) is proposed to
model not only the syntactic structure but also the semantic
information of a spoken utterance.
2.1. Observation Probability in SAHMM
Three measures are defined to encapsulate semantic information,
syntactic structure, and fragment features into the observation
probability for each SAHMM and described as follows.
2.1.1. Semantic Observation Probability
We adopt the Kullback-Liebler distance to measure the semantic
similarity between an input fragment and the center of the
fragment class (FC) in the SAHMM. In order to acquire a
symmetric distance measure, the divergence measure is generally
defined as equation (1) to represent the “distance” between two
probability distributions p and q.
(1)
Div( p, q)  (KL( p || q)  KL(q || p)) 2
The semantic observation probability is derived based on the
fragment occurrence with respect to a speech act. We formulate
the conditional probability of a fragment w with respect to a
speech act SAx as a pmf by the set of probabilities pw(SAx)=P(w|
SAx), x=1,2,…,H. P(w| SAx ) represents the prior probability of a
fragment w present in the xth speech act calculated from the
training corpus. Then the pmf for the FC cj can be defined as:
(2)
pc j (SAx )   pwk (SAx )  u jk
wk c j
where ujk is the fuzzy membership of the fragment wk in the FC
cj. In our approach, the divergence measure for an input
fragment wi in the FC cj is defined as Div( pwi ( SAx ), pc j ( SAx )) .
For the normalization of observation measures, an exponential
function is adopted to convert the divergence measure to the
semantic observation probability defined as:
pSem (wi | S  c j )  e
 Div ( pwi ( SAx ), pc j ( SAx ))
where S denotes the state at time
(3)
in an SAHMM.
2.1.2. Syntactic Observation Probability
The syntactic observation probability shows how close the
fragment belongs to the FC. This probability is estimated
according to the cosine measure [8] and defined as
pSyn ( wi | S  c j )
 cos( BBV ( wi ), BBV (  j )) 
BBV ( wi )  BBV (  j )
(4)
BBV ( wi )  BBV (  j )
where BBV(wi) is a vector and defined as
 f  Fg1, wi  ,..., f  Fg K , wi  , f  wi , Fg1  ,..., f  wi , Fg K 
(5)
In the above equation, f(Fgk,wi) represents the frequency of the
fragment Fgk just preceding (or f(wi ,Fgk) for succeeding) the
input fragment wi in the training corpus and K is the total number
of the extracted fragments. The class center vector BBV(μj) of the
FC cj can be estimated as follows.
K
The state-transition probability from state S  ci at time
state S
1  c j at time
aci c j  P( S

Nc (S
1
1
 c j | S  ci )
 c j | S  ci )
u
i 1
N c ( S  ci )
where Nc( • ) is the frequency. In spontaneous speech,
disfluencies such as filled pauses, repetitions, lengthening,
repairs, false starts and silence pauses are quite common. They
are not easy to be characterized by the language model. Many
strategies such as language model adaptation techniques, backoff
model, and maximum entropy criterion, have been tried to solve
this problem. Up to now, disfluency modeling is still a tough
problem, especially for spoken language. It is impossible to
collect the corpus of spontaneous conversations, which cover all
the occurrences of disfluencies. Instead of collecting the
diversified conversations exhaustively, an alternative approach is
proposed to estimate the probabilities from the sparse corpus. In
this approach, an interpolation method is exploited to re-estimate
the state transition probability if a disfluency fragment ci occurs
as shown in Fig. 1 (a). In this figure, the transition probabilities
aci 1ci and aci ci 1 for the fragment classes from ci-1 to ci and ci to
ci+1, respectively, cannot be properly estimated due to the
problem of data sparseness. The only information is the jump
cue aci1ci1 , which represents the transition probability without
disfluency. This probability is estimated from the training corpus
and then used to backoff the bigram probability by interpolation
as shown in Fig. 1 (b).
(a)
i 1
ji
(6)
Disfluent
FC
ci
FC
ci-1
where uji is the fuzzy membership of wi in the FC cj.
aci1ci
2.1.3. Fragment Class Observation Probability
N ( wi )  u ji
 N (w )  u
k

 (1   i ) aci1ci1   i * aci ci1
Jump Cue
aci 1ci 1
(7)
jk
FC
ci-1
where N(wi) represents the frequency of fragment wi in speech
act SAx and uji is the membership of wi in the FC cj.
For each input fragment, the observations are represented as
Y={ y1, y2,…,yM} in which y1 is for semantic observation, y2 is for
syntactic observation, and y3 is for fragment class observation in
an SAHMM. The observation probability of an input fragment wi
is defined as the linear combination of the three observations yz
(z = 1,..,M and M = 3) at state S and estimated as
bc j ( wi ) 
FC
ci+1
aci ci1
  i * aci1ci  (1   i ) aci1ci1
This fragment class observation probability for fragment wi at
state S in the SAHMM is defined as
wk c j
(9)
, 1  ci , c j  N
K
BBV (  j )   u ji BBV (wi )
pFC ( wi | S  c j ) 
to
 1 is estimated as
mz P( wi | y z )
Y ,1 z  M
 m1  pSem ( wi | S  c j )  m2  pSyn ( wi | S  c j )
(8)
 m3  pFC ( wi | S  c j )
where m1 , m2 and m3 are the weights for semantic, syntactic, and
fragment class observation probabilities, respectively, and satisfy
the condition m1+ m2+ m3=1.
2.2. State Transition Probability
aci 1ci
Disfluent
FC
ci
a ci ci 1
FC
ci+1
(b)
Figure. 1 : Diagram of the interpolation method to estimate the
transition probability aci 1ci and aci ci 1 using the jump cue.
These two transition probabilities are estimated as follows.
 i  aci1ci  (1   i )aci1ci1 , If ci  Disfluency

(10)
aci1ci  
aci1ci ,
Otherwise



 i  aci ci1  (1   i )aci1ci1 , If ci  Disfluency
(11)
aci ci1  
aci ci1 ,
Otherwise


The weights δi and γi in the interpolation equations are calculated
by observing a disfluency fragment at state ci in the training
corpus and defined as follows:
i 
c
i
Nc (ci 1 )  Nunseen (ci 1, ci )  ci
(12)
i 
c
N
(13)
i
Nc (ci 1 )  Nunseen (ci , ci 1 )  ci
where Nc(•) is the number of occurrences of a FC and  ci is
the expected number of occurrences of the unseen FC ci.
Nunseen(ci-1,ci) is the number of occurrences of the unseen FC ci
following ci-1. The most widely used value for  ci is 0.5. This
choice can be theoretically justified as being the expectation of
the same quantity, which is maximized by MLE. The original
idea of this method is coming from the Jeferys-Perks law [7].
2.3. Weight Determination Using Expectation Maximization
(EM) Algorithm
In the training process, the weights can be determined using the
Expectation Maximization algorithm. For a fragment sequence
of N observations W: w1,w2,…,wN and its corresponding state
(FC) sequence FCS: c1,c2,…,cN in the training corpus, we
estimate the parameter set based on maximum likelihood to
obtain an optimal solution. The Q function is defined as
Q(SA , SA )

 E log P(W  w1 , w2 ,..., wN , FCS  c1 , c2 ,..., cN | SA ) | W , SA
Ks

n 1 i 1
 P( w
n
 yz )  1  0
N
mz  P( wn  y z ) 
N
n
 P( w
n
 yz ) 
re-estimated probability under parameter set SA . Our concern
is the Q function Q(P(wn | cn )| P(wn | cn )) because we want to
maximize the expectation with respect to the combination
weights. Then it can be derived as
Q(P(wn | cn )| P(wn | cn ))
  P (cn | wn , SA ) log P(cn | SA )]
n
Ks
  P(cn  Si | wn  y z , SA ) log[ P (wn  y z ) P ( wn  y z | cn  Si )] (15)
n 1 i 1
Ks
  P(cn  Si | wn  y z , SA ) log P (wn  y z )
n 1 i 1
Ks
  P (cn  Si | wn  y z , SA ) log P ( wn  y z | cn  Si )
n 1 i 1
where Ks is the state number. The Q function should be
maximized under the condition m1+m2+m3=1 (i.e.,
 P(wn  yz )  1 ). We add the Lagrange multiplier into the Q
z
function and get the equation:
Ks
n
n 1 i 1

 

 P (w
n
z

 y z )  1

 Si | wn  y z ,SA )

z
1  0
(20)
After the arrangement of equation (20), η can be calculated as:
N
Ks
 P(c
n
 Si | wn  y z ,SA )
(21)
n 1 i 1
Ks
 P(c
n
mz  P( wn  y z ) 
 Si | wn  y z ,SA )
n 1 i 1
N Ks
 P(c
(22)
 Si | wn  y z ,SA )
z n 1 i 1
2.4. Speech Act Identification
Speech act identification attempts to choose the most probable
*
speech act SAW
with the highest probability conditioned by
k
the input utterance U along with its corresponding kth fragment
sequence Wk according to SAHMM and is defined as follows.
SAW* k  arg max P( SA | W k ,U )
SA
(23)
 arg max P(W k | FCS k ) P(W k | SA )
SA
*
k
Eventually, the score P( SAW
estimated from the
k | W ,U )
SAHMM is then combined with the acoustic score BS(Wk|U) to
choose the most probable fragment sequence W* and its
corresponding speech act SA* as described as follows.
(24)
( SA* ,W * )  arg max {(1   )log P( SAW* k | W k ,U )   BS (W k | U )}
( SA k ,W k )
W
where β is the weight between 0 and 1. In our approach, the
semantic information from W* and SA* are then as the input for
dialogue component.
3. Experiments
 Si | wn  y z ,SA )log P(wn  y z )
z
z n 1 i 1
n
(14)
is the parameter set of SAHMM and P(  ) is the
 P(c
(19)
Ks
n
 Q(P(cn1 | cn )| P(cn1 | cn ))  Q(P(wn | cn )| P(wn | cn ))
N
 Si | wn  y z ,SA ) 
 P(c
N
  P(cn | wn , SA ) log P( wn | cn , SA )   P(cn | wn , SA ) log P(cn | SA )]
N
n
Then we substitute P(wn=yz) in equation (19) into equation (18)
and get
n
N
 P(c
n 1 i 1
  P(cn | wn , SA ) log[ P( wn | cn , SA ) P(cn | SA )]
N
Ks
After replacing η into equation (19), we can obtain the optimal
closed form solution:
n
where SA
(18)
From equation (17), we can get equation (19) by transposition.
z
  P(cn | wn , SA ) log P( wn , cn | SA )
n
(17)
z
 

P(cn  Si | wn  y z ,SA )
  0
P ( wn  y z )
(16)
We do the partial differential by parameter set of SA and η, and
get the following equation:
Performance evaluation experiments were conducted using a
spoken dialogue system for ATIS. The system was trained by
two corpora, which are initial corpus (4250 utterances) and
adaptation corpus (2200 utterances). The initial corpus was
collected from a real ATIS in fluent spoken dialogue and real
human interactions. The adaptation corpus was augmented in a
wizard environment. It is helpful for improving the system
3.1. Experiment on the performance of SAHMM
In this experiment, the system chose the fragment sequence with
the highest score obtained from equation (24) among all the
SAHMMs and determines its corresponding speech act.
Fragment correct rate (FCR) and speech act correct rate (SACR)
defined in [4] were adopted to evaluate the performance of our
approach. The SACR and FCR experiments on our proposed
SAHMM which mixture weights are determined by the EM
algorithm are shown in Fig. 2. The best SACR achieved 90.3%
with a 85.5% FCR when β was chosen as 0.4. Apparently, high
performance in SACR and FCR is because the EM algorithm
determines a closed form solution for the determination of
mixture weights. In addition, the proposed approach adopting the
semantic information also provides the ability to avoid confusion
and ambiguity among speech acts.
sentence and used to choose the speech act of the input sentence.
The semantic information, syntactic structure, and fragment
feature of an input sentence are statistically encapsulated into a
proposed SAHMM to characterize the speech act. An
interpolation method is adopted to re-estimate the transition
probabilities in the SAHMM concerning disfluencies. In the
evaluation, for disfluent speech, the experimental results show
the SAHMM gains a significant improvement of 5.7% in SACR
and 6.9% in FCR compared to the baseline system without
considering filled pauses. For fluent speech, the experimental
results show that the performance can achieve 90.3% in SACR
and 85.5% in FCR using the SAHMM.
78.0
74.0
SACR(%)
performance because it concerns various habits and behaviors
when people face a computer service system.
Using the collected corpus, 206 fragments were extracted and
clustered into 38 fuzzy fragment classes [4]. The 30 SAHMMs
were then constructed and trained using the corresponding
speech act sub-corpus by the proposed approach. For fluent
speech, experiments were carried out using 3038 sentences from
25 speakers (15 male and 10 female). For disfluent speech, the
testing database was collected particularly to contain disfluencies
in speech. There are totally 352 sentences.
70.0
66.0
Baseline System
62.0
SAHMM
58.0
0
0.1
0.2
0.3
0.4
beta
0.5
0.6
0.7
0.8
Figure. 3 : SACR as a function of the values of β for SAHMM
on testing database with disfluency
72
68
Correct Rate (%)
SACR
88
FCR
FCR(%)
92
64
60
Baseline System
84
56
SAHMM
80
52
0
76
0
0.1
0.2
0.3
0.4
beta
0.5
0.6
0.7
0.8
Figure. 2 : SACR and FCR as a function of the values of β for
SAHMM
3.2. Experiments on Disfluency Modeling
In this experiment, filled pauses [6] such as “ah,” “ung,” “um,”
“em,” and “hem” were investigated. The input speech is
recognized into possible fragment sequences by a composite
acoustic model – fluency and disfluency HMMs. In order to
evaluate our efforts on disfluency problem, a baseline system
without interpolation is established and the state-transition
probability according to equation (9) is estimated. The
SAHMMs and the baseline system were trained by the same
training corpus. The experimental results in Fig. 3 and Fig. 4
show the SACR and FCR for different values of β. In the
baseline system, the best results in SACR and FCR were 69%
and 63.3% when β=0.3, respectively. It suffers from performance
degradation when compared to the fluent speech. This is due to
the out-of-grammar in language modeling for filled pauses. The
performance on the SAHMM with filled pauses shows 74.7%
SACR and 70.2% FCR when β=0.5. Obviously, the modeling on
filled pauses shows the effectiveness on the disfluency problem
using SAHMM.
4. Conclusion
In this paper, a novel SAHMM is proposed to characterize the
semantic information and syntactic structure of a speech act in a
0.1
0.2
0.3
0.4
beta
0.5
0.6
0.7
0.8
Figure. 4: FCR as a function of the values of β for SAHMM on
testing database with disfluency
5.
References
[1] H. Meng, S. Busayapongchai, and V. Zue, “WHEELS: A
Conversational System in the Automobile Classification
Domain,” in Proc. ICSLP, 1996, pp. 542-545.
[2] A.L. Gorin, G. Riccardi, and J.H. Wright. How may I help
you? Speech communication 23(1), pp. 113-127, 1997.
[3] Hans Ulrich Block, “The language components in
VERBMOBIL,” in Proc. ICASSP, 1997, pp. 79 –82.
[4] C.H. Wu, G.L. Yan, and C.L. Lin, “Speech act modeling in
a spoken dialog system using a fuzzy fragment-class
Markov model,” Speech Communication 38, pp. 183-199,
2002.
[5] K. Tatsuya, C.H. Lee, and B.H. Juang, “Flexible Speech
Understanding Based on Combined Key-Phrase Detection
and Verification,” IEEE Transactions on Speech and Audio
Processing, Vol.6, No. 6, pp. 558-568, November. 1998.
[6] C.H. Wu and G.L. Yan, “Acoustic feature analysis and
discriminative modeling of filled pauses for Spontaneous
Speech Recognition,” Journal of VLSI., to be published.
[7] Christopher D.M. and Hinrich S., “Foundations of statistical
natural language processing,” The MIT Press 1999.
[8] K. Arai, J.H. Wright et. al. “Grammar Fragment acquisition
using syntactic and semantic clustering,” Speech
Communication, Vol. 27, Issue: 1, pp. 43-62, Feb. 1999.
Download