Flexible Speech Act Identification of Spontaneous Speech with Disfluency Chung-Hsien Wu and Gwo-Lang Yan Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan, ROC chwu@csie.ncku.edu.tw ABSTRACT This paper describes an approach for flexible speech act identification of spontaneous speech with disfluency. In this approach, semantic information, syntactic structure, and fragment features of an input utterance are statistically encapsulated into a proposed speech act hidden Markov model (SAHMM) to characterize the speech act. To deal with the disfluency problem in a sparse training corpus, an interpolation mechanism is exploited to re-estimate the state transition probability in SAHMM. Finally, the dialog system accepts the speech act with best score and returns the corresponding response. Experiments were conducted to evaluate the proposed approach using a spoken dialogue system for the air travel information service. A testing database from 25 speakers containing 480 dialogues including 3038 sentences was collected and used for evaluation. Using the proposed approach, the experimental results show that the performance can achieve 90.3% in speech act correct rate (SACR) and 85.5% in fragment correct rate (FCR) for fluent speech and gains a significant improvement of 5.7% in SACR and 6.9% in FCR compared to the baseline system without considering filled pauses for disfluent speech. 1. Introduction In the age of information explosion, data access to information is increasingly becoming a pivotal service in our daily lives. Spoken dialogue system is desirable to play a crucial role as a human-computer interface, which enables users to consult computers for their demands. In this decade, several spoken dialogue systems have been demonstrated in real-world applications such as air travel information services (ATIS), weather forecast systems, automatic call managers, and railway ticket reservations [1][2]. However, there are still many problems, which make the dialogue systems impractical. Many approaches have been proposed to deal with these problems. One promising approach is to model the speech act of the input speech, which conveys the speaker’s intention, goal, and need in a spoken utterance. In addition, speech act is also helpful to handle the ill-formed or badly recognized utterances [3][4]. Some approaches that used a large set of rules to explain the syntactic and semantic possibilities for spoken sentences suffered from a lack of robustness when faced with the wide variety and disfluency problem of spoken sentences. The derivation of syntactic and semantic rules is labor intensive, time consuming and tedious. It is indeed difficult to collect appropriate and complete rules to describe the syntactic and semantic diversities. Also these methods do not consider the disfluency problem. In the keyword-based approach [5], keywords were collected for a specific speech act to identify the intention from a sentence. This approach lacks the semantic information and syntactic structure of a sentence. Ambiguous results generally occur when losing these indispensable relations. In this approach, a statistical speech act hidden Markov model (SAHMM) is proposed to statistically characterize the semantic information and syntactic structure of a speech act. An interpolation mechanism is exploited to re-estimate the state transition probability in SAHMM in order to deal with the disfluency problem in a sparse training corpus. Performance evaluation was conducted using a spoken dialogue system for an ATIS. The system architecture includes four main components. In speech recognition component, the speech recognizer, containing the fluency and disfluency HMMs, are adopted to recognize the input speech into possible fragment sequences according to a fragment dictionary and a fragment bigram language model. In semantic analysis component, 30 speech acts are defined and their corresponding SAHMMs are trained and used to select a potential speech act. For the identified speech act, the dialogue component and the text-to-speech component will give a corresponding response via speech. 2. Speech Act Hidden Markov Model In a spoken dialogue system, an intelligent behavior depends on explicitly identifying the speech act of a sentence. In this paper, fragment is defined as a combination of some words or characters that generally appear together in a specific domain. A fragment extraction algorithm in [4] is adopted to extract the fragments from the training corpus for the following modeling. A speech act hidden Markov model (SAHMM) is proposed to model not only the syntactic structure but also the semantic information of a spoken utterance. 2.1. Observation Probability in SAHMM Three measures are defined to encapsulate semantic information, syntactic structure, and fragment features into the observation probability for each SAHMM and described as follows. 2.1.1. Semantic Observation Probability We adopt the Kullback-Liebler distance to measure the semantic similarity between an input fragment and the center of the fragment class (FC) in the SAHMM. In order to acquire a symmetric distance measure, the divergence measure is generally defined as equation (1) to represent the “distance” between two probability distributions p and q. (1) Div( p, q) (KL( p || q) KL(q || p)) 2 The semantic observation probability is derived based on the fragment occurrence with respect to a speech act. We formulate the conditional probability of a fragment w with respect to a speech act SAx as a pmf by the set of probabilities pw(SAx)=P(w| SAx), x=1,2,…,H. P(w| SAx ) represents the prior probability of a fragment w present in the xth speech act calculated from the training corpus. Then the pmf for the FC cj can be defined as: (2) pc j (SAx ) pwk (SAx ) u jk wk c j where ujk is the fuzzy membership of the fragment wk in the FC cj. In our approach, the divergence measure for an input fragment wi in the FC cj is defined as Div( pwi ( SAx ), pc j ( SAx )) . For the normalization of observation measures, an exponential function is adopted to convert the divergence measure to the semantic observation probability defined as: pSem (wi | S c j ) e Div ( pwi ( SAx ), pc j ( SAx )) where S denotes the state at time (3) in an SAHMM. 2.1.2. Syntactic Observation Probability The syntactic observation probability shows how close the fragment belongs to the FC. This probability is estimated according to the cosine measure [8] and defined as pSyn ( wi | S c j ) cos( BBV ( wi ), BBV ( j )) BBV ( wi ) BBV ( j ) (4) BBV ( wi ) BBV ( j ) where BBV(wi) is a vector and defined as f Fg1, wi ,..., f Fg K , wi , f wi , Fg1 ,..., f wi , Fg K (5) In the above equation, f(Fgk,wi) represents the frequency of the fragment Fgk just preceding (or f(wi ,Fgk) for succeeding) the input fragment wi in the training corpus and K is the total number of the extracted fragments. The class center vector BBV(μj) of the FC cj can be estimated as follows. K The state-transition probability from state S ci at time state S 1 c j at time aci c j P( S Nc (S 1 1 c j | S ci ) c j | S ci ) u i 1 N c ( S ci ) where Nc( • ) is the frequency. In spontaneous speech, disfluencies such as filled pauses, repetitions, lengthening, repairs, false starts and silence pauses are quite common. They are not easy to be characterized by the language model. Many strategies such as language model adaptation techniques, backoff model, and maximum entropy criterion, have been tried to solve this problem. Up to now, disfluency modeling is still a tough problem, especially for spoken language. It is impossible to collect the corpus of spontaneous conversations, which cover all the occurrences of disfluencies. Instead of collecting the diversified conversations exhaustively, an alternative approach is proposed to estimate the probabilities from the sparse corpus. In this approach, an interpolation method is exploited to re-estimate the state transition probability if a disfluency fragment ci occurs as shown in Fig. 1 (a). In this figure, the transition probabilities aci 1ci and aci ci 1 for the fragment classes from ci-1 to ci and ci to ci+1, respectively, cannot be properly estimated due to the problem of data sparseness. The only information is the jump cue aci1ci1 , which represents the transition probability without disfluency. This probability is estimated from the training corpus and then used to backoff the bigram probability by interpolation as shown in Fig. 1 (b). (a) i 1 ji (6) Disfluent FC ci FC ci-1 where uji is the fuzzy membership of wi in the FC cj. aci1ci 2.1.3. Fragment Class Observation Probability N ( wi ) u ji N (w ) u k (1 i ) aci1ci1 i * aci ci1 Jump Cue aci 1ci 1 (7) jk FC ci-1 where N(wi) represents the frequency of fragment wi in speech act SAx and uji is the membership of wi in the FC cj. For each input fragment, the observations are represented as Y={ y1, y2,…,yM} in which y1 is for semantic observation, y2 is for syntactic observation, and y3 is for fragment class observation in an SAHMM. The observation probability of an input fragment wi is defined as the linear combination of the three observations yz (z = 1,..,M and M = 3) at state S and estimated as bc j ( wi ) FC ci+1 aci ci1 i * aci1ci (1 i ) aci1ci1 This fragment class observation probability for fragment wi at state S in the SAHMM is defined as wk c j (9) , 1 ci , c j N K BBV ( j ) u ji BBV (wi ) pFC ( wi | S c j ) to 1 is estimated as mz P( wi | y z ) Y ,1 z M m1 pSem ( wi | S c j ) m2 pSyn ( wi | S c j ) (8) m3 pFC ( wi | S c j ) where m1 , m2 and m3 are the weights for semantic, syntactic, and fragment class observation probabilities, respectively, and satisfy the condition m1+ m2+ m3=1. 2.2. State Transition Probability aci 1ci Disfluent FC ci a ci ci 1 FC ci+1 (b) Figure. 1 : Diagram of the interpolation method to estimate the transition probability aci 1ci and aci ci 1 using the jump cue. These two transition probabilities are estimated as follows. i aci1ci (1 i )aci1ci1 , If ci Disfluency (10) aci1ci aci1ci , Otherwise i aci ci1 (1 i )aci1ci1 , If ci Disfluency (11) aci ci1 aci ci1 , Otherwise The weights δi and γi in the interpolation equations are calculated by observing a disfluency fragment at state ci in the training corpus and defined as follows: i c i Nc (ci 1 ) Nunseen (ci 1, ci ) ci (12) i c N (13) i Nc (ci 1 ) Nunseen (ci , ci 1 ) ci where Nc(•) is the number of occurrences of a FC and ci is the expected number of occurrences of the unseen FC ci. Nunseen(ci-1,ci) is the number of occurrences of the unseen FC ci following ci-1. The most widely used value for ci is 0.5. This choice can be theoretically justified as being the expectation of the same quantity, which is maximized by MLE. The original idea of this method is coming from the Jeferys-Perks law [7]. 2.3. Weight Determination Using Expectation Maximization (EM) Algorithm In the training process, the weights can be determined using the Expectation Maximization algorithm. For a fragment sequence of N observations W: w1,w2,…,wN and its corresponding state (FC) sequence FCS: c1,c2,…,cN in the training corpus, we estimate the parameter set based on maximum likelihood to obtain an optimal solution. The Q function is defined as Q(SA , SA ) E log P(W w1 , w2 ,..., wN , FCS c1 , c2 ,..., cN | SA ) | W , SA Ks n 1 i 1 P( w n yz ) 1 0 N mz P( wn y z ) N n P( w n yz ) re-estimated probability under parameter set SA . Our concern is the Q function Q(P(wn | cn )| P(wn | cn )) because we want to maximize the expectation with respect to the combination weights. Then it can be derived as Q(P(wn | cn )| P(wn | cn )) P (cn | wn , SA ) log P(cn | SA )] n Ks P(cn Si | wn y z , SA ) log[ P (wn y z ) P ( wn y z | cn Si )] (15) n 1 i 1 Ks P(cn Si | wn y z , SA ) log P (wn y z ) n 1 i 1 Ks P (cn Si | wn y z , SA ) log P ( wn y z | cn Si ) n 1 i 1 where Ks is the state number. The Q function should be maximized under the condition m1+m2+m3=1 (i.e., P(wn yz ) 1 ). We add the Lagrange multiplier into the Q z function and get the equation: Ks n n 1 i 1 P (w n z y z ) 1 Si | wn y z ,SA ) z 1 0 (20) After the arrangement of equation (20), η can be calculated as: N Ks P(c n Si | wn y z ,SA ) (21) n 1 i 1 Ks P(c n mz P( wn y z ) Si | wn y z ,SA ) n 1 i 1 N Ks P(c (22) Si | wn y z ,SA ) z n 1 i 1 2.4. Speech Act Identification Speech act identification attempts to choose the most probable * speech act SAW with the highest probability conditioned by k the input utterance U along with its corresponding kth fragment sequence Wk according to SAHMM and is defined as follows. SAW* k arg max P( SA | W k ,U ) SA (23) arg max P(W k | FCS k ) P(W k | SA ) SA * k Eventually, the score P( SAW estimated from the k | W ,U ) SAHMM is then combined with the acoustic score BS(Wk|U) to choose the most probable fragment sequence W* and its corresponding speech act SA* as described as follows. (24) ( SA* ,W * ) arg max {(1 )log P( SAW* k | W k ,U ) BS (W k | U )} ( SA k ,W k ) W where β is the weight between 0 and 1. In our approach, the semantic information from W* and SA* are then as the input for dialogue component. 3. Experiments Si | wn y z ,SA )log P(wn y z ) z z n 1 i 1 n (14) is the parameter set of SAHMM and P( ) is the P(c (19) Ks n Q(P(cn1 | cn )| P(cn1 | cn )) Q(P(wn | cn )| P(wn | cn )) N Si | wn y z ,SA ) P(c N P(cn | wn , SA ) log P( wn | cn , SA ) P(cn | wn , SA ) log P(cn | SA )] N n Then we substitute P(wn=yz) in equation (19) into equation (18) and get n N P(c n 1 i 1 P(cn | wn , SA ) log[ P( wn | cn , SA ) P(cn | SA )] N Ks After replacing η into equation (19), we can obtain the optimal closed form solution: n where SA (18) From equation (17), we can get equation (19) by transposition. z P(cn | wn , SA ) log P( wn , cn | SA ) n (17) z P(cn Si | wn y z ,SA ) 0 P ( wn y z ) (16) We do the partial differential by parameter set of SA and η, and get the following equation: Performance evaluation experiments were conducted using a spoken dialogue system for ATIS. The system was trained by two corpora, which are initial corpus (4250 utterances) and adaptation corpus (2200 utterances). The initial corpus was collected from a real ATIS in fluent spoken dialogue and real human interactions. The adaptation corpus was augmented in a wizard environment. It is helpful for improving the system 3.1. Experiment on the performance of SAHMM In this experiment, the system chose the fragment sequence with the highest score obtained from equation (24) among all the SAHMMs and determines its corresponding speech act. Fragment correct rate (FCR) and speech act correct rate (SACR) defined in [4] were adopted to evaluate the performance of our approach. The SACR and FCR experiments on our proposed SAHMM which mixture weights are determined by the EM algorithm are shown in Fig. 2. The best SACR achieved 90.3% with a 85.5% FCR when β was chosen as 0.4. Apparently, high performance in SACR and FCR is because the EM algorithm determines a closed form solution for the determination of mixture weights. In addition, the proposed approach adopting the semantic information also provides the ability to avoid confusion and ambiguity among speech acts. sentence and used to choose the speech act of the input sentence. The semantic information, syntactic structure, and fragment feature of an input sentence are statistically encapsulated into a proposed SAHMM to characterize the speech act. An interpolation method is adopted to re-estimate the transition probabilities in the SAHMM concerning disfluencies. In the evaluation, for disfluent speech, the experimental results show the SAHMM gains a significant improvement of 5.7% in SACR and 6.9% in FCR compared to the baseline system without considering filled pauses. For fluent speech, the experimental results show that the performance can achieve 90.3% in SACR and 85.5% in FCR using the SAHMM. 78.0 74.0 SACR(%) performance because it concerns various habits and behaviors when people face a computer service system. Using the collected corpus, 206 fragments were extracted and clustered into 38 fuzzy fragment classes [4]. The 30 SAHMMs were then constructed and trained using the corresponding speech act sub-corpus by the proposed approach. For fluent speech, experiments were carried out using 3038 sentences from 25 speakers (15 male and 10 female). For disfluent speech, the testing database was collected particularly to contain disfluencies in speech. There are totally 352 sentences. 70.0 66.0 Baseline System 62.0 SAHMM 58.0 0 0.1 0.2 0.3 0.4 beta 0.5 0.6 0.7 0.8 Figure. 3 : SACR as a function of the values of β for SAHMM on testing database with disfluency 72 68 Correct Rate (%) SACR 88 FCR FCR(%) 92 64 60 Baseline System 84 56 SAHMM 80 52 0 76 0 0.1 0.2 0.3 0.4 beta 0.5 0.6 0.7 0.8 Figure. 2 : SACR and FCR as a function of the values of β for SAHMM 3.2. Experiments on Disfluency Modeling In this experiment, filled pauses [6] such as “ah,” “ung,” “um,” “em,” and “hem” were investigated. The input speech is recognized into possible fragment sequences by a composite acoustic model – fluency and disfluency HMMs. In order to evaluate our efforts on disfluency problem, a baseline system without interpolation is established and the state-transition probability according to equation (9) is estimated. The SAHMMs and the baseline system were trained by the same training corpus. The experimental results in Fig. 3 and Fig. 4 show the SACR and FCR for different values of β. In the baseline system, the best results in SACR and FCR were 69% and 63.3% when β=0.3, respectively. It suffers from performance degradation when compared to the fluent speech. This is due to the out-of-grammar in language modeling for filled pauses. The performance on the SAHMM with filled pauses shows 74.7% SACR and 70.2% FCR when β=0.5. Obviously, the modeling on filled pauses shows the effectiveness on the disfluency problem using SAHMM. 4. Conclusion In this paper, a novel SAHMM is proposed to characterize the semantic information and syntactic structure of a speech act in a 0.1 0.2 0.3 0.4 beta 0.5 0.6 0.7 0.8 Figure. 4: FCR as a function of the values of β for SAHMM on testing database with disfluency 5. References [1] H. Meng, S. Busayapongchai, and V. Zue, “WHEELS: A Conversational System in the Automobile Classification Domain,” in Proc. ICSLP, 1996, pp. 542-545. [2] A.L. Gorin, G. Riccardi, and J.H. Wright. How may I help you? Speech communication 23(1), pp. 113-127, 1997. [3] Hans Ulrich Block, “The language components in VERBMOBIL,” in Proc. ICASSP, 1997, pp. 79 –82. [4] C.H. Wu, G.L. Yan, and C.L. Lin, “Speech act modeling in a spoken dialog system using a fuzzy fragment-class Markov model,” Speech Communication 38, pp. 183-199, 2002. [5] K. Tatsuya, C.H. Lee, and B.H. Juang, “Flexible Speech Understanding Based on Combined Key-Phrase Detection and Verification,” IEEE Transactions on Speech and Audio Processing, Vol.6, No. 6, pp. 558-568, November. 1998. [6] C.H. Wu and G.L. Yan, “Acoustic feature analysis and discriminative modeling of filled pauses for Spontaneous Speech Recognition,” Journal of VLSI., to be published. [7] Christopher D.M. and Hinrich S., “Foundations of statistical natural language processing,” The MIT Press 1999. [8] K. Arai, J.H. Wright et. al. “Grammar Fragment acquisition using syntactic and semantic clustering,” Speech Communication, Vol. 27, Issue: 1, pp. 43-62, Feb. 1999.