“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus www.cs.cmu.edu/~dbohus dbohus@cs.cmu.edu Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 problem spoken language interfaces lack robustness when faced with understanding errors errors stem mostly from speech recognition typical word error rates: 20-30% significant negative impact on interactions 2 guarding against understanding errors use confidence scores machine learning approaches for detecting misunderstadings [Walker, Litman, San-Segundo, Wright, and others] engage in confirmation actions explicit confirmation did you say you wanted to fly to Seoul? yes → trust hypothesis no → delete hypothesis “other” → non-understanding implicit confirmation traveling to Seoul … what day did you need to travel? 3 rely on new values overwriting old values related work : data : user response analysis : proposed approach: experiments and results : conclusion today’s talk … construct accurate beliefs by integrating information over multiple turns in a conversation S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?} 4 belief updating: problem statement given an initial belief Binitial(C) over destination = {seoul/0.65} concept C S: traveling to Seoul. What day did you need to travel? a system action SA [THE TRAVELING BERLIN P_M R / 0.60] a userTOresponse destination = {?} construct an updated belief Bupdated(C) ← f (Binitial(C), SA, R) 5 outline proposed approach data experiments and results effect on dialog performance conclusion 6 proposed approach: data: experiments and results : effect on dialog performance : conclusion belief updating: problem statement destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?} given an initial belief Binitial(C) over concept C a system action SA(C) a user response R construct an updated belief Bupdated(C) ← f(Binitial(C),SA(C),R) 7 proposed approach: data: experiments and results : effect on dialog performance : conclusion belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) most accurate representation probability distribution over the set of possible values however system will “hear” only a small number of conflicting values for a concept within a dialog session in our data 8 max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard proposed approach: data: experiments and results : effect on dialog performance : conclusion belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) compressed belief representation k hypotheses + other at each turn, the system retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k) 9 proposed approach: data: experiments and results : effect on dialog performance : conclusion belief representation Bupdated(C) ← f(Binitial(C), SA(C), R) B(C) modeled as a multinomial variable {h1, h2, … hk, other} B(C) = <ch1, ch2, …, chk, cother> where ch1 + ch2 + … + chk + cother = 1 belief updating can be cast as multinomial regression problem: Bupdated(C) ← Binitial(C) + SA(C) + R 10 proposed approach: data: experiments and results : effect on dialog performance : conclusion system action 11 Bupdated(C) ← f(Binitial(C), SA(C), R) request S: For when do you want the room? U: Friday [FRIDAY / 0.65] explicit confirmation S: Did you say you wanted a room for Friday? U: Yes [GUEST / 0.30] implicit confirmation S: a room for Friday … starting at what time? U: starting at ten a.m. [STARTING AT TEN A_M / 0.86] unplanned implicit confirmation S: I found 5 rooms available Friday from 10 until noon. Would you like a small or a large room? U: not Friday, Thursday [FRIDAY THURSDAY / 0.25] no action / unexpected update S: okay. I will complete the reservation. Please tell me your name or say ‘guest user’ if you are not a registered user. U: guest user [THIS TUESDAY / 0.55] proposed approach: data: experiments and results : effect on dialog performance : conclusion user response 12 Bupdated(C) ← f(Binitial(C), SA(C), R) acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-tounvoiced ratio, speech rate, initial pause, etc; lexical number of words, lexical terms highly correlated with corrections or acknowledgements (selected via mutual information computation). grammatical number of slots (new and repeated), parse fragmentation, parse gaps, etc; dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in, concept identity priors priors for concept values (manually constructed by a domain expert for 3 of 29 concepts: date, start_time, end_time; uniform assumed o/w) confusability empirically derived confusability scores proposed approach: data: experiments and results : effect on dialog performance : conclusion approach Bupdated(C) ← f(Binitial(C), SA(C), R) problem <uch1, … uchk, ucoth> ← f(<ich1, … ichk, icoth>, SA(C), R) approach: multinomial generalized linear model regression model, multinomial independent variable sample efficient stepwise approach feature selection BIC to control over-fitting one model for each system action 13 <uch1, … uchk, ucoth> ← fSA(C)(<ich1, … ichk, icoth>, R) proposed approach: data: experiments and results : effect on dialog performance : conclusion outline proposed approach data experiments and results effect on dialog performance conclusion 14 proposed approach: data: experiments and results : effect on dialog performance : conclusion data collected with RoomLine a phone-based mixed-initiative spoken dialog system conference room reservation explicit and implicit confirmations simple heuristic rules for belief updating explicit confirm: yes / no implicit confirm: new values overwrite old ones 15 proposed approach: data: experiments and results : effect on dialog performance : conclusion corpus user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success corpus 449 sessions, 8848 user turns orthographically transcribed manually annotated 16 misunderstandings corrections correct concept values proposed approach: data: experiments and results : effect on dialog performance : conclusion outline proposed approach data experiments and results effect on dialog performance conclusion 17 proposed approach: data: experiments and results : effect on dialog performance : conclusion baselines initial baseline accuracy of system beliefs before the update heuristic baseline accuracy of heuristic update rule used by the system oracle baseline accuracy if we knew exactly when the user corrects 18 proposed approach: data: experiments and results : effect on dialog performance : conclusion k=2 hypotheses + other Informative features priors and confusability initial confidence score concept identity barge-in expectation match repeated grammar slots 19 proposed approach: data: experiments and results : effect on dialog performance : conclusion outline proposed approach data experiments and results effect on dialog performance conclusion 20 proposed approach: data: experiments and results : effect on dialog performance : conclusion a question remains … … does this really matter? what is the effect on global dialog performance? 21 proposed approach: data: experiments and results : effect on dialog performance : conclusion let’s run an experiment guinea pigs from Speech Lab for exp: $0 getting change from guys in the lab: $2/$3/$5 real subjects for the experiment: $25 picture with advisor of the VERY last exp at CMU: priceless!!!! [courtesy of Mohit Kumar] 22 a new user study … implemented models in RavenClaw, performed a new user study 40 participants, first-time users 10 scenario-driven interactions each non-native speakers of North-American English improvements more likely at higher WER supported by empirical evidence between-subjects; 2 gender-balanced groups control: RoomLine using heuristic update rules treatment: RoomLine using runtime models 23 proposed approach: data: experiments and results : effect on dialog performance : conclusion effect on task success control 73.6% treatment 81.3% task success even though 24 control 21.9% treatment 24.2% average user WER proposed approach: data: experiments and results : effect on dialog performance : conclusion effect on task success … a closer look treatment control 90% 80% 78% 78% 70% 64% 60% 50% 30% 20% 10% 0 0 30% WER 40% 16% WER probability of task success 100% 20% 40% 60% 80% 100% word error rate Task Success ← 2.09 - 0.05∙WER + 0.69∙Condition p=0.001 25 proposed approach: data: experiments and results : effect on dialog performance : conclusion improvements at different WER absolute Improvement in task success 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 10 20 30 40 50 60 70 80 90 100 word-error-rate 26 proposed approach: data: experiments and results : effect on dialog performance : conclusion effect on task duration (for successful tasks) ANOVA on task duration for successful tasks Duration ← -0.21 + 0.013∙WER - 0.106∙Condition significant improvement, equivalent to 7.9% absolute reduction in WER 27 proposed approach: data: experiments and results : effect on dialog performance : conclusion outline proposed approach data experiments and results effect on dialog performance conclusion 28 proposed approach: data: experiments and results : effect on dialog performance : conclusion summary data-driven approach for constructing accurate system beliefs integrate information across multiple turns bridge together detection of misunderstandings and corrections significantly outperforms current heuristics significantly improves effectiveness and efficiency 29 other advantages sample efficient performs a local one-turn optimization good local performance leads to good global performance scalable works independently on concepts 29 concepts, varying cardinalities portable decoupled from dialog task specification doesn’t make strong assumptions about dialog management technology 30 thank you! 31 questions … user study 10 scenarios, fixed order presented graphically (explained during briefing) participants compensated per task success 32