“k hypotheses + other” belief updating in spoken dialog systems

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus www.cs.cmu.edu/~dbohus dbohus@cs.cmu.edu Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 problem spoken language interfaces lack robustness when faced with understanding errors  errors stem mostly from speech recognition  typical word error rates: 20-30%  significant negative impact on interactions 2 guarding against understanding errors  use confidence scores  machine learning approaches for detecting misunderstadings [Walker, Litman, San-Segundo, Wright, and others]  engage in confirmation actions  explicit confirmation did you say you wanted to fly to Seoul?    yes → trust hypothesis no → delete hypothesis “other” → non-understanding  implicit confirmation traveling to Seoul … what day did you need to travel?  3 rely on new values overwriting old values related work : data : user response analysis : proposed approach: experiments and results : conclusion today’s talk … construct accurate beliefs by integrating information over multiple turns in a conversation S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?} 4 belief updating: problem statement  given  an initial belief Binitial(C) over destination = {seoul/0.65} concept C S: traveling to Seoul. What day did you need to travel?  a system action SA [THE TRAVELING BERLIN P_M R / 0.60]  a userTOresponse destination = {?}  construct an updated belief  Bupdated(C) ← f (Binitial(C), SA, R) 5 outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion 6 proposed approach: data: experiments and results : effect on dialog performance : conclusion belief updating: problem statement destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}  given  an initial belief Binitial(C) over concept C  a system action SA(C)  a user response R  construct an updated belief  Bupdated(C) ← f(Binitial(C),SA(C),R) 7 proposed approach: data: experiments and results : effect on dialog performance : conclusion belief representation Bupdated(C) ← f(Binitial(C), SA(C), R)  most accurate representation  probability distribution over the set of possible values  however  system will “hear” only a small number of conflicting values for a concept within a dialog session  in our data 8  max = 3 (conflicting values heard)  only in 6.9% of cases, more than 1 value heard proposed approach: data: experiments and results : effect on dialog performance : conclusion belief representation Bupdated(C) ← f(Binitial(C), SA(C), R)  compressed belief representation  k hypotheses + other  at each turn, the system retains the top m initial hypotheses and adds n new hypotheses from the input (m+n=k) 9 proposed approach: data: experiments and results : effect on dialog performance : conclusion belief representation Bupdated(C) ← f(Binitial(C), SA(C), R)  B(C) modeled as a multinomial variable  {h1, h2, … hk, other}  B(C) = <ch1, ch2, …, chk, cother>  where ch1 + ch2 + … + chk + cother = 1  belief updating can be cast as multinomial regression problem: Bupdated(C) ← Binitial(C) + SA(C) + R 10 proposed approach: data: experiments and results : effect on dialog performance : conclusion system action 11 Bupdated(C) ← f(Binitial(C), SA(C), R) request S: For when do you want the room? U: Friday [FRIDAY / 0.65] explicit confirmation S: Did you say you wanted a room for Friday? U: Yes [GUEST / 0.30] implicit confirmation S: a room for Friday … starting at what time? U: starting at ten a.m. [STARTING AT TEN A_M / 0.86] unplanned implicit confirmation S: I found 5 rooms available Friday from 10 until noon. Would you like a small or a large room? U: not Friday, Thursday [FRIDAY THURSDAY / 0.25] no action / unexpected update S: okay. I will complete the reservation. Please tell me your name or say ‘guest user’ if you are not a registered user. U: guest user [THIS TUESDAY / 0.55] proposed approach: data: experiments and results : effect on dialog performance : conclusion user response 12 Bupdated(C) ← f(Binitial(C), SA(C), R) acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-tounvoiced ratio, speech rate, initial pause, etc; lexical number of words, lexical terms highly correlated with corrections or acknowledgements (selected via mutual information computation). grammatical number of slots (new and repeated), parse fragmentation, parse gaps, etc; dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in, concept identity priors priors for concept values (manually constructed by a domain expert for 3 of 29 concepts: date, start_time, end_time; uniform assumed o/w) confusability empirically derived confusability scores proposed approach: data: experiments and results : effect on dialog performance : conclusion approach Bupdated(C) ← f(Binitial(C), SA(C), R)  problem  <uch1, … uchk, ucoth> ← f(<ich1, … ichk, icoth>, SA(C), R)  approach: multinomial generalized linear model  regression model, multinomial independent variable  sample efficient  stepwise approach  feature selection  BIC to control over-fitting  one model for each system action  13 <uch1, … uchk, ucoth> ← fSA(C)(<ich1, … ichk, icoth>, R) proposed approach: data: experiments and results : effect on dialog performance : conclusion outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion 14 proposed approach: data: experiments and results : effect on dialog performance : conclusion data  collected with RoomLine  a phone-based mixed-initiative spoken dialog system  conference room reservation  explicit and implicit confirmations  simple heuristic rules for belief updating  explicit confirm: yes / no  implicit confirm: new values overwrite old ones 15 proposed approach: data: experiments and results : effect on dialog performance : conclusion corpus  user study  46 participants (naïve users)  10 scenario-based interactions each  compensated per task success  corpus  449 sessions, 8848 user turns  orthographically transcribed  manually annotated    16 misunderstandings corrections correct concept values proposed approach: data: experiments and results : effect on dialog performance : conclusion outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion 17 proposed approach: data: experiments and results : effect on dialog performance : conclusion baselines  initial baseline  accuracy of system beliefs before the update  heuristic baseline  accuracy of heuristic update rule used by the system  oracle baseline  accuracy if we knew exactly when the user corrects 18 proposed approach: data: experiments and results : effect on dialog performance : conclusion k=2 hypotheses + other Informative features  priors and confusability  initial confidence score  concept identity  barge-in  expectation match  repeated grammar slots 19 proposed approach: data: experiments and results : effect on dialog performance : conclusion outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion 20 proposed approach: data: experiments and results : effect on dialog performance : conclusion a question remains … … does this really matter? what is the effect on global dialog performance? 21 proposed approach: data: experiments and results : effect on dialog performance : conclusion let’s run an experiment guinea pigs from Speech Lab for exp: $0 getting change from guys in the lab: $2/$3/$5 real subjects for the experiment: $25 picture with advisor of the VERY last exp at CMU: priceless!!!! [courtesy of Mohit Kumar] 22 a new user study …  implemented models in RavenClaw, performed a new user study  40 participants, first-time users  10 scenario-driven interactions each  non-native speakers of North-American English  improvements more likely at higher WER  supported by empirical evidence  between-subjects; 2 gender-balanced groups  control: RoomLine using heuristic update rules  treatment: RoomLine using runtime models 23 proposed approach: data: experiments and results : effect on dialog performance : conclusion effect on task success control 73.6% treatment 81.3% task success even though 24 control 21.9% treatment 24.2% average user WER proposed approach: data: experiments and results : effect on dialog performance : conclusion effect on task success … a closer look treatment control 90% 80% 78% 78% 70% 64% 60% 50% 30% 20% 10% 0 0 30% WER 40% 16% WER probability of task success 100% 20% 40% 60% 80% 100% word error rate Task Success ← 2.09 - 0.05∙WER + 0.69∙Condition p=0.001 25 proposed approach: data: experiments and results : effect on dialog performance : conclusion improvements at different WER absolute Improvement in task success 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 10 20 30 40 50 60 70 80 90 100 word-error-rate 26 proposed approach: data: experiments and results : effect on dialog performance : conclusion effect on task duration (for successful tasks)  ANOVA on task duration for successful tasks Duration ← -0.21 + 0.013∙WER - 0.106∙Condition  significant improvement, equivalent to 7.9% absolute reduction in WER 27 proposed approach: data: experiments and results : effect on dialog performance : conclusion outline  proposed approach  data  experiments and results  effect on dialog performance  conclusion 28 proposed approach: data: experiments and results : effect on dialog performance : conclusion summary  data-driven approach for constructing accurate system beliefs  integrate information across multiple turns  bridge together detection of misunderstandings and corrections  significantly outperforms current heuristics  significantly improves effectiveness and efficiency 29 other advantages  sample efficient  performs a local one-turn optimization  good local performance leads to good global performance  scalable  works independently on concepts  29 concepts, varying cardinalities  portable  decoupled from dialog task specification  doesn’t make strong assumptions about dialog management technology 30 thank you! 31 questions … user study  10 scenarios, fixed order  presented graphically (explained during briefing)  participants compensated per task success 32

“k hypotheses + other” belief updating in spoken dialog systems

Related documents

Products

Support

“k hypotheses + other” belief updating in spoken dialog systems

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib