Belief Updating in Spoken Dialog Systems Dan Bohus www.cs.cmu.edu/~dbohus dbohus@cs.cmu.edu Computer Science Department Carnegie Mellon University Pittsburgh, PA, 15217 problem spoken language interfaces lack robustness when faced with understanding errors. stems mostly from speech recognition spans most domains and interaction types 2 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ……… 3 non- and misunderstandings NON understanding MIS understanding 4 S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ……… approaches for increasing robustness fix recognition gracefully handle errors through interaction detect the problems develop a set of recovery strategies know how to choose between them (policy) 5 six not-so-easy pieces … misunderstandings detection strategies policy 6 non-understandings belief updating misunderstandings detection construct more accurate beliefs by integrating information over multiple turns S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?} 7 belief updating: problem statement given: an initial belief Pinitial(C) over concept C destination = {seoul/0.65} a system action SA S: traveling to Seoul. What day did you need to travel? a user response R construct an updated belief: [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?} 8 Pupdated(C) ← f (Pinitial(C), SA, R) outline 9 related work a restricted version data user response analysis experiments and results some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work confidence annotation + heuristic updates confidence annotation traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar] more recently: semantic confidence annotation [Walker, San-Segundo, Bohus] machine learning approach results fairly good, but not perfect heuristic updates explicit confirmation: no → don’t trust ; yes → trust implicit confirmation: no → don’t trust ; o/w → trust suboptimal for several reasons 10 related work : restricted version : data : user response analysis : experiment & results : caveats & future work correction detection detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow] machine learning approach features from different knowledge sources in the system results fairly good, but not perfect 11 related work : restricted version : data : user response analysis : experiment & results : caveats & future work integration confidence annotation and correction detection are useful tools but separately, neither solves the problem bridge together in a unified approach to accurately track beliefs 12 related work : restricted version : data : user response analysis : experiment & results : caveats & future work outline 13 related work a restricted version data user response analysis experiments and results some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work belief updating: general form given: an initial belief Pinitial(C) over concept C a system action SA a user response R construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R) 14 related work : restricted version : data : user response analysis : experiment & results : caveats & future work restricted version: 2 simplifications 1. compact belief system unlikely to “hear” more than 3 or 4 values single vs. multiple recognition results in our data: max = 3 values, only 6.9% have >1 value confidence score of top hypothesis 2. updates after confirmation actions reduced problem 15 ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R) related work : restricted version : data : user response analysis : experiment & results : caveats & future work outline 16 related work a restricted version data user response analysis experiments and results some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work data collected with RoomLine a phone-based mixed-initiative spoken dialog system conference room reservation search and negotiation explicit and implicit confirmations confidence threshold model (+ some exploration) unplanned implicit confirmations I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? 17 related work : restricted version : data : user response analysis : experiment & results : caveats & future work corpus user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success corpus 449 sessions, 8848 user turns orthographically transcribed rich annotation: correct concepts, corrections, etc. 18 related work : restricted version : data : user response analysis : experiment & results : caveats & future work outline 19 related work a restricted version data user response analysis experiments and results some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work user response types following Krahmer and Swerts study on Dutch train-table information system 3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER cross-tabulated against correctness of confirmations 20 related work : restricted version : data : user response analysis : experiment & results : caveats & future work user responses to explicit confirmations from transcripts CORRECT INCORRECT YES NO Other 94% [93%] 0% [0%] 5% [7%] 1% [6%] 72% [57%] 27% [37%] ~10% [numbers in brackets from Krahmer&Swerts] from decoded 21 YES NO Other CORRECT 87% 1% 12% INCORRECT 1% 61% 38% related work : restricted version : data : user response analysis : experiment & results : caveats & future work other responses to explicit confirmations ~70% users repeat the correct value ~15% users don’t address the question attempt to shift conversation focus CORRECT INCORRECT 22 User does not correct User corrects 1159 0 29 250 [10% of incor] [90% of incor] related work : restricted version : data : user response analysis : experiment & results : caveats & future work user responses to implicit confirmations Transcripts YES NO Other CORRECT 30% [0%] 7% [0%] 63% [100%] INCORRECT 6% [0%] 33% [15%] 61% [85%] [numbers in brackets from Krahmer&Swerts] Decoded 23 YES NO Other CORRECT 28% 5% 67% INCORRECT 7% 27% 66% related work : restricted version : data : user response analysis : experiment & results : caveats & future work ignoring errors in implicit confirmations User does not correct User corrects CORRECT 552 2 INCORRECT 118 111 [51% of incor] [49% of incor] users correct later (40% of 118) users interact strategically correct only if essential ~correct later correct later 24 ~critical 55 2 critical 14 47 related work : restricted version : data : user response analysis : experiment & results : caveats & future work outline 25 related work a restricted version data user response analysis experiments and results some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work machine learning approach need good probability outputs low cross-entropy between model predictions and reality cross-entropy = negative average log posterior logistic regression sample efficient stepwise approach → feature selection logistic model tree for each action root splits on response-type 26 related work : restricted version : data : user response analysis : experiment & results : caveats & future work features. target. initial situation initial confidence score concept identity, dialog state, turn number system action other actions performed in parallel features of the user response acoustic / prosodic features lexical features grammatical features dialog-level features target: was the value correct? 27 related work : restricted version : data : user response analysis : experiment & results : caveats & future work baselines initial baseline accuracy of system beliefs before the update heuristic baseline accuracy of heuristic rule currently used in the system oracle baseline accuracy if we knew exactly when the user is correcting the system 28 related work : restricted version : data : user response analysis : experiment & results : caveats & future work results: explicit confirmation Hard error (%)Explicit Confirmation Soft error Initial Heuristic LMT Oracle Hard-error (%) 20 10 0.51 0.4 0.2 8.41 Initial Heuristic LMT 0.6 Soft-error 31.15 30 0.19 0.12 3.57 0 29 2.71 0 related work : restricted version : data : user response analysis : experiment & results : caveats & future work results: implicit confirmation Hard error (%)Implicit Confirmation Soft error Hard-error (%) 30 23.37 20 16.15 15.33 Initial Heuristic LMT 1 0.8 Soft-error Initial Heuristic LMT Oracle 30.40 0.67 0.6 0.61 0.43 0.4 10 0.2 0 30 0 related work : restricted version : data : user response analysis : experiment & results : caveats & future work results: unplanned implicit confirmation Hard error (%) Soft error Implicit Confirmation Unplanned Hard-error (%) 20 15.40 12.64 10 10.37 0.43 0.4 0.46 0.34 0.2 0 31 14.36 Initial Heuristic LMT 0.6 Soft-error Initial Heuristic LMT Oracle 0 related work : restricted version : data : user response analysis : experiment & results : caveats & future work informative features 32 initial confidence score prosody features barge-in expectation match repeated grammar slots concept id related work : restricted version : data : user response analysis : experiment & results : caveats & future work outline 33 related work a reduced version. approach data user response analysis experiments and results some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work eliminate simplification 1 current restricted version belief = confidence score of top hypothesis only 6.9% of cases had more than 1 hypothesis extend to N hypotheses + 1 (other), where N is a small integer (2 or 3) approach: multinomial generalized linear model use information from multiple recognition hypotheses 34 related work : restricted version : data : user response analysis : experiment & results : caveats & future work eliminate simplification 2 current restricted version only updates following system confirmation actions users might correct the system at any point extend to updates after all system actions 35 related work : restricted version : data : user response analysis : experiment & results : caveats & future work shameless self promotion misunderstandings detection - rejection threshold adaptation - nonu impact on performance strategies - comparative analysis of 10 recovery strategies policy 36 non-understandings [Interspeech-05] [SIGdial-05] - wizard experiment - towards learning nonu recovery policies [Sigdial-05] shameless CMU promotion Ananlada (Moss) Chotimongkol automatic concept and task structure acquisition Antoine Raux turn-taking, conversation micro-management Jahanzeb Sherwani multimodal personal information management Satanjeev Banerjee meeting understanding Stefanie Tomko universal speech interface Thomas Harris multi-participant dialog DoD / Young Researchers’ Roundtable 37 thankyou! 38 a more subtle caveat distribution of training data confidence annotator + heuristic update rules distribution of run-time data confidence annotator + learned model always a problem when interacting with the world hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach? 39