Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004 Misunderstandings Misunderstandings are an important problem in spoken dialog systems System obtains an incorrect semantic interpretation of the users’ utterance 15-40% of turns Significant negative impact on overall success rate 2 Confidence annotation Use confidence scores to guard against potential misunderstandings Traditionally: from speech recognition engine [Chase, Bansal, Cox, Kemp, etc] Focuses on WER, not tuned to task at hand More recently: system-specific semantic confidence scores [Carpenter, Walker, San-Segundo, etc] Integrate knowledge from different levels in the system: 3 speech recognition, language understanding, dialog management Correction Detection Detect whether or not the user is trying to correct the system Related: aware-site detection 4 Similar ML approaches using multiple sources of knowledge [Litman, Swerts, Krahmer, etc] Proposed: Belief Updating Integrate confidence annotation and correction detection in a unified framework for continuously tracking beliefs A “belief updating” problem: S: Where are you flying from? U: [CityName={Aspen/0.6; Austin/0.2}] initial belief + S: Did you say you wanted to fly out of Aspen? U: [No/0.6] [CityName={Boston/0.8}] [CityName={Aspen/?; Austin/?; Boston/?}] 5 system action + user response updated belief Formally… Given: An initial belief Pinitial(C) over concept C A system action SA A user response R Construct an updated belief Pupdated(C) As “accurate” as possible 6 Pupdated(C) ← f (Pinitial(C), SA, R) Examples 7 Examples - continued 8 Outline 9 Introduction Data A simplified version of the problem. Approach User behaviors Learning: Preliminary results More on evaluation Where to from here? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Data Collected in an experiment with RoomLine Phone-based, mixed initiative system for making conference room reservations Equipped with explicit and implicit confirmations Corpus statistics 46 participants 449 sessions, 8278 turns 13.5% misunderstandings [9.8% / 22.5%] 25.6% WER [19.6% / 39.5%] 11362 concept updates 10 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? System actions and concept updates Explicit and implicit confirmations Start time: Explicit Confirmation/grounding [EC] Date: Implicit Confirmation/grounding [IC] 11 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? System actions and concept updates Implicit Confirmations Task Date: Implicit Confirmation/grounding [IC] Start time: Implicit Confirmation/grounding [IC] End time: Implicit Confirmation/task [ICT] 12 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? # of Conflicting Hypotheses 13 Below 3% involve more than 1 hypothesis System not using multiple hypotheses [Future work: regenerate multiple hypotheses in batch] data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Outline 14 Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? A Simplified Version Given only 3% have more than 1 hypothesis, 15 Update belief in the top-hypothesis after implicit and explicit confirmations Instead of Do Pupdated(C) ← f (Pinitial(C), SA, R) ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R) For SA = {EC, IC, ICT} data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Approach Use machine learning Dataset Concept updates for EC, IC, ICTs Features Initial confidence score ConfTopinitial(C) System action (SA) User response (R) Target Updated confidence score ConfTopupdated(C) Data is labeled, so we have a binary target 16 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Outline 17 Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? User behaviors Study of user behaviors in response to ICs and ECs Can inform feature selection and feature development Provide insights into where the difficulties are Can inform potential strategy refinements 18 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? User responses to ECs Transcripts CORRECT INCORRECT NO Other 1097 8 62 202 84 [94.2% of cor] 3 [69.9% of inc] ~10% Decoded CORRECT INCORRECT 19 YES YES NO Other 1016 11 137 171 116 [87.3% of cor] 2 [69.9% of inc] data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? “Other” Responses to EC “Eyeball” estimates (out of 146 responses) ~70% simply repeat the correct concept value That should come in as a handy feature ~10% change conversation focus ~10% turn overtaking issues Maybe inhibit barge-in until Antoine finishes his thesis ~10% other 20 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? User responses to ICs Transcripts CORRECT INCORRECT NO Other 166 38 326 75 148 [31.3% of cor] 15 [31.5% of inc] Decoded CORRECT INCORRECT 21 YES YES NO Other 151 20 369 62 160 [28.5% of cor] 16 [26.1% of inc] data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Users Don’t Always Correct ICs 22 Actually, they corrected in 45% of the cases User does not correct User corrects CORRECT 557 1 INCORRECT 126 104 [55% of incor] [45% of incor] That means if we knew exactly when they correct, we’d still have (126+1)/788 = 16% error So what do users do when they don’t correct? They may actually correct partially Completely ignore the error … (if non-essential) Readjust to accommodate task data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? More questions… Understand better this “ignore” phenomenon Impact on task success? IC correction rate: 49% (successful tasks) vs 41% (unsuccessful) Fixed vs more “flexible” scenarios Impact of prompt length on P(user will correct)? “Essential” vs “non-essential” concepts? 23 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Outline 24 Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Which ML technique? Need good probability outputs Margins produced by discriminant classifiers are inadequate If you want probability scores, i.e. conf = 0.85 means that in 85% of cases with conf=0.85 the concept is right evaluate on a soft-metric [I’ll contradict myself later!! ] Step-wise logistic regression Sample-efficient Feature selection Good soft-metric performance 25 optimizes for avg. log likelihood of data data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Data. Features For each system action {EC, IC, ICT} Initial Confidence score Other indicators about current state: How well has the dialog been going Which concept are we talking about How far back was this concept acquired Features on user response 26 Confirmation and Disconfirmation markers Acoustic / Prosodic: f0 (min, max, range, maxslope, etc) + normalized versions Num words; turn length (secs) Concept information: expected / repeated / new concepts and grammar slots… Confidence Barge-in & Timeout info Lexical features (preselected by MI with “target” or confirm/disconfirm markers) data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Results Actually using a 1-level logistic model-tree Split on answer_type = {yes, no, other, no_parse} Perform step-wise logistic regression on the 4 leaves 27 P-entry = 0.05 P-reject = 0.30 BIC stopping criterion Also tried full-blown model tree, results are similar, maybe marginally worse data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? HARD SOFT Initial 31.1% -0.5076 Heuristic 8.6% -0.1943 LMT(CV) 3.7% -0.1160 LMT(training) 2.9% -0.0851 35 0.7 30 0.6 25 0.5 Avg. Log-Likelihood Error rate (%) Explicit Confirmation 20 15 10 5 0 28 0.4 0.3 0.2 0.1 Initial Heuristic LMT(CV) LMT(training) 0 Initial Heuristic LMT(CV) LMT(training) data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? HARD SOFT Initial 31.4% -0.6217 Heuristic 24.0% -0.6736 LMT(CV) 19.6% -0.4521 LMT(training) 18.8% -0.4124 Oracle Baseline 16.1% - 35 0.7 30 0.6 25 0.5 Avg. Log-Likelihood Error rate (%) Implicit Confirmation 20 15 10 5 0 29 0.4 0.3 0.2 0.1 Initial Heuristic LMT(CV) LMT(training) 0 Initial Heuristic LMT(CV) LMT(training) data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Outline 30 Introduction Data A simplified version of the problem. Approach User behaviors Learning: preliminary results More on evaluation Where to from here? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? What can Logistic Regression / AVG-LL do for you? D = {d1, d2, d3, d4, …} di = 1/0 P(D) = ∏P(di=1 | xi) Express density P(di=1 | xi) as: P(d=1 | x) = 1 / (1 + exp(-wx)) 31 You can actually derive this if you start with P(x | d) gaussian Find parameters w to max(P(D)) argmax(P(D)) = argmax ∏P(di=1 | xi) argmax(P(D)) = argmin ∑-log(P(di=1 | xi)) But what does that mean? Hence we maximize the average log-likelihood data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Loss function in Logistic Regression Log-likelihood loss function If d=1, then P(d=1)=0.01 is ten times worse than P(d=1)=0.1, but P(d=1)=0.7 is about the same as P(d=1)=0.8 Things are mirrored for d=0 This does not match the “threshold” model commonly used to engage actions 0.01 0.1 0.7 0.8 1 d=1 32 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? A New Loss Function: T2 A loss function that better matches our domain: T2 (or even T3) d=1 d=0 C3 C1 C4 C2 0 33 t1 t2 1 0 t1 t2 1 Optimize argmax ∑ T2(P(di=c | xi)) Not differentiable Not convex data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Smoothed version A loss function that better matches our domain: T2 (or even T3) d=1 SmoothT2(p) = σ1(p) + σ2(p) σi(p) = 1 / (1+exp(ki(p-θi))) with ks and θs chosen accordingly C1 C2 0 34 t1 t2 1 Optimize argmax ∑ SmoothT2(P(di=c | xi)) Differentiable! But still not convex … multiple local maxima data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Costs & Thresholds Costs: where from? “Expert” knowledge Derive from data (might be tricky) Thresholds: where from? Fixed Actually optimize at the same time 35 SmoothT2 = SmoothT2(w, th1, th2) Differentiable in th1 and th2, so we can do gradient search for it Calibrates in one step both the belief updating and the threshold to minimize loss data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Questions: What Next? 36 ICT: can we do anything there? Push for better performance Push for better understanding Optimize for new loss function More in the future: look at the full belief updating problem Looks really tough … Add more features? … Debug the models more, eliminate singularities … Why doesn’t the model-tree do better? … What are the other interesting questions … data : problem/approach : user behaviors : preliminary results : more on evaluation : what next? Thank You! 37 Encoding System Actions 38 For each concept update, define system action signature: <IC, ICT, EC, REQ> IC: Implicit Confirm [grounding] ICT: Implicit Confirm [task] EC: Explicit Confirm REQ: Request Each variable can have 1 of 4 values 0 C (action happens on concept of interest) OC (action happens on some other concept) C&OC (action happens both on concept of interest and some other concept) Only certain combinations are valid and appear in the data