Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee Maxine Eskenazi (chair) Alan W Black Reid Simmons Diane J. Litman Spoken Dialog Systems Spoken dialog systems have long promised to improve human-machine interaction Speech is a natural means of communication Recent improvements in underlying technologies have made such systems a reality 2 Sometimes they work… S U S: U: S: U: S: U: S: U: S: What can I do for you? I’d like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16 PM. 3 …but not always… S U S: U: S: S: S: U: S: S: U: S: What can I do for you? ‘kay. 51C Carrick from Century Square to Downtown Pittsburgh, to Oakland. The 61… If you want… Leaving from Oakland. Is this correct? 51C leaving Century Square going to Oakland, I mean go to South Side. Leaving… Leaving from McKeesport. Is… No. Leaving from Century Square. Leaving from McKeesport. Did I get that right? 4 Key Definitions (Conversational) Floor “The right to address an assembly” (Merriam-Webster) The interactional state that describes which participant in a dialog has the right to provide or request information at any point. Turn-Taking The process by which participants in a conversation alternately own the conversational floor. 5 Thesis Statement Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty. 6 Floor, Intentions and Beliefs The floor is not an observable state. Rather, participants have: • intentions to claim the floor or not • beliefs over whether others are claiming it Participants negotiate the floor to limit gaps and overlaps. [Sacks et al 1974, Clark 1996] 7 Uncertainty over the Floor S U Uncertainty over the floor leads to breakdowns in turn-taking: • Cut-ins • Latency • Barge-in latency • Self interruptions 8 Turn-Taking Errors by System Cut-ins System grabs floor before user releases it. S U U: S: ‘kay. 51C Carrick from Century Square (…) The 61… S Latency System waits after user has released floor. U S: U: S: (…) Is this correct? Yeah. Alright (…) 9 Turn-Taking Errors by System S Barge-in latency System keeps floor while user is claiming it. U S: For example, you can say “When is the next 28X from downtown to the airport?” or “I’d like to go from McKee… U: When is the next 54… S: Leaving from Atwood. Is this correct? S Self interruptions System releases floor while user not claiming it. U S: U: S: S: S: S: What can I do for you? 61A. For example, you can say when is… Where would you li… Let’s proceed step by step. Which neighb… Leaving from North Side. Is this correct? 10 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model Conclusion 11 Pipeline Architectures Speech Recognition Natural Language Understanding Dialog Management Speech Synthesis Backend Natural Language Generation Turn-taking imposed by full-utterance-based processing Sequential processing Lack of reactivity No sharing of information across modules Hard to extend to multimodal/asynchronous events 12 Multi-layer Architectures • Separate reactive from deliberative behavior – turn-taking vs dialog act planning • Different layers work asynchronously [Thorisson 1996, Allen et al 2001, Lemon et al 2003] • But no previous work: – addressed how conversational floor interacts with dialog management – successfully deployed a multi-layer architecture in a broadly used system 13 Proposed Architecture: Olympus 2 Sensors Speech Recognition Natural Language Understanding Interaction Management Speech Synthesis Dialog Management Backend Natural Language Generation Actuators 14 Olympus 2 Architecture Sensors Speech Recognition Natural Language Understanding Interaction Management Speech Synthesis Dialog Management Backend Natural Language Generation Actuators Explicitly models turn-taking explicitly Integrates dialog features from both low and high levels Operates on generalized events and actions Uses floor state to control planning of conversational acts 15 Olympus 2 Deployment • Ported Let’s Go to Olympus 2 – publicly deployed telephone bus information – originally built using Olympus 1 • New version processed about 30,000 dialogs since deployed – no performance degradation • Allows research on turn-taking models to be guided by real users behavior 16 Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 17 End-of-Turn Detection What can I do for you? S U I’d like to go to the airport. Detecting when the user releases the floor. Potential problems: • Cut-ins • Latency 18 End-of-Turn Detection What can I do for you? S U I’d like to go to the airport. End of turn 19 Latency / Cut-in Tradeoff What can I do for you? S U I’d like to go to Long threshold the airport. Few cut-ins Long latency 20 Latency / Cut-in Tradeoff What can I do for you? S U I’d like to go to Long threshold Short threshold the airport. Few cut-ins Long latency Many cut-ins Short latency Can we exploit dialog information to get the best of both worlds? 21 End-of-Turn Detection as Classification • Classify pauses as internal/final based on words, syntax, prosody [Sato et al, 2002] • Repeat classification every n milliseconds until pause ends or end-of-turn is detected [Ferrer et al, 2003, Takeuchi et al, 2004] • But no previous work: – successfully combined a wide range of features – tested model in a real dialog system 22 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 23 Using Variable Thresholds Open question Does partial hyp match current expectations? Specific question S Confirmation U What can I do for you? I’d like to go to the airport. Discourse (dialog state) Semantics (partial ASR) Prosody (F0, duration) Timing (pause start) Speaker (avg # pauses) 24 Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average non-understanding ratio < 15% Dialog state is Average pause duration < 200 ms Partial ASR has Partial ASR has “YES” 205 ms 200 ms 779 ms open question Average pause duration < 300 ms less than 3 words 693 ms 789 ms Trained on 1326 dialogs with the Let’s Go public dialog system 1005 ms Partial ASR is available 637 ms 847 ms 907 ms Consecutive user turns w/o system prompt 1440 ms 1078 ms Dialog state is open question Average pause duration < 300 ms 839 ms 922 ms 25 Example Decision Tree I’d like to go to Utterance duration < 2000 ms Partial ASR matches expectations Average non-understanding ratio < 15% Dialog state is Average pause duration < 200 ms Partial ASR has Partial ASR has “YES” 205 ms 200 ms 779 ms open question Average pause duration < 300 ms less than 3 words 693 ms 789 ms Trained on 1326 dialogs with the Let’s Go public dialog system 1005 ms Partial ASR is available 637 ms 847 ms 907 ms Consecutive user turns w/o system prompt 1440 ms 1078 ms Dialog state is open question Average pause duration < 300 ms 839 ms 922 ms 26 Example Decision Tree I’d like to go to the airport. Utterance duration < 2000 ms Partial ASR matches expectations Average non-understanding ratio < 15% Dialog state is Average pause duration < 200 ms Partial ASR has Partial ASR has “YES” 205 ms 200 ms 779 ms open question Average pause duration < 300 ms less than 3 words 693 ms 789 ms Trained on 1326 dialogs with the Let’s Go public dialog system 1005 ms Partial ASR is available 637 ms 847 ms 907 ms Consecutive user turns w/o system prompt 1440 ms 1078 ms Dialog state is open question Average pause duration < 300 ms 839 ms 922 ms 27 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 28 Performance per Feature Set Average latency (ms) 900 38% cut-in rate reduction 700 Fixed threshold State-specific threshold Decision Tree 22% latency reduction 500 300 2 3 4 Turn cut-in rate (%) 5 6 29 Performance per Feature Set Average latency (ms) 900 Fixed threshold Discourse Speaker Prosody Timing Semantics All features Decision Tree 700 500 Semantics is the most useful feature type 300 2 3 4 Turn cut-in rate (%) 5 6 30 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 31 Live Evaluation • Implemented decision tree in the Let’s Go IM • Operating point: 3% cut-in, 635 ms average • 1061 dialogs collected in May ‘08 – 548 control dialogs (threshold = 700 ms) – 513 treatment dialogs (decision tree) 32 Cut-in Rate per Dialog State 14% Control Decision Tree 12% 10% Largest improvement: after open requests 8% 6% 4% Fewer cut-ins overall (p<0.05) 2% 0% Overall Open Question Specific Question Confirmation 33 Average Latency per State 1200 Slower on answers to 1100 open questions Control Decision Tree 1000 900 800 700 Faster on confirmations 600 500 Overall Open Question Specific Question Confirmation 34 Non-Understanding Rate per State Control Decision Tree 35% 30% 25% 20% 15% Significant reduction of after confirmations (p < 0.01) 10% 5% 0% Overall Open Question Specific Question Confirmation 35 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection In pauses Anytime Application to barge-in detection 36 The Finite-State Turn-Taking Machine System User 37 The Finite-State Turn-Taking Machine FreeS System User FreeU 38 The Finite-State Turn-Taking Machine FreeS System BothS User BothU FreeU 39 The Finite-State Turn-Taking Machine FreeS Similar models were proposed BothS by Brady (1969) and Jaffe and System User Feldstein (1970) for analysis of BothU human conversations. FreeU 40 Uncertainty in the FSTTM • System: – knows whether it is claiming the floor or not – holds probabilistic beliefs over whether the user is • Probability distribution over the state • In some (useful) cases, approximations allow to reduce uncertainty to two states: – User vs FreeU during user utterances – System vs BothS during system prompts 41 Making Decisions with the FSTTM • Actions – YIELD, KEEP if system is currently holding the floor – GRAB, WAIT if it is not – Different costs in different states • Decision-theoretic action selection – Pick action with lowest expected cost given the belief distribution over the states 42 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection In pauses Anytime Application to barge-in detection 43 End-of-Turn Detection in the FSTTM FreeS System BothS User BothU FreeU 44 End-of-Turn Detection in the FSTTM FreeS System BothS BothU GRAB WAIT User GRAB WAIT FreeU 45 Action/State Cost Matrix in Pauses Floor state User FreeU WAIT 0 CG∙t GRAB CU System action (constant) (time in pause) 0 • Latency cost linearly increases with time • Constant cut-in cost 46 Action Selection • At time t in a pause, take the action with minimal expected cost: C t (GRAB) Pt (User ) CU C t (WAIT ) Pt (FreeU ) C G t 47 Estimating State Probabilities Probability that user releases floor, estimated at the beginning of pause Pt (FreeU ) P(FreeU ) Pt (User ) (1 P(FreeU )) e Probability that user keeps floor, estimated at the beginning of pause t Exponential decay 48 Estimating P(FreeU) • Step-wise logistic regression • Selected features: – boundary LM score, “YES” in ASR hyp – energy, F0 before pause – Barge-in Baseline Logistic Regression Classification Error 21.9% 21.7% Log Likelihood -0.52 -0.44 49 In-Pause Detection Results Average Latency (ms) 900 Fixed threshold Decision Tree FSTTM 700 28% latency reduction 500 300 2 3 4 Cut-in Rate (%) 5 6 50 Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection At pauses Anytime Application to barge-in detection 51 Delays in Pause Detection I’d like to go to the airport. • About 200 ms between pause start and VAD change of state • In some cases, we can make the decision before VAD detection: – partial hypotheses during speech – previous model once a pause is detected Anytime End-of-Turn Detection 52 End-of-turn Detection in Speech • Cost matrix: Floor state User FreeU WAIT 0 CW GRAB CU System action (constant) (constant) 0 • Leads to a fixed threshold on P(FreeU) 53 Estimating P(FreeU) in Speech • Step-wise logistic regression • Features: – boundary LM score, “YES”/”NO” in hyp – number of words – Barge in Baseline Logistic Regression Classification Error 38.9% 19.2% Log Likelihood -0.67 -0.45 54 Anytime Detection Results Average latency (ms) 900 Fixed threshold In-pause-FSTTM Anytime-FSTTM 700 35% latency reduction 500 300 2 3 4 Cut-in rate (%) 5 6 55 Histogram of Turn Latencies 20% In-pause-FSTTM 18% 16% Less predictable ends of turns % Turns 14% Highly predictable ends of turns 12% 10% 8% 6% 4% 2% 0% 0 500 1000 1500 Latency 56 Histogram of Turn Latencies 20% In-pause-FSTTM 40% of highly predictable cases get Anytime-FSTTM predicted during speech No change to less predictable cases 18% 16% % Turns 14% 12% 10% 8% 10% of turn ends detected during speech 4% 6% 2% 0% 0 500 1000 1500 Latency 57 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A generic, trainable turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection At pauses Anytime Application to barge-in detection 58 Barge-in Detection in the FSTTM FreeS System BothS User BothU FreeU 59 Barge-in Detection in the FSTTM YIELD KEEP System FreeS BothS KEEP YIELD User BothU FreeU 60 Cost Matrix during System Prompts Floor state System BothS KEEP 0 CO YIELD CS System action (constant) (constant) 0 • Constant costs • Equivalent to setting a threshold on P (BothS) 61 Estimating P(BothS) • Estimated at each new partial ASR hypothesis • Logistic regression • Features: – partial hyp matches expectations – cue words in the hypothesis • selected using mutual information on a previous corpus • E.g. : “When” in a state where “When is the next/previous bus” is expected 62 Barge-in Detection Results Average Latency (ms) 700 Hyp matches Hyp matches + cues 600 500 400 0.5 1.0 1.5 2.0 Self Interruption Rate (%) 2.5 63 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A generic, trainable turn-taking model Conclusion 64 Thesis Statement Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. • latency and/or cut-in rate reduced by both decision tree and FSTTM • semantic features most useful Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty. • FSTTM 65 Contributions • An architecture for spoken dialog systems that incorporates dialog and interaction management • Analysis of dialog features underlying turn-taking • The Finite State Turn-Taking Machine – domain-independent turn-taking model – data-driven – improves end-of-turn and barge-in detection 66 Extending the FSTTM • A framework to organize turn-taking • Extensions – generalized FSTTM topology • multi-party conversation – richer cost functions • non-linear latency cost, non-uniform cut-in cost, etc – better tracking of uncertainty • priors • Partially Observable Markov Decision Processes 67 FSTTM Dialog S U S: U: S: U: S: U: S: U: S: What can I do for you? Next bus from Fifth and Negley to Fifth and Craig. Leaving from Fifth and Negley. Is this correct? Yes. Alright. Going to Fifth and Craig. Is this correct? Yes. Alright. I think you want the next bus. Am I… Yes. Right. Just a minute. I’ll look that up. The next 71D leaves Fifth Avenue at Negley at 10:54 AM. 68 Thank you! Questions? Extra Slides 70 Building Threshold Decision Trees 1. Cluster pauses using automatically extracted features from discourse, semantics, prosody, timing and speaker. 2. Set one threshold for each cluster so as to minimize overall latency. 71 700 25 650 20 600 15 550 10 500 5 450 0 0 200 400 600 800 1000 Tree size (# of internal nodes) Latency (ms) Learning Curve 1200 Number of dialogues in training set Average turn final latency Tree size 72 Estimating Parameters • μ – overall mean pause duration – state-specific mean pause duration – predicted using dialog features through stepwise generalized linear regression • Correl: 0.42 • Feats: barge-in, dialog state, LM score, “YES” Predicted pause duration (ms) 1500 1000 500 0 0 500 1000 Actual pause duration (ms) 1500 73 Endpointing Threshold • Threshold is solution of: P0 (Free S ) t (1 P0 (Free S )) e t K Threshold (ms) 2000 1500 1000 500 0 0 0.2 0.4 0.6 0.8 1 P0(FreeS) 74 User Barge-in Time Distribution 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Going to <destination>. Is this correct? 75 The First Step S: What can I do for you? U: When is the next 54C coming to 18th street? S: The 54C. Did I get that right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Going to Carson. Is this correct? U: Yes. S: Okay. Let me check that for you. S: What can I do for you? U: When is the next 54C coming to 18th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. 76 The First Step Prompt design Prosody Turn-Taking Incremental Processing S: What can I do for you? U: When is the next 54C coming to 18th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. 77 The First Step Prompt design Prosody Turn-Taking Incremental Processing S: What can I do for you? U: When is the next 54C coming to 18th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. 78 The First Step Prompt design Prosody Turn-Taking Incremental Processing S: What can I do for you? U: When is the next 54C coming to 18th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. 79 Spoken Dialog Systems S U S: U: S: U: S: U: U: S: What can I do for you? I’d like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16pm. 80 Turn Endpointing S: What can I do for you? U: I’d like to go to the airport. Silence Detected Speech Detected Silence DetectedEndpoint VAD Speech Silence Speech Silence Threshold 81 Endpointing Issues S: What can I do for you? U: I’d like to go to the airport. Cut-in VAD Speech Silence Speech Silence Threshold 82 End-of-Turn Detection Issues S: What can I do for you? U: I’d like to go to the airport. Latency VAD Speech Silence Speech Silence Threshold 83 The Endpointing Trade Off S: What can I do for you? U: I’d like to go to VAD Speech the airport. Silence Speech Silence Threshold Long threshold Few cut-ins Long latency 84 The Endpointing Trade Off S: What can I do for you? U: I’d like to go to VAD Speech the airport. Silence Speech Silence Threshold Long threshold Short threshold Few cut-ins Many cut-ins Long latency Short latency 85 Using Variable Thresholds • Discourse (dialog state) S: What can I do for you? • Semantics (partial ASR) • Prosody (F0, duration) U: I’d like to go to the airport. • Timing (pause start) • Speaker (avg # pauses) VAD Speech Silence Speech Silence Threshold 86 Standard Approach to Turn-Taking in Spoken Dialog Systems • Typically not explicitly modeled • Rules based on low-level features – threshold-based end-of-utterance detection – (optionally) barge-in detection • Fixed behavior • Not integrated in the overall dialog model 87 The Finite-State Turn-Taking Machine FreeS System Both SmoothS Transition User BothU FreeU USER YIELDS 88 The Finite-State Turn-Taking Machine FreeS System Both SmoothS Transition User BothU SYSTEM GRABS FreeU 89 The Finite-State Turn-Taking Machine FreeS System BothS Latency User BothU SYSTEM WAITS USER WAITS FreeU 90 The Finite-State Turn-Taking Machine FreeS Cut-in System BothS BothU User SYSTEM GRABS FreeU 91 The Finite-State Turn-Taking Machine SYSTEM GRABS System FreeS BothS Time out User BothU FreeU 92 The Finite-State Turn-Taking Machine FreeS USER GRABS System BothS User Both U Barge-in FreeU 93 The Finite-State Turn-Taking Machine FreeS System BothS SYSTEM YIELDS User Both U Barge-in FreeU 94 Average Latency (ms) Optimal CW 490 470 450 430 410 390 370 350 CU is set to maintain an overall cut-in rate of 5% 0 500 1000 CW 1500 2000 95 Estimating State Probabilities Pt (Free , sil) Pt (Free S ) Pt (sil |Free S ) P0 (Free S ) 1 P0 (Free S ) 96 Estimating State Probabilities Pt (Free , sil) Pt (Free S ) Pt (sil |Free S ) P0 (Free S ) 1 User remain silent indefinitely at the end of P0 (Free S ) turn (no transition FreeS User) 97 Estimating State Probabilities Pt (Free , sil) Pt (Free S ) Pt (sil |Free S ) P0 (Free S ) 1 P ( Free ) 0 S Without knowledge of silence duration, Pt(FreeS)=P0(FreeS) 98 Estimating State Probabilities Pt (Free , sil) Pt (Free S ) Pt (sil |Free S ) P0 (Free S ) 1 P0 (Free S ) 99 Estimating State Probabilities Pt (User , sil) Pt (User ) Pt (sil |User ) P0 (User ) P(dur (sil) t |User ) (1 P0 (Free S )) e t 100 Estimating State Probabilities Pt (User , sil) Pt (User ) Pt (sil |User ) P0 (User ) P(dur (sil) t |User ) t Probability that the user (1 P0 (Free S )) ise still silent at time t, given that they haven’t finished their turn. 101 Estimating State Probabilities Pt (User , sil) Pt (User ) Pt (sil |User ) P0 (User ) P(dur (sil) t |User ) (1 P0 (Free S )) e t Assuming an exponential distribution on internal silence duration. μ is mean pause duration. 102 Estimating State Probabilities Pt (User , sil) Pt (User ) Pt (sil |User ) P0 (User ) P(dur (sil) t |User ) (1 P0 (Free S )) e t 103 Reducing Uncertainty Different levels of information can help reduce uncertainty over the floor: Immediate information syntax, semantics, prosody of current turn… Discourse information dialog state, task structure, expectations… Environment information acoustic conditions, user characteristics… 104 Endpointing Threshold • Threshold is solution of: P(FreeU ) t (1 P(FreeU )) e t K Expected Cost • Parameter K set empirically (typically 10,000 ms) 0 GRAB WAIT Endpointing threshold 200 400 600 Time 800 1000 105 Pause Endpointing Results Average Latency (ms) 900 Baseline Threshold Optimization FSTTM, only Pf estimated FSTTM, P(U|O) & μ estimated FSTTM, P(U|O) estimated, μ oracle 700 500 300 2 3 4 Cut-in Rate (%) 5 6 106