Characterizing Task-Oriented Dialog using a Simulated ASR Channel Jason D. Williams Machine Intelligence Laboratory Cambridge University Engineering Department SACTI-1 Corpus Simulated ASR-Channel – Tourist Information • • • • Motivation for the data collection Experimental set-up Transcription & Annotation Effects of ASR error rate on… – – – – – Turn length / dialog length Perception of error rate Task completion “Initiative” Overall satisfaction (PARADISE) ASR channel vs. HH channel Properties HH dialog ASR channel • “Instant” communication • Effectively perfect recognition of words • Prosodic information carries additional information • • • • Turns explicitly segmented Barge-in, End-pointed Prosody virtually eliminated ASR & parsing errors Observations • Frequent but brief overlaps • 80% of utterances contain fewer than 12 words; 50% < 5 • Approximately equal turn length • Approximately equal balance of initiative • About half of turns are ACK (often spliced) • • • • • Few overlaps Longer system turns; shorter user turns Initiative more often with system Virtually no turns are ACK Virtually no splicing Are models of HC dialog/grounding appropriate in the presence of the ASR channel? My approach 1. Study the ASR channel in the abstract – – WoZ experiments using a simulated ASR channel Understand how people behave with an “ideal” dialog manager • • – Note that collected data has unique properties useful to: • • • 2. For example, grounding model Use these insights to inform state space and action set selection RL-based systems Hidden-state estimation User modeling Formulate dialog management problem as a POMDP – Decompose state into BN nodes – for example: • • • – – Conversation state (grounding state) User action User belief (goal) Train using data collected Solve using approximations The paradox of “dialog data” • To build a user model, we need to see the user’s reaction to all kinds of misunderstandings However, most systems use a fixed policy • – – – Systems typically do not take different actions in the same situation Taking random actions is clearly not an option! Constraining actions means building very complex systems… • … and which actions should be in the system’s repertoire? An ideal data collection… • …would show users reactions to a variety of error handling strategies (no fixed policy) • • • • • BUT would not be nonsense dialogs! …would use the ASR channel …would explore a variety of operating conditions – e.g., WER rate …would not assume a particular state space ... would somehow “discover” the set of system actions Data collection set-up ASR simulation state machine • • Simple energy-based barge-in User interrupts wizard SILENCE User starts talking USER_TALKING Typist done; reco result displayed Wizard stops talking Wizard starts talking WIZARD_TALKING User starts talking User stops talking TYPIST_TYPING ASR simulation • Simplified FSM-based recognizer – • Flow – – – – – – – – • • Weighted finite state transducer (WFST) Reference input Spell-checked against full dictionary Converted to phonetic string using full dictionary Phonetic lattice generated based on confusion model Word lattice produced Language model composed to re-score lattice “De-coded” to produce word strings N-Best list extracted. Various free variables to induce random behavior Plumb N-Best list for variability ASR simulation evaluation • Hypothesis: Simulation produces errors similar to errors induced by additive noise w.r.t. concept accuracy • Assess concept accuracy as F-Measure using automated, data-driven procedure (HVS model) Plot concept accuracy of: • – – – • Real additive noise A naïve confusion model (simple insertion, substitution, deletion) WFST confusion model WFST appears to follow real data much more closely than naïve model Scenario & Tasks • Tourist / Tourist information scenario – – – • Wizard (Information giver) – • • • Intentionally goal-directed Intentionally simple tasks Mixtures of simple information gathering and basic planning Access to bus times, tram times, restaurants, hotels, bars, tourist attraction information, etc. User given series of tasks Likert scores asked at end of each task 4 Dialogs / user; 3 users/Wizard Example task: Finding the perfect hotel You’re looking for a hotel for you and your travelling partner that meets a number of requirements. You’d like the following: En suite rooms Quiet rooms As close to the main square as possible Given those desires, find the least expensive hotel. You’d prefer not compromise on your requirements, but of course you will if you must! Please indicate the location of the hotel on the map and fill in the boxes below. Name of accommodation Cost per night for 2 people User’s Map Fountain Road Castle Fountain Nine street Cinema Castle e dl ad Ro et Cascade Road re et xa nd e St Museum Ale n ine rS tre eN idg Br et M ai Riv ers ide et e ridg ill er H w o T dB Re Alexander Stre Dri ve Tourist info d Roa Loop id M Post office Stre Shopping area Castle Loop Main square er and Alex West loop North Road Art Square Tower South B ridge Park Road Park Wizard’s map Fountain Road H5 H4 Castle Loop Cinema Castle Loop B6 H2 e dl id M R6 re et et rS Museum Cascade Road St ine n xa nd e et B3 M ai tre B2 Ale R1 B1 eN idg Stre Post office Br er and Alex ad Ro Main square R2 Shopping area Art Square Nine street H1 H6 ide et e ridg er T ow dB Re Alexander Stre d Roa Hill Dri ve Tourist info Riv ers West loop Castle R4 North Road Fountain B5 H3 Tower B4 South B ridge R3 Park Road Park R5 Likert-scale questions • User/wizard each given 6 questions after each task Disagree strongly (1) 1. 2. 3. 4. 5. 6. Disagree (2) Disagree somewhat (3) Neither agree nor disagree (4) Agree somewhat (5) Agree (6) Strongly agree (7) Subject example: In this task, I accomplished the goal. In this task, I thought the speech recognition was accurate. In this task, I found it difficult to communicate because of the speech recognition. In this task, I believe the other subject was very helpful. In this task, the other subject found using the speech recognition difficult. Overall, I was very satisfied with this past task. Transcription • User-side transcribed during experiments – Prioritized for speed "I NEED UH I’M LOOKING FOR A PIZZA" • Wizard-side transcribed using a subset of LDC transcription guidelines – more detail "ok %uh% (()) sure you –- i can" epErrorEnd=true Annotation (acts) • • Each turn is a sequence of tags Inspired by Traum’s “Grounding Acts” – More detailed / easier to infer from surface words Tag Meaning Request Question/request requiring response Inform Statement/provision of task information Greet-Farewell “Hello”, “How can I help,” “that’s all”, “Thanks”, “Goodbye”, etc. ExplAck Explicit statement of acknowledgement, showing speaker understanding of OS Unsolicited-Affirm Explicit statement of acknowledgement, showing OS understands speaker HoldFloor Explicit request for OS to wait ReqRepeat Request for OS to repeat their last turn ReqAck Request for OS to show understanding RspAffirm Affirmative response to ReqAck RspNegate Negative response to ReqAck StateInterp A statement of intention of OS DisAck Show of lack of understanding of OS RejOther Display of lack of understanding of speaker’s intention or desire by OS Annotation (understanding) • Each wizard turn was labeled to indicate whether the wizard understood the previous user turn Label Wizard’s understanding of previous user turn Full All intentions understood correctly. Partial Some intentions understood; none misunderstood. Non Wizard made no guess at user intention. Flagged-Mis The wizard formed an incorrect hypothesis of the user’s meaning, and signalled a dialog problem Un-Flagged-Mis The wizard formed an incorrect hypothesis of the user’s meaning, accepted it as correct and continued with the dialog. Corpus summary WER target # Wiz # User # Task Completed in time limit Per-turn WER Per-dialog WER None 2 6 24 83 % 0% 0% Low 4 12 48 83 % 32 % 28 % Med 4 12 48 77 % 46 % 41 % Hi 2 6 24 42 % 63 % 60 % Perception of ASR accuracy Wizard Average Likert Score • How accurately do users & wizards perceive WER? Perceptions of recognition quality broadly reflected actual performance, but users consistently gave higher quality scores than wizards for the same WER User 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 Hi Med Low WER Target None Average turn length (words) Wizard User 35 30 Av words/turn • • How does WER affect wizard & user turn length? Wizard turn length increases User turn length stays relatively constant 25 20 15 10 5 0 Hi Med WER Target Low None Grounding behavior 100% All Others DisAck StateInterp ReqAck ReqRepeat Request ExplAck Inform 80% % of all tags • How does WER affect wizard grounding behavior? As WER increases, wizard grounding behaviors become increasingly prevalent 60% 40% 20% 0% Hi Med Low WER Target None Wizard understanding 100% 90% 80% % of Wizard Turns • • How does WER affect wizard understanding status? Misunderstanding increases with WER… …and task completion falls (83%, 83%, 77%, 42%) 70% UnFlaggedMis FlaggedMis Non Partial Full 60% 50% 40% 30% 20% 10% 0% Hi Med Low WER Target None Wizard strategies • Classify each wizard turn into one of 5 “strategies” Label Meaning wiz init? Tags REPAIR Attempt to repair Yes ReqAck, ReqRepeat, StateInterp, DisAck, RejOther ASKQ Ask task-question Yes Request GIVEINFO Provide task info No Inform RSPND Non-initiative taking grounding actions No ExplAck, Rsp-Affirm, RspNegate, Unsolicited-Affirm OTHER Not included in analysis n/a All others Wizard strategies • 100% S % of Wizard Turns • What are the most successful strategy after known dialog trouble? This plot shows wizard understanding status one turn after known dialog trouble: effect of ‘REPAIR’ vs ‘ASKQ’. ‘S’ indicates significant differences 80% 60% UnFlaggedMis FlaggedMis Non Partial Full S 40% S 20% 0% REPAIR ASKQ REPAIR Hi Med Strategy & WER Target ASKQ User reactions to misunderstandings • How does a user respond after being misunderstood? Surprisingly little explicit indication! WER target User turns including tag DisAck RejectOther None N/A N/A N/A Low 0.0 % 3.8 % 92.3 % Med 2.5 % 19.0 % 75.9 % Hi 0.0 % 12.3 % 87.0 % Request Level of wizard “initiative” % of Wizard Turns • How does “initiative” vary with WER? Define wizard “initiative” using strategies, above 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% GIVEINFO RESPOND ASKQ REPAIR Hi Med Low WER Target None Reward measures/PARADISE • • Satisfaction = Task completion + Dialog Cost metrics 2 kinds of user satisfaction: – – • Single Combi 3 kinds of task completion – – – • User Obj Hyb Cost metrics – – – – – – – PerDialogWER %UnFlaggedMis %FlaggedMis %Non Turns %REPAIR %ASKQ Reward measures/PARADISE • In almost all experiments using the User task completion metric, it was the only significant predictor The single/combi metrics almost always selected the same predictors • Dataset R2 Metrics (task & user sat) Significant predictors ALL User-S 52 % 1.03 Task ALL User-C 60 % 5.29 Task – 1.54 %UnFlagMis ALL Obj-S 24 % -0.49 Turns + 0.38 Task ALL Obj-C 27 % -2.43 Turns – 1.45 %UnFlagMis + 1.35 Task ALL Hyb-S 41 % 0.74 Task – 0.36 Turns Hi Obj-S 40 % 0.98 Task Hi Hyb-S 48 % 1.07 Task Med Obj-S 16 % -0.62 %Non Med Obj-C 37 % -3.35 %Non – 2.94 Turns Med Hyb-S 38 % 0.97 Task Low Obj-S 28 % -0.59 Turns Low Hyb-S 40 % -0.49 Turns + 0.40 Task Reward measures/PARADISE What indicators best predict user satisfaction? When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. • – • Broadly speaking: – – – • • %UnFlaggedMis is serving as a better measurement of understanding accuracy than WER alone, since it effectively combines recognition accuracy with a measure of confidence. Task completion is most important at the High WER level Task completion and dialog quality is most important at the Med WER level Efficiency is most important at the Low WER level These patterns mirror findings from other PARADISE experiments using Human/Computer data This gives us some confidence that this data set is valid for training Human/Computer systems Conclusions/Next steps • • • • • • At moderate WER levels, asking task-related questions appears to be more successful than direct dialog repair. Levels of expert “initiative” increase with WER, primarily as a result of grounding behavior. Users infrequently give a direct indication of having been misunderstood, with no clear correlation to WER. When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. Task completion appears to be most predictive of user satisfaction; however, efficiency shows some influence at lower WERs. Next… apply this corpus to statistical systems. Thanks! Jason D. Williams jdw30@cam.ac.uk