Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002 Errors in Spoken-Language Interfaces Speech Recognition is problematic: Input signal quality Accents, Non-native speakers Spoken language disfluencies: stutters, falsestarts, /mm/, /um/ Typical Word Error Rates in SDS: 10-30% Systems today lack the ability to gracefully recover from error An example S: Are you a registered user ? U: No I'm not. No [NO I'M NOT NO] S: What is your full name ? U: Adam Schumacher [ATHENS IN AKRON] S: There is an Athens in Georgia and in Greece. Which destination did you want ? U: Neither [NEITHER] S: I'm sorry, I didn't catch that. There is an Athens in Georgia and in Greece. Which destination did you want ? U: Georgia [GEORGIA] S: A flight from Athens... Where do you want to go ? U: Start over [START OVER] S: Do you really want to start over ? U: Yes [YES] S: What city are you leaving from ? U: Hamilton, Ontario [HILTON ONTARIO] S: Sorry, I'm not sure I understood what you said. Where are you leaving from ? U: Hamilton [HILTON] S: I'm still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from ? U: Toronto [TORONTO] Pathway to a solution Make systems aware of unreliability in their inputs Confidence scores Develop a model which learns to optimally choose between several prevention/repair strategies Identify strategies Express them in a computable manner Develop the model Papers Error Detection in Spoken HumanMachine Interaction [E.Krahmer, M. Swerts, M. Theune, M. Weegels] Problem Spotting in Human-Machine Interaction [E.Krahmer, M. Swerts, M. Theune, M. Weegels] The Dual of Denial: Discomfirmations in Dialogue and Their Prosodic Correlates [E.Krahmer, M. Swerts, M. Theune, M. Weegels] Goals [Let’s look at dialog on page 2] (1) Analysis of positive an negative cues we use in response to implicit and explicit verification questions (2) Explore the possibilities of spotting errors on line Explicit vs. Implicit Explicit Presumably easier for the system to verify But there’s evidence that it’s not as easy … Leads to more turns, less efficiency, frustration Implicit Efficiency But induces a higher cognitive burden which can result in more confusion ~ Systems don’t deal very well with it… Clarke & Schaeffer Grounding model Presentation phase Acceptance phase Various indicators Go ON / YES Go BACK / NO Can we detect them reliably (when following implicit and explicit verification questions) ? Positive and Negative Cues Positive Negative Short turns Long turns Unmarked word order Marked word order Confirm Discomfirm Answer No answer No corrections Corrections No repetitions Repetitions New info No new info Experimental Setup / Data 120 dialogs : Dutch SDS providing train timetable information 487 utterances 44 (~10%) not used Users accepting a wrong result Barge-in Users starting their own contribution Left 443 resulting adjacent S/U utterances Results – Nr of words Explicit Implicit ~Problems 1.68 3.21 Problems 3.44 7.12 Results – Empty turns (%) Explicit Implicit ~Problems 0% 3.4% Problems 2.6% 10.3% Results – Marked word order % Explicit Implicit ~Problems 3.3% 1.2% Problems 4.4% 26.9% Results – Yes/No Explicit Implicit ~Problems Problems Yes 92.8% 6.1% No 0% 56.6% Other 7.1% 37.1% Yes 0% 0% No 0% 15.4% Other 100% ? 84.6% Results – Repeated/Corrected/New Explicit Implicit ~Problems Problems Repeated 8.5% 23.9% Corrected 0% 72.6% New 11.4% 12.4% Repeated 2.4% 61.0% Corrected 0% 92.3% New 53.6% 36.5% First conclusion People use more negative cues when there are problems And even more so for implicit confirmations (vs. explicit ones) How well can you classify Using individual features Look at precision/recall Explicit: absence of confirmation Implicit: non-zero number of corrections Multiple features Used memory based learning 97% accuracy (maj. Baseline 68%) Confirm + Correct is winning, although individually less good This is overall, right ? How about for explicit vs. implicit ? BUT !!! How many of these features are available on-line? Positive Short turns Unmarked word order Negative Long turns Marked word order Confirm Answer No corrections ? Disconfirm No answer Corrections ? No repetitions ? New info ? Repetitions ? No new info ? What else can we throw at it ? Prosody (next paper) Lexical information Acoustic confidence scores Maybe also of previous utterances Repetitions/Corrections/New info on transcript ? … … Papers Error Detection in Spoken HumanMachine Interaction [E.Krahmer, M. Swerts, M. Theune, M. Weegels] Problem Spotting in Human-Machine Interaction [E.Krahmer, M. Swerts, M. Theune, M. Weegels] The Dual of Denial: Discomfirmations in Dialogue and Their Prosodic Correlates [E.Krahmer, M. Swerts, M. Theune, M. Weegels] Goals Investigate the prosodic correlates of disconfirmations Is this slightly different than before ? (i.e. now looking at any corrections? Answer: No) Looked at prosody on “NO” as a go_on vs a go_back: Do you want to fly from Pittsburgh ? Shall I summarize your trip ? Human-human Higher pitch range, longer duration Preceded by a longer delay High H% boundary tone Expected to see same behavior for disconfirmation in human-machine Prosodic correlates POSITIVE(‘go on’) NEGATIVE(‘go back’) Boundary tone Low High Duration Short Long Delay Short Long Pause Short Long Pitch range Low High Features Yes, the correlations are there as expected Perceptual analysis Took 40 “No” from No+stuff, 20 go_on and 20 go_back (note that some features are lost this way…) Forced choice randomized task, w/ no feedback; 25 native speakers of Dutch Results 17 go_on correctly identified above chance 15 go_back correctly identified above chance; but also 1 incorrectly identified above chance. Discussion Q1: Blurred relationships … Confidence annotation Go_on / Go_back signal Is that the same as corrections ? Is that the most general case for responses to implicit/explicit verifications, or should we have a separate detector ? Q2: What other features could we throw at these problems ? What are the “most juicy” ones ? Discussion Q3: For implicit confirms, are these different in terms of induced response behavior ? When do you want to leave Pittsburgh ? Travelling from Pittsburgh … when do you want to leave ? When do you want to leave from Pittsburgh to Boston ?