Error Detection and Correction in SDS Julia Hirschberg CS 4706 7/15/2016 1 Today • Avoiding errors • Detecting errors – From the user side: what cues does the user provide to indicate an error? – From the system side: how likely is it the system made an error? • Dealing with Errors: what can the system do when it thinks an error has occurred? • Evaluating SDS: evaluating ‘problem’ dialogues 7/15/2016 2 Avoiding misunderstandings • The problem • By imitating human performance – Timing and grounding (Clark ’03) – Confirmation strategies – Clarification and repair subdialogues 7/15/2016 3 Today • Avoiding errors • Detecting errors – From the user side: what cues does the user provide to indicate an error? – From the system side: how likely is it the system made an error? • Dealing with Errors: what can the system do when it thinks an error has occurred? • Evaluating SDS: evaluating ‘problem’ dialogues 7/15/2016 4 Percentage of all repetitions Learning from Human Behavior: Features in repetition corrections (KTH) 7/15/2016 50 40 adults children 30 20 10 0 more increased shifting of clearly loudness focus articulated 5 Learning from Human Behavior (Krahmer et al ’01) • Learning from human behavior – ‘go on’ and ‘go back’ signals in grounding situations (implicit/explicit verification) – Positive: short turns, unmarked word order, confirmation, answers, no corrections or repetitions, new info – Negative: long turns, marked word order, disconfirmation, no answer, corrections, repetitions, no new info 7/15/2016 6 – Hypotheses supported but… • Can these cues be identified automatically? • How might they affect the design of SDS? 7/15/2016 7 Today • Avoiding errors • Detecting errors – From the user side: what cues does the user provide to indicate an error? – From the system side: how likely is it the system made an error? • Dealing with Errors: what can the system do when it thinks an error has occurred? • Evaluating SDS: evaluating ‘problem’ dialogues 7/15/2016 8 Systems Have Trouble Knowing When They’ve Made a Mistake • Hard for humans to correct system misconceptions (Krahmer et al `99) User: I want to go to Boston. System: What day do you want to go to Baltimore? – Easier: answering explicit requests for confirmation or responding to ASR rejections System: Did you say you want to go to Baltimore? System: I'm sorry. I didn't understand you. Could you please repeat your utterance? 7/15/2016 9 • But constant confirmation or over-cautious rejection lengthens dialogue and decreases user satisfaction 7/15/2016 10 …And Systems Have Trouble Recognizing User Corrections • Probability of recognition failures increases after a misrecognition (Levow ‘98) • Corrections of system errors often hyperarticulated (louder, slower, more internal pauses, exaggerated pronunciation) more ASR error (Wade et al ‘92, Oviatt et al ‘96, Swerts & Ostendorf ‘97, Levow ‘98, Bell & Gustafson ‘99) 7/15/2016 11 Can Prosodic Information Help Systems Perform Better? • If errors occur where speaker turns are prosodically ‘marked’…. – Can we recognize turns that will be misrecognized by examining their prosody? – Can we modify our dialogue and recognition strategies to handle corrections more appropriately? 7/15/2016 12 Approach • Collect corpus from interactive voice response system • Identify speaker ‘turns’ misrecognition • incorrectly recognized aware site • where speakers first aware of error • that correct misrecognitions correction • Identify prosodic features of turns in each category and compare to other turns • Use Machine Learning techniques to train a classifier to make these distinctions automatically 7/15/2016 13 Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. misrecognition TOOT: Which city do you want to go to? User: New York. correction aware site 7/15/2016 14 TOOT Dialogues • Collected to study effects of differences in dialogue strategy on user performance and satisfaction (Litman & Pan ‘99) – type of initiative (system, user, mixed) – type of confirmation (explicit, implicit, none) – adaptability condition • Subjects – 39 summer students – 16/23 (F/M) – 20/19 (native speaker/non) 7/15/2016 15 •Platform: combined over-the-phone ASR and TTS (Kamm et al ‘97) with web access to train information •Task: find train information for 4 scenarios •Corpus for current study: 2328 speaker turns 52 dialogues •Misrecognitions Overall word accuracy: 61% Overall concept accuracy(CA): 71% “I want to go to Boston from Philadelphia” (2 domain concepts) recognized as “I want to go to Boston” (one concept) = 50% 7/15/2016 16 A Successful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I would like a train to New York City from Philadelphia on Sunday at ten thirty P M S: I heard you say “go to New York City from Philadelphia on Sunday approximately at ten thirty p.m." Do you want me to find the trains from Philadelphia to New York City on Sunday approximately at ten thirty p. m. now? U: Yes. S: I am going to get the train schedule for you.... 7/15/2016 17 Are Misrecognitions, Aware Turns, Corrections Measurably Different from Other Turns? • For each type of turn: – For each speaker, for each prosodic feature, calculate mean values for e.g. all correctly recognized speaker turns and for all incorrectly recognized turns – Perform paired t-tests on these speaker pairs of means (e.g., for each speaker, pairing mean values for correctly and incorrectly recognized turns) 7/15/2016 18 How: Prosodic Features Examined per Turn • Raw prosodic/acoustic features – – – – – – f0 maximum and mean (pitch excursion/range) rms maximum and mean (amplitude) total duration duration of preceding silence amount of silence within turn speaking rate (estimated from syllables of recognized string per second) • Normalized versions of each feature (compared to first turn in task, to previous turn in task, Z scores) 7/15/2016 19 Distinguishing Correct Recognitions from Misrecognitions (NAACL ‘00) • Misrecognitions differ prosodically from correct recognitions in – – – – – F0 maximum (higher) RMS maximum (louder) turn duration (longer) preceding pause (longer) slower • Effect holds up across speakers and even when hyperarticulated turns are excluded 7/15/2016 20 WER-Based Results Misrecognitions are higher in pitch, louder, longer, more preceding pause and less internal silence 7/15/2016 F0 Max F0 Mean RMS Max RMS Mean Duration Prior Pause Tempo % Sil in Turn T stat Mean MisrecMean Rec P 5.78 1.52 2.52 -1.82 9.94 5.586 -4.71 -1.48 25.84 Hz 1.56 Hz 150.56 -25.05 2.13 sec 0.29 sec 0.54 sps -.02% 0.000 0.140 0.020 0.080 0.000 0.000 0.000 0.150 21 Predicting Turn Types Automatically • Ripper (Cohen ‘96) automatically induces rule sets for predicting turn types – greedy search guided by measure of information gain – input: vectors of feature values – output: ordered rules for predicting dependent variable and (X-validated) scores for each rule set • Independent variables: – all prosodic features, raw and normalized – experimental conditions (adaptability of system, initiative type, confirmation style, subject, task) – gender, native/non-native status – ASR recognized string, grammar, and acoustic confidence score 7/15/2016 22 ML Results: WER-defined Misrecognition Feature Set Baseline Prosody, ASR Conf, String, Grammar ASR Conf, String, Grammar ASR String ASR Conf Prosody Duration 7/15/2016 Estimated Error (SE) 32.35% (NA) 8.64% (0.53) 14.83% (0.81) 18.00% (0.86) 18.91% (1.00) 19.20% (0.80) 20.92% (0.85) 23 Best Rule-Set for Predicting WER Using prosody, ASR conf, ASR string, ASR grammar if (conf <= -2.85 ^ (duration >= 1.27) ^ then F if (conf <= -4.34) then F if (tempo <= .81) then F If (conf <= -4.09 then F If (conf <= -2.46 ^ str contains “help” then F If conf <= -2.47 ^ ppau >= .77 ^ tempo <= .25 then F If str contains “nope” then F If dur >= 1.71 ^ tempo <= 1.76 then F else T 7/15/2016 24 Today • Avoiding errors • Detecting errors – From the user side: what cues does the user provide to indicate an error? – From the system side: how likely is it the system made an error? • Dealing with Errors: what can the system do when it thinks an error has occurred? • Evaluating SDS: evaluating ‘problem’ dialogues 7/15/2016 25 Error Handling Strategies • If systems can recognize their lack of recognition, how should they inform the user that they don’t understand (Goldberg et al ’03)? – System rephrasing vs. repetitions vs. statement of not understanding – Apologies • What behaviors might these produce? – Hyperarticulation – User frustration – User repetition vs. rephrasing 7/15/2016 26 • What lessons do we learn? – When users are frustrated they are generally harder to recognize accurately – When users are increasingly misrecognized they tend to be misrecognized more often and become increasingly frustrated – Apologies combined with rephrasing of system prompts tend to decrease frustration and improve WER: Don’t just repeat! – Users are better recognized when they rephrase their input 7/15/2016 27 Today • Avoiding errors • Detecting errors – From the user side: what cues does the user provide to indicate an error? – From the system side: how likely is it the system made an error? • Dealing with Errors: what can the system do when it thinks an error has occurred? • Evaluating SDS: evaluating ‘problem’ dialogues 7/15/2016 37 Recognizing `Problematic’ Dialogues • Hastie et al, “What’s the Trouble?” ACL 2002 • How to define a dialogue as problematic? – User satisfaction is low – Task is not completed • How to recognize? – Train on a corpus of recorded dialogues (1242 DARPA Communicator dialogues) – Predict • User Satisfaction • Task Completion (0,1,2) 7/15/2016 38 – User Satisfaction features: 7/15/2016 39 Results 7/15/2016 40 Next Class • Speech data mining • HW3c due 7/15/2016 41