Lecture 22 Spoken Dialogue Systems CS 4705 Talking to a Machine….and Getting an Answer • Today’s spoken dialogue systems make it possible to accomplish real tasks, over the phone, without talking to a person – Real-time speech technology enables real-time interaction – Speech recognition and understanding is ‘good enough’ for limited, goal-directed interactions – Careful dialogue design can be tailored to capabilities of component technologies • Limited domain • Judicious use of system initiative vs. mixed initiative Why is Dialogue Different? • Different phenomena to model – Turn-taking – Grounding – Speech acts/dialogue acts • New problems to handle – How much flexibility to allow: initiative strategies – How to deal with error: confirmation strategies • How to evaluate `successs’ Turns and Utterances • Dialogue is characterized by turn-taking: who should talk next, and when they should talk • How do we identify turns in recorded speech? – Little speaker overlap (around 5% in English --although depends on domain) – But little silence between turns either • How do we know when a speaker is giving up or taking a turn? Simplified Turn-Taking Rule (Sacks et al) • At each transition-relevance place of each turn: – If during this turn current speaker has selected A as the next speaker, then A must speak next – If current speaker does not select the next speaker, any other speaker may take the next turn – If no one else takes the next turn, the current speaker may take the next turn • Transition-relevance places are where the structure of the language allows speaker shifts to occur. • Adjacency pairs (set up next speaker expectations) – – – – GREETING GREETING QUESTION ANSWER COMPLIMENT DOWNPLAYER REQUEST GRANT • Significant silence follows first element of adjacency pair) A: Is there something bothering you or not? (1.0s) A: Yes or no? (1.5s) A: Eh? B: No. Utterances • Transition-relevance places typically occur at utterance boundaries but how to define `utterance’ – Spoken utterances typically shorter, contain more pronouns, include repairs …, compared to written – Cue words, ngrams, prosody – A single sentence may span several turns A: We've got you on USAir flight 99 B: Yep A: leaving on December 1. – Multiple sentences may occur in single turn A: We've got you on USAir flight 99 leaving on December. Do you need a rental car? Grounding • Conversational participants must continually establish common ground (or, mutual belief) • Hearer must ground a speaker's utterances (by making it clear that (believed) understanding has occurred), or else indicate that a grounding problem occurred How do hearers do this? – Acknowledgement • continuer / backchannel / acknowledgement token (also nods if vision available) A: … returning on U.S. flight one. C: Mm hmm • grounds A's utterance, and also returns turn – Display (stronger method) • display all or part of utterance to be grounded verbatim C: OK I'll take the 5ish flight on the 11th. A: On the 11th? – Request for repair indicates lack of grounding C: OK I'll take the 5ish flight on the 11th. A: Huh? C: I'll take the 5ish flight on the 11th. Detecting Grounding Automatically • Evidence of system misconceptions reflected in user responses (Krahmer et al ‘99, ‘00) – Responses to incorrect verifications • contain more words (or are empty) • show marked word order (especially after implicit verifications) • contain more disconfirmations, more repeated/corrected info – ‘No’ after incorrect verifications vs. other ynq’s • has higher boundary tone • wider pitch range • longer duration • longer pauses before and after • more additional words after it • User information state reflected response (Shimojima et al ’99, ‘01) – Echoic responses repeat prior information – as acknowledgment or request for confirmation S1: Then go to Keage station. S2: Keage. – Experiment: • Identify ‘degree of integration’ and prosodic features (boundary tone, pitch range, tempo, initial pause) • Perception studies to elicit ‘integration’ effect – Results: fast tempo, little pause and low pitch signal high integration Dialogue Acts • Austin (1962) observed that dialogue utterances are a kind of speaker action, or speech act – E.g.: performative sentences I name this ship the Titanic. I second this motion. I bet you five dollars it will snow tomorrow. Types of Speech Acts • Locutionary acts: the utterance of a sentence with a particular meaning • Illocutionary acts: the act of asking, answering, promising, etc. in uttering a sentence • Perlocutionary acts: the (often intentional) production of certain effects upon the feelings, thoughts, or actions of the addressee in uttering a sentence • You can't do that. – locutionary: utterance – illocutionary force: protesting – perlocutionary effect: stopping or annoying the hearer Types of Illocutionary Acts • Searle’s term to classify illocutionary acts (1975). – Assertives: committing the speaker to something's being the case (suggesting, putting forward, swearing boasting, concluding) – Directives: attempts by the speaker to get the addressee to do something (asking, ordering requesting, inviting advising, begging) – Commissives: committing the speaker to some future course of action (e.g., promising, planning, vowing, betting, opposing) – Expressives: expressing the psychological state of the speaker about a state of affairs (thanking, apologizing, welcoming, deploring) – Declarations: bringing about a different state of the world via the utterance (including many of the performative acts above: I resign, you're fired) Types of Initiative • System Initiative • User Initiative • `Mixed’ initiative Some Representative Spoken Dialogue Systems Deployed Brokerage (Schwab-Nuance) Mixed Initiative User E-MailAccess (myTalk) System Initiative Directory Assistant (BNR) Air Travel (UA Info-SpeechWorks) Communicator (DARPA Travel) MIT Galaxy/Jupiter Communications (Wildfire, Portico) Customer Care (HMIHY – AT&T) Banking (ANSER) ATIS (DARPA Travel) Train Schedule (ARISE) Multimodal Maps (Trains, Quickset) 1980+ 1990+ 1993+ 1995+ 1997+ 1999+ Types of Confirmation Strategies U: I want to go to Baltimore. • Explicit S: Did you say you want to go to Baltimore? • Implicit S: Baltimore? S: What time do you want to leave Baltimore? Evaluating Dialogue Systems • PARADISE framework (Walker et al ’00) • “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures Task Success •Task goals seen as Attribute-Value Matrix ELVIS e-mail retrieval task (Walker et al ‘97) Example: “Find the time and place of your meeting with Kim.” Attribute Selection Criterion Time Place Value Kim or Meeting 10:30 a.m. 2D516 •Task success defined by match between AVM values at end of with “true” values for AVM Metrics • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted 7/15/2016 20 Experimental Procedures • Subjects given specified tasks • Spoken dialogues recorded • Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled • Users specify task solution via web page • Users complete User Satisfaction surveys • Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors User Satisfaction: Sum of Many Measures • Was Annie easy to understand in this conversation? (TTS Performance) • In this conversation, did Annie understand what you said? (ASR Performance) • In this conversation, was it easy to find the message you wanted? (Task Ease) • Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) • In this conversation, did you know what you could say at each point of the dialog? 7/15/2016 (User Expertise) • How often was Annie sluggish and slow to reply to you in this conversation? (System Response) • Did Annie work the way you expected her to in this conversation? (Expected Behavior) • From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) 22 Performance Functions from Three Systems • ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET • TOOT User Sat.= .35* COMP + .45* MRS - .14*ET • ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help – COMP: User perception of task completion (task success) – MRS: Mean recognition accuracy (cost) – ET: Elapsed time (cost) – Help: Help requests (cost) Performance Model • Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction • Performance model useful for system development – Making predictions about system modifications – Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But can we also tell on-line when a dialogue is ‘going wrong’ Identifying Misrecognitions, Awares and User Corrections Automatically (Hirschberg, Litman & Swerts) • Collect corpus from interactive voice response system • Identify speaker ‘turns’ • incorrectly recognized • where speakers first aware of error • that correct misrecognitions • Identify prosodic features of turns in each category and compare to other turns • Use Machine Learning techniques to train a classifier to make these distinctions automatically Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. misrecognition TOOT: Which city do you want to go to? User: New York. correction aware site Results • Reduced error in predicting misrecognized turns to 8.64% • Error in predicting ‘awares’ (12%) • Error in predicting corrections (18-21%) Conclusions • Dialogue (especially spoken) presents new problems but also new possibilities – Recognizing speech introduces a new source of errors – Additional information provided in the speech stream offers new information about users’ intended meanings, emotional state (grounding of information, speech acts, reaction to system errors)