Spoken Dialogue Systems CS 4705

Spoken Dialogue Systems CS 4705 Boston Directions Corpus • Erratum from last week: speakers did give directions to a silent confederate, a Harvard student Talking to a Machine….and (often) Getting an Answer • Today’s spoken dialogue systems make it possible to accomplish real tasks without talking to a person – Could Eliza do this? – What do today’s systems do better? – Do they actually embody human intelligence? • Key advances – – – – Stick to goal-directed interactions in a limited domain Prime users to adopt the vocabulary you can recognize Partition the interaction into manageable stages Judicious use of system vs. mixed initiative Today • • • • • • • • • Utterances Turn-taking Initiative Strategies Grounding Confirmation Strategies The Waxholm Project: Word and Topic Prediction Evaluating Spoken Dialogue Systems Predicting System Errors and User Corrections More Examples Dialogue vs. Monologue • Monologue and dialogue both involve interpreting – – – – Information status Coherence issues Reference resolution Speech acts, implicature, intentionality • Dialogue involves managing – Turn-taking – Grounding and repairing misunderstandings – Initiative and confirmation strategies Segmenting Speech into Utterances • What is an `utterance’? – Why is EOU detection harder than EOS? – How does speech differ from text? – Single syntactic sentence may span several turns A: We've got you on USAir flight 99 B: Yep A: leaving on December 1. – Multiple syntactic sentences may occur in single turn A: We've got you on USAir flight 99 leaving on December. Do you need a rental car? – Intonational definitions: intonational phrase, breath group, intonation unit Turns and Utterances • Dialogue is characterized by turn-taking: who should talk next, and when they should talk • How do we identify turns in recorded speech? – Little speaker overlap (around 5% in English --although depends on domain) – But little silence between turns either • How do we know when a speaker is giving up or taking a turn? Holding the floor? How do we know when a speaker is interruptable? Simplified Turn-Taking Rule (Sacks et al) • At each transition-relevance place (TRP) of each turn: – If current speaker has selected A as next speaker, then A must speak next – If current speaker does not select next speaker, any other speaker may take next turn – If no one else takes next turn, the current speaker may take next turn • TRPs are where the structure of the language allows speaker shifts to occur • Adjacency pairs set up next speaker expectations – – – – GREETING/GREETING QUESTION/ANSWER COMPLIMENT/DOWNPLAYER REQUEST/GRANT • ‘Significant silence’ is dispreferred A: Is there something bothering you or not? (1.0s) A: Yes or no? (1.5s) A: Eh? B: No. Intonational Cues to Turntaking • Continuation rise (L-H%) holds the floor • H-H% requests a response – L*H-H% (ynq contour) – H* H-H% (highrise question contour) • Intonational contours signal dialogue acts in adjacency pairs Timing and Turntaking • How should we time responses in a SDS? – Japanese studies of aizuchi (backchannels) (Koiso et al ‘98, Takeuchi et al ‘02) in natural speech – Lexical information: particles ne and ka ending preceding turn or (in telephone shopping) product names – Length of preceding utterance, f0, loudness, and pause after even more important in predicting turntaking Turntaking and Initiative Strategies • System Initiative S: Please give me your arrival city name. U: Baltimore. S: Please give me your departure city name…. • User Initiative S: How may I help you? U: I want to go from Boston to Baltimore on November 8. • `Mixed’ initiative S: How may I help you? U: I want to go to Boston. S: What day do you want to go to Boston? Grounding (Clark & Shaefer ‘89) • Conversational participants don’t just take turns speaking….they try to establish common ground (or mutual belief) • H must ground a S's utterances by making it clear whether or not understanding has occurred • How do hearers do this? S: I can upgrade you to an SUV at that rate. – Continued attention (U gazes appreciatively at S) – Relevant next contribution U: Do you have a RAV4 available? – Acknowledgement/backchannel U: Ok/Mhmmm/Great! – Demonstration/paraphrase U: An SUV. – Display/repetition U: You can upgrade me to an SUV at the same rate? – Request for repair U: I beg your pardon? Detecting Grounding Behavior • Evidence of system misconceptions reflected in user responses (Krahmer et al ‘99, ‘00) – Responses to incorrect verifications • contain more words (or are empty) • show marked word order (especially after implicit verifications) • contain more disconfirmations, more repeated/corrected info – ‘No’ after incorrect verifications vs. other ynq’s • has higher boundary tone • wider pitch range • longer duration • longer pauses before and after • more additional words after it • User information state reflected in response (Shimojima et al ’99, ‘01) – Echoic responses repeat prior information – as acknowledgment or request for confirmation S1: Then go to Keage station. S2: Keage. – Experiment: • Identify ‘degree of integration’ and prosodic features (boundary tone, pitch range, tempo, initial pause) • Perception studies to elicit ‘integration’ effect – Results: fast tempo, little pause and low pitch signal high integration Grounding and Confirmation Strategies U: I want to go to Baltimore. • Explicit S: Did you say you want to go to Baltimore? • Implicit S: Baltimore. (H* L- L%) S: Baltimore? (L* H- H%) S: What time do you want to leave Baltimore? • No confirmation The Waxholm Project at KTH • tourist information • Stockholm archipelago • time-tables, hotels, hostels, camping and dining possibilities. • mixed initiative dialogue • speech recognition • multimodal synthesis • graphic information • pictures, maps, charts and time-tables • Demos at http://www.speech.kth.se/multimodal Dialogue control state prediction Dialog grammar specified by a number of states Each state associated with an action database search, system question… … Probable state determined from semantic features Transition probability from one state to state Dialog control design tool with a graphic interface Waxholm Topics TIME_TABLE Task: get a time-table. Example: När går båten? (When does the boat leave?) SHOW_MAP Task : get a chart or a map displayed. Example: Var ligger Vaxholm? (Where is Vaxholm located?) EXIST Task : display lodging and dining possibilities. Example: Var finns det vandrarhem? (Where are there hostels?) OUT_OF_DOMAIN Task : the subject is out of the domain. Example: Kan jag boka rum. (Can I book a room?) NO_UNDERSTANDING Task : no understanding of user intentions. Example: Jag heter Olle. (My name is Olle) END_SCENARIO Task : end a dialog. Example: Tack. (Thank you.) Topic selection FEATURES TOPIC EXAMPLES TIME TABLE SHOW MAP FACILITY NO UNDER- OUT OF STANDING DOMAIN OBJECT QUEST-WHEN QUEST-WHERE FROM-PLACE AT-PLACE .062 .188 .062 .250 .062 .312 .031 .688 .031 .219 .073 .024 .390 .024 .293 .091 .091 .091 .091 .091 .067 .067 .067 .067 .067 .091 .091 .091 .091 .091 TIME PLACE OOD END HOTEL HOSTEL ISLAND PORT MOVE .312 .091 .062 .062 .062 .062 .333 .125 .875 .031 .200 .031 .031 .031 .031 .556 .750 .031 .024 .500 .122 .024 .488 .122 .062 .244 .098 .091 .091 .091 .091 .091 .091 .091 .091 .091 .067 .067 .933 .067 .067 .067 .067 .067 .067 .091 .091 .091 .909 .091 .091 .091 .091 .091 { p(ti | F )} argmax i END Topic prediction results % Errors 15 12,9 8,8 10 5 12,7 8,5 “no understanding” excluded 3,1 2,9 0 complete parse All raw data no extra linguistic sounds User answers to questions? The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) • • • • • 22% 11% 11% 7% 6% • - Friday (fredag) I want to go on Friday (jag vill åka på fredag) I want to go today (jag vill åka idag) on Friday (på fredag) I want to go a Friday (jag vill åka en fredag) are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) User answers to questions? The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) • • • • • 22% 11% 11% 7% 6% • - Friday (fredag) I want to go on Friday (jag vill åka på fredag) I want to go today (jag vill åka idag) on Friday (på fredag) I want to go a Friday (jag vill åka en fredag) are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) Examples of questions and answers Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern? jag åker en gång om året kanske jag åker ganska sällan utomlands på semester jag åker nästan alltid utomlands under min semester jag åker ungefär 2 gånger per år utomlands på semester jag åker utomlands nästan varje år jag åker utomlands på semestern varje år jag åker utomlands ungefär en gång om året jag är nästan aldrig utomlands en eller två gånger om året en gång per semester kanske en gång per år ungefär en gång per år åtminståne en gång om året nästan aldrig jag reser en gång om året utomlands jag reser inte ofta utomlands på semester det blir mera i arbetet jag reser reser utomlands på semestern vartannat år jag reser utomlands en gång per semester jag reser utomlands på semester ungefär en gång per år jag brukar resa utomlands på semestern åtminståne en gång i året en gång per år kanske en gång vart annat år varje år vart tredje år ungefär nu för tiden inte så ofta varje år brukar jag åka utomlands Results no no reuse 4% 2%answer other 24% reuse 52% 18% ellipse The Waxholm system There Information Information Information Which areWhen IIs IWaxholm am lots Which think day This itWhere looking possible of Ido about about Where of want about is Iboats hotels the want the Thank ais can Thank table hotels the evening to The for week shown to the from isto go are Ieat restaurants boats Waxholm? you find city hotels of go tomorrow you is in do Stockholm in the to on boats shown too Waxholm? hotels? Waxholm? you to Waxholm boats... this inWaxholm Waxholm want depart? in map inWaxholm to this toWaxholm go? table is on a Friday, Fromis At shown where shown whatin do time inthis you this do table want table you to want go to go? How do we evaluate Dialogue Systems? • PARADISE framework (Walker et al ’00) • “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures What metrics should we use? • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted 7/15/2016 29 User Satisfaction: Sum of Many Measures • Was Annie easy to understand in this conversation? (TTS Performance) • In this conversation, did Annie understand what you said? (ASR Performance) • In this conversation, was it easy to find the message you wanted? (Task Ease) • Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) • In this conversation, did you know what you could say at each point of the dialog? 7/15/2016 (User Expertise) • How often was Annie sluggish and slow to reply to you in this conversation? (System Response) • Did Annie work the way you expected her to in this conversation? (Expected Behavior) • From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) 30 Performance Model • Weights trained for each independent factor via multiple regression modeling: how much does each contribute to User Satisfaction? • Result useful for system development – Making predictions about system modifications – Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But … can we also tell on-line when a dialogue is ‘going wrong’ Identifying Misrecognitions, Awares and User Corrections Automatically (Hirschberg, Litman & Swerts) • Collect corpus from interactive voice response system • Identify speaker ‘turns’ • incorrectly recognized • where speakers first aware of error • that correct misrecognitions • Identify prosodic features of turns in each category and compare to other turns • Use Machine Learning techniques to train a classifier to make these distinctions automatically Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. misrecognition TOOT: Which city do you want to go to? User: New York. correction aware site Results • Reduced error in predicting misrecognized turns to 8.64% • Error in predicting ‘awares’ (12%) • Error in predicting corrections (18-21%) Percentage of all repetitions Features in repetition corrections (KTH) 50 40 adults children 30 20 10 0 more increased shifting of clearly loudness focus articulated The August system People IWhat IStrindberg IYes, Over call The can How come Strindberg The Perhaps myself answer that information who amany from Royal million was live we Strindberg, questions the was people Institute ain will people smart glass married ishere? shown live thing about of houses live but in I Yes, When it What Do You do might you Thank you Good were are is was like your be do welcome! bye! you that you! born for itdepartment name? born? we ameet inliving? will! 1849 Strindberg, ofdon’t Speech, should in the really three Technology! Stockholm? on soon Stockholm not KTH Music tothe have throw say! times! again! map and and a surname Stockholm stones area Hearing Initial experiments • Studies on human-human conversation • The Higgins domain (similar to Map Task) • Using ASR in one direction to elicit error handling behaviour User Speaks ASR Listens Vocoder Reads Speaks Operator Non-Understanding Error Recovery (Skantze ’03) • Humans tend not to signal non-understanding: – O: Do you see a wooden house in front of you? – U: ASR: YES CROSSING ADDRESS NOW (I pass the wooden house now) – O: Can you see a restaurant sign? • This leads to – Increased experience of task success – Faster recovery from non-understanding Conclusions • Spoken dialogue systems presents new problems - but also new possibilities – Recognizing speech introduces a new source of errors – Additional information provided in the speech stream offers new information about users’ intended meanings, emotional state (grounding of information, speech acts, reaction to system errors) • Why spoken dialogue systems rather than webbased interfaces?

Spoken Dialogue Systems CS 4705

Related documents

Products

Support

Spoken Dialogue Systems CS 4705

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib