Dialogue Systems Julia Hirschberg CS 4705 7/15/2016 1 Today • Examples from English and Swedish • Controlling the dialogue flow – State prediction • Influencing user behavior – Entrainment • Learning from human-human dialogue – User feedback • Role of ‘personality’ in SDS • Evaluating SDS 7/15/2016 2 Issues in Designing SDS • Coverage: functionality and vocabulary • Dialogue control: – System – User – Mixed initiative • Confirmation strategies: – Explicit – Implicit – None 7/15/2016 3 The Waxholm Project at KTH • Tourist information • Stockholm archipelago • time-tables, hotels, hostels, camping and dining possibilities. • Mixed initiative dialogue • speech recognition • multimodal synthesis • Graphic information • pictures, maps, charts and time-tables • Demos at http://www.speech.kth.se/multimodal 7/15/2016 4 The Waxholm system 7/15/2016 There Information Information Information Which areWhen IIs IWaxholm am lots Which think day This itWhere looking possible of Ido about about Where of want about is Iboats hotels the want the Thank ais can Thank table hotels the evening to The for week shown to the from isto go are Ieat restaurants boats Waxholm? you find city hotels of go tomorrow you is in do Stockholm in the to on boats shown too Waxholm? hotels? Waxholm? you to Waxholm boats... this inWaxholm Waxholm want depart? in map inWaxholm to this toWaxholm go? table is on a Friday, Fromis At shown where shown whatin do time inthis you this do table want table you to want go to go? 5 Dialogue Control: Predicting Dialogue State • Dialog grammar specified by a number of states – Each state associated with an action – Database search, system question… … – Probable state determined from semantic features • Train transition probabilities from state to state • Dialog control design tool with a graphic interface 7/15/2016 6 Waxholm Topics TIME_TABLE Task: get a time-table. Example: När går båten? (When does the boat leave?) SHOW_MAP Task : get a chart or a map displayed. Example: Var ligger Vaxholm? (Where is Vaxholm located?) EXIST Task : display lodging and dining possibilities. Example: Var finns det vandrarhem? (Where are there hostels?) OUT_OF_DOMAIN Task : the subject is out of the domain. Example: Kan jag boka rum. (Can I book a room?) NO_UNDERSTANDING Task : no understanding of user intentions. Example: Jag heter Olle. (My name is Olle) END_SCENARIO Task : end a dialog. Example: Tack. (Thank you.) 7/15/2016 7 Topic selection FEATURES 7/15/2016 TOPIC EXAMPLES TIME TABLE SHOW MAP FACILITY NO UNDER- OUT OF STANDING DOMAIN OBJECT QUEST-WHEN QUEST-WHERE FROM-PLACE AT-PLACE .062 .188 .062 .250 .062 .312 .031 .688 .031 .219 .073 .024 .390 .024 .293 .091 .091 .091 .091 .091 .067 .067 .067 .067 .067 .091 .091 .091 .091 .091 TIME PLACE OOD END HOTEL HOSTEL ISLAND PORT MOVE .312 .091 .062 .062 .062 .062 .333 .125 .875 .031 .200 .031 .031 .031 .031 .556 .750 .031 .024 .500 .122 .024 .488 .122 .062 .244 .098 .091 .091 .091 .091 .091 .091 .091 .091 .091 .067 .067 .933 .067 .067 .067 .067 .067 .067 .091 .091 .091 .909 .091 .091 .091 .091 .091 { p(ti | F )} argmax i END 8 Topic prediction results % Errors 15 12,9 8,8 10 5 12,7 8,5 All “no understanding” excluded 3,1 2,9 0 complete parse 7/15/2016 raw data no extra linguistic sounds 9 Entrainment (Adaptation, Accommodation, Alignment) • Hypothesis: over time, people tend to adapt their communicative behavior to that of their conversational partner • Issues – What are the dimensions of entrainment? – How rapidly do people adapt? – Does entrainment occur (on the human side) in human/computer conversations? – Can this be used to the system’s advantage? The user’s? 7/15/2016 10 Varieties of Entrainment… • Lexical: S and H tend over time to adopt the same method of referring to items in a discourse A: It’s that thing that looks like a harpsichord. B: So the harpsichord-looking thing… .... B: The harpsichord… • Phonological – Word pronunciation: voice/voiceless /t/ in better • Acoustic/Prosodic – Speaking rate, pitch range, choice of contour • Discourse/dialogue/social – Marking of topic shift, turn-taking 7/15/2016 11 The Vocabulary Problem • Furnas et al ’87: the probability that 2 subjects will producing the same name for a command for common computer operations varied from .07-.18 – Remove a file: remove, delete, erase, kill, omit, destroy, lose, change, trash – With 20 synonyms for a single command, the likelihood that 2 people will choose the same one was 80% – With 25 commands, the likelihood that 2 people who choose the same term think it means the same command was 15% • How can people possibly communicate? – They collaborate on choice of referring expressions 7/15/2016 12 Early Studies of Priming Effects • Hypothesis: Users will tend to use the vocabulary and syntax the system uses – Evidence from data collections in the field • Systems should take advantage of this proclivity to prime users to speak in ways that the system can recognize well 7/15/2016 13 User Responses to Vaxholm The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) • • • • • 22% 11% 11% 7% 6% • - 7/15/2016 Friday (fredag) I want to go on Friday (jag vill åka på fredag) I want to go today (jag vill åka idag) on Friday (på fredag) I want to go a Friday (jag vill åka en fredag) are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) 14 Verb Priming: How often do you go abroad on holiday? Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern? jag åker en gång om året kanske jag åker ganska sällan utomlands på semester jag åker nästan alltid utomlands under min semester jag åker ungefär 2 gånger per år utomlands på semester jag åker utomlands nästan varje år jag åker utomlands på semestern varje år jag åker utomlands ungefär en gång om året jag är nästan aldrig utomlands en eller två gånger om året en gång per semester kanske en gång per år ungefär en gång per år åtminståne en gång om året nästan aldrig 7/15/2016 jag reser en gång om året utomlands jag reser inte ofta utomlands på semester det blir mera i arbetet jag reser reser utomlands på semestern vartannat år jag reser utomlands en gång per semester jag reser utomlands på semester ungefär en gång per år jag brukar resa utomlands på semestern åtminståne en gång i året en gång per år kanske en gång vart annat år varje år vart tredje år ungefär nu för tiden inte så ofta varje år brukar jag åka utomlands 15 Results no no reuse 4% 2%answer other 24% reuse 52% 18% ellipse 7/15/2016 16 Lexical Entrainment in Referring Expressions • Choice of Referring Expressions: Informativeness vs. availability (basic level or not) vs. saliency vs. recency • Gricean prediction – People use descriptions that minimally but effectively distinguish among items in the discourse • Garrod & Anderson ’87 Output/Input Principle – Conversational partners formulate their current utterance according to the model used to interpret their partner’s most recent utterance • Clark, Brennan, et al’s Conceptual Pacts – People make Conceptual Pacts wrt appropriate referring expressions made with particular conversational partners – They are loath to abandon these even when shorter expressions possible 7/15/2016 17 Entrainment in Spontaneous Speech S13: the orange M&M looking kind of scared and then a one on the bottom left and a nine on the bottom right S12: alright I have the exact same thing I just had it's an M&M looking scared that's orange S13: yeah the scared M&M guy yeah S12: framed mirror and the scared M&M on the lower right S13: and it's to the right of the scared M&M guy S13: yeah and the iron should be on the same line as the frightened M&M kind of like an L S12: to the left of the scared M&M to the right of the onion and above the iron 7/15/2016 18 Extraterrestrial vs Alien I s11: okay in the middle of the card I have an extraterrestrial figure… s11: okay middle of the card I have the extraterrestrial … s10: I've got the blue lion with the extraterrestrial on the lower right s11: okay I have the extraterrestrial now and then I have the eye at the bottom right corner s10: my extraterrestrial's gone 7/15/2016 19 Extraterrestrial vs. Alien II S03: okay I have a blue lion and then the extraterrestrial at the lower right corner S11: mm I'll pass I have the alien with an eye in the lower right corner S03: um I have just the alien so I guess I'll match that -----------------------------------------------------------------------------S10: yes now I've got that extraterrestrial with the yellow lion and the money … S12: oh now I have the blue lion in the center with our little alien buddy in the right hand corner S10: with the alien buddy so I'm gonna match him with the single blue lion okay I've got our alien with the eye in the corner 7/15/2016 20 Timing and Voice Quality • Guitar & Marchinkoski ’01: – How early do we start to adapt to others’ speech? – Do children adapt their speaking rate to their mother’s speech? • Study: – 6 mothers spoke with their own (normally speaking) 3-yr-olds (3M, 3F) – Mothers’ rates significantly reduced (B) or not (A) in A-B-A-B design • Results: – 5/6 children reduced their rates when their mothers spoke more slowly 7/15/2016 21 • Sherblom & La Riviere ’87: How are speech timing and voice quality affected by a non-familiar conversational partner? • Study: – 65 pairs of undergraduates asked to discuss a ‘problem situation’ together – Utter a single sentence before and after the conversation – Sentences compared for speaking rate, utterance length and vocal jitter • Results: – Substantial influence of partner on all 3 measures – Interpersonal uncertainty and differences in arousal influenced degree of adaptation 7/15/2016 22 Amplitude and Response Latency • Coulston et al ’02: – Do humans adapt to the behavior of non-human partners? – Do children speak more loudly to a loud animated character? • Study: – 24 7-10-yr olds interacted with an extroverted, loud animated character and with an introverted, soft character (TTS voices) – Multiple tasks using different amplitude ranges – Human/TTS amplitudes and latencies compared • Results: – 79-94% of children adapted their amplitude, bidirectionally – Also adapted their response latencies (mean 18.4%),23 7/15/2016 bidirectionally Social Status and Entrainment • Azuma ’97: Do speakers adapt to the style of other social classes? • Study: Emperor Hirohito visits the countryside – Corpus-based study of speech style of Japanese Emperor Hirohito during chihoo jyunkoo (`visits to countryside‘), 1946-54 – Published transcripts of speeches – Findings: • Emperor Hirohito converged his speech style to that of listeners lower in social status – Choice of verb-forms, pronouns no longer those of person with highest authority – Perceived as like those of a (low-status) mother 7/15/2016 24 Socio-Cultural Influences and Entrainment • Co-teachers adapt teaching styles (Roth ’05) – Social context • High school in NE with predominantly AfricanAmerican student body • Cristobal: Cuban-African-American teacher • Chris: new Italian-American teacher – Adaptation of Chris to Cristobal • Catch phrases (e.g. right!, really really hot) and their production: pitch and intensity contours • Pitch ‘matching’ across speakers – Mimesis vs entrainment 7/15/2016 25 Conclusions for SDS • Systems can make use of user tendency to entrain to system vocabulary • Should systems also entrain to users? – CMU’s Let’s Go adapts confirmation prompts to non-native speech: Finds closest match to user input in system vocabulary 7/15/2016 26 Evidence from Human Performance • Users provide explicit positive and negative feedback • Corpus-based vs. laboratory experiments – do these tell us different things? 7/15/2016 27 The August system 7/15/2016 People IWhat IStrindberg IYes, Over call The can How come Strindberg The Perhaps myself answer that information who amany from Royal million was live we Strindberg, questions the was people Institute ain will people smart glass married ishere? shown live thing about of houses live but in I Yes, When it What Do You do might you Thank you Good were are is was like your be do welcome! bye! you that you! born for itdepartment name? born? we ameet inliving? will! 1849 Strindberg, ofdon’t Speech, should in the really three Technology! Stockholm? on soon Stockholm not KTH Music tothe have throw say! times! again! map and and a surname Stockholm stones area Hearing 28 Adapt – demonstration of ”complete” system 7/15/2016 29 Feedback and ‘Grounding’: Bell & Gustafson ’00 • Positive and negative – Previous corpora: August system • 18% of users gave pos or neg feedback in subcorpus • Push-to-talk • Corpus: Adapt system – 50 dialogues, 33 subjects, 1845 utterances – Feedback utterances labeled w/ • Positive or negative • Explicit or implicit • Attention/Attitude • Results: – 18% of utterances contained feedback – 94% of users provided 7/15/2016 30 – 65% positive, 2/3 explicit, equal amounts of attention vs. attitude – Large variation • Some subjects provided at almost every turn • Some never did • Utility of study: – Use positive feedback to model the user better (preferences) – Use negative feedback in error detection 7/15/2016 31 The HIGGINS domain This is a 3D test environment • • The primary domain of HIGGINS is city navigation for pedestrians. Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings. 7/15/2016 32 Initial experiments • Studies on human-human conversation • The Higgins domain (similar to Map Task) • Using ASR in one direction to elicit error handling behaviour User 7/15/2016 Speaks ASR Listens Vocoder Reads Speaks Operator 33 Non-Understanding Error Recovery (Skantze ’03) • Humans tend not to signal non-understanding: O: Do you see a wooden house in front of you? U: ASR: YES CROSSING ADDRESS NOW (I pass the wooden house now) O: Can you see a restaurant sign? • This leads to – Increased experience of task success – Faster recovery from non-understanding 7/15/2016 34 Personality and Computer Systems • Early-pc-era reports that significant others were jealous of the time their partners spent with their computers. • Reeves & Nass, The Media Equation How People Treat Computers, Television, and New Media Like Real People and Places, 1996 – Evolution explains the anthropomorphization of the pc • Humans evolved over millions of years without media • Proper response to any stimulus was critical to survival • Human psychology and physiological responses well developed before media invented • Ergo, our bodies and minds react to media, immediately and fundamentally, as if they were real 7/15/2016 35 People See ‘Personality’ Everywhere • Humans assess personality of another (human or otherwise) quickly, with minimal clues • Perceived computer personality strongly affects how we evaluate the computer and information it provides • Experiments: – Created “dominant” and “submissive” computer interfaces and asked subjects to use to solve hypothetical problems – Max (dominant) used assertive language, showed higher confidence in the information displayed (via a numeric scale), always presented its own analysis of the problem first – Linus (submissive) phrased information more tentatively, rated its own information at lower confidence levels, and allowed human to discuss problem first – Each used alternately by people whose personalities previously identified as being either dominant or submissive 7/15/2016 36 User Reactions • Users described Max and Linus in human terms: aggressive, assertive, authoritative vs. shy, timid, submissive – Users correctly identified machines more like themselves – Users rated machines more like themselves as better computers even though content received exactly the same. – Users rated their own performance better when machine’s personality matched theirs • People more frank when rating a computer if questionnaire presented on another machine • Subjects thought highly of computers that praised them, even if praise clearly undeserved 7/15/2016 37 Personality in SDS • Mairesse & Walker ’07 PERSONAGE (PERSONAlity GEnerator) – ‘Big 5’ personality trait model: extroversion, neuroticism, agreeableness, conscientiousness, openness to experience – Attempts to generate “extroverted” language based on traits associated with extroversion in psychology literature – Demo: find your personality type 7/15/2016 38 7/15/2016 39 Conclusions for SDS • Systems can be designed to convey different personalities • Can they recognize users’ personalities and entrain to them? • Should they? 7/15/2016 40 Evaluating Dialogue Systems • PARADISE framework (Walker et al ’00) • “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success Minimize Costs Efficiency Measures 7/15/2016 Qualitative Measures 41 Task Success •Task goals seen as Attribute-Value Matrix ELVIS e-mail retrieval task (Walker et al ‘97) “Find the time and place of your meeting with Kim.” Attribute Value Selection Criterion Kim or Meeting Time 10:30 a.m. Place 2D516 •Task success defined by match between AVM values at end of with “true” values for AVM 7/15/2016 42 Metrics • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted 7/15/2016 43 Experimental Procedures • Subjects given specified tasks • Spoken dialogues recorded • Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled • Users specify task solution via web page • Users complete User Satisfaction surveys • Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors 7/15/2016 44 User Satisfaction: Sum of Many Measures • Was Annie easy to understand in this conversation? (TTS Performance) • In this conversation, did Annie understand what you said? (ASR Performance) • In this conversation, was it easy to find the message you wanted? (Task Ease) • Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) • In this conversation, did you know what you could say at each point of the dialog? 7/15/2016 (User Expertise) • How often was Annie sluggish and slow to reply to you in this conversation? (System Response) • Did Annie work the way you expected her to in this conversation? (Expected Behavior) • From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) 45 Performance Functions from Three Systems • ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET • TOOT User Sat.= .35* COMP + .45* MRS - .14*ET • ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help – COMP: User perception of task completion (task success) – MRS: Mean recognition accuracy (cost) – ET: Elapsed time (cost) – Help: Help requests (cost) 7/15/2016 46 Performance Model • Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction • Performance model useful for system development – Making predictions about system modifications – Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But can we also tell on-line when a dialogue is ‘going wrong’ 7/15/2016 47 Next • Generation: Summarization 7/15/2016 48