Lecture 22 Spoken Dialogue Systems CS 4705

advertisement
Lecture 22
Spoken Dialogue Systems
CS 4705
Talking to a Machine….and
Getting an Answer
• Today’s spoken dialogue systems make it possible
to accomplish real tasks, over the phone, without
talking to a person
– Real-time speech technology enables real-time
interaction
– Speech recognition and understanding is ‘good enough’
for limited, goal-directed interactions
– Careful dialogue design can be tailored to capabilities
of component technologies
• Limited domain
• Judicious use of system initiative vs. mixed
initiative
Why is Dialogue Different?
• Different phenomena to model
– Turn-taking
– Grounding
– Speech acts/dialogue acts
• New problems to handle
– How much flexibility to allow: initiative strategies
– How to deal with error: confirmation strategies
• How to evaluate `successs’
Turns and Utterances
• Dialogue is characterized by turn-taking: who
should talk next, and when they should talk
• How do we identify turns in recorded speech?
– Little speaker overlap (around 5% in English --although
depends on domain)
– But little silence between turns either
• How do we know when a speaker is giving up or
taking a turn?
Simplified Turn-Taking Rule (Sacks et al)
• At each transition-relevance place of each turn:
– If during this turn current speaker has selected A as the
next speaker, then A must speak next
– If current speaker does not select the next speaker, any
other speaker may take the next turn
– If no one else takes the next turn, the current speaker
may take the next turn
• Transition-relevance places are where the structure
of the language allows speaker shifts to occur.
• Adjacency pairs (set up next speaker expectations)
–
–
–
–
GREETING GREETING
QUESTION ANSWER
COMPLIMENT DOWNPLAYER
REQUEST GRANT
• Significant silence follows first element of
adjacency pair)
A: Is there something bothering you or not? (1.0s)
A: Yes or no? (1.5s)
A: Eh?
B: No.
Utterances
• Transition-relevance places typically occur at
utterance boundaries but how to define `utterance’
– Spoken utterances typically shorter, contain more
pronouns, include repairs …, compared to written
– Cue words, ngrams, prosody
– A single sentence may span several turns
A: We've got you on USAir flight 99
B: Yep
A: leaving on December 1.
– Multiple sentences may occur in single turn
A: We've got you on USAir flight 99 leaving on December. Do
you need a rental car?
Grounding
• Conversational participants must continually
establish common ground (or, mutual belief)
• Hearer must ground a speaker's utterances (by
making it clear that (believed) understanding has
occurred), or else indicate that a grounding
problem occurred
How do hearers do this?
– Acknowledgement
• continuer / backchannel / acknowledgement token
(also nods if vision available)
A: … returning on U.S. flight one.
C: Mm hmm
• grounds A's utterance, and also returns turn
– Display (stronger method)
• display all or part of utterance to be grounded
verbatim
C: OK I'll take the 5ish flight on the 11th.
A: On the 11th?
– Request for repair indicates lack of grounding
C: OK I'll take the 5ish flight on the 11th.
A: Huh?
C: I'll take the 5ish flight on the 11th.
Detecting Grounding Automatically
• Evidence of system misconceptions reflected in user
responses (Krahmer et al ‘99, ‘00)
– Responses to incorrect verifications
• contain more words (or are empty)
• show marked word order (especially after implicit verifications)
• contain more disconfirmations, more repeated/corrected info
– ‘No’ after incorrect verifications vs. other ynq’s
• has higher boundary tone
• wider pitch range
• longer duration
• longer pauses before and after
• more additional words after it
• User information state reflected response
(Shimojima et al ’99, ‘01)
– Echoic responses repeat prior information – as
acknowledgment or request for confirmation
S1: Then go to Keage station.
S2: Keage.
– Experiment:
• Identify ‘degree of integration’ and prosodic features
(boundary tone, pitch range, tempo, initial pause)
• Perception studies to elicit ‘integration’ effect
– Results: fast tempo, little pause and low pitch signal
high integration
Dialogue Acts
• Austin (1962) observed that dialogue utterances
are a kind of speaker action, or speech act
– E.g.: performative sentences
I name this ship the Titanic.
I second this motion.
I bet you five dollars it will snow tomorrow.
Types of Speech Acts
• Locutionary acts: the utterance of a sentence with a
particular meaning
• Illocutionary acts: the act of asking, answering, promising,
etc. in uttering a sentence
• Perlocutionary acts: the (often intentional) production of
certain effects upon the feelings, thoughts, or actions of the
addressee in uttering a sentence
• You can't do that.
– locutionary: utterance
– illocutionary force: protesting
– perlocutionary effect: stopping or annoying the hearer
Types of Illocutionary Acts
• Searle’s term to classify illocutionary acts (1975).
– Assertives: committing the speaker to something's being the case
(suggesting, putting forward, swearing boasting, concluding)
– Directives: attempts by the speaker to get the addressee to do
something (asking, ordering requesting, inviting advising, begging)
– Commissives: committing the speaker to some future course of
action (e.g., promising, planning, vowing, betting, opposing)
– Expressives: expressing the psychological state of the speaker
about a state of affairs (thanking, apologizing, welcoming,
deploring)
– Declarations: bringing about a different state of the world via the
utterance (including many of the performative acts above: I resign,
you're fired)
Types of Initiative
• System Initiative
• User Initiative
• `Mixed’ initiative
Some Representative
Spoken Dialogue Systems
Deployed
Brokerage
(Schwab-Nuance)
Mixed
Initiative
User
E-MailAccess
(myTalk)
System
Initiative
Directory
Assistant (BNR)
Air Travel
(UA Info-SpeechWorks)
Communicator
(DARPA Travel)
MIT
Galaxy/Jupiter
Communications
(Wildfire, Portico)
Customer Care
(HMIHY – AT&T)
Banking
(ANSER)
ATIS
(DARPA Travel)
Train Schedule
(ARISE)
Multimodal Maps
(Trains, Quickset)
1980+
1990+
1993+
1995+ 1997+ 1999+
Types of Confirmation Strategies
U: I want to go to Baltimore.
• Explicit
S: Did you say you want to go to Baltimore?
• Implicit
S: Baltimore?
S: What time do you want to leave Baltimore?
Evaluating Dialogue Systems
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
Qualitative
Measures
Task Success
•Task goals seen as Attribute-Value Matrix
ELVIS e-mail retrieval
task (Walker et al ‘97)
Example:
“Find the time and place of your meeting with
Kim.”
Attribute
Selection Criterion
Time
Place
Value
Kim or Meeting
10:30 a.m.
2D516
•Task success defined by match between AVM
values at end of with “true” values for AVM
Metrics
• Efficiency of the Interaction:User Turns,
System Turns, Elapsed Time
• Quality of the Interaction: ASR rejections,
Time Out Prompts, Help Requests, Barge-Ins,
Mean Recognition Score (concept accuracy),
Cancellation Requests
• User Satisfaction
• Task Success: perceived completion,
information extracted
7/15/2016
20
Experimental Procedures
• Subjects given specified tasks
• Spoken dialogues recorded
• Cost factors, states, dialog acts automatically
logged; ASR accuracy,barge-in hand-labeled
• Users specify task solution via web page
• Users complete User Satisfaction surveys
• Use multiple linear regression to model User
Satisfaction as a function of Task Success and
Costs; test for significant predictive factors
User Satisfaction:
Sum of Many Measures
• Was Annie easy to understand
in this conversation? (TTS
Performance)
• In this conversation, did Annie
understand what you said?
(ASR Performance)
• In this conversation, was it
easy to find the message you
wanted? (Task Ease)
• Was the pace of interaction with
Annie appropriate in this
conversation? (Interaction Pace)
• In this conversation, did you
know what you could say at
each point of the dialog?
7/15/2016
(User Expertise)
• How often was Annie sluggish
and slow to reply to you in this
conversation? (System
Response)
• Did Annie work the way you
expected her to in this
conversation? (Expected
Behavior)
• From your current experience
with using Annie to get your
email, do you think you'd use
Annie regularly to access your
mail when you are away from
your desk? (Future Use)
22
Performance Functions from Three Systems
• ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET
• TOOT User Sat.= .35* COMP + .45* MRS - .14*ET
• ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help
– COMP: User perception of task completion (task
success)
– MRS: Mean recognition accuracy (cost)
– ET: Elapsed time (cost)
– Help: Help requests (cost)
Performance Model
• Perceived task completion and mean recognition
score are consistently significant predictors of
User Satisfaction
• Performance model useful for system
development
– Making predictions about system modifications
– Distinguishing ‘good’ dialogues from ‘bad’ dialogues
• But can we also tell on-line when a dialogue is
‘going wrong’
Identifying Misrecognitions, Awares and User
Corrections Automatically (Hirschberg,
Litman & Swerts)
• Collect corpus from interactive voice response
system
• Identify speaker ‘turns’
• incorrectly recognized
• where speakers first aware of error
• that correct misrecognitions
• Identify prosodic features of turns in each
category and compare to other turns
• Use Machine Learning techniques to train a
classifier to make these distinctions automatically
Turn Types
TOOT: Hi. This is AT&T Amtrak Schedule
System. This is TOOT. How may I help you?
User: Hello. I would like trains from
Philadelphia to New York leaving on Sunday at
ten thirty in the evening.
misrecognition
TOOT: Which city do you want to go to?
User: New York.
correction
aware site
Results
• Reduced error in predicting misrecognized turns to
8.64%
• Error in predicting ‘awares’ (12%)
• Error in predicting corrections (18-21%)
Conclusions
• Dialogue (especially spoken) presents new
problems but also new possibilities
– Recognizing speech introduces a new source of errors
– Additional information provided in the speech stream
offers new information about users’ intended meanings,
emotional state (grounding of information, speech acts,
reaction to system errors)
Download