Spoken Dialogue Systems Julia Hirschberg CS 4706 7/15/2016

advertisement
Spoken Dialogue Systems
Julia Hirschberg
CS 4706
7/15/2016
1
Today
• Basic Conversational Agents
–
–
–
–
ASR
NLU
Generation
Dialogue Manager
• Dialogue Manager Design
– Finite State
– Frame-based
– Initiative: User, System, Mixed
• Information-State
–
–
–
–
Dialogue-Act Detection
Dialogue-Act Generation
Evaluation
Utility-based conversational agents
• MDP, POMDP
7/15/2016
2
Conversational Agents
• AKA:
– Interactive Voice Response Systems
– Dialogue Systems
– Spoken Dialogue Systems
• Applications:
–
–
–
–
–
7/15/2016
Travel arrangements (Amtrak, United airlines)
Telephone call routing
Tutoring
Communicating with robots
Anything with limited screen/keyboard
3
A travel dialog: Communicator
7/15/2016
4
Call routing: ATT HMIHY
7/15/2016
5
A tutorial dialogue: ITSPOKE
7/15/2016
6
Conversational Structure
• Telephone conversations
–
–
–
–
Stage 1: Enter a conversation
Stage 2: Identification
Stage 3: Establish joint willingness to converse
Stage 4: First topic is raised, usually by caller
7/15/2016
7
Why is this customer confused?
• Customer: (rings)
• Operator: Directory Enquiries, for which town
please?
• Customer: Could you give me the phone number
of um: Mrs. um: Smithson?
• Operator: Yes, which town is this at please?
• Customer: Huddleston.
• Operator: Yes. And the name again?
• Customer: Mrs. Smithson
7/15/2016
8
Why is this customer confused?
• A: And, what day in May did you want to travel?
• C: OK, uh, I need to be there for a meeting that’s
from the 12th to the 15th.
• Note that client did not answer question.
• Meaning of client’s sentence:
– Meeting
• Start-of-meeting: 12th
• End-of-meeting: 15th
– Doesn’t say anything about flying!!!!!
• How does agent infer client is informing him/her of travel
dates?
7/15/2016
9
Will this client be confused?
A: … there’s 3 non-stops today.
– True if in fact 7 non-stops today.
– But agent means: 3 and only 3.
– How can client infer that agent means:
• only 3
7/15/2016
10
Grice: conversational implicature
• Implicature means a particular class of licensed
inferences.
• Grice (1975) proposed that what enables
hearers to draw correct inferences is:
• Cooperative Principle
– This is a tacit agreement by speakers and listeners to
cooperate in communication
7/15/2016
11
4 Gricean Maxims
• Relevance: Be relevant
• Quantity: Do not make your contribution more or
less informative than required
• Quality: try to make your contribution one that is
true (don’t say things that are false or for which
you lack adequate evidence)
• Manner: Avoid ambiguity and obscurity; be brief
and orderly
7/15/2016
12
Relevance
• A: Is Regina here?
• B: Her car is outside.
• Implication: yes
– Hearer thinks: why would he mention the car? It must be
relevant. How could it be relevant? It could since if her car
is here she is probably here.
• Client: I need to be there for a meeting that’s from the
12th to the 15th
– Hearer thinks: Speaker is following maxims, would only have
mentioned meeting if it was relevant. How could meeting be
relevant? If client meant me to understand that he had to
depart in time for the mtg.
7/15/2016
13
Quantity
• A:How much money do you have on you?
• B: I have 5 dollars
– Implication: not 6 dollars
• Similarly, 3 non stops can’t mean 7 non-stops (hearer
thinks:
– if speaker meant 7 non-stops she would have said 7 non-stops
• A: Did you do the reading for today’s class?
• B: I intended to
– Implication: No
– B’s answer would be true if B intended to do the reading AND did
the reading, but would then violate maxim
7/15/2016
14
Dialogue System Architecture
7/15/2016
15
Speech recognition
• Input: acoustic waveform
• Output: string of words
– Basic components:
• a recognizer for phones, small sound units like [k] or [ae].
• a pronunciation dictionary like cat = [k ae t]
• a grammar telling us what words are likely to follow what
words
• A search algorithm to find the best string of words
7/15/2016
16
Natural Language Understanding
• Or “NLU”
• Or “Computational semantics”
• There are many ways to represent the meaning
of sentences
• For speech dialogue systems, most common is
“Frame and slot semantics”.
7/15/2016
17
An example of a frame
• Show me morning flights from Boston to SF on Tuesday.
SHOW:
FLIGHTS:
ORIGIN:
CITY: Boston
DATE: Tuesday
TIME: morning
DEST:
CITY: San Francisco
7/15/2016
18
How to generate this semantics?
•
•
•
•
Many methods,
Simplest: “semantic grammars”
We’ll come back to these after we’ve seen parsing.
But a quick teaser for those of you who might have
already seen parsing:
• CFG in which the LHS of rules is a semantic category:
– LIST -> show me | I want | can I see|…
– DEPARTTIME -> (after|around|before) HOUR
| morning | afternoon | evening
– HOUR -> one|two|three…|twelve (am|pm)
– FLIGHTS -> (a) flight|flights
– ORIGIN -> from CITY
– DESTINATION -> to CITY
– CITY -> Boston | San Francisco | Denver | Washington
7/15/2016
19
Semantics for a sentence
LIST
FLIGHTS ORIGIN
Show me flights
from Boston
DESTINATION
DEPARTDATE
to San Francisco on Tuesday
DEPARTTIME
morning
7/15/2016
20
Generation and TTS
• Generation component
– Chooses concepts to express to user
– Plans out how to express these concepts in words
– Assigns any necessary prosody to the words
• TTS component
– Takes words and prosodic annotations
– Synthesizes a waveform
7/15/2016
21
Generation Component
• Content Planner
– Decides what content to express to user
• (ask a question, present an answer, etc)
– Often merged with dialogue manager
• Language Generation
– Chooses syntactic structures and words to express meaning.
– Simplest method
• All words in sentence are prespecified!
• “Template-based generation”
• Can have variables:
– What time do you want to leave CITY-ORIG?
– Will you return to CITY-ORIG from CITY-DEST?
7/15/2016
22
More sophisticated language generation
component
• Natural Language Generation
• Approach:
– Dialogue manager builds representation of meaning
of utterance to be expressed
– Passes this to a “generator”
– Generators have three components
• Sentence planner
• Surface realizer
• Prosody assigner
7/15/2016
23
Architecture of a generator for a dialogue system
(after Walker and Rambow 2002)
7/15/2016
24
HCI constraints on generation for
dialogue: “Coherence”
• Discourse markers and pronouns (“Coherence”):
(1) Please say the date.
…
Please say the start time.
…
Please say the duration…
…
Please say the subject…
(2) First, tell me the date.
…
Next, I’ll need the time it starts.
…
Thanks. <pause> Now, how long is it supposed to last?
…
Last of all, I just need a brief description
7/15/2016
25
HCI constraints on generation for dialogue: coherence
(II): tapered prompts
•
•
•
•
•
•
•
•
•
•
•
Prompts which get incrementally shorter:
System: Now, what’s the first company to add to your watch list?
Caller: Cisco
System: What’s the next company name? (Or, you can say,
“Finished”)
Caller: IBM
System: Tell me the next company name, or say, “Finished.”
Caller: Intel
System: Next one?
Caller: America Online.
System: Next?
Caller: …
7/15/2016
26
Dialogue Manager
• Controls the architecture and structure of
dialogue
–
–
–
–
Takes input from ASR/NLU components
Maintains some sort of state
Interfaces with Task Manager
Passes output to NLG/TTS modules
7/15/2016
27
Four architectures for dialogue management
• Finite State
• Frame-based
• Information State
– Markov Decision Processes
• AI Planning
7/15/2016
28
Finite-State Dialogue Management
• Consider a trivial airline travel system
–
–
–
–
Ask the user for a departure city
For a destination city
For a time
Whether the trip is round-trip or not
7/15/2016
29
Finite State Dialogue Manager
7/15/2016
30
Finite-state Dialogue Managers
• System completely controls the conversation
with the user
• Asks the user a series of questions
• Ignores (or misinterprets) anything the user says
that is not a direct answer to the system’s
questions
7/15/2016
31
Dialogue Initiative
• Systems that control conversation like this are
system initiative or single initiative.
• “Initiative”: who has control of conversation
• In normal human-human dialogue, initiative
shifts back and forth between participants.
7/15/2016
32
System Initiative
• Systems which completely control the conversation at all
times are called system initiative.
• Advantages:
– Simple to build
– User always knows what they can say next
– System always knows what user can say next
• Known words: Better performance from ASR
• Known topic: Better performance from NLU
– Ok for VERY simple tasks (entering a credit card, or login name
and password)
• Disadvantage:
– Too limited
7/15/2016
33
User Initiative
• User directs the system
• Generally, user asks a single question, system
answers
• System can’t ask questions back, engage in
clarification dialogue, confirmation dialogue
• Used for simple database queries
• User asks question, system gives answer
• Web search is user initiative dialogue.
7/15/2016
34
Problems with System Initiative
• Real dialogue involves give and take!
• In travel planning, users might want to say
something that is not the direct answer to the
question.
• For example answering more than one question
in a sentence:
– Hi, I’d like to fly from Seattle Tuesday morning
– I want a flight from Milwaukee to Orlando one way
leaving after 5 p.m. on Wednesday.
7/15/2016
35
Single initiative + universals
• We can give users a little more flexibility by adding
universal commands
• Universals: commands you can say anywhere
• As if we augmented every state of FSA with these
– Help
– Start over
– Correct
• This describes many implemented systems
• But still doesn’t allow user to say what the want to say
7/15/2016
36
Mixed Initiative
• Conversational initiative can shift between system and
user
• Simplest kind of mixed initiative: use the structure of the
frame itself to guide dialogue
– Slot
Question
–
–
–
–
–
What city are you leaving from?
Where are you going?
What day would you like to leave?
What time would you like to leave?
What is your preferred airline?
ORIGIN
DEST
DEPT DATE
DEPT TIME
AIRLINE
7/15/2016
37
Frames are mixed-initiative
• User can answer multiple questions at once.
• System asks questions of user, filling any slots
that user specifies
• When frame is filled, do database query
• If user answers 3 questions at once, system has
to fill slots and not ask these questions again!
• Anyhow, we avoid the strict constraints on order
of the finite-state architecture.
7/15/2016
38
Multiple frames
• flights, hotels, rental cars
• Flight legs: Each flight can have multiple legs, which
might need to be discussed separately
• Presenting the flights (If there are multiple flights meeting
users constraints)
– It has slots like 1ST_FLIGHT or 2ND_FLIGHT so user can ask
“how much is the second one”
• General route information:
– Which airlines fly from Boston to San Francisco
• Airfare practices:
– Do I have to stay over Saturday to get a decent airfare?
7/15/2016
39
Multiple Frames
• Need to be able to switch from frame to frame
• Based on what user says.
• Disambiguate which slot of which frame an input
is supposed to fill, then switch dialogue control
to that frame.
• Main implementation: production rules
– Different types of inputs cause different productions to
fire
– Each of which can flexibly fill in different frames
– Can also switch control to different frame
7/15/2016
40
Defining Mixed Initiative
• Mixed Initiative could mean
– User can arbitrarily take or give up initiative in various
ways
• This is really only possible in very complex plan-based
dialogue systems
• No commercial implementations
• Important research area
– Something simpler and quite specific which we will
define in the next few slides
7/15/2016
41
True Mixed Initiative
7/15/2016
42
How mixed initiative is usually defined
• First we need to define two other factors
• Open prompts vs. directive prompts
• Restrictive versus non-restrictive grammar
7/15/2016
43
Open vs. Directive Prompts
• Open prompt
– System gives user very few constraints
– User can respond how they please:
– “How may I help you?” “How may I direct your call?”
• Directive prompt
– Explicit instructs user how to respond
– “Say yes if you accept the call; otherwise, say no”
7/15/2016
44
Restrictive vs. Non-restrictive grammars
• Restrictive grammar
– Language model which strongly constrains the ASR
system, based on dialogue state
• Non-restrictive grammar
– Open language model which is not restricted to a
particular dialogue state
7/15/2016
45
Definition of Mixed Initiative
Grammar
Open Prompt
Restrictive
Doesn’t make sense System Initiative
Non-restrictive
User Initiative
7/15/2016
Directive Prompt
Mixed Initiative
46
VoiceXML
•
•
•
•
Voice eXtensible Markup Language
An XML-based dialogue design language
Makes use of ASR and TTS
Deals well with simple, frame-based mixed
initiative dialogue.
• Most common in commercial world (too limited
for research systems)
• But useful to get a handle on the concepts.
7/15/2016
47
Voice XML
• Each dialogue is a <form>. (Form is the
VoiceXML word for frame)
• Each <form> generally consists of a sequence of
<field>s, with other commands
7/15/2016
48
Sample vxml doc
<form>
<field name="transporttype">
<prompt>
Please choose airline, hotel, or rental car. </prompt>
<grammar type="application/x=nuance-gsl">
[airline hotel "rental car"]
</grammar>
</field>
<block>
<prompt>
You have chosen <value expr="transporttype">. </prompt>
</block>
</form>
7/15/2016
49
VoiceXML interpreter
•
•
•
•
Walks through a VXML form in document order
Iteratively selecting each item
If multiple fields, visit each one in order.
Special commands for events
7/15/2016
50
Another vxml doc (1)
<noinput>
I'm sorry, I didn't hear you. <reprompt/>
</noinput>
- “noinput” means silence exceeds a timeout threshold
<nomatch>
I'm sorry, I didn't understand that. <reprompt/>
</nomatch>
- “nomatch” means confidence value for utterance is too low
- notice “reprompt” command
7/15/2016
51
Another vxml doc (2)
<form>
<block> Welcome to the air travel consultant. </block>
<field name="origin">
<prompt> Which city do you want to leave from? </prompt>
<grammar type="application/x=nuance-gsl">
[(san francisco) denver (new york) barcelona]
</grammar>
<filled>
<prompt> OK, from <value expr="origin"> </prompt>
</filled>
</field>
- “filled” tag is executed by interpreter as soon as field filled by user
7/15/2016
52
Another vxml doc (3)
<field name="destination">
<prompt> And which city do you want to go to?
<grammar type="application/x=nuance-gsl">
[(san francisco) denver (new york) barcelona]
</grammar>
<filled>
<prompt> OK, to <value expr="destination">
</filled>
</field>
<field name="departdate" type="date">
<prompt> And what date do you want to leave?
<filled>
<prompt> OK, on <value expr="departdate">
</filled>
</field>
7/15/2016
</prompt>
</prompt>
</prompt>
</prompt>
53
Another vxml doc (4)
<block>
<prompt> OK, I have you are departing from
<value expr="origin”> to <value expr="destination”>
on <value expr="departdate">
</prompt>
send the info to book a flight...
</block>
</form>
7/15/2016
54
Summary: VoiceXML
•
•
•
•
Voice eXtensible Markup Language
An XML-based dialogue design language
Makes use of ASR and TTS
Deals well with simple, frame-based mixed
initiative dialogue.
• Most common in commercial world (too limited
for research systems)
• But useful to get a handle on the concepts.
7/15/2016
55
Information-State and Dialogue Acts
• If we want a dialogue system to be more than just formfilling
• Needs to:
– Decide when the user has asked a question, made a proposal,
rejected a suggestion
– Ground a user’s utterance, ask clarification questions,
suggestion plans
• Suggests:
– Conversational agent needs sophisticated models of
interpretation and generation
• In terms of speech acts and grounding
• Needs more sophisticated representation of dialogue context than
just a list of slots
7/15/2016
56
Information-state architecture
•
•
•
•
Information state
Dialogue act interpreter
Dialogue act generator
Set of update rules
– Update dialogue state as acts are interpreted
– Generate dialogue acts
• Control structure to select which update rules to
apply
7/15/2016
57
Information-state
7/15/2016
58
Dialogue acts
• Also called “conversational moves”
• An act with (internal) structure related
specifically to its dialogue function
• Incorporates ideas of grounding
• Incorporates other dialogue and conversational
functions that Austin and Searle didn’t seem
interested in
7/15/2016
59
Verbmobil task
• Two-party scheduling dialogues
• Speakers were asked to plan a meeting at some
future date
• Data used to design conversational agents
which would help with this task
• (cross-language, translating, scheduling
assistant)
7/15/2016
60
Verbmobil Dialogue Acts
THANK
GREET
INTRODUCE
BYE
REQUEST-COMMENT
SUGGEST
REJECT
ACCEPT
REQUEST-SUGGEST
INIT
GIVE_REASON
FEEDBACK
DELIBERATE
CONFIRM
CLARIFY
7/15/2016
thanks
Hello Dan
It’s me again
Allright, bye
How does that look?
June 13th through 17th
No, Friday I’m booked all day
Saturday sounds fine
What is a good day of the week for you?
I wanted to make an appointment with you
Because I have meetings all afternoon
Okay
Let me check my calendar here
Okay, that would be wonderful
Okay, do you mean Tuesday the 23rd?
61
Automatic Interpretation of Dialogue Acts
• How do we automatically identify dialogue acts?
• Given an utterance:
– Decide whether it is a QUESTION, STATEMENT,
SUGGEST, or ACK
• Recognizing illocutionary force will be crucial to
building a dialogue agent
• Perhaps we can just look at the form of the
utterance to decide?
7/15/2016
62
Can we just use the surface syntactic form?
• YES-NO-Q’s have auxiliary-before-subject
syntax:
– Will breakfast be served on USAir 1557?
• STATEMENTs have declarative syntax:
– I don’t care about lunch
• COMMAND’s have imperative syntax:
– Show me flights from Milwaukee to Orlando on
Thursday night
7/15/2016
63
Surface form != speech act type
Locutionary
Force
Illocutionary
Force
Can I have the rest of
your sandwich?
Question
Request
I want the rest of your
sandwich
Declarative
Request
Give me your
sandwich!
Imperative
Request
7/15/2016
64
Dialogue act disambiguation is hard! Who’s
on First?
Abbott: Well, Costello, I'm going to New York with you. Bucky Harris the Yankee's manage
gave me a job as coach for as long as you're on the team.
Costello: Look Abbott, if you're the coach, you must know all the players.
Abbott: I certainly do.
Costello: Well you know I've never met the guys. So you'll have to tell me their names, and
then I'll know who's playing on the team.
Abbott: Oh, I'll tell you their names, but you know it seems to me they give these ball
players now-a-days very peculiar names.
Costello: You mean funny names?
Abbott: Strange names, pet names...like Dizzy Dean...
Costello: His brother Daffy Abbott: Daffy Dean...
Costello: And their French cousin.
Abbott: French?
Costello: Goofe'
Abbott: Goofe' Dean. Well, let's see, we have on
the bags, Who's on first, What's on second, I
Don't Know is on third...
Costello: That's what I want to find out.
7/15/2016
Abbott:
I say Who's on first, What's on second, I Don't Know's on third.
65
Dialogue act ambiguity
• Who’s on first?
– INFO-REQUEST
– or
– STATEMENT
7/15/2016
66
Dialogue Act ambiguity
• Can you give me a list of the flights from Atlanta
to Boston?
– This looks like an INFO-REQUEST.
– If so, the answer is:
• YES.
– But really it’s a DIRECTIVE or REQUEST, a polite
form of:
– Please give me a list of the flights…
• What looks like a QUESTION can be a
REQUEST
7/15/2016
67
Dialogue Act ambiguity
• Similarly, what looks like a STATEMENT can be
a QUESTION:
Us OPENOPTIO
N
A HOLD
g
I was wanting to make some arrangements
for a trip that I’m going to be taking uh to
LA uh beginnning of the week after next
OK uh let me pull up your profile and I’ll be
right with you here. [pause]
A CHECK And you said you wanted to travel next week?
g
Us 7/15/2016
ACCEP Uh yes.
68
T
Indirect speech acts
• Utterances which use a surface statement to ask
a question
• Utterances which use a surface question to
issue a request
7/15/2016
69
DA interpretation as statistical classification
• Lots of clues in each sentence that can tell us which DA it is:
• Words and Collocations:
– Please or would you: good cue for REQUEST
– Are you: good cue for INFO-REQUEST
• Prosody:
– Rising pitch is a good cue for INFO-REQUEST
– Loudness/stress can help distinguish yeah/AGREEMENT from
yeah/BACKCHANNEL
• Conversational Structure
– Yeah following a proposal is probably AGREEMENT; yeah following an
INFORM probably a BACKCHANNEL
7/15/2016
70
Statistical classifier model of dialogue act
interpretation
• Our goal is to decide for each sentence what dialogue
act it is
• This is a classification task (we are making a 1-of-N
classification decision for each sentence)
• With N classes (= number of dialog acts).
• Three probabilistic models corresponding to the 3 kinds
of cues from the input sentence.
– Conversational Structure: Probability of one dialogue act
following another P(Answer|Question)
– Words and Syntax: Probability of a sequence of words given a
dialogue act: P(“do you” | Question)
– Prosody: probability of prosodic features given a dialogue act :
7/15/2016P(“rise at end of sentence” | Question)
71
An example of dialogue act detection:
Correction Detection
• Despite all these clever confirmation/rejection strategies, dialogue
systems still make mistakes (Surprise!)
• If system misrecognizes an utterance, and either
– Rejects
– Via confirmation, displays its misunderstanding
• Then user has a chance to make a correction
– Repeat themselves
– Rephrasing
– Saying “no” to the confirmation question.
7/15/2016
72
Corrections
• Unfortunately, corrections are harder to recognize than normal
sentences!
– Swerts et al (2000): corrections misrecognized twice as often (in terms
of WER) as non-corrections!!!
– Why?
• Prosody seems to be largest factor: hyperarticulation
• English Example from Liz Shriberg
– “NO, I am DE-PAR-TING from Jacksonville)
• A German example from Bettina Braun from a talking elevator
7/15/2016
73
A Labeled dialogue (Swerts et al)
7/15/2016
74
Machine Learning and Classifiers
• Given a labeled training set
• We can build a classifier to label observations
into classes
– Decision Tree
– Regression
– SVM
• I won’t introduce the algorithms here.
• But these are at the core of NLP/computational
linguistics/Speech/Dialogue
• You can learn them in:
– AI - CS 121/221
7/15/2016
– Machine Learning CS 229
75
Machine learning to detect user corrections
• Build classifiers using features like
– Lexical information (words “no”, “correction”, “I don’t”,
swear words)
– Prosodic features (various increases in F0 range,
pause duration, and word duration that correlation
with hyperarticulation)
– Length
– ASR confidence
– LM probability
– Various dialogue features (repetition)
7/15/2016
76
Generating Dialogue Acts
• Confirmation
• Rejection
7/15/2016
77
Confirmation
• Another reason for grounding
• Errors: Speech is a pretty errorful channel
– Even for humans; so they use grounding to confirm
that they heard correctly
• ASR is way worse than humans!
• So dialogue systems need to do even more
grounding and confirmation than humans
7/15/2016
78
Explicit confirmation
•
•
•
•
S: Which city do you want to leave from?
U: Baltimore
S: Do you want to leave from Baltimore?
U: Yes
7/15/2016
79
Explicit confirmation
• U: I’d like to fly from Denver Colorado to New
York City on September 21st in the morning on
United Airlines
• S: Let’s see then. I have you going from Denver
Colorado to New York on September 21st. Is
that correct?
• U: Yes
7/15/2016
80
Implicit confirmation: display
• U: I’d like to travel to Berlin
• S: When do you want to travel to Berlin?
• U: Hi I’d like to fly to Seattle Tuesday morning
• S: Traveling to Seattle on Tuesday, August
eleventh in the morning. Your name?
7/15/2016
81
Implicit vs. Explicit
• Complementary strengths
• Explicit: easier for users to correct systems’s
mistakes (can just say “no”)
• But explicit is cumbersome and long
• Implicit: much more natural, quicker, simpler (if
system guesses right).
7/15/2016
82
Implicit and Explicit
• Early systems: all-implicit or all-explicit
• Modern systems: adaptive
• How to decide?
– ASR system can give confidence metric.
– This expresses how convinced system is of its
transcription of the speech
– If high confidence, use implicit confirmation
– If low confidence, use explicit confirmation
7/15/2016
83
Computing confidence
• Simplest: use acoustic log-likelihood of user’s
utterance
• More features
– Prosodic: utterances with longer pauses, F0
excursions, longer durations
– Backoff: did we have to backoff in the LM?
– Cost of an error: Explicit confirmation before moving
money or booking flights
7/15/2016
84
Rejection
• e.g., VoiceXML “nomatch”
• “I’m sorry, I didn’t understand that.”
• Reject when:
– ASR confidence is low
– Best interpretation is semantically ill-formed
• Might have four-tiered level of confidence:
–
–
–
–
Below confidence threshhold, reject
Above threshold, explicit confirmation
If even higher, implicit confirmation
Even higher, no confirmation
7/15/2016
85
Dialogue System Evaluation
• Key point about SLP.
• Whenever we design a new algorithm or build a
new application, need to evaluate it
• Two kinds of evaluation
– Extrinsic: embedded in some external task
– Intrinsic: some sort of more local evaluation.
• How to evaluate a dialogue system?
• What constitutes success or failure for a
dialogue system?
7/15/2016
86
Dialogue System Evaluation
• It turns out we’ll need an evaluation metric for two
reasons
– 1) the normal reason: we need a metric to help us compare
different implementations
• can’t improve it if we don’t know where it fails
• Can’t decide between two algorithms without a goodness metric
– 2) a new reason: we will need a metric for “how good a dialogue
went” as an input to reinforcement learning:
• automatically improve our conversational agent performance via
learning
7/15/2016
87
Evaluating Dialogue Systems
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected both by
what gets accomplished by the user and the dialogue
agent and how it gets accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
Qualitative
Measures
7/15/2016
88
Slide from Julia Hirschberg
PARADISE evaluation again:
• Maximize Task Success
• Minimize Costs
– Efficiency Measures
– Quality Measures
• PARADISE (PARAdigm for Dialogue System
Evaluation)
7/15/2016
89
Task Success
• % of subtasks completed
• Correctness of each questions/answer/error msg
• Correctness of total solution
– Attribute-Value matrix (AVM)
– Kappa coefficient
• Users’ perception of whether task was
completed
7/15/2016
90
Task Success
•Task goals seen as Attribute-Value Matrix
ELVIS e-mail retrieval task (Walker et al ‘97)
“Find the time and place of your meeting with
Kim.”
Attribute
Selection Criterion
Time
Place
Value
Kim or Meeting
10:30 a.m.
2D516
•Task success can be defined by match between
AVM values at end of task with “true” values for
AVM
7/15/2016
91
Slide from Julia Hirschberg
Efficiency Cost
• Polifroni et al. (1992), Danieli and Gerbino (1995)
Hirschman and Pao (1993)
• Total elapsed time in seconds or turns
• Number of queries
• Turn correction ration: number of system or user
turns used solely to correct errors, divided by
total number of turns
7/15/2016
92
Quality Cost
• # of times ASR system failed to return any
sentence
• # of ASR rejection prompts
• # of times user had to barge-in
• # of time-out prompts
• Inappropriateness (verbose, ambiguous) of
system’s questions, answers, error messages
7/15/2016
93
Another key quality cost
• “Concept accuracy” or “Concept error rate”
• % of semantic concepts that the NLU component returns
correctly
• I want to arrive in Austin at 5:00
– DESTCITY: Boston
– Time: 5:00
• Concept accuracy = 50%
• Average this across entire dialogue
• “How many of the sentences did the system understand
correctly”
7/15/2016
94
PARADISE: Regress against user
satisfaction
7/15/2016
95
Regressing against user satisfaction
• Questionnaire to assign each dialogue a “user
satisfaction rating”: this is dependent measure
• Set of cost and success factors are independent
measures
• Use regression to train weights for each factor
7/15/2016
96
Experimental Procedures
• Subjects given specified tasks
• Spoken dialogues recorded
• Cost factors, states, dialog acts automatically logged;
ASR accuracy,barge-in hand-labeled
• Users specify task solution via web page
• Users complete User Satisfaction surveys
• Use multiple linear regression to model User Satisfaction
as a function of Task Success and Costs; test for
significant predictive factors
7/15/2016
97
Slide from Julia Hirschberg
User Satisfaction:
Sum of Many Measures
Was the system easy to understand? (TTS Performance)
Did the system understand what you said? (ASR Performance)
Was it easy to find the message/plane/train you wanted? (Task Ease)
Was the pace of interaction with the system appropriate? (Interaction Pace)
Did you know what you could say at each point of the dialog? (User Expertise)
How often was the system sluggish and slow to reply to you? (System Response
Did the system work the way you expected it to in this conversation? (Expected
Behavior)
Do you think you'd use the system regularly in the future? (Future Use)
7/15/2016
98
Adapted from Julia Hirschberg
Performance Functions from Three Systems
•
•
•
ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET
TOOT User Sat.= .35* COMP + .45* MRS - .14*ET
ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help
–
–
–
–
7/15/2016
COMP: User perception of task completion (task success)
MRS: Mean (concept) recognition accuracy (cost)
ET: Elapsed time (cost)
Help: Help requests (cost)
99
Slide from Julia Hirschberg
Performance Model
• Perceived task completion and mean recognition score (concept
accuracy) are consistently significant predictors of User Satisfaction
• Performance model useful for system development
– Making predictions about system modifications
– Distinguishing ‘good’ dialogues from ‘bad’ dialogues
– As part of a learning model
7/15/2016
100
Now that we have a success metric
• Could we use it to help drive learning?
• In recent work we use this metric to help us
learn an optimal policy or strategy for how the
conversational agent should behave
7/15/2016
101
New Idea: Modeling a dialogue system as a
probabilistic agent
• A conversational agent can be characterized by:
– The current knowledge of the system
• A set of states S the agent can be in
– a set of actions A the agent can take
– A goal G, which implies
• A success metric that tells us how well the agent achieved its
goal
• A way of using this metric to create a strategy or policy  for
what action to take in any particular state.
7/15/2016
102
What do we mean by actions A and policies
?
• Kinds of decisions a conversational agent needs
to make:
– When should I ground/confirm/reject/ask for
clarification on what the user just said?
– When should I ask a directive prompt, when
an open prompt?
– When should I use user, system, or mixed
initiative?
7/15/2016
103
A threshold is a human-designed policy!
• Could we learn what the right action is
–
–
–
–
Rejection
Explicit confirmation
Implicit confirmation
No confirmation
• By learning a policy which,
– given various information about the current state,
– dynamically chooses the action which maximizes
dialogue success
7/15/2016
104
Another strategy decision
• Open versus directive prompts
• When to do mixed initiative
7/15/2016
105
Outline
•
•
The Linguistics of Conversation
Basic Conversational Agents
–
–
–
–
•
ASR
NLU
Generation
Dialogue Manager
Dialogue Manager Design
– Finite State
– Frame-based
– Initiative: User, System, Mixed
•
•
VoiceXML
Information-State
– Dialogue-Act Detection
– Dialogue-Act Generation
•
•
Evaluation
Utility-based conversational agents
– MDP, POMDP
7/15/2016
106
END of TODAY’S LECTURE
• THE FOLLOWING SLIDES ARE AN OPTIONAL
ADVANCED DISCUSSION OF MARKOVDECISION-PROCESS DIALOGUE SYSTEMS.
7/15/2016
107
Review: Open vs. Directive Prompts
• Open prompt
– System gives user very few constraints
– User can respond how they please:
– “How may I help you?” “How may I direct your call?”
• Directive prompt
– Explicit instructs user how to respond
– “Say yes if you accept the call; otherwise, say no”
7/15/2016
108
Review: Restrictive vs. Non-restrictive
gramamrs
• Restrictive grammar
– Language model which strongly constrains the ASR
system, based on dialogue state
• Non-restrictive grammar
– Open language model which is not restricted to a
particular dialogue state
7/15/2016
109
Kinds of Initiative
• How do I decide which of these initiatives to use
at each point in the dialogue?
Grammar
Open Prompt
Restrictive
Doesn’t make sense System Initiative
Non-restrictive
User Initiative
7/15/2016
Directive Prompt
Mixed Initiative
110
Modeling a dialogue system as a
probabilistic agent
• A conversational agent can be characterized by:
– The current knowledge of the system
• A set of states S the agent can be in
– a set of actions A the agent can take
– A goal G, which implies
• A success metric that tells us how well the agent achieved its
goal
• A way of using this metric to create a strategy or policy  for
what action to take in any particular state.
7/15/2016
111
Goals are not enough
• Goal: user satisfaction
• OK, that’s all very well, but
– Many things influence user satisfaction
– We don’t know user satisfaction til after the dialogue
is done
– How do we know, state by state and action by action,
what the agent should do?
• We need a more helpful metric that can apply to
each state
7/15/2016
112
Utility
• A utility function
–
–
–
–
maps a state or state sequence
onto a real number
describing the goodness of that state
I.e. the resulting “happiness” of the agent
• Principle of Maximum Expected Utility:
– A rational agent should choose an action that
maximizes the agent’s expected utility
7/15/2016
113
Maximum Expected Utility
• Principle of Maximum Expected Utility:
– A rational agent should choose an action that maximizes the agent’s
expected utility
• Action A has possible outcome states Resulti(A)
• E: agent’s evidence about current state of world
• Before doing A, agent estimates prob of each outcome
– P(Resulti(A)|Do(A),E)
• Thus can compute expected utility:
EU(A | E)   P(Result i (A) | Do(A), E)U(Result i (A)
7/15/2016
i
114
Utility (Russell and Norvig)
7/15/2016
115
Markov Decision Processes
• Or MDP
• Characterized by:
– a set of states S an agent can be in
– a set of actions A the agent can take
– A reward r(a,s) that the agent receives for taking an
action in a state
– (+ Some other things I’ll come back to (gamma, state transition
probabilities))
7/15/2016
116
A brief tutorial example
• Levin et al (2000)
• A Day-and-Month dialogue system
• Goal: fill in a two-slot frame:
– Month: November
– Day: 12th
• Via the shortest possible interaction with user
7/15/2016
117
What is a state?
• In principle, MDP state could include any
possible information about dialogue
– Complete dialogue history so far
• Usually use a much more limited set
–
–
–
–
–
Values of slots in current frame
Most recent question asked to user
Users most recent answer
ASR confidence
etc
7/15/2016
118
State in the Day-and-Month example
• Values of the two slots day and month.
• Total:
–
–
–
–
–
–
2 special initial state si and sf.
365 states with a day and month
1 state for leap year
12 states with a month but no day
31 states with a day but no month
411 total states
7/15/2016
119
Actions in MDP models of dialogue
• Speech acts!
–
–
–
–
–
Ask a question
Explicit confirmation
Rejection
Give the user some database information
Tell the user their choices
• Do a database query
7/15/2016
120
Actions in the Day-and-Month example
•
•
•
•
ad: a question asking for the day
am: a question asking for the month
adm: a question asking for the day+month
af: a final action submitting the form and
terminating the dialogue
7/15/2016
121
A simple reward function
• For this example, let’s use a cost function
• A cost function for entire dialogue
• Let
– Ni=number of interactions (duration of dialogue)
– Ne=number of errors in the obtained values (0-2)
– Nf=expected distance from goal
• (0 for complete date, 1 if either data or month are missing, 2 if both
missing)
• Then (weighted) cost is:
• C = wiNi + weNe + wfNf
7/15/2016
122
3 possible policies
Dumb
Open prompt
Directive prompt
7/15/2016
P1=probability of error in open prompt
P2=probability of error in directive prompt
123
3 possible policies
Strategy 3 is better than strategy 2 when
improved error rate justifies longer
interaction:
open
wi
p1  p2 
2we
P1=probability of error in open prompt

directive
7/15/2016
P2=probability of error in directive prompt
124
That was an easy optimization
• Only two actions, only tiny # of policies
• In general, number of actions, states, policies is
quite large
• So finding optimal policy * is harder
• We need reinforcement leraning
• Back to MDPs:
7/15/2016
125
MDP
• We can think of a dialogue as a trajectory in state space
• The best policy * is the one with the greatest expected
reward over all trajectories
• How to compute a reward for a state sequence?
7/15/2016
126
Reward for a state sequence
• One common approach: discounted rewards
• Cumulative reward Q of a sequence is
discounted sum of utilities of individual states
• Discount factor  between 0 and 1
• Makes agent care more about current than
future rewards; the more future a reward, the
more discounted its value
7/15/2016
127
The Markov assumption
• MDP assumes that state transitions are
Markovian
P(st 1 | st ,st1,...,so,at ,at1,...,ao )  PT (st 1 | st ,at )
7/15/2016
128
Expected reward for an action
• Expected cumulative reward Q(s,a) for taking a particular
action from a particular state can be computed by
Bellman equation:
• Expected cumulative reward for a given state/action pair
is:
–
–
–
–
immediate reward for current state
+ expected discounted utility of all possible next states s’
Weighted by probability of moving to that state s’
And assuming once there we take optimal action a’
7/15/2016
129
What we need for Bellman equation
•
•
•
•
A model of p(s’|s,a)
Estimate of R(s,a)
How to get these?
If we had labeled training data
– P(s’|s,a) = C(s,s’,a)/C(s,a)
• If we knew the final reward for whole dialogue
R(s1,a1,s2,a2,…,sn)
• Given these parameters, can use value iteration
algorithm to learn Q values (pushing back reward values
over state sequences) and hence best policy
7/15/2016
130
Final reward
• What is the final reward for whole dialogue
R(s1,a1,s2,a2,…,sn)?
• This is what our automatic evaluation metric
PARADISE computes!
• The general goodness of a whole dialogue!!!!!
7/15/2016
131
How to estimate p(s’|s,a) without labeled data
• Have random conversations with real people
– Carefully hand-tune small number of states and policies
– Then can build a dialogue system which explores state space by
generating a few hundred random conversations with real
humans
– Set probabilities from this corpus
• Have random conversations with simulated people
– Now you can have millions of conversations with simulated
people
– So you can have a slightly larger state space
7/15/2016
132
An example
• Singh, S., D. Litman, M. Kearns, and M. Walker. 2002.
Optimizing Dialogue Management with Reinforcement
Learning: Experiments with the NJFun System. Journal
of AI Research.
• NJFun system, people asked questions about
recreational activities in New Jersey
• Idea of paper: use reinforcement learning to make a
small set of optimal policy decisions
7/15/2016
133
Very small # of states and acts
• States: specified by values of 8 features
–
–
–
–
–
Which slot in frame is being worked on (1-4)
ASR confidence value (0-5)
How many times a current slot question had been asked
Restrictive vs. non-restrictive grammar
Result: 62 states
• Actions: each state only 2 possible actions
– Asking questions: System versus user initiative
– Receiving answers: explicit versus no confirmation.
7/15/2016
134
Ran system with real users
• 311 conversations
• Simple binary reward function
– 1 if competed task (finding museums, theater, winetasting in NJ
area)
– 0 if not
• System learned good dialogue strategy: Roughly
– Start with user initiative
– Backoff to mixed or system initiative when re-asking for an
attribute
– Confirm only a lower confidence values
7/15/2016
135
State of the art
• Only a few such systems
– From (former) ATT Laboratories researchers, now dispersed
– And Cambridge UK lab
• Hot topics:
– Partially observable MDPs (POMDPs)
– We don’t REALLY know the user’s state (we only know what we
THOUGHT the user said)
– So need to take actions based on our BELIEF , I.e. a probability
distribution over states rather than the “true state”
7/15/2016
136
Download