Building and Evaluating SDS Julia Hirschberg LSA07 353 7/15/2016

advertisement
Building and Evaluating SDS
Julia Hirschberg
LSA07 353
7/15/2016
1
Today
• Natural Language Understanding (NLU) in
SDS
– Meaning representations
• Evaluating SDS
7/15/2016
2
The NLU Component
• What kinds of meanings do we want to capture?
– Entities, actions, events, times…
– E.g. an airline travel domain
•
•
•
•
•
Entities: flights, airlines, cities
Actions: look up, reserve, fly
Modifiers: first, next, latest
Events: departures, arrivals
Temporals: dates, hours
• How should we represent these in our system?
– What does an SDS need from a meaning
representation?
7/15/2016
3
– Be able to
• Answer questions
– When does the first flight from Boston to Philadelphia
leave on July 7?
• Determine truth
– Is there a flight from Boston to Philadelphia?
• Draw inferences
– Is there a connecting flight to Denver?
7/15/2016
4
Most SDS use Semantic Grammars
• Robust parsing, or, information extraction
• Define grammars specifically in terms of the
semantic information we want to extract
– Domain specific rules correspond to items in
the domain
I want to go from Boston to Baltimore on Thursday,
September 24th
• TripRequest  Greeting Need-specification travelverb from City to City on Date
• Greeting  {Hello|Hi|Um|ε}
• Need-specification  {I want|I need}
•…
7/15/2016
5
– Grammar elements can be mapped directly to
slots in semantic frames
FromCity
Boston
7/15/2016
To-City
Date
Baltimore 9/24/07
Airline
?
6
Drawbacks of Semantic Grammars
• Lack of generality
– A new one for each application
– Very costly in development time
– Constant updates
• Can be very large, depending on how much
coverage you want them to have
• If users go outside grammar, things may break
disastrously
I want to leave from my house at 10 a.m.
I want to talk to a person.
7/15/2016
7
Semantic HMMs
• Train sequential model on labeled corpora to
identify semantic slots from user input
I want /Need spec to go /Travel verb to Boston /Tocity on Thursday /Date.
• Relies on training corpus to provide
– Likely sequences of semantic slots
– Likely words associated with those slots
• Can be used to fill slots in template
– Or to directly generate a response while
processing input
7/15/2016
8
Managing User Input
• Rely heavily on knowledge of task and
constraints on user actions
– Handle fairly sophisticated phenomena
I want to go to Boston next Thursday.
I want to leave from there on Friday for Baltimore.
• TripRequest  Need-spec travel-verb from City on
Date for City
• Dialogue postulates and simple discourse history
handle reference resolution
• Use prior probabilities based on training corpus
to resolve ambiguities, e.g.
Boston Baltimore on July 12
7/15/2016
9
Priming User Input
• Hypothesis: Users tend to use the vocabulary
and syntax the system uses
– Lexical entrainment experiments (Clark &
Brennan ’96)
– Re-use of system prompt vocabulary/syntax:
Please tell me where you would like to leave/depart
from.
I would like to leave/depart from Boston…
• Evidence from KTH data collections in field
7/15/2016
10
User Responses to Vaxholm
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
•
•
•
•
•
22%
11%
11%
7%
6%
• -
7/15/2016
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
11
Verb Priming: How often do you go abroad on
holiday?
Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern?
jag åker en gång om året kanske
jag åker ganska sällan utomlands på semester
jag åker nästan alltid utomlands under min
semester
jag åker ungefär 2 gånger per år utomlands på
semester
jag åker utomlands nästan varje år
jag åker utomlands på semestern varje år
jag åker utomlands ungefär en gång om året
jag är nästan aldrig utomlands
en eller två gånger om året
en gång per semester
kanske en gång per år
ungefär en gång per år
åtminståne en gång om året
nästan aldrig
7/15/2016
jag reser en gång om året utomlands
jag reser inte ofta utomlands på semester det blir mera i
arbetet
jag reser reser utomlands på semestern vartannat år
jag reser utomlands en gång per semester
jag reser utomlands på semester ungefär en gång per år
jag brukar resa utomlands på semestern åtminståne en
gång i året
en gång per år kanske
en gång vart annat år
varje år
vart tredje år ungefär
nu för tiden inte så ofta
varje år brukar jag åka utomlands
12
Results
no
no
reuse
4% 2%answer
other
24%
reuse
52%
18%
ellipse
7/15/2016
13
Issue: Training and SDS
• Is implicit training ‘better’ than explicit?
• Is it easier to train the user than to retrain the
system
7/15/2016
14
Evaluating Dialogue Systems
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets
accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
7/15/2016
Qualitative
Measures
15
Task Success
•Task goals seen as Attribute-Value Matrix
ELVIS e-mail retrieval task (Walker et al ‘97)
“Find the time and place of your meeting with
Kim.”
Attribute
Selection Criterion
Time
Place
Value
Kim or Meeting
10:30 a.m.
2D516
•Task success defined by match between AVM
values at end of with “true” values for AVM
7/15/2016
16
Metrics
• Efficiency of the Interaction:User Turns, System
Turns, Elapsed Time
• Quality of the Interaction: ASR rejections, Time Out
Prompts, Help Requests, Barge-Ins, Mean
Recognition Score (concept accuracy), Cancellation
Requests
• User Satisfaction
• Task Success: perceived completion, information
extracted
7/15/2016
17
Experimental Procedures
• Subjects given specified tasks
• Spoken dialogues recorded
• Cost factors, states, dialog acts automatically
logged; ASR accuracy,barge-in hand-labeled
• Users specify task solution via web page
• Users complete User Satisfaction surveys
• Use multiple linear regression to model User
Satisfaction as a function of Task Success and
Costs; test for significant predictive factors
7/15/2016
18
User Satisfaction:
Sum of Many Measures
• Was Annie easy to understand
in this conversation? (TTS
Performance)
• In this conversation, did Annie
understand what you said?
(ASR Performance)
• In this conversation, was it
easy to find the message you
wanted? (Task Ease)
• Was the pace of interaction
with Annie appropriate in this
conversation? (Interaction
Pace)
• In this conversation, did you
know what you could say at
each point of the dialog?
7/15/2016
(User Expertise)
• How often was Annie sluggish
and slow to reply to you in this
conversation? (System
Response)
• Did Annie work the way you
expected her to in this
conversation? (Expected
Behavior)
• From your current experience
with using Annie to get your
email, do you think you'd use
Annie regularly to access your
mail when you are away from
your desk? (Future Use)
19
Performance Functions from Three Systems
• ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET
• TOOT User Sat.= .35* COMP + .45* MRS - .14*ET
• ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help
– COMP: User perception of task completion
(task success)
– MRS: Mean recognition accuracy (cost)
– ET: Elapsed time (cost)
– Help: Help requests (cost)
7/15/2016
20
Performance Model
• Perceived task completion and mean recognition
score are consistently significant predictors of
User Satisfaction
• Performance model useful for system
development
– Making predictions about system
modifications
– Distinguishing ‘good’ dialogues from ‘bad’
dialogues
• But can we also tell on-line when a dialogue is
‘going wrong’
7/15/2016
21
Next Class
• J&M 22.5
• Jurafsky et al ’98
• Rosset & Lamel ‘04
7/15/2016
22
Download