Characterizing Task-Oriented Dialog using a Simulated ASR Channel Jason D. Williams

advertisement
Characterizing Task-Oriented Dialog using a
Simulated ASR Channel
Jason D. Williams
Machine Intelligence Laboratory
Cambridge University Engineering Department
SACTI-1 Corpus
Simulated ASR-Channel – Tourist Information
•
•
•
•
Motivation for the data collection
Experimental set-up
Transcription & Annotation
Effects of ASR error rate on…
–
–
–
–
–
Turn length / dialog length
Perception of error rate
Task completion
“Initiative”
Overall satisfaction (PARADISE)
ASR channel vs. HH channel
Properties
HH dialog
ASR channel
• “Instant” communication
• Effectively perfect recognition of words
• Prosodic information carries additional
information
•
•
•
•
Turns explicitly segmented
Barge-in, End-pointed
Prosody virtually eliminated
ASR & parsing errors
Observations
• Frequent but brief overlaps
• 80% of utterances contain fewer than 12
words; 50% < 5
• Approximately equal turn length
• Approximately equal balance of initiative
• About half of turns are ACK (often spliced)
•
•
•
•
•
Few overlaps
Longer system turns; shorter user turns
Initiative more often with system
Virtually no turns are ACK
Virtually no splicing
Are models of HC dialog/grounding appropriate in the presence of the ASR channel?
My approach
1.
Study the ASR channel in the abstract
–
–
WoZ experiments using a simulated ASR channel
Understand how people behave with an “ideal” dialog manager
•
•
–
Note that collected data has unique properties useful to:
•
•
•
2.
For example, grounding model
Use these insights to inform state space and action set selection
RL-based systems
Hidden-state estimation
User modeling
Formulate dialog management problem as a POMDP
–
Decompose state into BN nodes – for example:
•
•
•
–
–
Conversation state (grounding state)
User action
User belief (goal)
Train using data collected
Solve using approximations
The paradox of “dialog data”
•
To build a user model, we need to see the user’s
reaction to all kinds of misunderstandings
However, most systems use a fixed policy
•
–
–
–
Systems typically do not take different actions in the
same situation
Taking random actions is clearly not an option!
Constraining actions means building very complex
systems…
•
… and which actions should be in the system’s
repertoire?
An ideal data collection…
•
…would show users reactions to a variety of
error handling strategies (no fixed policy)
•
•
•
•
•
BUT would not be nonsense dialogs!
…would use the ASR channel
…would explore a variety of operating
conditions – e.g., WER rate
…would not assume a particular state space
... would somehow “discover” the set of
system actions
Data collection set-up
ASR simulation state machine
•
•
Simple energy-based barge-in
User interrupts wizard
SILENCE
User
starts
talking
USER_TALKING
Typist done;
reco result
displayed
Wizard
stops
talking
Wizard
starts
talking
WIZARD_TALKING
User
starts
talking
User
stops
talking
TYPIST_TYPING
ASR simulation
•
Simplified FSM-based recognizer
–
•
Flow
–
–
–
–
–
–
–
–
•
•
Weighted finite state transducer (WFST)
Reference input
Spell-checked against full dictionary
Converted to phonetic string using full dictionary
Phonetic lattice generated based on confusion model
Word lattice produced
Language model composed to re-score lattice
“De-coded” to produce word strings
N-Best list extracted.
Various free variables to induce random behavior
Plumb N-Best list for variability
ASR simulation evaluation
•
Hypothesis: Simulation produces errors similar to errors
induced by additive noise w.r.t. concept accuracy
•
Assess concept accuracy as
F-Measure using
automated, data-driven
procedure (HVS model)
Plot concept accuracy of:
•
–
–
–
•
Real additive noise
A naïve confusion model
(simple insertion,
substitution, deletion)
WFST confusion model
WFST appears to follow
real data much more closely
than naïve model
Scenario & Tasks
•
Tourist / Tourist information scenario
–
–
–
•
Wizard (Information giver)
–
•
•
•
Intentionally goal-directed
Intentionally simple tasks
Mixtures of simple information gathering and basic planning
Access to bus times, tram times, restaurants, hotels, bars, tourist attraction information, etc.
User given series of tasks
Likert scores asked at end of each task
4 Dialogs / user; 3 users/Wizard
Example task: Finding the perfect hotel
You’re looking for a hotel for you and your travelling partner that meets a number of requirements.
You’d like the following:
 En suite rooms
 Quiet rooms
 As close to the main square as possible
Given those desires, find the least expensive hotel. You’d prefer not compromise on your
requirements, but of course you will if you must!
Please indicate the location of the hotel on the map and fill in the boxes below.
Name of accommodation
Cost per night for 2 people
User’s Map
Fountain Road
Castle
Fountain
Nine street
Cinema
Castle
e
dl
ad
Ro
et
Cascade Road
re
et
xa
nd
e
St
Museum
Ale
n
ine
rS
tre
eN
idg
Br
et
M
ai
Riv
ers
ide
et
e
ridg
ill
er H
w
o
T
dB
Re
Alexander Stre
Dri
ve
Tourist
info
d
Roa
Loop
id
M
Post
office
Stre
Shopping
area
Castle Loop
Main
square
er
and
Alex
West loop
North Road
Art Square
Tower
South B
ridge
Park
Road
Park
Wizard’s map
Fountain Road
H5
H4
Castle Loop
Cinema
Castle
Loop
B6
H2
e
dl
id
M
R6
re
et
et
rS
Museum
Cascade Road
St
ine
n
xa
nd
e
et
B3
M
ai
tre
B2
Ale
R1 B1
eN
idg
Stre
Post
office
Br
er
and
Alex
ad
Ro
Main
square
R2
Shopping
area
Art Square
Nine street
H1
H6
ide
et
e
ridg
er
T ow
dB
Re
Alexander Stre
d
Roa
Hill
Dri
ve
Tourist
info
Riv
ers
West loop
Castle
R4
North Road
Fountain
B5
H3
Tower
B4
South B
ridge
R3
Park
Road
Park
R5
Likert-scale questions
•
User/wizard each given 6 questions after each task
Disagree
strongly
(1)
1.
2.
3.
4.
5.
6.
Disagree
(2)
Disagree
somewhat
(3)
Neither agree
nor disagree
(4)
Agree
somewhat
(5)
Agree
(6)
Strongly agree
(7)
Subject example:
In this task, I accomplished the goal.
In this task, I thought the speech recognition was accurate.
In this task, I found it difficult to communicate because of
the speech recognition.
In this task, I believe the other subject was very helpful.
In this task, the other subject found using the speech
recognition difficult.
Overall, I was very satisfied with this past task.
Transcription
•
User-side transcribed during experiments
–
Prioritized for speed
"I NEED UH I’M LOOKING FOR A PIZZA"
•
Wizard-side transcribed using a subset of LDC
transcription guidelines – more detail
"ok %uh% (()) sure you –- i can"
epErrorEnd=true
Annotation (acts)
•
•
Each turn is a sequence of tags
Inspired by Traum’s “Grounding Acts”
–
More detailed / easier to infer from surface words
Tag
Meaning
Request
Question/request requiring response
Inform
Statement/provision of task information
Greet-Farewell
“Hello”, “How can I help,” “that’s all”, “Thanks”, “Goodbye”, etc.
ExplAck
Explicit statement of acknowledgement, showing speaker understanding of OS
Unsolicited-Affirm
Explicit statement of acknowledgement, showing OS understands speaker
HoldFloor
Explicit request for OS to wait
ReqRepeat
Request for OS to repeat their last turn
ReqAck
Request for OS to show understanding
RspAffirm
Affirmative response to ReqAck
RspNegate
Negative response to ReqAck
StateInterp
A statement of intention of OS
DisAck
Show of lack of understanding of OS
RejOther
Display of lack of understanding of speaker’s intention or desire by OS
Annotation (understanding)
•
Each wizard turn was labeled to indicate whether
the wizard understood the previous user turn
Label
Wizard’s understanding of previous user turn
Full
All intentions understood correctly.
Partial
Some intentions understood; none misunderstood.
Non
Wizard made no guess at user intention.
Flagged-Mis
The wizard formed an incorrect hypothesis of the user’s meaning, and signalled a dialog
problem
Un-Flagged-Mis
The wizard formed an incorrect hypothesis of the user’s meaning, accepted it as correct and
continued with the dialog.
Corpus summary
WER
target
# Wiz
# User
# Task
Completed
in time limit
Per-turn
WER
Per-dialog
WER
None
2
6
24
83 %
0%
0%
Low
4
12
48
83 %
32 %
28 %
Med
4
12
48
77 %
46 %
41 %
Hi
2
6
24
42 %
63 %
60 %
Perception of ASR accuracy
Wizard
Average Likert Score
•
How accurately do users & wizards perceive WER?
Perceptions of recognition quality broadly reflected actual performance,
but users consistently gave higher quality scores than wizards for the
same WER
User
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
Hi
Med
Low
WER Target
None
Average turn length (words)
Wizard
User
35
30
Av words/turn
•
•
How does WER affect wizard & user turn length?
Wizard turn length increases
User turn length stays relatively constant
25
20
15
10
5
0
Hi
Med
WER Target
Low
None
Grounding behavior
100%
All Others
DisAck
StateInterp
ReqAck
ReqRepeat
Request
ExplAck
Inform
80%
% of all tags
•
How does WER affect wizard grounding behavior?
As WER increases, wizard grounding behaviors become
increasingly prevalent
60%
40%
20%
0%
Hi
Med
Low
WER Target
None
Wizard understanding
100%
90%
80%
% of Wizard Turns
•
•
How does WER affect wizard understanding status?
Misunderstanding increases with WER…
…and task completion falls (83%, 83%, 77%, 42%)
70%
UnFlaggedMis
FlaggedMis
Non
Partial
Full
60%
50%
40%
30%
20%
10%
0%
Hi
Med
Low
WER Target
None
Wizard strategies
•
Classify each wizard turn into one of 5 “strategies”
Label
Meaning
wiz
init?
Tags
REPAIR
Attempt to repair
Yes
ReqAck, ReqRepeat,
StateInterp, DisAck, RejOther
ASKQ
Ask task-question
Yes
Request
GIVEINFO
Provide task info
No
Inform
RSPND
Non-initiative taking
grounding actions
No
ExplAck, Rsp-Affirm, RspNegate, Unsolicited-Affirm
OTHER
Not included in analysis
n/a
All others
Wizard strategies
•
100%
S
% of Wizard Turns
•
What are the most successful strategy after known dialog trouble?
This plot shows wizard understanding status one turn after known dialog
trouble: effect of ‘REPAIR’ vs ‘ASKQ’.
‘S’ indicates significant differences
80%
60%
UnFlaggedMis
FlaggedMis
Non
Partial
Full
S
40%
S
20%
0%
REPAIR
ASKQ
REPAIR
Hi
Med
Strategy & WER Target
ASKQ
User reactions to misunderstandings
•
How does a user respond after being misunderstood?
Surprisingly little explicit indication!
WER
target
User turns including tag
DisAck
RejectOther
None
N/A
N/A
N/A
Low
0.0 %
3.8 %
92.3 %
Med
2.5 %
19.0 %
75.9 %
Hi
0.0 %
12.3 %
87.0 %
Request
Level of wizard “initiative”
% of Wizard Turns
•
How does “initiative” vary with WER?
Define wizard “initiative” using strategies, above
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
GIVEINFO
RESPOND
ASKQ
REPAIR
Hi
Med
Low
WER Target
None
Reward measures/PARADISE
•
•
Satisfaction = Task completion + Dialog Cost metrics
2 kinds of user satisfaction:
–
–
•
Single
Combi
3 kinds of task completion
–
–
–
•
User
Obj
Hyb
Cost metrics
–
–
–
–
–
–
–
PerDialogWER
%UnFlaggedMis
%FlaggedMis
%Non
Turns
%REPAIR
%ASKQ
Reward measures/PARADISE
•
In almost all experiments using the User task completion metric, it was the only
significant predictor
The single/combi metrics almost always selected the same predictors
•
Dataset
R2
Metrics (task &
user sat)
Significant predictors
ALL
User-S
52 %
1.03 Task
ALL
User-C
60 %
5.29 Task – 1.54 %UnFlagMis
ALL
Obj-S
24 %
-0.49 Turns + 0.38 Task
ALL
Obj-C
27 %
-2.43 Turns – 1.45 %UnFlagMis + 1.35 Task
ALL
Hyb-S
41 %
0.74 Task – 0.36 Turns
Hi
Obj-S
40 %
0.98 Task
Hi
Hyb-S
48 %
1.07 Task
Med
Obj-S
16 %
-0.62 %Non
Med
Obj-C
37 %
-3.35 %Non – 2.94 Turns
Med
Hyb-S
38 %
0.97 Task
Low
Obj-S
28 %
-0.59 Turns
Low
Hyb-S
40 %
-0.49 Turns + 0.40 Task
Reward measures/PARADISE
What indicators best predict user satisfaction?
When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user
satisfaction.
•
–
•
Broadly speaking:
–
–
–
•
•
%UnFlaggedMis is serving as a better measurement of understanding accuracy than
WER alone, since it effectively combines recognition accuracy with a measure of
confidence.
Task completion is most important at the High WER level
Task completion and dialog quality is most important at the Med WER level
Efficiency is most important at the Low WER level
These patterns mirror findings from other PARADISE experiments using
Human/Computer data
This gives us some confidence that this data set is valid for training
Human/Computer systems
Conclusions/Next steps
•
•
•
•
•
•
At moderate WER levels, asking task-related questions appears to be more
successful than direct dialog repair.
Levels of expert “initiative” increase with WER, primarily as a result of
grounding behavior.
Users infrequently give a direct indication of having been misunderstood, with
no clear correlation to WER.
When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user
satisfaction.
Task completion appears to be most predictive of user satisfaction; however,
efficiency shows some influence at lower WERs.
Next… apply this corpus to statistical systems.
Thanks!
Jason D. Williams
jdw30@cam.ac.uk
Download