Belief Updating in Task-Oriented Spoken Dialog Systems

advertisement
Belief Updating in
Spoken Dialog Systems
Dan Bohus
www.cs.cmu.edu/~dbohus
dbohus@cs.cmu.edu
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA, 15217
problem
spoken language interfaces lack
robustness when faced with understanding
errors.
 stems mostly from speech recognition
 spans most domains and interaction types
2
more concretely …
S: What city are you leaving from?
U: Urbana Champaign [OKAY IN THAT SAME PAY]
S: Sorry I’m not sure I understood what you said. What city are you leaving from ?
U: Urbana Champaign [FOR MINUTE SINCE HEY]
S: I’m still having trouble understanding you. To go on, I need you to answer the
following question. What city are you leaving from?
U: Chicago [CHICAGO]
S: traveling from Chicago. Where would you like to go?
U: Huntsville [SEOUL]
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
S: traveling in the afternoon. Okay what day would you be departing chicago
U: /uh/ the tenth of august [AT THE TENTH OF AUGUST]
S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm,
arrives Seoul at 5pm the next day. Do you want to take that?
U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND]
S: traveling on Saturday, August 12th … I have a flight departing Chicago at
1:40pm arrives Seoul at ………
3
non- and misunderstandings
NON
understanding
MIS
understanding
4
S: What city are you leaving from?
U: Urbana Champaign [OKAY IN THAT SAME PAY]
S: Sorry I’m not sure I understood what you said. What city are you leaving from ?
U: Urbana Champaign [FOR MINUTE SINCE HEY]
S: I’m still having trouble understanding you. To go on, I need you to answer the
following question. What city are you leaving from?
U: Chicago [CHICAGO]
S: traveling from Chicago. Where would you like to go?
U: Huntsville [SEOUL]
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
S: traveling in the afternoon. Okay what day would you be departing chicago
U: /uh/ the tenth of august [AT THE TENTH OF AUGUST]
S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm,
arrives Seoul at 5pm the next day. Do you want to take that?
U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND]
S: traveling on Saturday, August 12th … I have a flight departing Chicago at
1:40pm arrives Seoul at ………
approaches for increasing robustness
 fix recognition
 gracefully handle errors through
interaction
 detect the problems
 develop a set of recovery strategies
 know how to choose between them (policy)
5
six not-so-easy pieces …
misunderstandings
detection
strategies
policy
6
non-understandings
belief updating
misunderstandings
detection
 construct more accurate beliefs by
integrating information over multiple turns
S: Where would you like to go?
U: Huntsville
[SEOUL / 0.65]
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham
[THE TRAVELING TO BERLIN P_M / 0.60]
destination = {?}
7
belief updating: problem statement
 given:
 an initial belief Pinitial(C) over concept C
destination = {seoul/0.65}
 a system action SA
S: traveling to Seoul. What day did you need to travel?
 a user response R
 construct an updated belief:
[THE TRAVELING TO BERLIN P_M / 0.60]
destination = {?}
8
 Pupdated(C) ← f (Pinitial(C), SA, R)
outline






9
related work
a restricted version
data
user response analysis
experiments and results
some caveats and future work
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
confidence annotation + heuristic updates
 confidence annotation
 traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar]
 more recently: semantic confidence annotation [Walker, San-Segundo, Bohus]


machine learning approach
results fairly good, but not perfect
 heuristic updates
 explicit confirmation: no → don’t trust ; yes → trust
 implicit confirmation: no → don’t trust ; o/w → trust
 suboptimal for several reasons
10
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
correction detection
 detect if the user is trying to correct the system
[Litman, Swerts, Hirschberg, Krahmer, Levow]
 machine learning approach
 features from different knowledge sources in the system
 results fairly good, but not perfect
11
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
integration
 confidence annotation and correction detection
are useful tools
 but separately, neither solves the problem
 bridge together in a unified approach to
accurately track beliefs
12
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
outline






13
related work
a restricted version
data
user response analysis
experiments and results
some caveats and future work
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
belief updating: general form
 given:
 an initial belief Pinitial(C) over concept C
 a system action SA
 a user response R
 construct an updated belief:
 Pupdated(C) ← f (Pinitial(C), SA, R)
14
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
restricted version: 2 simplifications
1. compact belief

system unlikely to “hear” more than 3 or 4 values



single vs. multiple recognition results
in our data: max = 3 values, only 6.9% have >1 value
confidence score of top hypothesis
2. updates after confirmation actions

reduced problem

15
ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
outline






16
related work
a restricted version
data
user response analysis
experiments and results
some caveats and future work
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
data
 collected with RoomLine
 a phone-based mixed-initiative spoken dialog system
 conference room reservation

search and negotiation
 explicit and implicit confirmations
 confidence threshold model (+ some exploration)
 unplanned implicit confirmations
 I found 10 rooms for Friday between 1 and 3 p.m. Would like a
small room or a large one?
17
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
corpus
 user study
 46 participants (naïve users)
 10 scenario-based interactions each
 compensated per task success
 corpus
 449 sessions, 8848 user turns
 orthographically transcribed
 rich annotation: correct concepts, corrections, etc.
18
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
outline






19
related work
a restricted version
data
user response analysis
experiments and results
some caveats and future work
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
user response types
 following Krahmer and Swerts
 study on Dutch train-table information system
 3 user response types
 YES: yes, right, that’s right, correct, etc.
 NO: no, wrong, etc.
 OTHER
 cross-tabulated against correctness of
confirmations
20
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
user responses to explicit confirmations
 from transcripts
CORRECT
INCORRECT
YES
NO
Other
94% [93%]
0% [0%]
5% [7%]
1% [6%]
72% [57%] 27% [37%]
~10%
[numbers in brackets from Krahmer&Swerts]
 from decoded
21
YES
NO
Other
CORRECT
87%
1%
12%
INCORRECT
1%
61%
38%
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
other responses to explicit confirmations
 ~70% users repeat the correct value
 ~15% users don’t address the question
 attempt to shift conversation focus
CORRECT
INCORRECT
22
User does not
correct
User corrects
1159
0
29
250
[10% of incor]
[90% of incor]
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
user responses to implicit confirmations
 Transcripts
YES
NO
Other
CORRECT
30% [0%]
7% [0%]
63% [100%]
INCORRECT
6% [0%]
33% [15%]
61% [85%]
[numbers in brackets from Krahmer&Swerts]
 Decoded
23
YES
NO
Other
CORRECT
28%
5%
67%
INCORRECT
7%
27%
66%
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
ignoring errors in implicit confirmations
User does not
correct
User corrects
CORRECT
552
2
INCORRECT
118
111
[51% of incor]
[49% of incor]
 users correct later (40% of 118)
 users interact strategically
 correct only if essential ~correct later correct later
24
~critical
55
2
critical
14
47
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
outline






25
related work
a restricted version
data
user response analysis
experiments and results
some caveats and future work
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
machine learning approach
 need good probability outputs
 low cross-entropy between model predictions
and reality
 cross-entropy = negative average log posterior
 logistic regression
 sample efficient
 stepwise approach → feature selection
 logistic model tree for each action
 root splits on response-type
26
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
features. target.
 initial situation
 initial confidence score
 concept identity, dialog state, turn number
 system action
 other actions performed in parallel
 features of the user response




acoustic / prosodic features
lexical features
grammatical features
dialog-level features
 target: was the value correct?
27
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
baselines
 initial baseline
 accuracy of system beliefs before the update
 heuristic baseline
 accuracy of heuristic rule currently used in the system
 oracle baseline
 accuracy if we knew exactly when the user is correcting the
system
28
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
results: explicit confirmation
Hard error (%)Explicit Confirmation Soft error
Initial
Heuristic
LMT
Oracle
Hard-error (%)
20
10
0.51
0.4
0.2
8.41
Initial
Heuristic
LMT
0.6
Soft-error
31.15
30
0.19
0.12
3.57
0
29
2.71
0
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
results: implicit confirmation
Hard error (%)Implicit Confirmation Soft error
Hard-error (%)
30
23.37
20
16.15
15.33
Initial
Heuristic
LMT
1
0.8
Soft-error
Initial
Heuristic
LMT
Oracle
30.40
0.67
0.6
0.61
0.43
0.4
10
0.2
0
30
0
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
results: unplanned implicit confirmation
Hard error
(%)
Soft error
Implicit Confirmation
Unplanned
Hard-error (%)
20
15.40
12.64
10
10.37
0.43
0.4
0.46
0.34
0.2
0
31
14.36
Initial
Heuristic
LMT
0.6
Soft-error
Initial
Heuristic
LMT
Oracle
0
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
informative features






32
initial confidence score
prosody features
barge-in
expectation match
repeated grammar slots
concept id
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
outline






33
related work
a reduced version. approach
data
user response analysis
experiments and results
some caveats and future work
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
eliminate simplification 1
 current restricted version
 belief = confidence score of top hypothesis
 only 6.9% of cases had more than 1 hypothesis
 extend to
 N hypotheses + 1 (other), where N is a small integer (2 or 3)
 approach: multinomial generalized linear model
 use information from multiple recognition hypotheses
34
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
eliminate simplification 2
 current restricted version
 only updates following system confirmation actions
 users might correct the system at any point
 extend to
 updates after all system actions
35
related work : restricted version : data : user response analysis : experiment & results : caveats & future work
shameless self promotion
misunderstandings
detection
- rejection threshold adaptation
- nonu impact on performance
strategies
- comparative analysis of 10
recovery strategies
policy
36
non-understandings
[Interspeech-05]
[SIGdial-05]
- wizard experiment
- towards learning nonu
recovery policies [Sigdial-05]
shameless CMU promotion
 Ananlada (Moss) Chotimongkol
 automatic concept and task structure acquisition
 Antoine Raux
 turn-taking, conversation micro-management
 Jahanzeb Sherwani
 multimodal personal information management
 Satanjeev Banerjee
 meeting understanding
 Stefanie Tomko
 universal speech interface
 Thomas Harris
 multi-participant dialog
 DoD / Young Researchers’ Roundtable
37
thankyou!
38
a more subtle caveat
 distribution of training data
 confidence annotator + heuristic update rules
 distribution of run-time data
 confidence annotator + learned model
 always a problem when interacting with the
world
 hopefully, distribution shift will not cause large
degradation in performance
 remains to validate empirically
 maybe a bootstrap approach?
39
Download