“k hypotheses + other” belief updating in spoken dialog systems

advertisement
“k hypotheses + other”
belief updating in spoken dialog systems
Dialogs on Dialogs Talk, March 2006
Dan Bohus
www.cs.cmu.edu/~dbohus
dbohus@cs.cmu.edu
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213
problem
spoken language interfaces lack robustness
when faced with understanding errors
 errors stem mostly from speech recognition
 typical word error rates: 20-30%
 significant negative impact on interactions
2
guarding against understanding errors
 use confidence scores
 machine learning approaches for detecting misunderstadings
[Walker, Litman, San-Segundo, Wright, and others]
 engage in confirmation actions
 explicit confirmation
did you say you wanted to fly to Seoul?



yes → trust hypothesis
no → delete hypothesis
“other” → non-understanding
 implicit confirmation
traveling to Seoul … what day did you need to travel?

3
rely on new values overwriting old values
related work : data : user response analysis : proposed approach: experiments and results : conclusion
today’s talk …
construct accurate beliefs by integrating
information over multiple turns in a conversation
S: Where would you like to go?
U: Huntsville
[SEOUL / 0.65]
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to travel?
U: no no I’m traveling to Birmingham
[THE TRAVELING TO BERLIN P_M / 0.60]
destination = {?}
4
belief updating: problem statement
 given
 an initial belief Binitial(C) over
destination = {seoul/0.65}
concept C
S: traveling to Seoul. What day did you need to travel?
 a system action SA
[THE TRAVELING
BERLIN P_M R
/ 0.60]
 a userTOresponse
destination = {?}
 construct an updated belief
 Bupdated(C) ← f (Binitial(C), SA, R)
5
outline
 proposed approach
 data
 experiments and results
 effect on dialog performance
 conclusion
6
proposed approach: data: experiments and results : effect on dialog performance : conclusion
belief updating: problem statement
destination = {seoul/0.65}
S: traveling to Seoul. What day did you need to travel?
[THE TRAVELING TO BERLIN P_M / 0.60]
destination = {?}
 given
 an initial belief Binitial(C) over
concept C
 a system action SA(C)
 a user response R
 construct an updated belief
 Bupdated(C) ← f(Binitial(C),SA(C),R)
7
proposed approach: data: experiments and results : effect on dialog performance : conclusion
belief representation
Bupdated(C) ← f(Binitial(C), SA(C), R)
 most accurate representation
 probability distribution over the set of possible values
 however
 system will “hear” only a small number of conflicting
values for a concept within a dialog session
 in our data
8

max = 3 (conflicting values heard)

only in 6.9% of cases, more than 1 value heard
proposed approach: data: experiments and results : effect on dialog performance : conclusion
belief representation
Bupdated(C) ← f(Binitial(C), SA(C), R)
 compressed belief
representation
 k hypotheses + other
 at each turn, the system
retains the top m initial
hypotheses and adds n new
hypotheses from the input
(m+n=k)
9
proposed approach: data: experiments and results : effect on dialog performance : conclusion
belief representation
Bupdated(C) ← f(Binitial(C), SA(C), R)
 B(C) modeled as a multinomial variable
 {h1, h2, … hk, other}
 B(C) = <ch1, ch2, …, chk, cother>

where ch1 + ch2 + … + chk + cother = 1
 belief updating can be cast as multinomial
regression problem:
Bupdated(C) ← Binitial(C) + SA(C) + R
10
proposed approach: data: experiments and results : effect on dialog performance : conclusion
system action
11
Bupdated(C) ← f(Binitial(C), SA(C), R)
request
S: For when do you want the room?
U: Friday
[FRIDAY / 0.65]
explicit
confirmation
S: Did you say you wanted a room for Friday?
U: Yes
[GUEST / 0.30]
implicit
confirmation
S: a room for Friday … starting at what time?
U: starting at ten a.m.
[STARTING AT TEN A_M / 0.86]
unplanned implicit
confirmation
S: I found 5 rooms available Friday from 10 until
noon. Would you like a small or a large room?
U: not Friday, Thursday
[FRIDAY THURSDAY / 0.25]
no action /
unexpected update
S: okay. I will complete the reservation. Please tell
me your name or say ‘guest user’ if you are not
a registered user.
U: guest user
[THIS TUESDAY / 0.55]
proposed approach: data: experiments and results : effect on dialog performance : conclusion
user response
12
Bupdated(C) ← f(Binitial(C), SA(C), R)
acoustic /
prosodic
acoustic and language scores, duration,
pitch (min, max, mean, range, std.dev, min and
max slope, plus normalized versions), voiced-tounvoiced ratio, speech rate, initial pause, etc;
lexical
number of words, lexical terms highly correlated
with corrections or acknowledgements (selected via
mutual information computation).
grammatical
number of slots (new and repeated), parse
fragmentation, parse gaps, etc;
dialog
dialog state, turn number, expectation match, new
value for concept, timeout, barge-in, concept
identity
priors
priors for concept values (manually constructed by
a domain expert for 3 of 29 concepts: date,
start_time, end_time; uniform assumed o/w)
confusability
empirically derived confusability scores
proposed approach: data: experiments and results : effect on dialog performance : conclusion
approach
Bupdated(C) ← f(Binitial(C), SA(C), R)
 problem
 <uch1, … uchk, ucoth> ← f(<ich1, … ichk, icoth>, SA(C), R)
 approach: multinomial generalized linear model
 regression model, multinomial independent variable
 sample efficient
 stepwise approach

feature selection

BIC to control over-fitting
 one model for each system action

13
<uch1, … uchk, ucoth> ← fSA(C)(<ich1, … ichk, icoth>, R)
proposed approach: data: experiments and results : effect on dialog performance : conclusion
outline
 proposed approach
 data
 experiments and results
 effect on dialog performance
 conclusion
14
proposed approach: data: experiments and results : effect on dialog performance : conclusion
data
 collected with RoomLine
 a phone-based mixed-initiative spoken dialog system
 conference room reservation
 explicit and implicit confirmations
 simple heuristic rules for belief updating
 explicit confirm: yes / no
 implicit confirm: new values overwrite old ones
15
proposed approach: data: experiments and results : effect on dialog performance : conclusion
corpus
 user study
 46 participants (naïve users)
 10 scenario-based interactions each
 compensated per task success
 corpus
 449 sessions, 8848 user turns
 orthographically transcribed
 manually annotated



16
misunderstandings
corrections
correct concept values
proposed approach: data: experiments and results : effect on dialog performance : conclusion
outline
 proposed approach
 data
 experiments and results
 effect on dialog performance
 conclusion
17
proposed approach: data: experiments and results : effect on dialog performance : conclusion
baselines
 initial baseline
 accuracy of system beliefs before the update
 heuristic baseline
 accuracy of heuristic update rule used by the system
 oracle baseline
 accuracy if we knew exactly when the user corrects
18
proposed approach: data: experiments and results : effect on dialog performance : conclusion
k=2 hypotheses + other
Informative features
 priors and confusability
 initial confidence score
 concept identity
 barge-in
 expectation match
 repeated grammar slots
19
proposed approach: data: experiments and results : effect on dialog performance : conclusion
outline
 proposed approach
 data
 experiments and results
 effect on dialog performance
 conclusion
20
proposed approach: data: experiments and results : effect on dialog performance : conclusion
a question remains …
… does this really matter?
what is the effect on global dialog performance?
21
proposed approach: data: experiments and results : effect on dialog performance : conclusion
let’s run an experiment
guinea pigs from Speech
Lab for exp: $0
getting change from guys
in the lab: $2/$3/$5
real subjects for the
experiment: $25
picture with advisor of the
VERY last exp at CMU:
priceless!!!!
[courtesy of Mohit Kumar]
22
a new user study …
 implemented models in RavenClaw, performed a
new user study
 40 participants, first-time users
 10 scenario-driven interactions each
 non-native speakers of North-American English
 improvements more likely at higher WER

supported by empirical evidence
 between-subjects; 2 gender-balanced groups
 control: RoomLine using heuristic update rules
 treatment: RoomLine using runtime models
23
proposed approach: data: experiments and results : effect on dialog performance : conclusion
effect on task success
control
73.6%
treatment
81.3%
task
success
even though
24
control
21.9%
treatment
24.2%
average
user WER
proposed approach: data: experiments and results : effect on dialog performance : conclusion
effect on task success … a closer look
treatment
control
90%
80%
78%
78%
70%
64%
60%
50%
30%
20%
10%
0
0
30% WER
40%
16% WER
probability of task success
100%
20%
40%
60%
80%
100%
word error rate
Task Success ← 2.09 - 0.05∙WER + 0.69∙Condition
p=0.001
25
proposed approach: data: experiments and results : effect on dialog performance : conclusion
improvements at different WER
absolute Improvement in task success
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
10
20
30
40
50
60
70
80
90
100
word-error-rate
26
proposed approach: data: experiments and results : effect on dialog performance : conclusion
effect on task duration
(for successful tasks)
 ANOVA on task duration for successful tasks
Duration ← -0.21 + 0.013∙WER - 0.106∙Condition
 significant improvement, equivalent to 7.9%
absolute reduction in WER
27
proposed approach: data: experiments and results : effect on dialog performance : conclusion
outline
 proposed approach
 data
 experiments and results
 effect on dialog performance
 conclusion
28
proposed approach: data: experiments and results : effect on dialog performance : conclusion
summary
 data-driven approach for constructing accurate
system beliefs
 integrate information across multiple turns
 bridge together detection of misunderstandings and
corrections
 significantly outperforms current heuristics
 significantly improves effectiveness and efficiency
29
other advantages
 sample efficient
 performs a local one-turn optimization
 good local performance leads to good global
performance
 scalable
 works independently on concepts
 29 concepts, varying cardinalities
 portable
 decoupled from dialog task specification
 doesn’t make strong assumptions about dialog
management technology
30
thank you!
31
questions …
user study
 10 scenarios, fixed order
 presented graphically (explained during briefing)
 participants compensated per task success
32
Download