Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005

advertisement
Belief Updating
in Spoken Dialog Systems
Dialogs on Dialogs Reading Group
June, 2005
Dan Bohus
Carnegie Mellon University, January 2004
Misunderstandings

Misunderstandings are an important problem
in spoken dialog systems
 System obtains an incorrect semantic interpretation of the
users’ utterance
 15-40% of turns
 Significant negative impact on overall success rate
2
Confidence annotation

Use confidence scores to guard against
potential misunderstandings

Traditionally: from speech recognition engine
[Chase, Bansal, Cox, Kemp, etc]
 Focuses on WER, not tuned to task at hand

More recently: system-specific semantic
confidence scores [Carpenter, Walker, San-Segundo, etc]
 Integrate knowledge from different levels in the system:

3
speech recognition, language understanding, dialog management
Correction Detection

Detect whether or not the user is trying to
correct the system
 Related: aware-site detection

4
Similar ML approaches using multiple
sources of knowledge [Litman, Swerts, Krahmer, etc]
Proposed: Belief Updating

Integrate confidence annotation and
correction detection in a unified framework for
continuously tracking beliefs

A “belief updating” problem:
S:
Where are you flying from?
U: [CityName={Aspen/0.6; Austin/0.2}]
initial belief
+
S:
Did you say you wanted to fly out of Aspen?
U: [No/0.6] [CityName={Boston/0.8}]
[CityName={Aspen/?;
Austin/?;
Boston/?}]
5
system action
+
user response
updated belief
Formally…

Given:
 An initial belief Pinitial(C) over concept C
 A system action SA
 A user response R

Construct an updated belief Pupdated(C)
 As “accurate” as possible

6
Pupdated(C) ← f (Pinitial(C), SA, R)
Examples
7
Examples - continued
8
Outline







9
Introduction
Data
A simplified version of the problem. Approach
User behaviors
Learning: Preliminary results
More on evaluation
Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Data

Collected in an experiment with RoomLine
 Phone-based, mixed initiative system for making conference


room reservations
Equipped with explicit and implicit confirmations
Corpus statistics
 46 participants
 449 sessions, 8278 turns
 13.5% misunderstandings [9.8% / 22.5%]
 25.6% WER [19.6% / 39.5%]
 11362 concept updates
10
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
System actions and concept updates

Explicit and implicit confirmations
Start time: Explicit Confirmation/grounding [EC]
Date: Implicit Confirmation/grounding [IC]
11
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
System actions and concept updates

Implicit Confirmations Task
Date: Implicit Confirmation/grounding [IC]
Start time: Implicit Confirmation/grounding [IC]
End time: Implicit Confirmation/task [ICT]
12
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
# of Conflicting Hypotheses



13
Below 3% involve
more than 1
hypothesis
System not using
multiple hypotheses
[Future work:
regenerate multiple
hypotheses in batch]
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Outline







14
Introduction
Data
A simplified version of the problem. Approach
User behaviors
Learning: preliminary results
More on evaluation
Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
A Simplified Version
Given only 3% have more than 1 hypothesis,
15

Update belief in the top-hypothesis after
implicit and explicit confirmations

Instead of

Do
 Pupdated(C) ← f (Pinitial(C), SA, R)
 ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)
 For SA = {EC, IC, ICT}
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Approach


Use machine learning
Dataset
 Concept updates for EC, IC, ICTs

Features
 Initial confidence score ConfTopinitial(C)
 System action (SA)
 User response (R)

Target
 Updated confidence score ConfTopupdated(C)
 Data is labeled, so we have a binary target
16
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Outline







17
Introduction
Data
A simplified version of the problem. Approach
User behaviors
Learning: preliminary results
More on evaluation
Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
User behaviors

Study of user behaviors in response to ICs
and ECs
 Can inform feature selection and feature development
 Provide insights into where the difficulties are
 Can inform potential strategy refinements
18
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
User responses to ECs

Transcripts
CORRECT
INCORRECT

NO
Other
1097
8
62
202
84
[94.2% of cor]
3
[69.9% of inc]
~10%
Decoded
CORRECT
INCORRECT
19
YES
YES
NO
Other
1016
11
137
171
116
[87.3% of cor]
2
[69.9% of inc]
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
“Other” Responses to EC

“Eyeball” estimates (out of 146 responses)
 ~70% simply repeat the correct concept value

That should come in as a handy feature
 ~10% change conversation focus
 ~10% turn overtaking issues

Maybe inhibit barge-in until Antoine finishes his thesis
 ~10% other
20
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
User responses to ICs

Transcripts
CORRECT
INCORRECT

NO
Other
166
38
326
75
148
[31.3% of cor]
15
[31.5% of inc]
Decoded
CORRECT
INCORRECT
21
YES
YES
NO
Other
151
20
369
62
160
[28.5% of cor]
16
[26.1% of inc]
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Users Don’t Always Correct ICs



22
Actually, they corrected in 45% of the cases
User does not
correct
User corrects
CORRECT
557
1
INCORRECT
126
104
[55% of incor]
[45% of incor]
That means if we knew exactly when they correct,
we’d still have (126+1)/788 = 16% error
So what do users do when they don’t correct?
 They may actually correct partially
 Completely ignore the error … (if non-essential)
 Readjust to accommodate task
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
More questions…

Understand better this “ignore” phenomenon
 Impact on task success?


IC correction rate: 49% (successful tasks) vs 41% (unsuccessful)
Fixed vs more “flexible” scenarios
 Impact of prompt length on P(user will correct)?
 “Essential” vs “non-essential” concepts?
23
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Outline







24
Introduction
Data
A simplified version of the problem. Approach
User behaviors
Learning: preliminary results
More on evaluation
Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Which ML technique?

Need good probability outputs
 Margins produced by discriminant classifiers are inadequate
 If you want probability scores, i.e. conf = 0.85 means that in
85% of cases with conf=0.85 the concept is right


evaluate on a soft-metric [I’ll contradict myself later!! ]
Step-wise logistic regression
 Sample-efficient
 Feature selection
 Good soft-metric performance

25
optimizes for avg. log likelihood of data
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Data. Features

For each system action {EC, IC, ICT}
 Initial Confidence score
 Other indicators about current state:



How well has the dialog been going
Which concept are we talking about
How far back was this concept acquired
 Features on user response







26
Confirmation and Disconfirmation markers
Acoustic / Prosodic: f0 (min, max, range, maxslope, etc) + normalized
versions
Num words; turn length (secs)
Concept information: expected / repeated / new concepts and grammar
slots…
Confidence
Barge-in & Timeout info
Lexical features (preselected by MI with “target” or confirm/disconfirm
markers)
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Results

Actually using a 1-level logistic model-tree
 Split on answer_type = {yes, no, other, no_parse}
 Perform step-wise logistic regression on the 4 leaves




27
P-entry = 0.05
P-reject = 0.30
BIC stopping criterion
Also tried full-blown model tree, results are
similar, maybe marginally worse
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
HARD
SOFT
Initial
31.1%
-0.5076
Heuristic
8.6%
-0.1943
LMT(CV)
3.7%
-0.1160
LMT(training)
2.9%
-0.0851
35
0.7
30
0.6
25
0.5
Avg. Log-Likelihood
Error rate (%)
Explicit Confirmation
20
15
10
5
0
28
0.4
0.3
0.2
0.1
Initial
Heuristic
LMT(CV)
LMT(training)
0
Initial
Heuristic
LMT(CV)
LMT(training)
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
HARD
SOFT
Initial
31.4%
-0.6217
Heuristic
24.0%
-0.6736
LMT(CV)
19.6%
-0.4521
LMT(training)
18.8%
-0.4124
Oracle Baseline
16.1%
-
35
0.7
30
0.6
25
0.5
Avg. Log-Likelihood
Error rate (%)
Implicit Confirmation
20
15
10
5
0
29
0.4
0.3
0.2
0.1
Initial
Heuristic
LMT(CV)
LMT(training)
0
Initial
Heuristic
LMT(CV)
LMT(training)
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Outline







30
Introduction
Data
A simplified version of the problem. Approach
User behaviors
Learning: preliminary results
More on evaluation
Where to from here?
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
What can Logistic Regression / AVG-LL do
for you?



D = {d1, d2, d3, d4, …} di = 1/0
P(D) = ∏P(di=1 | xi)
Express density P(di=1 | xi) as:
 P(d=1 | x) = 1 / (1 + exp(-wx))

31
You can actually derive this if you start with P(x | d) gaussian



Find parameters w to max(P(D))
argmax(P(D)) = argmax ∏P(di=1 | xi)
argmax(P(D)) = argmin ∑-log(P(di=1 | xi))

But what does that mean?
 Hence we maximize the average log-likelihood
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Loss function in Logistic Regression

Log-likelihood loss function
If d=1,
then P(d=1)=0.01 is ten times worse than P(d=1)=0.1,
but P(d=1)=0.7 is about the same as P(d=1)=0.8
Things are mirrored for d=0
This does not match the “threshold” model commonly
used to engage actions
0.01 0.1
0.7
0.8
1
d=1
32
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
A New Loss Function: T2

A loss function that better matches our
domain: T2 (or even T3)
d=1
d=0
C3
C1
C4
C2
0

33
t1
t2
1
0
t1
t2
1
Optimize argmax ∑ T2(P(di=c | xi))
 Not differentiable 
 Not convex 
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Smoothed version

A loss function that better matches our
domain: T2 (or even T3)
d=1
SmoothT2(p) = σ1(p) + σ2(p)
σi(p) = 1 / (1+exp(ki(p-θi)))
with ks and θs chosen accordingly
C1
C2
0

34
t1
t2
1
Optimize argmax ∑ SmoothT2(P(di=c | xi))
 Differentiable! 
 But still not convex  … multiple local maxima
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Costs & Thresholds

Costs: where from?
 “Expert” knowledge
 Derive from data (might be tricky)

Thresholds: where from?
 Fixed
 Actually optimize at the same time


35
SmoothT2 = SmoothT2(w, th1, th2)
 Differentiable in th1 and th2, so we can do gradient search for it
Calibrates in one step both the belief updating and the threshold to
minimize loss
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Questions: What Next?
36

ICT: can we do anything there?

Push for better performance

Push for better understanding


Optimize for new loss function
More in the future: look at the full belief
updating problem
 Looks really tough
 … Add more features?
 … Debug the models more, eliminate singularities
 … Why doesn’t the model-tree do better?
 … What are the other interesting questions …
data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?
Thank You!
37
Encoding System Actions



38
For each concept update, define system action
signature: <IC, ICT, EC, REQ>
 IC: Implicit Confirm [grounding]
 ICT: Implicit Confirm [task]
 EC: Explicit Confirm
 REQ: Request
Each variable can have 1 of 4 values
 0
 C (action happens on concept of interest)
 OC (action happens on some other concept)
 C&OC (action happens both on concept of interest and some other
concept)
Only certain combinations are valid and appear in the
data
Download