Modeling the Cost of Misunderstandings in the CMU Communicator System ø. Abstract

advertisement
Modeling the Cost of Misunderstandings
in the CMU Communicator System
Dan Bohus
Alex Rudnicky
School of Computer Science,
Carnegie Mellon University,
Pittsburgh, PA, 15213
ø. Abstract
We present a data-driven approach which allows us to quantitatively assess the costs of various types of errors that a
confidence annotator commits in the CMU Communicator spoken dialog system. Knowing these costs we can determine the
optimal tradeoff point between these errors, and fine-tune the confidence annotator accordingly. The cost models based on net
concept transfer efficiency fit our data quite well, and the relative costs of false-positives and false-negatives are in accordance with
our intuitions. We also find, surprisingly that for a mixed-initiative system such as the CMU Communicator, these errors trade-off
equally over a wide operating range.
1.
Motivation. Problem Formulation.
2. Cost Models: The Approach
Intro
The Approach
In previous work [1], we have cast the problem of utterancelevel confidence annotation as a binary classification task,
and have trained multiple classifiers for this purpose:
To model the impact of FPs and FNs on the system
performance, we:
 Identify a suitable dialog performance metric (P)
which we want to optimize for
 Build a statistical regression model on whole
sessions using P as the response variable and the
counts of FPs and FNs as predictors:
- P = f(FPs, FNs)
- P = k+CostFP • FP+CostFN•FN (Linear Regression)
 Training corpus: 131 dialogs, 4550 utterances
 12 Features from recognition, parsing and dialog level
 7 Classifiers: Decision Tree, ANN, Bayesian Net,
AdaBoost, Naïve Bayes, SVM, Logistic regression.
Results (mean classification error rates in 10-fold cross-validation)
Random baseline
Previous “Garble” Baseline
Classifiers*
32%
25%
16%
* Most of the classifiers obtained statistically indistinguishable results (with the notable
exception of Naïve Bayes). The logistic regression model obtained much better
performance on a soft-metric
Question: Is Classification Error Rate the
Right Way to Evaluate Performance ?
CER as a measure of performance implicitly assumes that
the cost of false-positives and false-negatives is the
same. But intuitively this assumption does not hold in most
dialog systems:
 On FP, the system incorporates an will act on invalid info;
 On FN, the system will reject a valid user utterance.
So optimally, we want to build an error function which
takes into account these costs, and optimize for that.
Problem Formulation
1. Develop a cost model which allows us to
Quantitatively assess the costs of FP and
FN errors
2. Use these costs to pick an optimal point on
the classifier operating characteristic
Performance metrics:
 User satisfaction (5-point scale): subjective, hard to
obtain
 Completion (binary): too coarse
 Concept transmission efficiency
CTC = correctly transferred concepts/turn
ITC = incorrectly transferred concepts/turn
REC = relevantly expressed concepts/turn
The Dataset
 134 dialogs collected using mostly 4 different scenarios 2561 utterances
 User satisfaction scores obtained for only 35 dialogs
 Corpus manually labeled at the concept level:
- 4 labels: OK/RBAD/PBAD/OOD
- Aggregate utterance labels generated
 Confidence annotator decisions available in the logs
 We therefore could compute the counts of FPs, FNs
and CTCs and ITCs for each session
An Example
User:
I want to fly from Pittsburgh to Boston
Decoder: I want to fly from Pittsburgh to Austin
Parse: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/OK]
 Only 2 relevantly expressed concepts
 If Accept: CTC=1, ITC=1, REC=2
 If Reject: CTC=0, ITC=0, REC=2
So the problem translates to locating a point on the
operating characteristic (by moving the classification
threshold) which minimizes the total cost (and thus
implicitly maximize the chosen performance metric),
rather than the classification error rate. The cost, according
to model 3 is:
Cost = 0.48 • FPNC + 2.12 • FPC + 1.33 • FN + 0.56 • TN
3. Cost Models: The Results
Cost Models Targeting Efficiency
3 successively refined cost models were developed targeting
efficiency as a response variable.
The goodness of fit for this models (indicated by R2), both on
the training and in a 10-fold cross-validation process are
illustrated in the table below.
Model 1:
The fact that the cost
function is almost
constant across a wide
range of thresholds,
indicates that the
efficiency of the dialog
stays about the same,
regardless of the ratios
of FPs and FNs that the
system makes.
CTC = FP + FN + TN + k
Model 2:
CTC–ITC = REC + FP + FN + TN + k
 added the ITC term so that we also minimize the number
of incorrectly transferred concepts.
 REC captures a prior on the verbosity of the user
 both changes further improve performance
Model 3:
CTC–ITC = REC + FPC + FPNC + FN + TN + k
 The FP term was split into 2, since there are 2 different
types of false positives in the system, which intuitively
should have very different costs:
FPC = false positives with relevant concepts
FPNC = false positives without relevant concepts
Model
CTC=FP+FN+TN
CTC-ITC=FP+FN+TN
CTC-ITC=REC+FP+FN+TN
CTC-ITC =REC+FPC+FPNC+FN+TN
R2 all R2 train R2 test
0.81
0.81
0.73
0.86
0.86
0.78
0.89
0.89
0.83
0.94
0.94
0.90
The resulting coefficients for model 3 are given below,
together with their 95% confidence intervals:
k
CREC
CFPNC
CFPC
CFN
CTN
0.41
0.62
-0.48
-2.12
-1.33
-0.55
Other Models
Targeting Completion (binary)
 Logistic regression model
 Estimated model does not indicate a good fit
Targeting User Satisfaction (5-point scale)
 Based only on 35 dialogs
 R2=0.61, similar to literature (Walker et al)
 Explanation: subjectivity of metric + limited dataset
4. Fine-tuning the annotator
We want to find the optimal trade-off point on the operating
characteristic of the classifier. Implicitly we are minimizing
classification error rate (FP + FN).
False Negatives
False Positives
Total Error
0.6
Error Rate
0.5
0.4
0.3
0.2
0.1
-1.5
Further Analysis
Is CPT-IPT an Adequate Metric ?
Mean = 0.71; Standard Deviation = 0.28,
Mean for Completed dialogs = 0.82,
Mean for Uncompleted dialogs = 0.57,
differences are statistically significant at a very high level of
confidence (p = 7.23 •10-9)
Can We Reliably Extrapolate the Model to
Other Areas of the ROC ?
The distribution of FPs and FNs across dialogs indicates that,
although the data is obtained with the confidence annotator
running with a threshold of 0.5, we have enough samples to
reliably estimate the other areas of the ROC.
How About the Impact of the Baseline
Error Rate ?
Cost models constructed based on sessions with a low
baseline error rate indicate that the optimal point is with the
threshold at 0 (no confidence annotator).
Explanation:
 Ability to easily overwrite incorrectly captured
information in the CMU Communicator.
 Relatively low baseline error rates.
6. Conclusions
 Proposed a data-driven approach to quantitatively
assess the costs of various types of errors committed
by a confidence annotator.
 Models based on efficiency fit the data well; obtained
costs confirm the intuition.
 For CMU Communicator, the models predict that the
total cost stays the same across a large range of the
operating characteristic of the confidence annotator.
0.7
0
-2
5.
-1
-0.5
0
0.5
1
1.5
2
Boosting threshold
School of Computer Science,
Carnegie Mellon University, 2001,
Pittsburgh, PA, 15213.
Download