Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213 ø. Abstract We present a data-driven approach which allows us to quantitatively assess the costs of various types of errors that a confidence annotator commits in the CMU Communicator spoken dialog system. Knowing these costs we can determine the optimal tradeoff point between these errors, and fine-tune the confidence annotator accordingly. The cost models based on net concept transfer efficiency fit our data quite well, and the relative costs of false-positives and false-negatives are in accordance with our intuitions. We also find, surprisingly that for a mixed-initiative system such as the CMU Communicator, these errors trade-off equally over a wide operating range. 1. Motivation. Problem Formulation. 2. Cost Models: The Approach Intro The Approach In previous work [1], we have cast the problem of utterancelevel confidence annotation as a binary classification task, and have trained multiple classifiers for this purpose: To model the impact of FPs and FNs on the system performance, we: Identify a suitable dialog performance metric (P) which we want to optimize for Build a statistical regression model on whole sessions using P as the response variable and the counts of FPs and FNs as predictors: - P = f(FPs, FNs) - P = k+CostFP • FP+CostFN•FN (Linear Regression) Training corpus: 131 dialogs, 4550 utterances 12 Features from recognition, parsing and dialog level 7 Classifiers: Decision Tree, ANN, Bayesian Net, AdaBoost, Naïve Bayes, SVM, Logistic regression. Results (mean classification error rates in 10-fold cross-validation) Random baseline Previous “Garble” Baseline Classifiers* 32% 25% 16% * Most of the classifiers obtained statistically indistinguishable results (with the notable exception of Naïve Bayes). The logistic regression model obtained much better performance on a soft-metric Question: Is Classification Error Rate the Right Way to Evaluate Performance ? CER as a measure of performance implicitly assumes that the cost of false-positives and false-negatives is the same. But intuitively this assumption does not hold in most dialog systems: On FP, the system incorporates an will act on invalid info; On FN, the system will reject a valid user utterance. So optimally, we want to build an error function which takes into account these costs, and optimize for that. Problem Formulation 1. Develop a cost model which allows us to Quantitatively assess the costs of FP and FN errors 2. Use these costs to pick an optimal point on the classifier operating characteristic Performance metrics: User satisfaction (5-point scale): subjective, hard to obtain Completion (binary): too coarse Concept transmission efficiency CTC = correctly transferred concepts/turn ITC = incorrectly transferred concepts/turn REC = relevantly expressed concepts/turn The Dataset 134 dialogs collected using mostly 4 different scenarios 2561 utterances User satisfaction scores obtained for only 35 dialogs Corpus manually labeled at the concept level: - 4 labels: OK/RBAD/PBAD/OOD - Aggregate utterance labels generated Confidence annotator decisions available in the logs We therefore could compute the counts of FPs, FNs and CTCs and ITCs for each session An Example User: I want to fly from Pittsburgh to Boston Decoder: I want to fly from Pittsburgh to Austin Parse: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/OK] Only 2 relevantly expressed concepts If Accept: CTC=1, ITC=1, REC=2 If Reject: CTC=0, ITC=0, REC=2 So the problem translates to locating a point on the operating characteristic (by moving the classification threshold) which minimizes the total cost (and thus implicitly maximize the chosen performance metric), rather than the classification error rate. The cost, according to model 3 is: Cost = 0.48 • FPNC + 2.12 • FPC + 1.33 • FN + 0.56 • TN 3. Cost Models: The Results Cost Models Targeting Efficiency 3 successively refined cost models were developed targeting efficiency as a response variable. The goodness of fit for this models (indicated by R2), both on the training and in a 10-fold cross-validation process are illustrated in the table below. Model 1: The fact that the cost function is almost constant across a wide range of thresholds, indicates that the efficiency of the dialog stays about the same, regardless of the ratios of FPs and FNs that the system makes. CTC = FP + FN + TN + k Model 2: CTC–ITC = REC + FP + FN + TN + k added the ITC term so that we also minimize the number of incorrectly transferred concepts. REC captures a prior on the verbosity of the user both changes further improve performance Model 3: CTC–ITC = REC + FPC + FPNC + FN + TN + k The FP term was split into 2, since there are 2 different types of false positives in the system, which intuitively should have very different costs: FPC = false positives with relevant concepts FPNC = false positives without relevant concepts Model CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN CTC-ITC =REC+FPC+FPNC+FN+TN R2 all R2 train R2 test 0.81 0.81 0.73 0.86 0.86 0.78 0.89 0.89 0.83 0.94 0.94 0.90 The resulting coefficients for model 3 are given below, together with their 95% confidence intervals: k CREC CFPNC CFPC CFN CTN 0.41 0.62 -0.48 -2.12 -1.33 -0.55 Other Models Targeting Completion (binary) Logistic regression model Estimated model does not indicate a good fit Targeting User Satisfaction (5-point scale) Based only on 35 dialogs R2=0.61, similar to literature (Walker et al) Explanation: subjectivity of metric + limited dataset 4. Fine-tuning the annotator We want to find the optimal trade-off point on the operating characteristic of the classifier. Implicitly we are minimizing classification error rate (FP + FN). False Negatives False Positives Total Error 0.6 Error Rate 0.5 0.4 0.3 0.2 0.1 -1.5 Further Analysis Is CPT-IPT an Adequate Metric ? Mean = 0.71; Standard Deviation = 0.28, Mean for Completed dialogs = 0.82, Mean for Uncompleted dialogs = 0.57, differences are statistically significant at a very high level of confidence (p = 7.23 •10-9) Can We Reliably Extrapolate the Model to Other Areas of the ROC ? The distribution of FPs and FNs across dialogs indicates that, although the data is obtained with the confidence annotator running with a threshold of 0.5, we have enough samples to reliably estimate the other areas of the ROC. How About the Impact of the Baseline Error Rate ? Cost models constructed based on sessions with a low baseline error rate indicate that the optimal point is with the threshold at 0 (no confidence annotator). Explanation: Ability to easily overwrite incorrectly captured information in the CMU Communicator. Relatively low baseline error rates. 6. Conclusions Proposed a data-driven approach to quantitatively assess the costs of various types of errors committed by a confidence annotator. Models based on efficiency fit the data well; obtained costs confirm the intuition. For CMU Communicator, the models predict that the total cost stays the same across a large range of the operating characteristic of the confidence annotator. 0.7 0 -2 5. -1 -0.5 0 0.5 1 1.5 2 Boosting threshold School of Computer Science, Carnegie Mellon University, 2001, Pittsburgh, PA, 15213.