Evaluating Classification Performance Data Mining 1 Why Evaluate? • Multiple methods are available for classification • For each method, multiple choices are available for settings (e.g. value of K for KNN, size of Tree of Decision Tree Learning) • To choose best model, need to assess each model’s performance 2 Basic performance measure: Misclassification error • Error = classifying an example as belonging to one class when it belongs to another class. • Error rate = percent of misclassified examples out of the total examples in the validation data (or test data) 3 Naive classification Rule • Naïve rule: classify all examples as belonging to the most prevalent class • Often used as benchmark: we hope to do better than that • Exception: when goal is to identify high-value but rare outcomes, we may do well by doing worse than the naïve rule (see “lift” – later) 4 Confusion Matrix Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 201 1’s correctly classified as “1” 25 0’s incorrectly classified as “1” 85 1’s incorrectly classified as “0” 2689 0’s correctly classified as “0”We use TP to denote True Positives . Similarly , FP, FN and TN 0 85 2689 True Positives False Positives False Negatives True Negatives Error Rate and Accuracy Classification Confusion Matrix Predicted Class Actual Class 1 0 1 TP = 201 FP = 25 0 FN = 85 TN = 2689 Error rate = (FP + FN)/(TP + FP + FN + TN) Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = (201+2689) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records) 6 Cutoff for classification Many classification algorithms classify via a 2-step process: For each record, 1. Compute a score or a probability of belonging to class “1” 2. Compare to cutoff value, and classify accordingly • For example, with Naïve Bayes the default cutoff value is 0.5 If p(y=1|x) >= 0.5, classify as “1” If p(y=1|x) < 0.50, classify as “0” • Can use different cutoff values • Typically, error rate is lowest for cutoff = 0.50 7 Cutoff Table - example Actual Class 1 1 1 1 1 1 1 0 1 1 1 0 Prob. of "1" 0.996 0.988 0.984 0.980 0.948 0.889 0.848 0.762 0.707 0.681 0.656 0.622 Actual Class 1 0 0 1 0 0 0 0 0 0 0 0 Prob. of "1" 0.506 0.471 0.337 0.218 0.199 0.149 0.048 0.038 0.025 0.022 0.016 0.004 • If cutoff is 0.50: 13 examples are classified as “1” • If cutoff is 0.75: 8 examples are classified as “1” • If cutoff is 0.25: 15 examples are classified as “1” Confusion Matrix for Different Cutoffs Predicted class 1 Predicted class 0 Actual class 1 11 1 Actual class 0 4 8 Cutoff probability = 0.25 Accuracy = 19/24 Cutoff probability = 0.5 (Not shown) Accuracy = 21/24 Predicted class 1 Predicted class 0 Actual class 1 7 5 Actual class 0 1 11 Cutoff probability = 0.75 Accuracy = 18/24 Lift 10 When One Class is More Important • In many cases it is more important to identify members of one class • • • Tax fraud Response to promotional offer Detecting malignant tumors • In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention Alternate Accuracy Measures • Predicte Predicted d class 1 class 0 We assume that the important class is 1 Actual class =1 TP FN Actual class =0 FP TN Sensitivity = % of class 1 examples correctly classified Sensitivity = TP / (TP+ FN ) Specificity = % of class 0 examples correctly classified Specificity = TN / (TN+ FP ) True positive rate = % of class 1 examples correctly as class 1 = Sensitivity = TP / (TP+ FN ) False positive rate = % of class 0 examples that were classified as class 1 = 1- specificity = FP / (TN+ FP ) 12 ROC curve • Plot the True Positive Rate versus False Positive Rate for various values of the threshold • The diagonal is the baseline – a random classifier • Sometimes researchers use the Area under the ROC curve as a performance measure – AUC • AUC by definition is between 0 and 1 Lift Charts: Goal Useful for assessing performance in terms of identifying the most important class Helps evaluate, e.g., – How many tax records to examine – How many loans to grant – How many customers to mail offer to 14 Lift Charts – Cont. Compare performance of DM model to “no model, pick randomly” Measures ability of DM model to identify the important class, relative to its average prevalence Charts give explicit assessment of results over a large number of cutoffs 15 Lift Chart – cumulative performance Lift chart (training dataset) 14 Cumulative 12 Cumulative Positives #Positives Ownership when sorted using predicted values 10 8 6 4 Cumulative Ownership using #Positives average 2 0 0 10 20 30 # cases For example: after examining 10 cases (x-axis), 9 positive cases (y-axis) have been correctly identified 16 Lift Charts: How to Compute • Using the model’s classification scores, sort examples from most likely to least likely members of the important class • Compute lift: Accumulate the correctly classified “important class” records (Y axis) and compare to number of total records (X axis) 17 Asymmetric Costs 18 Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s) 19 Example – Response to Promotional Offer 20 • Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) • “Naïve rule” (classify everyone as “0”) has error rate of 1%, accuracy 99% (seems good) • Using DM we can correctly classify eight 1’s as 1’s • It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s. The Confusion Matrix Actual 1 Actual 0 Predict as 1 8 20 Predict as 0 2 970 Error rate = (2+20) = 2.2% (higher than naïve rate) 21 Introducing Costs & Benefits Suppose: • Profit from a “1” is $10 • Cost of sending offer is $1 Then: • Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit • Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) • Net profit = $60 22 Profit Matrix Actual 1 Actual 0 23 Predict as 1 $80 ($20) Predict as 0 0 0 Lift (again) • Adding costs to the mix, as above, does not change the actual classifications. • But it allows us to get a better decision (threshold) – Use the lift curve and change the cutoff value for “1” to maximize profit 24 Adding Cost/Benefit to Lift Curve • Sort test examples in descending probability of success • For each case, record cost/benefit of actual outcome • Also record cumulative cost/benefit • Plot all records X-axis is index number (1 for 1st case, n for nth case) Y-axis is cumulative cost/benefit Reference line from origin to yn ( yn = total net benefit) 25 Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum 26 Negative slope to reference curve Zoom in Maximum profit = 60$ 27 Multiple Classes For m classes, confusion matrix has m rows and m columns • Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways • Practically too many to work with • In decision-making context, though, such complexity rarely arises – one class is usually of primary interest 28 Classification Using Triage Take into account a gray area in making classification decisions • Instead of classifying as C1 or C0, we classify as C1 C0 Can’t say The third category might receive special human review 29