Classification performance measures

advertisement
Evaluating Classification
Performance
Data Mining
1
Why Evaluate?
• Multiple methods are available for
classification
• For each method, multiple choices are
available for settings (e.g. value of K for
KNN, size of Tree of Decision Tree Learning)
• To choose best model, need to assess each
model’s performance
2
Basic performance measure:
Misclassification error
• Error = classifying an example as belonging
to one class when it belongs to another
class.
• Error rate = percent of misclassified
examples out of the total examples in the
validation data (or test data)
3
Naive classification Rule
• Naïve rule: classify all examples as
belonging to the most prevalent class
• Often used as benchmark: we hope
to do better than that
• Exception: when goal is to identify
high-value but rare outcomes, we
may do well by doing worse than the
naïve rule (see “lift” – later)
4
Confusion Matrix
Classification Confusion Matrix
Predicted Class
Actual Class
1
0
1
201
25
201 1’s correctly classified as “1” 25 0’s incorrectly classified as “1” 85 1’s incorrectly classified as “0” 2689 0’s correctly classified as “0”We use TP to denote True Positives .
Similarly , FP, FN and TN
0
85
2689
True Positives
False Positives
False Negatives
True Negatives
Error Rate and Accuracy
Classification Confusion Matrix
Predicted Class
Actual Class
1
0
1
TP = 201
FP = 25
0
FN =
85
TN = 2689
Error rate = (FP + FN)/(TP + FP + FN + TN)
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 – err = (201+2689) = 96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)
6
Cutoff for classification
Many classification algorithms classify via a 2-step
process:
For each record,
1. Compute a score or a probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
• For example, with Naïve Bayes the default cutoff value
is 0.5
If p(y=1|x) >= 0.5, classify as “1”
If p(y=1|x) < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50
7
Cutoff Table - example
Actual Class
1
1
1
1
1
1
1
0
1
1
1
0
Prob. of "1"
0.996
0.988
0.984
0.980
0.948
0.889
0.848
0.762
0.707
0.681
0.656
0.622
Actual Class
1
0
0
1
0
0
0
0
0
0
0
0
Prob. of "1"
0.506
0.471
0.337
0.218
0.199
0.149
0.048
0.038
0.025
0.022
0.016
0.004
• If cutoff is 0.50: 13 examples are classified as “1”
• If cutoff is 0.75: 8 examples are classified as “1”
• If cutoff is 0.25: 15 examples are classified as “1”
Confusion Matrix for Different
Cutoffs
Predicted
class 1
Predicted
class 0
Actual class
1
11
1
Actual class
0
4
8
Cutoff probability = 0.25
Accuracy = 19/24
Cutoff probability = 0.5
(Not shown)
Accuracy = 21/24
Predicted
class 1
Predicted
class 0
Actual class
1
7
5
Actual class
0
1
11
Cutoff probability = 0.75
Accuracy = 18/24
Lift
10
When One Class is More Important
• In many cases it is more important to identify
members of one class
•
•
•
Tax fraud
Response to promotional offer
Detecting malignant tumors
• In such cases, we are willing to tolerate
greater overall error, in return for better
identifying the important class for further
attention
Alternate Accuracy Measures
•
Predicte Predicted
d class 1 class 0
We assume that the
important class is 1
Actual class =1
TP
FN
Actual class =0
FP
TN
Sensitivity = % of class 1 examples correctly classified
Sensitivity = TP / (TP+ FN )
Specificity = % of class 0 examples correctly classified
Specificity = TN / (TN+ FP )
True positive rate = % of class 1 examples correctly as
class 1 = Sensitivity = TP / (TP+ FN )
False positive rate = % of class 0 examples that were
classified as class 1 = 1- specificity = FP / (TN+ FP )
12
ROC curve
• Plot the True Positive Rate
versus False Positive Rate for
various values of the threshold
• The diagonal is the baseline –
a random classifier
• Sometimes researchers use
the Area under the ROC curve
as a performance measure –
AUC
• AUC by definition is between 0
and 1
Lift Charts: Goal
Useful for assessing performance in terms of
identifying the most important class
Helps evaluate, e.g.,
– How many tax records to examine
– How many loans to grant
– How many customers to mail offer to
14
Lift Charts – Cont.
Compare performance of DM model to “no
model, pick randomly”
Measures ability of DM model to identify the
important class, relative to its average
prevalence
Charts give explicit assessment of results over a
large number of cutoffs
15
Lift Chart – cumulative performance
Lift chart (training dataset)
14
Cumulative
12
Cumulative
Positives
#Positives
Ownership when
sorted using
predicted values
10
8
6
4
Cumulative
Ownership using
#Positives
average
2
0
0
10
20
30
# cases
For example: after examining 10 cases (x-axis), 9
positive cases (y-axis) have been correctly identified
16
Lift Charts: How to Compute
• Using the model’s classification scores, sort
examples from most likely to least likely
members of the important class
• Compute lift: Accumulate the correctly
classified “important class” records (Y axis)
and compare to number of total records (X
axis)
17
Asymmetric Costs
18
Misclassification Costs May Differ
The cost of making a misclassification error
may be higher for one class than the
other(s)
Looked at another way, the benefit of
making a correct classification may be
higher for one class than the other(s)
19
Example – Response to Promotional Offer
20
•
Suppose we send an offer to 1000 people, with
1% average response rate
(“1” = response,
“0” = nonresponse)
•
“Naïve rule” (classify everyone as “0”) has error
rate of 1%, accuracy 99% (seems good)
•
Using DM we can correctly classify eight 1’s as
1’s
• It comes at the cost of misclassifying twenty
0’s as 1’s and two 1’s as 0’s.
The Confusion Matrix
Actual 1
Actual 0
Predict as 1
8
20
Predict as 0
2
970
Error rate = (2+20) = 2.2% (higher than naïve rate)
21
Introducing Costs & Benefits
Suppose:
• Profit from a “1” is $10
• Cost of sending offer is $1
Then:
• Under naïve rule, all are classified as “0”, so no offers
are sent: no cost, no profit
• Under DM predictions, 28 offers are sent.
8 respond with profit of $10 each
20 fail to respond, cost $1 each
972 receive nothing (no cost, no profit)
• Net profit = $60
22
Profit Matrix
Actual 1
Actual 0
23
Predict as 1
$80
($20)
Predict as 0
0
0
Lift (again)
• Adding costs to the mix, as above, does not
change the actual classifications.
• But it allows us to get a better decision
(threshold)
– Use the lift curve and change the cutoff value
for “1” to maximize profit
24
Adding Cost/Benefit to Lift Curve
• Sort test examples in descending probability
of success
• For each case, record cost/benefit of actual
outcome
• Also record cumulative cost/benefit
• Plot all records
X-axis is index number (1 for 1st case, n for nth case)
Y-axis is cumulative cost/benefit
Reference line from origin to yn ( yn = total net benefit)
25
Lift Curve May Go Negative
If total net benefit from all cases is negative,
reference line will have negative slope
Nonetheless, goal is still to use cutoff to
select the point where net benefit is at a
maximum
26
Negative slope to reference curve
Zoom in
Maximum profit = 60$
27
Multiple Classes
For m classes, confusion matrix has m rows and
m columns
• Theoretically, there are m(m-1) misclassification costs, since any
case could be misclassified in m-1 ways
• Practically too many to work with
• In decision-making context, though, such complexity rarely arises
– one class is usually of primary interest
28
Classification Using Triage
Take into account a gray area in making
classification decisions
• Instead of classifying as C1 or C0, we classify as
C1
C0
Can’t say
The third category might receive special human
review
29
Download