Document

advertisement
IMBALANCED DATA
David Kauchak
CS 451 – Fall 2013
Admin
Assignment 3:
- how did it go?
- do the experiments help?
Assignment 4
Course feedback
Phishing
Setup
1.
2.
3.
4.
5.
for 1 hour, google collects 1M e-mails randomly
they pay people to label them as “phishing” or
“not-phishing”
they give the data to you to learn to classify
e-mails as phishing or not
you, having taken ML, try out a few of your
favorite classifiers
You achieve an accuracy of 99.997%
Should you be happy?
Imbalanced data
labeled data
99.997%
not-phishing
0.003%
phishing
The phishing problem is what is called an
imbalanced data problem
This occurs where there is a large discrepancy
between the number of examples with each
class label
e.g. for our 1M example dataset only about
30 would actually represent phishing e-mails
What is probably going on with our classifier?
Imbalanced data
always
predict
not-phishing
99.997% accuracy
Why does the classifier learn this?
Imbalanced data
Many classifiers are designed to optimize error/accuracy
This tends to bias performance towards the majority class
Anytime there is an imbalance in the data this can happen
It is particularly pronounced, though, when the imbalance is
more pronounced
Imbalanced problem domains
Besides phishing (and spam) what are some other
imbalanced problems domains?
Imbalanced problem domains
Medical diagnosis
Predicting faults/failures (e.g. hard-drive failures,
mechanical failures, etc.)
Predicting rare events (e.g. earthquakes)
Detecting fraud (credit card transactions, internet
traffic)
Imbalanced data: current classifiers
labeled data
99.997%
not-phishing
0.003%
phishing
How will our current classifiers do on this problem?
Imbalanced data: current classifiers
All will do fine if the data can be easily separated/distinguished
Decision trees:



explicitly minimizes training error
when pruning pick “majority” label at leaves
tend to do very poor at imbalanced problems
k-NN:

even for small k, majority class will tend to overwhelm the vote
perceptron:


can be reasonable since only updates when a mistake is made
can take a long time to learn
Part of the problem: evaluation
Accuracy is not the right measure of classifier
performance in these domains
Other ideas for evaluation measures?
“identification” tasks
View the task as trying to find/identify “positive” examples (i.e.
the rare events)
Precision: proportion of test examples predicted as positive
that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that
are correct
# correctly predicted as positive
# positive examples in test set
“identification” tasks
Precision: proportion of test examples predicted as positive that are correct
# correctly predicted as positive
# examples predicted as positive
Recall: proportion of test examples labeled as positive that are correct
# correctly predicted as positive
# positive examples in test set
predicted
positive
all positive
precision
recall
precision and recall
data
label
predicted
0
0
0
1
1
0
1
1
0
1
1
1
0
0
precision =
recall =
# correctly predicted as positive
# examples predicted as positive
# correctly predicted as positive
# positive examples in test set
precision and recall
data
label
predicted
0
0
0
1
1
0
1
1
0
1
1
0
precision =
recall =
# correctly predicted as positive
# examples predicted as positive
# correctly predicted as positive
# positive examples in test set
precision =
1
0
recall =
2
4
2
3
precision and recall
data
label
predicted
0
0
0
1
1
0
1
1
0
1
1
1
0
0
precision =
recall =
# correctly predicted as positive
# examples predicted as positive
# correctly predicted as positive
# positive examples in test set
Why do we have both measures?
How can we maximize precision?
How can we maximize recall?
Maximizing precision
data
label
predicted
0
0
0
0
1
0
1
0
0
0
1
0
0
0
precision =
recall =
# correctly predicted as positive
# examples predicted as positive
# correctly predicted as positive
# positive examples in test set
Don’t predict anything as positive!
Maximizing recall
data
label
predicted
0
1
0
1
1
1
1
1
0
1
1
1
0
1
precision =
recall =
# correctly predicted as positive
# examples predicted as positive
# correctly predicted as positive
# positive examples in test set
Predict everything as positive!
precision vs. recall
Often there is a tradeoff between precision and
recall
increasing one, tends to decrease the other
For our algorithms, how might be increase/decrease
precision/recall?
precision/recall tradeoff
data
label
predicted
confidence
0
0
0.75
0
1
0.60
1
0
0.20
1
1
0.80
0
1
0.50
1
1
0.55
0
0
0.90
- For many classifiers we can
get some notion of the
prediction confidence
- Only predict positive if the
confidence is above a given
threshold
- By varying this threshold, we
can vary precision and recall
precision/recall tradeoff
data
label
predicted
confidence
1
1
0.80
0
1
0.60
1
1
0.55
0
1
0.50
1
0
0.20
0
0
0.75
0
0
0.90
put most confident positive
predictions at top
put most confident negative
predictions at bottom
calculate precision/recall at
each break point/threshold
classify everything above
threshold as positive and
everything else negative
precision/recall tradeoff
data
label
predicted
confidence
1
1
0.80
0
1
0.60
1
1
0.55
0
1
0.50
1
0
0.20
0
0
0.75
0
0
0.90
precision
recall
1/2 = 0.5
1/3 = 0.33
precision/recall tradeoff
data
label
predicted
confidence
1
1
0.80
0
1
0.60
1
1
0.55
0
1
0.50
1
0
0.20
0
0
0.75
0
0
0.90
precision
recall
3/5 = 0.6
3/3 = 1.0
precision/recall tradeoff
data
label
predicted
confidence
1
1
0.80
0
1
0.60
1
1
0.55
0
1
0.50
1
0
0.20
0
0
0.75
0
0
0.90
precision
recall
3/7 = 0.43
3/3 = 1.0
precision/recall tradeoff
data
label
predicted
confidence
precision
recall
1
1
0.80
1.0
0.33
0
1
0.60
0.5
0.33
1
1
0.55
0.66
0.66
0
1
0.50
0.5
0.66
1
0
0.20
0.6
1.0
0
0
0.75
0.5
1.0
0
0
0.90
0.43
1.0
precision-recall curve
1.0
Precision
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
Recall
0.8
1.0
precision
precision
Which is system is better?
recall
recall
How can we quantify this?
Area under the curve
Area under the curve (AUC) is one metric that
encapsulates both precision and recall
calculate the precision/recall values for all thresholding of
the test set (like we did before)
then calculate the area under the curve
can also be calculated as the average precision for all the
recall points
precision
precision
Area under the curve?
recall
recall
Any concerns/problems?
precision
precision
Area under the curve?
?
recall
recall
For real use, often only
interested in performance in
a particular range
Eventually, need to deploy.
How do we decide what
threshold to use?
precision
precision
Area under the curve?
recall
?
recall
Ideas? We’d like a compromise between precision and recall
A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
(   1) PR
F

2
1
1

P

R
  (1   )
P
R
1
2
F1-measure
Most common α=0.5: equal balance/weighting
between precision and recall:
(   1) PR
F

2
1
1

P

R
  (1   )
P
R
2
1
1
2PR
F1 =
=
1
1 P+R
0.5 + 0.5
P
R
A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
(   1) PR
F

2
1
1

P

R
  (1   )
P
R
1
2
Why harmonic mean?
Why not normal mean (i.e. average)?
F1 and other averages
Combined Measures
100
80
Minimum
Maximum
60
Arithmetic
Geometric
40
Harmonic
20
0
0
20
40
60
80
100
Precision (Recall fixed at 70%)
Harmonic mean encourages precision/recall values that are similar!
Evaluation summarized
Accuracy is often NOT an appropriate evaluation
metric for imbalanced data problems
precision/recall capture different characteristics of
our classifier
AUC and F1 can be used as a single metric to
compare algorithm variations (and to tune
hyperparameters)
Phishing – imbalanced data
Black box approach
Abstraction: we have a generic binary classifier, how
can we use it to solve our new problem
+1
optionally: also output
a confidence/score
binary
classifier
-1
Can we do some pre-processing/post-processing of our
data to allow us to still use our binary classifiers?
Idea 1: subsampling
Create a new training data set by:
- including all k “positive” examples
- randomly picking k “negative”
examples
labeled data
99.997%
not-phishing
50%
not-phishing
50%
phishing
pros/cons?
0.003%
phishing
Subsampling
Pros:
 Easy
to implement
 Training becomes much more efficient (smaller training
set)
 For some domains, can work very well
Cons:

Throwing away a lot of data/information
Idea 2: oversampling
labeled data
99.997%
not-phishing
Create a new training data set by:
- including all m “negative” examples
- include m “positive examples:
- repeat each example a fixed
number of times, or
- sample with replacement
50%
not-phishing
50%
phishing
0.003%
phishing
pros/cons?
oversampling
Pros:
 Easy
to implement
 Utilizes all of the training data
 Tends to perform well in a broader set of circumstances
than subsampling
Cons:

Computationally expensive to train classifier
Idea 2b: weighted examples
cost/weights
Add costs/weights to the training set
labeled data
99.997%
not-phishing
0.003%
phishing
“negative” examples get weight 1
1
“positive” examples get a much larger
weight
change learning algorithm to optimize
weighted training error
99.997/0.003 =
33332
pros/cons?
weighted examples
Pros:
 Achieves
the effect of oversampling without the
computational cost
 Utilizes all of the training data
 Tends to perform well in a broader set circumstances
Cons:

Requires a classifier that can deal with weights
Of our three classifiers, can all be modified to handle weights?
Building decision trees
Otherwise:
- calculate the “score” for each feature if we used it to split the data
- pick the feature with the highest score, partition the data based on
that data value and call recursively
We used the training error to decide on which feature to choose:
use the weighted training error
In general, any time we do a count, use the weighted count (e.g. in
calculating the majority label at a leaf)
Idea 3: optimize a different error metric
Train classifiers that try and optimize F1 measure or
AUC or …
or, come up with another learning algorithm designed
specifically for imbalanced problems
pros/cons?
Idea 3: optimize a different error metric
Train classifiers that try and optimize F1 measure or AUC or …
Challenge: not all classifiers are amenable to this
or, come up with another learning algorithm designed
specifically for imbalanced problems
Don’t want to reinvent the wheel!
That said, there are a number of approaches that have been
developed to specifically handle imbalanced problems
Download