Machine Learning - Northwestern University

advertisement
Machine Learning
Topic 7: Boosting
(based
on Rob Schapire‟s IJCAI‟99 talk)
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Horse Race Prediction
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
How to Make $$$ In Horse Races?
• Ask a professional.
• Suppose:
– Professional cannot give single highly
accurate rule
– …but presented with a set of races, can
always generate better-than-random rules
• Can you get rich?
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Idea
1) Ask expert for rule-of-thumb
2) Assemble set of cases where rule-of-thumb
fails (hard cases)
3) Ask expert for a rule-of-thumb to deal with
the hard cases
4) Goto Step 2
• Combine all rules-of-thumb
• Expert could be “weak” learning algorithm
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Questions
• How to choose races on each round?
– concentrate on “hardest” races
(those most often misclassified by previous
rules of thumb)
• How to combine rules of thumb into single
prediction rule?
– take (weighted) majority vote of rules of
thumb
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Boosting
• boosting = general method of
converting rough rules of thumb into
highly accurate prediction rule
• more technically:
– given “weak” learning algorithm that can
consistently find hypothesis (classifier) with
error 1/2- (implicit here is that  1/2
– a boosting algorithm can provably
construct single hypothesis with error  e
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
This Lecture
• Introduction to boosting (AdaBoost)
• Analysis of training error
• Analysis of generalization error based on
theory of margins
• Extensions
• Experiments
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
A Formal View of Boosting
• Given training set X={(x1,y1),…,(xm,ym)}
• yi{-1,+1} correct label of instance xiX
HOW?
• for timesteps t = 1,…,T:
• construct a distribution Dt on {1,…,m}
• Find a weak hypothesis ht : X  {-1,+1}
with small error et on Dt:
e t  PrD [ht ( xi )  yi ]
t
• Output a final hypothesis Hfinal that combines
the weak hypotheses in a good way
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Weighting the Votes
• Hfinal is a weighted combination of the
choices from all our hypotheses.
How seriously
we take
hypothesis t
What
hypothesis t
guessed


H final ( x)  sgn   t ht ( x) 
 t

Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
The Hypothesis Weight
•
t determines how “seriously” we take
this particular classifier‟s answer
The error on training
distribution Dt
1 1- et
 t  ln 
2  et



Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
The Training Distribution
• Dt determines which elements in the training
set we focus on.
Size of the training set
1
D1 (i ) 
m
The right
answer
- t
Dt (i ) e
Dt +1 (i ) 
 
Zt  e t
What we
guessed
if yi  ht ( xi )
if yi  ht ( xi )
Normalization factor
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
The Hypothesis Weight
1  1- et
 t  ln 
2  et
Dt +1
Dt

Zt
- t
e
 
t
e


  0

if yi  ht ( xi )
if yi  ht ( xi )
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
AdaBoost [Freund&Schapire ’97]
• constructing Dt:
1
m
• given Dt and ht:
• D1 (i ) 
Dt +1
D
 t
Zt

e - t
 
t
e
if yi  ht ( xi )
if yi  ht ( xi )
Dt
 exp( - t  yi  ht ( xi ))
Zt
where: Zt = normalization constant
1  1- et 
  0
 t  ln 
2  et 


• final hypothesis: H final ( x)  sgn   t ht ( x) 

t

Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
A Formal View of Boosting
• Given training set X={(x1,y1),…,(xm,ym)}
• yi{-1,+1} correct label of instance xiX
• for timesteps t = 1,…,T:
• construct a distribution Dt on {1,…,m}
• Find a weak hypothesis ht : X  {-1,+1}
with small error et on Dt:
e t  PrD [ht ( xi )  yi ]
t
• Output a final hypothesis Hfinal that combines
the weak hypotheses in a good way
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Toy Example
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Round 1
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Round 2
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Round 3
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Final Hypothesis
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Analyzing the Training Error
• Theorem [Freund&Schapire ’97]:
write et as ½-t

2
then, training error( H final )  exp  -2  t 
t


so if t: t   > 0 then
then, training error( H final )  e
-2 2T
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Analyzing the Training Error
So what? This means AdaBoost is
adaptive:
• does not need to know  or T a priori
• can exploit t  
• …as long as ½ > t
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Proof Intuition
• on round t:
increase weight of examples incorrectly classified by ht
• if xi incorrectly classified by Hfinal
then xi incorrectly classified by weighted majority of ht‟s
then xi must have “large” weight under final dist. DT+1
• since total weight  1:
number of incorrectly classified examples “small”
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Analyzing Generalization Error
we expect:
 training error to continue to drop (or reach zero)
 test error to increase when Hfinal becomes “too complex”
(Occam‟s razor)
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
A Typical Run
(boosting on C4.5 on
“letter” dataset)
• Test error does not increase even after 1,000 rounds
(~2,000,000 nodes)
• Test error continues to drop after training error is zero!
• Occam‟s razor wrongly predicts “simpler” rule is better.
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
A Better Story: Margins
Key idea: Consider confidence (margin):
• with
 h ( x)
H final ( x)  sgn( f ( x))

f ( x) 

• define: margin of (x,y) = y  f (x)
t t
 [-1,1]
t
t
t
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Margins for Toy Example
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
The Margin Distribution
epoch
5
100
1000
training error
test error
0.0
8.4
0.0
3.3
0.0
3.1
%margins0.5
Minimum margin
7.7
0.14
0.0
0.52
0.0
0.55
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Boosting Maximizes Margins
• Can be shown to minimize
e
i
- yi f ( xi )
 e
- yi
 t ht ( xi )
t
i
 to margin of (xi,yi)
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Analyzing Boosting Using Margins
generalization error bounded by function of
training sample margins:
~ VC( H ) 

error  P̂r[margin f ( x, y)   ] + O
2

m



 larger margin  better bound
 bound independent on # of epochs
 boosting tends to increase margins of training
examples by concentrating on those with smallest
margin
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Relation to SVMs
SVM: map x into high-dim space,
separate data linearly
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Relation to SVMs (cont.)
+ 1 if 2 x 5 - 5 x 2 + x  10
H ( x)  
otherwise
- 1

h ( x)  (1, x, x 2 , x3 , x 4 , x5 )

  (-10,1,-5,0,0,2)
 
+ 1 if   h ( x)  0
H ( x)  
otherwise
- 1
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Relation to SVMs
• Both maximize margins:
 
(  h ( xi )) yi
  max min

i
w
||  ||

||  || 2 Euclidean norm (L2)
• SVM:

• AdaBoost: ||  ||1 Manhattan norm (L1)
• Has implications for optimization, PAC
bounds
See [Freund et al „98] for details
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Extensions: Multiclass Problems
• Reduce to binary problem by creating several
binary questions for each example:
...
• “does or does not example x belong to class 1?”
• “does or does not example x belong to class 2?”
• “does or does not example x belong to class 3?”
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Extensions: Confidences and Probabilities
• Prediction of hypothesis ht:
sgn( ht ( x))
• Confidence of hypothesis ht:
| ht ( x) |
• Probability of Hfinal:
e f ( x)
Pr f [ y  +1 | x]  f ( x ) - f ( x )
e
+e
[Schapire&Singer „98], [Friedman, Hastie & Tibshirani „98]
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Practical Advantages of AdaBoost
•
•
•
•
•
(quite) fast
simple + easy to program
only a single parameter to tune (T)
no prior knowledge
flexible: can be combined with any classifier
(neural net, C4.5, …)
• provably effective (assuming weak learner)
• shift in mind set: goal now is merely to find
hypotheses that are better than random guessing
• finds outliers
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Caveats
• performance depends on data & weak learner
• AdaBoost can fail if
– weak hypothesis too complex (overfitting)
– weak hypothesis too weak (t0 too quickly),
• underfitting
• Low margins  overfitting
• empirically, AdaBoost seems especially
susceptible to noise
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
UCI Benchmarks
Comparison with
• C4.5 (Quinlan‟s Decision Tree Algorithm)
• Decision Stumps (only single attribute)
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Text Categorization
database: Reuters
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Conclusion
• boosting useful tool for classification problems
•
•
•
•
grounded in rich theory
performs well experimentally
often (but not always) resistant to overfitting
many applications
• but
• slower classifiers
• result less comprehensible
• sometime susceptible to noise
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Background
• [Valiant‟84]
introduced theoretical PAC model for studying
machine learning
• [Kearns&Valiant‟88]
open problem of finding a boosting algorithm
• [Schapire‟89], [Freund‟90]
first polynomial-time boosting algorithms
• [Drucker, Schapire&Simard ‟92]
first experiments using boosting
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Backgroung (cont.)
• [Freund&Schapire ‟95]
– introduced AdaBoost algorithm
– strong practical advantages over previous boosting algorithms
• experiments using AdaBoost:
[Drucker&Cortes ‟95]
[Jackson&Cravon ‟96]
[Freund&Schapire ‟96]
[Quinlan ‟96]
[Breiman ‟96]
[
[Schapire&Singer ‟98]
[Maclin&Opitz ‟97]
[Bauer&Kohavi ‟97]
[Schwenk&Bengio ‟98]
Dietterich‟98]
• continuing development of theory & algorithms:
[Schapire,Freund,Bartlett&Lee ‟97] [Schapire&Singer ‟98]
[Breiman ‟97]
[Mason, Bartlett&Baxter ‟98]
[Grive and Schuurmans‟98]
[Friedman, Hastie&Tibshirani ‟98]
Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo
Download