Machine Learning - Northwestern University

Machine Learning Topic 7: Boosting (based on Rob Schapire‟s IJCAI‟99 talk) Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Horse Race Prediction Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo How to Make $$$ In Horse Races? • Ask a professional. • Suppose: – Professional cannot give single highly accurate rule – …but presented with a set of races, can always generate better-than-random rules • Can you get rich? Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Idea 1) Ask expert for rule-of-thumb 2) Assemble set of cases where rule-of-thumb fails (hard cases) 3) Ask expert for a rule-of-thumb to deal with the hard cases 4) Goto Step 2 • Combine all rules-of-thumb • Expert could be “weak” learning algorithm Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Questions • How to choose races on each round? – concentrate on “hardest” races (those most often misclassified by previous rules of thumb) • How to combine rules of thumb into single prediction rule? – take (weighted) majority vote of rules of thumb Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Boosting • boosting = general method of converting rough rules of thumb into highly accurate prediction rule • more technically: – given “weak” learning algorithm that can consistently find hypothesis (classifier) with error 1/2- (implicit here is that  1/2 – a boosting algorithm can provably construct single hypothesis with error  e Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo This Lecture • Introduction to boosting (AdaBoost) • Analysis of training error • Analysis of generalization error based on theory of margins • Extensions • Experiments Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Formal View of Boosting • Given training set X={(x1,y1),…,(xm,ym)} • yi{-1,+1} correct label of instance xiX HOW? • for timesteps t = 1,…,T: • construct a distribution Dt on {1,…,m} • Find a weak hypothesis ht : X  {-1,+1} with small error et on Dt: e t  PrD [ht ( xi )  yi ] t • Output a final hypothesis Hfinal that combines the weak hypotheses in a good way Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Weighting the Votes • Hfinal is a weighted combination of the choices from all our hypotheses. How seriously we take hypothesis t What hypothesis t guessed   H final ( x)  sgn   t ht ( x)   t  Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Hypothesis Weight • t determines how “seriously” we take this particular classifier‟s answer The error on training distribution Dt 1 1- et  t  ln  2  et    Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Training Distribution • Dt determines which elements in the training set we focus on. Size of the training set 1 D1 (i )  m The right answer - t Dt (i ) e Dt +1 (i )    Zt  e t What we guessed if yi  ht ( xi ) if yi  ht ( xi ) Normalization factor Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Hypothesis Weight 1  1- et  t  ln  2  et Dt +1 Dt  Zt - t e   t e     0  if yi  ht ( xi ) if yi  ht ( xi ) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo AdaBoost [Freund&Schapire ’97] • constructing Dt: 1 m • given Dt and ht: • D1 (i )  Dt +1 D  t Zt  e - t   t e if yi  ht ( xi ) if yi  ht ( xi ) Dt  exp( - t  yi  ht ( xi )) Zt where: Zt = normalization constant 1  1- et    0  t  ln  2  et    • final hypothesis: H final ( x)  sgn   t ht ( x)   t  Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Formal View of Boosting • Given training set X={(x1,y1),…,(xm,ym)} • yi{-1,+1} correct label of instance xiX • for timesteps t = 1,…,T: • construct a distribution Dt on {1,…,m} • Find a weak hypothesis ht : X  {-1,+1} with small error et on Dt: e t  PrD [ht ( xi )  yi ] t • Output a final hypothesis Hfinal that combines the weak hypotheses in a good way Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Toy Example Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Round 1 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Round 2 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Round 3 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Final Hypothesis Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing the Training Error • Theorem [Freund&Schapire ’97]: write et as ½-t  2 then, training error( H final )  exp  -2  t  t   so if t: t   > 0 then then, training error( H final )  e -2 2T Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing the Training Error So what? This means AdaBoost is adaptive: • does not need to know  or T a priori • can exploit t   • …as long as ½ > t Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Proof Intuition • on round t: increase weight of examples incorrectly classified by ht • if xi incorrectly classified by Hfinal then xi incorrectly classified by weighted majority of ht‟s then xi must have “large” weight under final dist. DT+1 • since total weight  1: number of incorrectly classified examples “small” Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing Generalization Error we expect:  training error to continue to drop (or reach zero)  test error to increase when Hfinal becomes “too complex” (Occam‟s razor) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Typical Run (boosting on C4.5 on “letter” dataset) • Test error does not increase even after 1,000 rounds (~2,000,000 nodes) • Test error continues to drop after training error is zero! • Occam‟s razor wrongly predicts “simpler” rule is better. Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Better Story: Margins Key idea: Consider confidence (margin): • with  h ( x) H final ( x)  sgn( f ( x))  f ( x)   • define: margin of (x,y) = y  f (x) t t  [-1,1] t t t Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Margins for Toy Example Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Margin Distribution epoch 5 100 1000 training error test error 0.0 8.4 0.0 3.3 0.0 3.1 %margins0.5 Minimum margin 7.7 0.14 0.0 0.52 0.0 0.55 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Boosting Maximizes Margins • Can be shown to minimize e i - yi f ( xi )  e - yi  t ht ( xi ) t i  to margin of (xi,yi) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing Boosting Using Margins generalization error bounded by function of training sample margins: ~ VC( H )   error  P̂r[margin f ( x, y)   ] + O 2  m     larger margin  better bound  bound independent on # of epochs  boosting tends to increase margins of training examples by concentrating on those with smallest margin Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Relation to SVMs SVM: map x into high-dim space, separate data linearly Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Relation to SVMs (cont.) + 1 if 2 x 5 - 5 x 2 + x  10 H ( x)   otherwise - 1  h ( x)  (1, x, x 2 , x3 , x 4 , x5 )    (-10,1,-5,0,0,2)   + 1 if   h ( x)  0 H ( x)   otherwise - 1 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Relation to SVMs • Both maximize margins:   (  h ( xi )) yi   max min  i w ||  ||  ||  || 2 Euclidean norm (L2) • SVM:  • AdaBoost: ||  ||1 Manhattan norm (L1) • Has implications for optimization, PAC bounds See [Freund et al „98] for details Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Extensions: Multiclass Problems • Reduce to binary problem by creating several binary questions for each example: ... • “does or does not example x belong to class 1?” • “does or does not example x belong to class 2?” • “does or does not example x belong to class 3?” Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Extensions: Confidences and Probabilities • Prediction of hypothesis ht: sgn( ht ( x)) • Confidence of hypothesis ht: | ht ( x) | • Probability of Hfinal: e f ( x) Pr f [ y  +1 | x]  f ( x ) - f ( x ) e +e [Schapire&Singer „98], [Friedman, Hastie & Tibshirani „98] Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Practical Advantages of AdaBoost • • • • • (quite) fast simple + easy to program only a single parameter to tune (T) no prior knowledge flexible: can be combined with any classifier (neural net, C4.5, …) • provably effective (assuming weak learner) • shift in mind set: goal now is merely to find hypotheses that are better than random guessing • finds outliers Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Caveats • performance depends on data & weak learner • AdaBoost can fail if – weak hypothesis too complex (overfitting) – weak hypothesis too weak (t0 too quickly), • underfitting • Low margins  overfitting • empirically, AdaBoost seems especially susceptible to noise Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo UCI Benchmarks Comparison with • C4.5 (Quinlan‟s Decision Tree Algorithm) • Decision Stumps (only single attribute) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Text Categorization database: Reuters Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Conclusion • boosting useful tool for classification problems • • • • grounded in rich theory performs well experimentally often (but not always) resistant to overfitting many applications • but • slower classifiers • result less comprehensible • sometime susceptible to noise Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Background • [Valiant‟84] introduced theoretical PAC model for studying machine learning • [Kearns&Valiant‟88] open problem of finding a boosting algorithm • [Schapire‟89], [Freund‟90] first polynomial-time boosting algorithms • [Drucker, Schapire&Simard ‟92] first experiments using boosting Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Backgroung (cont.) • [Freund&Schapire ‟95] – introduced AdaBoost algorithm – strong practical advantages over previous boosting algorithms • experiments using AdaBoost: [Drucker&Cortes ‟95] [Jackson&Cravon ‟96] [Freund&Schapire ‟96] [Quinlan ‟96] [Breiman ‟96] [ [Schapire&Singer ‟98] [Maclin&Opitz ‟97] [Bauer&Kohavi ‟97] [Schwenk&Bengio ‟98] Dietterich‟98] • continuing development of theory & algorithms: [Schapire,Freund,Bartlett&Lee ‟97] [Schapire&Singer ‟98] [Breiman ‟97] [Mason, Bartlett&Baxter ‟98] [Grive and Schuurmans‟98] [Friedman, Hastie&Tibshirani ‟98] Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo

Machine Learning - Northwestern University

Related documents

Products

Support

Machine Learning - Northwestern University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib