Machine Learning Topic 7: Boosting (based on Rob Schapire‟s IJCAI‟99 talk) Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Horse Race Prediction Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo How to Make $$$ In Horse Races? • Ask a professional. • Suppose: – Professional cannot give single highly accurate rule – …but presented with a set of races, can always generate better-than-random rules • Can you get rich? Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Idea 1) Ask expert for rule-of-thumb 2) Assemble set of cases where rule-of-thumb fails (hard cases) 3) Ask expert for a rule-of-thumb to deal with the hard cases 4) Goto Step 2 • Combine all rules-of-thumb • Expert could be “weak” learning algorithm Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Questions • How to choose races on each round? – concentrate on “hardest” races (those most often misclassified by previous rules of thumb) • How to combine rules of thumb into single prediction rule? – take (weighted) majority vote of rules of thumb Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Boosting • boosting = general method of converting rough rules of thumb into highly accurate prediction rule • more technically: – given “weak” learning algorithm that can consistently find hypothesis (classifier) with error 1/2- (implicit here is that 1/2 – a boosting algorithm can provably construct single hypothesis with error e Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo This Lecture • Introduction to boosting (AdaBoost) • Analysis of training error • Analysis of generalization error based on theory of margins • Extensions • Experiments Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Formal View of Boosting • Given training set X={(x1,y1),…,(xm,ym)} • yi{-1,+1} correct label of instance xiX HOW? • for timesteps t = 1,…,T: • construct a distribution Dt on {1,…,m} • Find a weak hypothesis ht : X {-1,+1} with small error et on Dt: e t PrD [ht ( xi ) yi ] t • Output a final hypothesis Hfinal that combines the weak hypotheses in a good way Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Weighting the Votes • Hfinal is a weighted combination of the choices from all our hypotheses. How seriously we take hypothesis t What hypothesis t guessed H final ( x) sgn t ht ( x) t Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Hypothesis Weight • t determines how “seriously” we take this particular classifier‟s answer The error on training distribution Dt 1 1- et t ln 2 et Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Training Distribution • Dt determines which elements in the training set we focus on. Size of the training set 1 D1 (i ) m The right answer - t Dt (i ) e Dt +1 (i ) Zt e t What we guessed if yi ht ( xi ) if yi ht ( xi ) Normalization factor Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Hypothesis Weight 1 1- et t ln 2 et Dt +1 Dt Zt - t e t e 0 if yi ht ( xi ) if yi ht ( xi ) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo AdaBoost [Freund&Schapire ’97] • constructing Dt: 1 m • given Dt and ht: • D1 (i ) Dt +1 D t Zt e - t t e if yi ht ( xi ) if yi ht ( xi ) Dt exp( - t yi ht ( xi )) Zt where: Zt = normalization constant 1 1- et 0 t ln 2 et • final hypothesis: H final ( x) sgn t ht ( x) t Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Formal View of Boosting • Given training set X={(x1,y1),…,(xm,ym)} • yi{-1,+1} correct label of instance xiX • for timesteps t = 1,…,T: • construct a distribution Dt on {1,…,m} • Find a weak hypothesis ht : X {-1,+1} with small error et on Dt: e t PrD [ht ( xi ) yi ] t • Output a final hypothesis Hfinal that combines the weak hypotheses in a good way Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Toy Example Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Round 1 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Round 2 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Round 3 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Final Hypothesis Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing the Training Error • Theorem [Freund&Schapire ’97]: write et as ½-t 2 then, training error( H final ) exp -2 t t so if t: t > 0 then then, training error( H final ) e -2 2T Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing the Training Error So what? This means AdaBoost is adaptive: • does not need to know or T a priori • can exploit t • …as long as ½ > t Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Proof Intuition • on round t: increase weight of examples incorrectly classified by ht • if xi incorrectly classified by Hfinal then xi incorrectly classified by weighted majority of ht‟s then xi must have “large” weight under final dist. DT+1 • since total weight 1: number of incorrectly classified examples “small” Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing Generalization Error we expect: training error to continue to drop (or reach zero) test error to increase when Hfinal becomes “too complex” (Occam‟s razor) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Typical Run (boosting on C4.5 on “letter” dataset) • Test error does not increase even after 1,000 rounds (~2,000,000 nodes) • Test error continues to drop after training error is zero! • Occam‟s razor wrongly predicts “simpler” rule is better. Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo A Better Story: Margins Key idea: Consider confidence (margin): • with h ( x) H final ( x) sgn( f ( x)) f ( x) • define: margin of (x,y) = y f (x) t t [-1,1] t t t Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Margins for Toy Example Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo The Margin Distribution epoch 5 100 1000 training error test error 0.0 8.4 0.0 3.3 0.0 3.1 %margins0.5 Minimum margin 7.7 0.14 0.0 0.52 0.0 0.55 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Boosting Maximizes Margins • Can be shown to minimize e i - yi f ( xi ) e - yi t ht ( xi ) t i to margin of (xi,yi) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Analyzing Boosting Using Margins generalization error bounded by function of training sample margins: ~ VC( H ) error P̂r[margin f ( x, y) ] + O 2 m larger margin better bound bound independent on # of epochs boosting tends to increase margins of training examples by concentrating on those with smallest margin Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Relation to SVMs SVM: map x into high-dim space, separate data linearly Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Relation to SVMs (cont.) + 1 if 2 x 5 - 5 x 2 + x 10 H ( x) otherwise - 1 h ( x) (1, x, x 2 , x3 , x 4 , x5 ) (-10,1,-5,0,0,2) + 1 if h ( x) 0 H ( x) otherwise - 1 Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Relation to SVMs • Both maximize margins: ( h ( xi )) yi max min i w || || || || 2 Euclidean norm (L2) • SVM: • AdaBoost: || ||1 Manhattan norm (L1) • Has implications for optimization, PAC bounds See [Freund et al „98] for details Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Extensions: Multiclass Problems • Reduce to binary problem by creating several binary questions for each example: ... • “does or does not example x belong to class 1?” • “does or does not example x belong to class 2?” • “does or does not example x belong to class 3?” Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Extensions: Confidences and Probabilities • Prediction of hypothesis ht: sgn( ht ( x)) • Confidence of hypothesis ht: | ht ( x) | • Probability of Hfinal: e f ( x) Pr f [ y +1 | x] f ( x ) - f ( x ) e +e [Schapire&Singer „98], [Friedman, Hastie & Tibshirani „98] Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Practical Advantages of AdaBoost • • • • • (quite) fast simple + easy to program only a single parameter to tune (T) no prior knowledge flexible: can be combined with any classifier (neural net, C4.5, …) • provably effective (assuming weak learner) • shift in mind set: goal now is merely to find hypotheses that are better than random guessing • finds outliers Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Caveats • performance depends on data & weak learner • AdaBoost can fail if – weak hypothesis too complex (overfitting) – weak hypothesis too weak (t0 too quickly), • underfitting • Low margins overfitting • empirically, AdaBoost seems especially susceptible to noise Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo UCI Benchmarks Comparison with • C4.5 (Quinlan‟s Decision Tree Algorithm) • Decision Stumps (only single attribute) Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Text Categorization database: Reuters Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Conclusion • boosting useful tool for classification problems • • • • grounded in rich theory performs well experimentally often (but not always) resistant to overfitting many applications • but • slower classifiers • result less comprehensible • sometime susceptible to noise Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Background • [Valiant‟84] introduced theoretical PAC model for studying machine learning • [Kearns&Valiant‟88] open problem of finding a boosting algorithm • [Schapire‟89], [Freund‟90] first polynomial-time boosting algorithms • [Drucker, Schapire&Simard ‟92] first experiments using boosting Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo Backgroung (cont.) • [Freund&Schapire ‟95] – introduced AdaBoost algorithm – strong practical advantages over previous boosting algorithms • experiments using AdaBoost: [Drucker&Cortes ‟95] [Jackson&Cravon ‟96] [Freund&Schapire ‟96] [Quinlan ‟96] [Breiman ‟96] [ [Schapire&Singer ‟98] [Maclin&Opitz ‟97] [Bauer&Kohavi ‟97] [Schwenk&Bengio ‟98] Dietterich‟98] • continuing development of theory & algorithms: [Schapire,Freund,Bartlett&Lee ‟97] [Schapire&Singer ‟98] [Breiman ‟97] [Mason, Bartlett&Baxter ‟98] [Grive and Schuurmans‟98] [Friedman, Hastie&Tibshirani ‟98] Northwestern University Fall 2007 Machine Learning EECS 349, Bryan Pardo