Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Weak Learning (Kearns and Valiant, 1994) Definition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that: for all > 0 , for all c C and all distributions D, • Pr S D R(hS ) 1 2 1 , • for samples S of size m = poly(1/ ) for a fixed polynomial. Mehryar Mohri - Foundations of Machine Learning page 2 Boosting Ideas Finding simple relatively accurate base classifiers often not hard weak learner. Main ideas: • use weak learner to create a strong learner. • combine base classifiers returned by weak learner (ensemble method). But, how should the base classifiers be combined? Mehryar Mohri - Foundations of Machine Learning page 3 AdaBoost - Preliminaries (Freund and Schapire, 1997) Family of base classifiers: H { 1, +1}X . Iterative algorithm (T rounds): at each round t [1, T ] maintains distribution Dt over training sample. t convex combination hypothesis ft = s=1 s hs , with s 0 and hs H. • • Weighted error of hypothesis h at round t : m Dt (i)1h(xi )=yi . Pr[h(xi ) = yi ] = Dt Mehryar Mohri - Foundations of Machine Learning i=1 page 4 AdaBoost H { 1, +1} . X (Freund and Schapire, 1997) AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i 1 to m do 1 2 D1 (i) m 3 for t 1 to T do 4 ht base classifier in H with small error t = PrDt [ht (xi ) = yi ] 1 t 1 log 5 t 2 t 1 2 6 Zt 2[ t (1 normalization factor t )] 7 for i 1 to m do Dt (i) exp( t yi ht (xi )) 8 Dt+1 (i) Zt T 9 f t=1 t ht 10 return h = sgn(f ) Mehryar Mohri - Foundations of Machine Learning page 5 Notes Distributions Dt over training sample: originally uniform. at each round, the weight of a misclassified example is increased. observation: Dt+1 (i) = me Q Z , since • • • Dt+1 (i) = yi ft (xi ) t s=1 s Dt (i)e t yi ht (xi ) Zt = Dt 1 (i)e t 1 y i ht Zt 1 (xi ) 1 Zt e t yi ht (xi ) 1 e = m yi Pt s=1 t s=1 s hs (xi ) Zs Weight assigned to base classifier ht : t directy depends on the accuracy of ht at round t . Mehryar Mohri - Foundations of Machine Learning page 6 . Illustration t = 1 Mehryar Mohri - Foundations of Machine Learning t = 2 page 7 t = 3 ... Mehryar Mohri - Foundations of Machine Learning ... page 8 α1 + α2 + α3 = Mehryar Mohri - Foundations of Machine Learning page 9 Bound on Empirical Error (Freund and Schapire, 1997) Theorem: The empirical error of the classifier output by AdaBoost verifies: T R(h) exp 2 t=1 • exp( 2 2 t ( 12 If further for all t [1, T ] , R(h) • 1 2 2 . t ) , then T ). does not need to be known in advance: adaptive boosting. Mehryar Mohri - Foundations of Machine Learning page 10 • Proof: Since, as we saw, D 1 R(h) = m 1 m m 1yi f (xi ) i=1 m • Now, since Z t 0 1 m , exp( yi f (xi )) i=1 T T Zt DT +1 (i) = m i=1 t+1 (i) = m yi ft (xi ) e Q m ts=1 Zs Zt . t=1 t=1 is a normalization factor, m Dt (i)e Zt = = i=1 t yi ht (xi ) Dt (i)e t i:yi ht (xi ) 0 = (1 t )e = (1 t) Mehryar Mohri - Foundations of Machine Learning Dt (i)e t i:yi ht (xi )<0 t + te t 1 + t + t t 1 t t =2 t (1 page 11 t ). • Thus, T Zt = t=1 T T t (1 2 t) = t=1 2 t t=1 T exp 2 1 1 2 4 2 1 2 t T = exp 2 t t (1 t )e + te . , at each round, AdaBoost assigns the same probability mass to correctly classified and misclassified instances. for base classifiers x [ 1, +1] , t can be similarly chosen to minimize Zt . t • 2 t=1 t=1 • Notes: • minimizer of • since (1 )e = 1 2 Mehryar Mohri - Foundations of Machine Learning t te t page 12 . AdaBoost = Coordinate Descent Objective Function: convex and differentiable. m m F( ) = e yi f (xi ) = e i=1 i=1 e yi PT t=1 t ht (xi ) x 0 1 loss Mehryar Mohri - Foundations of Machine Learning page 13 . • Direction: unit vector e with t et = argmin dF ( t 1 • dF ( t 1 + et ) d m t 1 + et ) = m yi i=1 =0 Pt 1 s=1 s hs (xi ) yi ht (xi ) s=1 t 1 yi ht (xi )Dt (i) m Zs s=1 i=1 t 1 t 1 = [(1 , s hs (xi ) yi i=1 m = e t 1 yi ht (xi ) exp = =0 e . d t Since F ( + et ) t) t] Zs = [2 m t 1] m Zs . s=1 s=1 Thus, direction corresponding to base classifier with smallest error. Mehryar Mohri - Foundations of Machine Learning page14 • Step size: obtained via dF ( t 1 m t 1 + et ) =0 d yi ht (xi ) exp s hs (xi ) yi e yi ht (xi ) s=1 i=1 m t 1 yi ht (xi )Dt (i) m Zs e yi ht (xi ) =0 s=1 i=1 m yi ht (xi )Dt (i)e yi ht (xi ) =0 i=1 [(1 = t )e 1 1 log 2 te t ]=0 . t Thus, step size matches base classifier weight of AdaBoost. Mehryar Mohri - Foundations of Machine Learning page 15 =0 Alternative Loss Functions x boosting loss square loss x (1 x)2 1x x e x logistic loss log2 (1 + e x ) hinge loss x max(1 1 x, 0) zero-one loss x 1x<0 Mehryar Mohri - Foundations of Machine Learning page 16 Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: data in RN, e.g., N = 2, (height(x), weight(x)) . associate a stump to each component. pre-sort each component: O(N m log m). at each round, find best component and threshold. total complexity: O((m log m)N + mN T ). stumps not weak learners: think XOR example! • • • • • • Mehryar Mohri - Foundations of Machine Learning page 17 Overfitting? Assume that VCdim(H) = d and for a fixed T , define T FT = sgn t ht b : t, b R, ht H . t=1 FT can form a very rich family of classifiers. It can be shown (Freund and Schapire, 1997) that: VCdim(FT ) 2(d + 1)(T + 1) log2 ((T + 1)e). This suggests that AdaBoost could overfit for large values of T , and that is in fact observed in some cases, but in various others it is not! Mehryar Mohri - Foundations of Machine Learning page 18 Empirical Observations 20 cumulative distribution Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: test error error 15 training error 10 5 0 10 100 1000 # rounds -1 1.0 0.5 -0.5 ma Figure 2: Errortrees curves and the margin C4.5 decision (Schapire et al., 1998). distribution graph for the letter dataset as reported by Schapire et al. [69]. Left: the error curves (lower and upper curves, respectively) of the comb Mehryar Mohri - Foundations of Machine Learning page 19 Rademacher Complexity of Convex Hulls Theorem: Let H be a set of functions mapping from X to R . Let the convex hull of H be defined as p conv(H) = { p µk hk : p 1, µk 0, k=1 1, hk µk H}. k=1 Then, for any sample S , RS (conv(H)) = RS (H). Proof: 1 RS (conv(H)) = E m 1 E = m 1 = E m p m sup hk H,µ 0, µ sup 1 1 i=1 p sup hk H µ 0, µ µk hk (xi ) i k=1 m i hk (xi ) µk 1 1 k=1 i=1 m i hk (xi ) sup max hk H k [1,p] i=1 m 1 E sup = m h H i=1 Mehryar Mohri - Foundations of Machine Learning i h(xi ) = RS (H). page 20 Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002) Corollary: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H) : R(h) R(h) 2 R (h) + Rm H + 2 R (h) + RS H + 3 log 1 2m log 2 . 2m Proof: Direct consequence of margin bound of Lecture 4 and RS (conv(H)) = RS (H) . Mehryar Mohri - Foundations of Machine Learning page 21 Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998) Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H): R(h) R (h) + 2 2d log em d + m log 1 . 2m Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3). Mehryar Mohri - Foundations of Machine Learning page 22 Notes All of these bounds can be generalized to hold uniformly for all (0, 1), at the cost of an additional log log2 2 term and other minor constant factor m changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions x f (x) 1 = T t=1 t ht (x) conv(H). 1 Note that T does not appear in the bound. Mehryar Mohri - Foundations of Machine Learning page 23 Margin Distribution Theorem: For any > 0 , the following holds: Pr T yf (x) 1 t 2T 1 (1 m 1yi f (xi ) i=1 1 . t=1 Proof: Using the identity Dt+1 (i) = 1 m 1+ t) 1 m 0 1 = m =e m exp( yi f (xi ) + i=1 m e i=1 1 , ) T T 1 Zt DT +1 (i) m t=1 T T Zt = 2 1 t=1 Mehryar Mohri - Foundations of Machine Learning yi f (xi ) eQ m T t=1 Zt 1 t t t=1 page 24 t (1 t ). Notes If for all t [1, T ] , can be bounded by Pr yf (x) ( 12 (1 t ), then 2 )1 the upper bound (1 + 2 )1+ T /2 . 1 For < , (1 2 )1 (1+2 )1+ < 1 and the bound decreases exponentially in T . O(1/ m), For the bound to be convergent: O(1/ m) is roughly the condition on the thus edge value. Mehryar Mohri - Foundations of Machine Learning page 25 But, Does AdaBoost Maximize the Margin? No: AdaBoost may converge to a margin that is significantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least max if data separable and 2 some conditions on the base learners (Rätsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported. Mehryar Mohri - Foundations of Machine Learning page 26 Outliers AdaBoost assigns larger weights to harder examples. Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rätsch, 2003). • • Mehryar Mohri - Foundations of Machine Learning page 27 AdaBoost’s Weak Learning Condition Definition: the edge of a base classifier ht for a distribution D over the training sample is 1 (t) = 2 t 1 = 2 m yi ht (xi )D(i). i=1 Condition: there exists > 0 for any distribution D over the training sample and any base classifier (t) Mehryar Mohri - Foundations of Machine Learning . page 28 Zero-Sum Games Definition: payoff matrix M = (Mij ) Rm n. m possible actions (pure strategy) for row player. n possible actions for column player. Mij payoff for row player ( = loss for column player) when row plays i , column plays j . • • • • Example: rock paper scissors Mehryar Mohri - Foundations of Machine Learning rock 0 1 -1 paper scissors -1 1 0 -1 1 0 page 29 Mixed Strategies (von Neumann, 1928) Definition: player row selects a distribution p over the rows, player column a distribution q over columns. The expected payoff for row is m n pi Mij qj = p Mq. E [Mij ] = i p j q i=1 j=1 von Neumann’s minimax theorem: max min p Mq = min max p Mq. p q • equivalent form: q p max min p Mej = min max ei Mq. p j [1,n] Mehryar Mohri - Foundations of Machine Learning q i [1,m] page 30 John von Neumann (1903 - 1957) Mehryar Mohri - Foundations of Machine Learning page 31 AdaBoost and Game Theory Game: Player A: selects point xi , i [1, m]. Player B: selects base learner ht , t [1, T ]. Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ). von Neumann’s theorem: assume finite H . • • • T m 2 D(i)yi h(xi ) = max min yi = min max D h H i=1 Mehryar Mohri - Foundations of Machine Learning i [1,m] t=1 page 32 t ht (xi ) 1 = . Consequences Weak learning condition = non-zero margin. thus, possible to search for non-zero margin. AdaBoost = (suboptimal) search for corresponding . • • Weak learning =strong condition: the condition implies linear separability with margin 2 > 0 . • Mehryar Mohri - Foundations of Machine Learning page 33 Linear Programming Problem Maximizing the margin: ( · xi ) = max min yi . || ||1 i [1,m] This is equivalent to the following convex optimization LP problem: max subject to : yi ( · xi ) 1 = 1. Note that: | · x| = x H 1 Mehryar Mohri - Foundations of Machine Learning , with H = {x : · x = 0}. page 34 Advantages of AdaBoost Simple: straightforward implementation. Efficient: complexity O(mN T ) for stumps: when N and T are not too large, the algorithm is quite fast. • Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost. • • Mehryar Mohri - Foundations of Machine Learning page 35 Weaker Aspects Parameters: need to determine T , the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overfitting or low margins. • • Noise: severely damages the accuracy of Adaboost (Dietterich, 2000). Mehryar Mohri - Foundations of Machine Learning page 36 Other Boosting Algorithms arc-gv (Breiman, 1996): designed to maximize the margin, but outperformed by AdaBoost in experiments (Reyzin and Schapire, 2006). L1-regularized AdaBoost (Raetsch et al., 2001): outperfoms AdaBoost in experiments (Cortes et al., 2014). DeepBoost (Cortes et al., 2014): more favorable learning guarantees, outperforms both AdaBoost and L1-regularized AdaBoost. Mehryar Mohri - Foundations of Machine Learning page 37 References • Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In ICML, pages 262-270, 2014. • • Leo Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996. • Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. • G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447–454, 2001. • Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. • J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Mehryar Mohri - Foundations of Machine Learning page 38 References • Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004. • Rätsch, G., and Warmuth, M. K. (2002) “Maximizing the Margin with Boosting”, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334–350, July 2002. • Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier complexity. In ICML, pages 753-760, 2006. • Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. • Robert E. Schapire and Yoav Freund. Boosting, Foundations and Algorithms. The MIT Press, 2012. • Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. Mehryar Mohri - Foundations of Machine Learning page 39 L1 Margin Definitions Definition: the margin of a point x with label y is (the · algebraic distance of x = [h1 (x), . . . , hT (x)] to the hyperplane ·x = 0 ): (x) = yf (x) m t=T = y T t=1 t t ht (x) =y 1 ·x . 1 Definition: the margin of the classifier for a sample S = (x1 , . . . , xm ) is the minimum margin of the points in that sample: = min yi i [1,m] Mehryar Mohri - Foundations of Machine Learning · xi . 1 page 40 • Note: • SVM margin: w · (xi ) = min yi . w 2 i [1,m] • Boosting margin: = min yi i [1,m] · H(xi ) , 1 h1 (x) .. . with H(x) = . hT (x) • Distances: · q distance to hyperplane w·x+b = 0 : 1 1 with + = 1. p q Mehryar Mohri - Foundations of Machine Learning |w · x + b| , w p page 41 • Maximum-margin solutions with different norms. Norm || · ||2 . Norm || · ||∞ . • AdaBoost gives an approximate solution for the LP problem. Mehryar Mohri - Foundations of Machine Learning page 42