Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Weak Learning (Kearns and Valiant, 1994) Definition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that: for all > 0 , for all c C and all distributions D, • Pr S D R(hS ) 1 2 1 , • for samples S of size m = poly(1/ ) for a fixed polynomial. Mehryar Mohri - Foundations of Machine Learning page 2 Boosting Ideas Finding simple relatively accurate base classifiers often not hard weak learner. Main ideas: • use weak learner to create a strong learner. • combine base classifiers returned by weak learner (ensemble method). But, how should the base classifiers be combined? Mehryar Mohri - Foundations of Machine Learning page 3 AdaBoost - Preliminaries (Freund and Schapire, 1997) Family of base classifiers: H { 1, +1}X . Iterative algorithm (T rounds): at each round t [1, T ] maintains distribution Dt over training sample. t convex combination hypothesis ft = s=1 s hs , with s 0 and hs H. • • Weighted error of hypothesis h at round t : m Dt (i)1h(xi )=yi . Pr[h(xi ) = yi ] = Dt Mehryar Mohri - Foundations of Machine Learning i=1 page 4 AdaBoost H { 1, +1} . X (Freund and Schapire, 1997) AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i 1 to m do 1 2 D1 (i) m 3 for t 1 to T do 4 ht base classifier in H with small error t = PrDt [ht (xi ) = yi ] 1 t 1 log 5 t 2 t 1 2 6 Zt 2[ t (1 normalization factor t )] 7 for i 1 to m do Dt (i) exp( t yi ht (xi )) 8 Dt+1 (i) Zt T 9 f t=1 t ht 10 return h = sgn(f ) Mehryar Mohri - Foundations of Machine Learning page 5 Notes Distributions Dt over training sample: originally uniform. at each round, the weight of a misclassified example is increased. observation: Dt+1 (i) = me Q Z , since • • • Dt+1 (i) = yi ft (xi ) t s=1 s Dt (i)e t yi ht (xi ) Zt = Dt 1 (i)e t 1 y i ht Zt 1 (xi ) 1 Zt e t yi ht (xi ) 1 e = m yi Pt s=1 t s=1 s hs (xi ) Zs Weight assigned to base classifier ht : t directy depends on the accuracy of ht at round t . Mehryar Mohri - Foundations of Machine Learning page 6 . Illustration t = 1 Mehryar Mohri - Foundations of Machine Learning t = 2 page 7 t = 3 ... Mehryar Mohri - Foundations of Machine Learning ... page 8 α1 + α2 + α3 = Mehryar Mohri - Foundations of Machine Learning page 9 Bound on Empirical Error (Freund and Schapire, 1997) Theorem: The empirical error of the classifier output by AdaBoost verifies: T R(h) exp 2 t=1 • exp( 2 2 t ( 12 If further for all t [1, T ] , R(h) • 1 2 2 . t ) , then T ). does not need to be known in advance: adaptive boosting. Mehryar Mohri - Foundations of Machine Learning page 10 • Proof: Since, as we saw, D 1 R(h) = m 1 m m 1yi f (xi ) i=1 m • Now, since Z t 0 1 m , exp( yi f (xi )) i=1 T T Zt DT +1 (i) = m i=1 t+1 (i) = m yi ft (xi ) e Q m ts=1 Zs Zt . t=1 t=1 is a normalization factor, m Dt (i)e Zt = = i=1 t yi ht (xi ) Dt (i)e t i:yi ht (xi ) 0 = (1 t )e = (1 t) Mehryar Mohri - Foundations of Machine Learning Dt (i)e t i:yi ht (xi )<0 t + te t 1 + t + t t 1 t t =2 t (1 page 11 t ). • Thus, T Zt = t=1 T T t (1 2 t) = t=1 2 t t=1 T exp 2 1 1 2 4 2 1 2 t T = exp 2 t t (1 t )e + te . , at each round, AdaBoost assigns the same probability mass to correctly classified and misclassified instances. for base classifiers x [ 1, +1] , t can be similarly chosen to minimize Zt . t • 2 t=1 t=1 • Notes: • minimizer of • since (1 )e = 1 2 Mehryar Mohri - Foundations of Machine Learning t te t page 12 . AdaBoost = Coordinate Descent Objective Function: convex and differentiable. m m F( ) = e yi f (xi ) = e i=1 i=1 e yi PT t=1 t ht (xi ) x 0 1 loss Mehryar Mohri - Foundations of Machine Learning page 13 . • Direction: unit vector e with t et = argmin dF ( t 1 • dF ( t 1 + et ) d m t 1 + et ) = m yi i=1 =0 Pt 1 s=1 s hs (xi ) yi ht (xi ) s=1 t 1 yi ht (xi )Dt (i) m Zs s=1 i=1 t 1 t 1 = [(1 , s hs (xi ) yi i=1 m = e t 1 yi ht (xi ) exp = =0 e . d t Since F ( + et ) t) t] Zs = [2 m t 1] m Zs . s=1 s=1 Thus, direction corresponding to base classifier with smallest error. Mehryar Mohri - Foundations of Machine Learning page14 • Step size: obtained via dF ( t 1 m t 1 + et ) =0 d yi ht (xi ) exp s hs (xi ) yi e yi ht (xi ) s=1 i=1 m t 1 yi ht (xi )Dt (i) m Zs e yi ht (xi ) =0 s=1 i=1 m yi ht (xi )Dt (i)e yi ht (xi ) =0 i=1 [(1 = t )e 1 1 log 2 te t ]=0 . t Thus, step size matches base classifier weight of AdaBoost. Mehryar Mohri - Foundations of Machine Learning page 15 =0 Alternative Loss Functions x boosting loss square loss x (1 x)2 1x x e x logistic loss log2 (1 + e x ) hinge loss x max(1 1 x, 0) zero-one loss x 1x<0 Mehryar Mohri - Foundations of Machine Learning page 16 Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: data in RN, e.g., N = 2, (height(x), weight(x)) . associate a stump to each component. pre-sort each component: O(N m log m). at each round, find best component and threshold. total complexity: O((m log m)N + mN T ). stumps not weak learners: think XOR example! • • • • • • Mehryar Mohri - Foundations of Machine Learning page 17 Overfitting? Assume that VCdim(H) = d and for a fixed T , define T FT = sgn t ht b : t, b R, ht H . t=1 FT can form a very rich family of classifiers. It can be shown (Freund and Schapire, 1997) that: VCdim(FT ) 2(d + 1)(T + 1) log2 ((T + 1)e). This suggests that AdaBoost could overfit for large values of T , and that is in fact observed in some cases, but in various others it is not! Mehryar Mohri - Foundations of Machine Learning page 18 Empirical Observations 20 cumulative distribution Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: test error error 15 training error 10 5 0 10 100 1000 # rounds -1 1.0 0.5 -0.5 ma Figure 2: Errortrees curves and the margin C4.5 decision (Schapire et al., 1998). distribution graph for the letter dataset as reported by Schapire et al. [69]. Left: the error curves (lower and upper curves, respectively) of the comb Mehryar Mohri - Foundations of Machine Learning page 19 Rademacher Complexity of Convex Hulls Theorem: Let H be a set of functions mapping from X to R . Let the convex hull of H be defined as p conv(H) = { p µk hk : p 1, µk 0, k=1 1, hk µk H}. k=1 Then, for any sample S , RS (conv(H)) = RS (H). Proof: 1 RS (conv(H)) = E m 1 E = m 1 = E m p m sup hk H,µ 0, µ sup 1 1 i=1 p sup hk H µ 0, µ µk hk (xi ) i k=1 m i hk (xi ) µk 1 1 k=1 i=1 m i hk (xi ) sup max hk H k [1,p] i=1 m 1 E sup = m h H i=1 Mehryar Mohri - Foundations of Machine Learning i h(xi ) = RS (H). page 20 Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002) Corollary: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H) : R(h) R(h) 2 R (h) + Rm H + 2 R (h) + RS H + 3 log 1 2m log 2 . 2m Proof: Direct consequence of margin bound of Lecture 4 and RS (conv(H)) = RS (H) . Mehryar Mohri - Foundations of Machine Learning page 21 Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998) Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H): R(h) R (h) + 2 2d log em d + m log 1 . 2m Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3). Mehryar Mohri - Foundations of Machine Learning page 22 Notes All of these bounds can be generalized to hold uniformly for all (0, 1), at the cost of an additional log log2 2 term and other minor constant factor m changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions x f (x) 1 = T t=1 t ht (x) conv(H). 1 Note that T does not appear in the bound. Mehryar Mohri - Foundations of Machine Learning page 23 Margin Distribution Theorem: For any > 0 , the following holds: Pr T yf (x) 1 t 2T 1 (1 m 1yi f (xi ) i=1 1 . t=1 Proof: Using the identity Dt+1 (i) = 1 m 1+ t) 1 m 0 1 = m =e m exp( yi f (xi ) + i=1 m e i=1 1 , ) T T 1 Zt DT +1 (i) m t=1 T T Zt = 2 1 t=1 Mehryar Mohri - Foundations of Machine Learning yi f (xi ) eQ m T t=1 Zt 1 t t t=1 page 24 t (1 t ). Notes If for all t [1, T ] , can be bounded by Pr yf (x) ( 12 (1 t ), then 2 )1 the upper bound (1 + 2 )1+ T /2 . 1 For < , (1 2 )1 (1+2 )1+ < 1 and the bound decreases exponentially in T . O(1/ m), For the bound to be convergent: O(1/ m) is roughly the condition on the thus edge value. Mehryar Mohri - Foundations of Machine Learning page 25 But, Does AdaBoost Maximize the Margin? No: AdaBoost may converge to a margin that is significantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least max if data separable and 2 some conditions on the base learners (Rätsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported. Mehryar Mohri - Foundations of Machine Learning page 26 Outliers AdaBoost assigns larger weights to harder examples. Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rätsch, 2003). • • Mehryar Mohri - Foundations of Machine Learning page 27 AdaBoost’s Weak Learning Condition Definition: the edge of a base classifier ht for a distribution D over the training sample is 1 (t) = 2 t 1 = 2 m yi ht (xi )D(i). i=1 Condition: there exists > 0 for any distribution D over the training sample and any base classifier (t) Mehryar Mohri - Foundations of Machine Learning . page 28 Zero-Sum Games Definition: payoff matrix M = (Mij ) Rm n. m possible actions (pure strategy) for row player. n possible actions for column player. Mij payoff for row player ( = loss for column player) when row plays i , column plays j . • • • • Example: rock paper scissors Mehryar Mohri - Foundations of Machine Learning rock 0 1 -1 paper scissors -1 1 0 -1 1 0 page 29 Mixed Strategies (von Neumann, 1928) Definition: player row selects a distribution p over the rows, player column a distribution q over columns. The expected payoff for row is m n pi Mij qj = p Mq. E [Mij ] = i p j q i=1 j=1 von Neumann’s minimax theorem: max min p Mq = min max p Mq. p q • equivalent form: q p max min p Mej = min max ei Mq. p j [1,n] Mehryar Mohri - Foundations of Machine Learning q i [1,m] page 30 John von Neumann (1903 - 1957) Mehryar Mohri - Foundations of Machine Learning page 31 AdaBoost and Game Theory Game: Player A: selects point xi , i [1, m]. Player B: selects base learner ht , t [1, T ]. Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ). von Neumann’s theorem: assume finite H . • • • T m 2 D(i)yi h(xi ) = max min yi = min max D h H i=1 Mehryar Mohri - Foundations of Machine Learning i [1,m] t=1 page 32 t ht (xi ) 1 = . Consequences Weak learning condition = non-zero margin. thus, possible to search for non-zero margin. AdaBoost = (suboptimal) search for corresponding . • • Weak learning =strong condition: the condition implies linear separability with margin 2 > 0 . • Mehryar Mohri - Foundations of Machine Learning page 33 Linear Programming Problem Maximizing the margin: ( · xi ) = max min yi . || ||1 i [1,m] This is equivalent to the following convex optimization LP problem: max subject to : yi ( · xi ) 1 = 1. Note that: | · x| = x H 1 Mehryar Mohri - Foundations of Machine Learning , with H = {x : · x = 0}. page 34 Advantages of AdaBoost Simple: straightforward implementation. Efficient: complexity O(mN T ) for stumps: when N and T are not too large, the algorithm is quite fast. • Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost. • • Mehryar Mohri - Foundations of Machine Learning page 35 Weaker Aspects Parameters: need to determine T , the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overfitting or low margins. • • Noise: severely damages the accuracy of Adaboost (Dietterich, 2000). Mehryar Mohri - Foundations of Machine Learning page 36 Other Boosting Algorithms arc-gv (Breiman, 1996): designed to maximize the margin, but outperformed by AdaBoost in experiments (Reyzin and Schapire, 2006). L1-regularized AdaBoost (Raetsch et al., 2001): outperfoms AdaBoost in experiments (Cortes et al., 2014). DeepBoost (Cortes et al., 2014): more favorable learning guarantees, outperforms both AdaBoost and L1-regularized AdaBoost. Mehryar Mohri - Foundations of Machine Learning page 37 References • Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In ICML, pages 262-270, 2014. • • Leo Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996. • Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. • G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447–454, 2001. • Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. • J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Mehryar Mohri - Foundations of Machine Learning page 38 References • Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004. • Rätsch, G., and Warmuth, M. K. (2002) “Maximizing the Margin with Boosting”, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334–350, July 2002. • Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier complexity. In ICML, pages 753-760, 2006. • Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. • Robert E. Schapire and Yoav Freund. Boosting, Foundations and Algorithms. The MIT Press, 2012. • Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. Mehryar Mohri - Foundations of Machine Learning page 39 L1 Margin Definitions Definition: the margin of a point x with label y is (the · algebraic distance of x = [h1 (x), . . . , hT (x)] to the hyperplane ·x = 0 ): (x) = yf (x) m t=T = y T t=1 t t ht (x) =y 1 ·x . 1 Definition: the margin of the classifier for a sample S = (x1 , . . . , xm ) is the minimum margin of the points in that sample: = min yi i [1,m] Mehryar Mohri - Foundations of Machine Learning · xi . 1 page 40 • Note: • SVM margin: w · (xi ) = min yi . w 2 i [1,m] • Boosting margin: = min yi i [1,m] · H(xi ) , 1 h1 (x) .. . with H(x) = . hT (x) • Distances: · q distance to hyperplane w·x+b = 0 : 1 1 with + = 1. p q Mehryar Mohri - Foundations of Machine Learning |w · x + b| , w p page 41 • Maximum-margin solutions with different norms. Norm || · ||2 . Norm || · ||∞ . • AdaBoost gives an approximate solution for the LP problem. Mehryar Mohri - Foundations of Machine Learning page 42

Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research

Related documents

Products

Support

Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib