AdaBoost Part of Lecture 6 Jason Corso SUNY at Buffalo Mar. 17 2009 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 1 / 61 Introduction In the first part of this lecture, we looked at 1 2 Lack of inherent superiority of any one particular classifier; Some systematic ways for selecting a particular method over another for a given scenario; And, we talked briefly about the bagging method for integrating component classifiers. Now, we turn to boosting and the AdaBoost method for integrating component classifiers into one strong classifier. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 2 / 61 Introduction Rationale Imagine the situation where you want to build an email filter that can distinguish spam from non-spam. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 3 / 61 Introduction Rationale Imagine the situation where you want to build an email filter that can distinguish spam from non-spam. The general way we would approach this problem in ML/PR follows the same scheme we have for the other topics: J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 3 / 61 Introduction Rationale Imagine the situation where you want to build an email filter that can distinguish spam from non-spam. The general way we would approach this problem in ML/PR follows the same scheme we have for the other topics: 1 2 3 4 Gathering as many examples as possible of both spam and non-spam emails. Train a classifier using these examples and their labels. Take the learned classifier, or prediction rule, and use it to filter your mail. The goal is to train a classifier that makes the most accurate predictions possible on new test examples. And, we’ve covered related topics on how to measure this like bias and variance. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 3 / 61 Introduction Rationale Imagine the situation where you want to build an email filter that can distinguish spam from non-spam. The general way we would approach this problem in ML/PR follows the same scheme we have for the other topics: 1 2 3 4 Gathering as many examples as possible of both spam and non-spam emails. Train a classifier using these examples and their labels. Take the learned classifier, or prediction rule, and use it to filter your mail. The goal is to train a classifier that makes the most accurate predictions possible on new test examples. And, we’ve covered related topics on how to measure this like bias and variance. But, building a highly accurate classifier is a difficult task. (You still get spam, right?!) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 3 / 61 Introduction We could probably come up with many quick rules of thumb. These could be only moderately accurate. Can you think of an example for this situation? J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 4 / 61 Introduction We could probably come up with many quick rules of thumb. These could be only moderately accurate. Can you think of an example for this situation? An example could be “if the subject line contains ‘buy now’ then classify as spam.” J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 4 / 61 Introduction We could probably come up with many quick rules of thumb. These could be only moderately accurate. Can you think of an example for this situation? An example could be “if the subject line contains ‘buy now’ then classify as spam.” This certainly doesn’t cover all spams, but it will be significantly better than random guessing. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 4 / 61 Introduction Basic Idea of Boosting Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 5 / 61 Introduction Basic Idea of Boosting Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb. It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate classifier. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 5 / 61 Introduction Basic Idea of Boosting Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb. It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate classifier. To begin, we define an algorithm for finding the rules of thumb, which we call a weak learner. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 5 / 61 Introduction Basic Idea of Boosting Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb. It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate classifier. To begin, we define an algorithm for finding the rules of thumb, which we call a weak learner. The boosting algorithm repeatedly calls this weak learner, each time feeding it a different distribution over the training data (in Adaboost). J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 5 / 61 Introduction Basic Idea of Boosting Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb. It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate classifier. To begin, we define an algorithm for finding the rules of thumb, which we call a weak learner. The boosting algorithm repeatedly calls this weak learner, each time feeding it a different distribution over the training data (in Adaboost). Each call generates a weak classifier and we must combine all of these into a single classifier that, hopefully, is much more accurate than any one of the rules. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 5 / 61 Introduction Key Questions Defining and Analyzing Boosting 1 How should the distribution be chosen each round? J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 6 / 61 Introduction Key Questions Defining and Analyzing Boosting 1 How should the distribution be chosen each round? 2 How should the weak rules be combined into a single rule? J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 6 / 61 Introduction Key Questions Defining and Analyzing Boosting 1 How should the distribution be chosen each round? 2 How should the weak rules be combined into a single rule? 3 How should the weak learner be defined? J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 6 / 61 Introduction Key Questions Defining and Analyzing Boosting 1 How should the distribution be chosen each round? 2 How should the weak rules be combined into a single rule? 3 How should the weak learner be defined? 4 How many weak classifiers should we learn? J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 6 / 61 Basic AdaBoost Getting Started We are given a training set D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 (1) 7 / 61 Basic AdaBoost Getting Started We are given a training set D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}. (1) For example, xi could represent some encoding of an email message (say in the vector-space text model), and yi is whether or not this message is spam. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 7 / 61 Basic AdaBoost Getting Started We are given a training set D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}. (1) For example, xi could represent some encoding of an email message (say in the vector-space text model), and yi is whether or not this message is spam. Note that we are working in a two-class setting, and this will be the case for the majority of our discussion. Some extensions to multi-class scenarios will be presented. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 7 / 61 Basic AdaBoost Getting Started We are given a training set D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}. (1) For example, xi could represent some encoding of an email message (say in the vector-space text model), and yi is whether or not this message is spam. Note that we are working in a two-class setting, and this will be the case for the majority of our discussion. Some extensions to multi-class scenarios will be presented. We P need to define a distribution D over the dataset D such that i D(i) = 1. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 7 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Weak Learners and Weak Classifiers First, we concretely define a weak classifier: ht : Rd → {−1, +1} J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 (2) Mar. 17 2009 8 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Weak Learners and Weak Classifiers First, we concretely define a weak classifier: ht : Rd → {−1, +1} (2) A weak classifier must work better than chance. In the two-class setting this mean is has less than 50% error and this is easy; if it would have higher than 50% error, just flip the sign. So, we want only a classifier than does not have exactly 50% error (since these classifiers would add no information). J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 8 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Weak Learners and Weak Classifiers First, we concretely define a weak classifier: ht : Rd → {−1, +1} (2) A weak classifier must work better than chance. In the two-class setting this mean is has less than 50% error and this is easy; if it would have higher than 50% error, just flip the sign. So, we want only a classifier than does not have exactly 50% error (since these classifiers would add no information). The error rate of a weak classifier ht (x) is calculated empirically over the training data: m (ht ) = 1 1 X δ(ht (xi ) 6= yi ) < . m 2 (3) i=1 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 8 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Given example images where for negative and positive examples respectively. A WL/WC Example for Images Initialize weights for respectively, where and are the number of negatives and positives respectively. Consider the case that our input data xi are For : rectangular image patches. 1. Normalize the weights, And consider this case closely as it is part of project I. so that is a probability distribution. 2. For each feature, , train a classifi er which is restricted to using a single feature. The error is evaluated with respect to , . 3. Choose the classifi er, , with the lowest error . 4. Update the weights: where rectly, if example otherwise, and is classifi ed cor. The fi nal strong classifi er is: J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Figure 3: The first and se aBoost. The two features ar overlayed on a typical train first feature measures the di region of the eyes and a reg feature capitalizes on the o often darker than the cheek the intensities in the eye re bridge of the nose. of the nose and cheeks (se atively large in comparison Mar. 17 2009 9 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Given example images where for negative and positive examples respectively. A WL/WC Example for Images Initialize weights for respectively, where and are the number of negatives and positives respectively. Consider the case that our input data xi are For : rectangular image patches. 1. Normalize the weights, And consider this case closely as it is part of project I. so that is a probability distribution. Define a collection of Haar-like eacheach feature, train a classifi er which single feature [2].2.AsFor a result stage of, the boosting process, which selects a new weak classifier, can beaviewed is restricted to using single feature. The rectangle features. as a feature selection process. AdaBoost provides an effecerror is evaluated with respect to , Figure 3: The first and se aBoost. The two features ar overlayed on a typical train B first feature measures the di region of the eyes and a reg feature capitalizes on the o often darker than the cheek the intensities Din the eye re bridge of the nose. tive learning algorithm and strong bounds on generalization . A performance [13, 9, 10]. The third major3.contribution of classifi this paper Choose the er,is a, method with the lowest error . for combining successively more complex classifiers in a 4. Update the weights: cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. The notion behind focus of attention approaches C is that it is often possible to rapidly determine where in an image an object mightwhere occur [17, 8, 1]. More complex pro- is classifi ed corif example cessing is reserved only for these promising regions. rectly, otherwise, andThe . 1: Example rectangle features shown relative to the Figure key measure of such an approach is the “false negative” rate enclosing detection window. The sum of the pixels which of the attentional process. It must be the case that all, or lie within the white rectangles are subtracted from the sum(se The fi nal strong classifi er is: of the nose and cheeks almost all, object instances are selected by the attentional of pixels in the grey rectangles. Two-rectangle features are atively large in comparison filter.(SUNY at Buffalo) J. Corso AdaBoost Part of Lecture 6 in (A) and (B). Figure 17a2009 9 / 61 shown (C)Mar. shows three-rectangle The feature value extracted is the difference of the pixel sum in the white sub-regions and the black sub-regions. With a base patch size of 24x24, there are over 180,000 possible such rectangle features. Basic AdaBoost Weak Learners and Weak Classifiers Although these features are somewhat primitive in comparison to things like steerable filters, SIFT keys, etc., they do provide a rich set on which boosting can learn. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 10 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Although these features are somewhat primitive in comparison to things like steerable filters, SIFT keys, etc., they do provide a rich set on which boosting can learn. And, they are quite efficiently computed when using the integral image representation. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 10 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Although these features are somewhat primitive in comparison to things like steerable filters, SIFT keys, etc., they do provide a rich set on which boosting can learn. And, they are quite efficiently computed when using the integral image representation. Define the integral image as the image whose pixel value at a particular pixel x, y is the sum of the pixel values to the left and above x, y in the original image: X ii(x, y) = i(x, y) (4) x0 ≤x,y 0 ≤y where ii is the integral image and i is the original image. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 10 / 61 Basic AdaBoost Weak Learners and Weak Classifiers Although these features are somewhat primitive in comparison to things like steerable filters, SIFT keys, etc., they do provide a rich set on which boosting can learn. And, they are quite efficiently computed when using the integral image representation. Define the integral image as the image whose pixel value at a particular pixel x, y is the sum of the pixel values to the left and above x, y in the original image: X ii(x, y) = i(x, y) (4) x0 ≤x,y 0 ≤y where ii is the integral image and i is the original image. Use the following pair of recurrences to compute the integral image in just one pass. s(x, y) = s(x, y − 1) + i(x, y) (5) ii(x, y) = ii(x − 1, y) + s(x, y) (6) where we define s(x, −1) = 0 and ii(−1, y) = 0. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 10 / 61 Basic AdaBoost Weak Learners and Weak Classifiers The sum of a particular rectangle can be computed in just 4 references using the integral image. The value at point 1 is the sum of the pixels in rectangle A. A B 1 Point 2 is A+B. 3 Point 3 is A+C. 2 D C 4 Point 4 is A+B+C+D. Figure 2: The sum of the pixels within rectangle can be computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle . The value at location 2 is , at location 3 is , and at location 4 is . The sum within can be computed as . where is the integral image and is the original image. Using the following pair of recurrences: J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 11 /(1) 61 Basic AdaBoost Weak Learners and Weak Classifiers The sum of a particular rectangle can be computed in just 4 references using the integral image. The value at point 1 is the sum of the pixels in rectangle A. A B 1 Point 2 is A+B. 3 Point 3 is A+C. 2 D C 4 Point 4 is A+B+C+D. So, the sum within D alone is 4+1-2-3. Figure 2: The sum of the pixels within rectangle can be computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle . The value at location 2 is , at location 3 is , and at location 4 is . The sum within can be computed as . where is the integral image and is the original image. Using the following pair of recurrences: J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 11 /(1) 61 Basic AdaBoost Weak Learners and Weak Classifiers The sum of a particular rectangle can be computed in just 4 references using the integral image. The value at point 1 is the sum of the pixels in rectangle A. A B 1 Point 2 is A+B. 3 Point 3 is A+C. 2 D C 4 Point 4 is A+B+C+D. So, the sum within D alone is 4+1-2-3. Figure 2: The sum of the pixels within rectangle can be computed with four array references. The value of the inteWe have a bunch of features. We certainly we gral image atcan’t location 1use is thethem sum of theall. pixelsSo, in rectangle . The value at location 2 is , at location 3 is , let the boosting procedure select theand best. But before we can do this, at location 4 is . The sum within can computed asweak learner. . we need to pair these features with abesimple where is the integral image and is the original image. Using the following pair of recurrences: J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 11 /(1) 61 Basic AdaBoost Weak Learners and Weak Classifiers Each run, the weak learner is designed to select the single rectangle feature which best separates the positive and negative examples. The weak learner searches for the optimal threshold classification function, such that the minimum number of examples are misclassified. The weak classifier ht (x) hence consists of the feature ft (x), a threshold θt , and a parity pt indicating the direction of the inequality sign: ( +1 if pt ft (x) < pt θt ht (x) = (7) −1 otherwise. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 12 / 61 Basic AdaBoost The AdaBoost Classifier The Strong AdaBoost Classifier Let’s assume we have selected T weak classifiers and a scalar constant αt associated with each: J. Corso (SUNY at Buffalo) h = {ht : t = 1, . . . , T } (8) α = {αt : t = 1, . . . , T } (9) AdaBoost Part of Lecture 6 Mar. 17 2009 13 / 61 Basic AdaBoost The AdaBoost Classifier The Strong AdaBoost Classifier Let’s assume we have selected T weak classifiers and a scalar constant αt associated with each: h = {ht : t = 1, . . . , T } (8) α = {αt : t = 1, . . . , T } (9) Denote the inner product over all weak classifiers as F : F (x) = T X αt ht (x) = hα, h(x)i (10) t=1 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 13 / 61 Basic AdaBoost The AdaBoost Classifier The Strong AdaBoost Classifier Let’s assume we have selected T weak classifiers and a scalar constant αt associated with each: h = {ht : t = 1, . . . , T } (8) α = {αt : t = 1, . . . , T } (9) Denote the inner product over all weak classifiers as F : F (x) = T X αt ht (x) = hα, h(x)i (10) t=1 Define the strong classifier as the sign of this inner product: " T # X H(x) = sign [F (x)] = sign αt ht (x) (11) t=1 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 13 / 61 Basic AdaBoost The AdaBoost Classifier Our objective is to choose h and α to minimize the empirical classification error of the strong classifier. (h, α)∗ = argmin Err(H; D) m 1 X δ(H(xi ) 6= yi ) = argmin m (12) (13) i=1 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 14 / 61 Basic AdaBoost The AdaBoost Classifier Our objective is to choose h and α to minimize the empirical classification error of the strong classifier. (h, α)∗ = argmin Err(H; D) m 1 X δ(H(xi ) 6= yi ) = argmin m (12) (13) i=1 Adaboost doesn’t directly minimize this error but rather minimizes an upper bound on it. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 14 / 61 Basic AdaBoost The AdaBoost Classifier Illustration of AdaBoost Classifier Weak Learner Input J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 15 / 61 Basic AdaBoost The AdaBoost Algorithm The Basic AdaBoost Algorithm Given D = (xi , yi ), . . . , (xm , ym ) as before. Initialize the distribution D1 to be uniform: D1 (i) = Repeat for t = 1, . . . , T : 1 Learn weak classifier ht using distribution Dt . 1 m. For the example given, this requires you to learn the threshold and the parity at each iteration given the current distribution Dt for the weak classifier h over each feature: Note, there are other ways of doing this step... J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 16 / 61 Basic AdaBoost The AdaBoost Algorithm The Basic AdaBoost Algorithm Given D = (xi , yi ), . . . , (xm , ym ) as before. Initialize the distribution D1 to be uniform: D1 (i) = Repeat for t = 1, . . . , T : 1 Learn weak classifier ht using distribution Dt . 1 m. For the example given, this requires you to learn the threshold and the parity at each iteration given the current distribution Dt for the weak classifier h over each feature: 1 Compute the weighted error for each weak classifier. t (h) = m X Dt (i)δ(h(xi ) 6= yi ), ∀h (14) i=1 Note, there are other ways of doing this step... J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 16 / 61 Basic AdaBoost The AdaBoost Algorithm The Basic AdaBoost Algorithm Given D = (xi , yi ), . . . , (xm , ym ) as before. Initialize the distribution D1 to be uniform: D1 (i) = Repeat for t = 1, . . . , T : 1 Learn weak classifier ht using distribution Dt . 1 m. For the example given, this requires you to learn the threshold and the parity at each iteration given the current distribution Dt for the weak classifier h over each feature: 1 Compute the weighted error for each weak classifier. t (h) = m X Dt (i)δ(h(xi ) 6= yi ), ∀h (14) i=1 2 Select the weak classifier with minimum error. ht = argminh t (h) (15) Note, there are other ways of doing this step... J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 16 / 61 Basic AdaBoost The AdaBoost Algorithm 2 Set weight αt based on the error: 1 1 − t (ht ) αt = ln 2 t (ht ) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 (16) Mar. 17 2009 17 / 61 Basic AdaBoost The AdaBoost Algorithm 2 Set weight αt based on the error: 1 1 − t (ht ) αt = ln 2 t (ht ) (16) 3 Update the distribution based on the performance so far: Dt+1 (i) = 1 Dt (i) exp [−αt yi ht (xi )] Zt (17) where Zt is a normalization factor to keep Dt+1 a distribution. Note the careful evaluation of the term inside of the exp based on the possible {−1, +1} values of the label. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 17 / 61 Basic AdaBoost The AdaBoost Algorithm 2 Set weight αt based on the error: 1 1 − t (ht ) αt = ln 2 t (ht ) (16) 3 Update the distribution based on the performance so far: Dt+1 (i) = 1 Dt (i) exp [−αt yi ht (xi )] Zt (17) where Zt is a normalization factor to keep Dt+1 a distribution. Note the careful evaluation of the term inside of the exp based on the possible {−1, +1} values of the label. One chooses T based on some established error criterion or some fixed number. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 17 / 61 Basic AdaBoost A Toy Example (From Schapire’s Slides) Toy Example D1 weak classifiers = vertical or horizontal half-planes J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 18 / 61 Basic AdaBoost A Toy Example (From Schapire’s Slides) Round 1 h1 D2 ε1 =0.30 α1=0.42 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 18 / 61 A Toy Example (From Schapire’s Slides) Basic AdaBoost Round 2 h2 D3 ε2 =0.21 α2=0.65 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 18 / 61 A Toy Example (From Schapire’s Slides) Basic AdaBoost Round 3 h3 ε3 =0.14 α3=0.92 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 18 / 61 A Toy Example (From Schapire’s Slides) Basic AdaBoost Final Classifier H = sign final = 0.42 ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! + 0.65 $ % $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' $ % $ % $ % $ % $ % $ % $ % $ % $ % & ' & ' ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # ( ) ( ) ( ) " ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # ( ) " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # " " # " # " # " # " # " # " # " # " # " # " # " # J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 + 0.92 Mar. 17 2009 18 / 61 AdaBoost Analysis Contents for AdaBoost Analysis Facts about the weights and normalizing functions. AdaBoost Convergence (why and how fast). Why do we calculate the weight of each weak classifier to be αt = 1 1 − t (ht ) ln ? 2 t (ht ) Why do we choose the weak classifier that has the minimum weighted error? Testing Error Analysis. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 19 / 61 AdaBoost Analysis Facts About the Weights Weak Classifier Weights The selected weight for each new weak classifier is always positive. t (ht ) < J. Corso (SUNY at Buffalo) 1 1 1 − t (ht ) ⇒ αt = ln >0 2 2 t (ht ) AdaBoost Part of Lecture 6 (18) Mar. 17 2009 20 / 61 AdaBoost Analysis Facts About the Weights Weak Classifier Weights The selected weight for each new weak classifier is always positive. t (ht ) < 1 1 1 − t (ht ) ⇒ αt = ln >0 2 2 t (ht ) (18) The smaller the classification error, the bigger the weight and the more this particular weak classifier will impact the final strong classifier. (hA ) < (hB ) ⇒ αA > αB J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 (19) Mar. 17 2009 20 / 61 AdaBoost Analysis Facts About the Weights Data Sample Weights The weights of the data points are multiplied by exp [−yi αt ht (xi )]. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 21 / 61 AdaBoost Analysis Facts About the Weights Data Sample Weights The weights of the data points are multiplied by exp [−yi αt ht (xi )]. ( exp [−αt ] < 1 if ht (xi ) = yi exp [−yi αt ht (xi )] = (20) exp [αt ] > 1 if ht (xi ) 6= yi The weights of correctly classified points are reduced and the weights of incorrectly classified points are increased. Hence, the incorrectly classified points will receive more attention in the next run. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 21 / 61 AdaBoost Analysis The weight distribution can be computed recursively: 1 Dt (i) exp [−αt yi ht (xi )] (21) Zt 1 Dt−1 (i) exp −yi αt ht (xi ) + αt−1 ht−1 (xi ) = Zt−1 Zt = ... 1 D1 (i) exp −yi αt ht (xi ) + · · · + α1 h1 (xi ) = Z1 . . . Zt Dt+1 (i) = J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 22 / 61 AdaBoost Analysis Facts About the Normalizing Functions At each iteration, the weights on the data points are normalized by X Zt = Dt (xi ) exp [−yi αi ht (xi )] (22) xi = X xi ∈A X Dt (xi ) exp [−αt ] + Dt (xi ) exp [αt ] (23) xi ∈A where A is the set of correctly classified points: {xi : yi = ht (xi )}. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 23 / 61 AdaBoost Analysis Facts About the Normalizing Functions At each iteration, the weights on the data points are normalized by X Zt = Dt (xi ) exp [−yi αi ht (xi )] (22) xi = X xi ∈A X Dt (xi ) exp [−αt ] + Dt (xi ) exp [αt ] (23) xi ∈A where A is the set of correctly classified points: {xi : yi = ht (xi )}. We can write these normalization factors as functions of αt , then: Zt = Zt (αt ) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 (24) Mar. 17 2009 23 / 61 AdaBoost Analysis Recall the data weights can be computed recursively: Dt+1 (i) = J. Corso (SUNY at Buffalo) 1 1 exp −yi F (xi ) . Z1 . . . Zt m AdaBoost Part of Lecture 6 (25) Mar. 17 2009 24 / 61 AdaBoost Analysis Recall the data weights can be computed recursively: Dt+1 (i) = 1 1 exp −yi F (xi ) . Z1 . . . Zt m (25) And, since we know the data weights must sum to one, we have m X Dt (xi ) = i=1 J. Corso (SUNY at Buffalo) m 1 1 X exp −yi F (xi ) = 1 Z1 . . . Zt m (26) i=1 AdaBoost Part of Lecture 6 Mar. 17 2009 24 / 61 AdaBoost Analysis Recall the data weights can be computed recursively: Dt+1 (i) = 1 1 exp −yi F (xi ) . Z1 . . . Zt m (25) And, since we know the data weights must sum to one, we have m X Dt (xi ) = i=1 m 1 1 X exp −yi F (xi ) = 1 Z1 . . . Zt m (26) i=1 Therefore, we can summarize this with a new normalizing function: Z = Z1 . . . Zt = m 1 X exp −yi F (xi ) . m (27) i=1 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 24 / 61 AdaBoost Analysis AdaBoost Convergence AdaBoost Convergence Key Idea: AdaBoost minimizes an upper bound on the classification error. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 25 / 61 AdaBoost Analysis AdaBoost Convergence AdaBoost Convergence Key Idea: AdaBoost minimizes an upper bound on the classification error. Claim: After t steps, the error of the strong classifier is bounded above by quantity Z, as we just defined it (the product of the data weight normalization factors): Err(H) ≤ Z = Z(α, h) = Zt (αt , ht ) . . . Z1 (α1 , h1 ) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 (28) 25 / 61 AdaBoost Analysis AdaBoost Convergence AdaBoost Convergence Key Idea: AdaBoost minimizes an upper bound on the classification error. Claim: After t steps, the error of the strong classifier is bounded above by quantity Z, as we just defined it (the product of the data weight normalization factors): Err(H) ≤ Z = Z(α, h) = Zt (αt , ht ) . . . Z1 (α1 , h1 ) (28) AdaBoost is a greedy algorithm that minimizes this upper bound on the classification error by choosing the optimal ht and αt to minimize Zt at each step. (h, α)∗ = argmin Z(α, h) (29) ∗ (ht , αt ) = argmin Zt (αt , ht ) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 (30) Mar. 17 2009 25 / 61 AdaBoost Analysis AdaBoost Convergence AdaBoost Convergence Key Idea: AdaBoost minimizes an upper bound on the classification error. Claim: After t steps, the error of the strong classifier is bounded above by quantity Z, as we just defined it (the product of the data weight normalization factors): Err(H) ≤ Z = Z(α, h) = Zt (αt , ht ) . . . Z1 (α1 , h1 ) (28) AdaBoost is a greedy algorithm that minimizes this upper bound on the classification error by choosing the optimal ht and αt to minimize Zt at each step. (h, α)∗ = argmin Z(α, h) (29) ∗ (ht , αt ) = argmin Zt (αt , ht ) (30) As Z goes to zero, the classification error goes to zero. Hence, it converges. (But, we need to account for the case when no new weak classifier has an error rate better than 0.5, upon which time we should stop.) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 25 / 61 AdaBoost Analysis AdaBoost Convergence We need to show the claim on the error bound is true: m m i=1 i=1 1 X 1 X δ(H(xi ) 6= yi ) ≤ Z = exp [−yi F (xi )] (31) Err(H) = m m J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 26 / 61 AdaBoost Analysis AdaBoost Convergence We need to show the claim on the error bound is true: m m i=1 i=1 1 X 1 X δ(H(xi ) 6= yi ) ≤ Z = exp [−yi F (xi )] (31) Err(H) = m m Proof: J. Corso (SUNY at Buffalo) F (xi ) = sign(F (xi ))|F (xi )| (32) = H(xi )|F (xi )| (33) AdaBoost Part of Lecture 6 Mar. 17 2009 26 / 61 AdaBoost Analysis AdaBoost Convergence We need to show the claim on the error bound is true: m m i=1 i=1 1 X 1 X δ(H(xi ) 6= yi ) ≤ Z = exp [−yi F (xi )] (31) Err(H) = m m Proof: F (xi ) = sign(F (xi ))|F (xi )| (32) = H(xi )|F (xi )| (33) The two cases are: If H(xi ) 6= yi then the LHS = 1 ≤ RHS = e+|F (xi )| . If H(xi ) = yi then the LHS = 0 ≤ RHS = e−|F (xi )| . J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 26 / 61 AdaBoost Analysis AdaBoost Convergence We need to show the claim on the error bound is true: m m i=1 i=1 1 X 1 X δ(H(xi ) 6= yi ) ≤ Z = exp [−yi F (xi )] (31) Err(H) = m m Proof: F (xi ) = sign(F (xi ))|F (xi )| (32) = H(xi )|F (xi )| (33) The two cases are: If H(xi ) 6= yi then the LHS = 1 ≤ RHS = e+|F (xi )| . If H(xi ) = yi then the LHS = 0 ≤ RHS = e−|F (xi )| . So, the inequality holds for each term δ(H(xi ) 6= yi ) ≤ exp [−yi F (xi )] (34) and hence, the inequality is true. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 26 / 61 AdaBoost Analysis AdaBoost Convergence Weak Classifier Pursuit Now, we want to explore how we are solve the step-wise minimization problem: (ht , αt )∗ = argmin Z(αt , ht ) (35) Recall, we can separate Z into two parts: X X Dt (xi ) exp [αt ] Zt (αt , ht ) = Dt (xi ) exp [−αt ] + xi ∈A (36) xi ∈A where A is the set of correctly classified points: {xi : yi = ht (xi )}. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 27 / 61 AdaBoost Analysis AdaBoost Convergence Weak Classifier Pursuit Now, we want to explore how we are solve the step-wise minimization problem: (ht , αt )∗ = argmin Z(αt , ht ) (35) Recall, we can separate Z into two parts: X X Dt (xi ) exp [αt ] Zt (αt , ht ) = Dt (xi ) exp [−αt ] + xi ∈A (36) xi ∈A where A is the set of correctly classified points: {xi : yi = ht (xi )}. Take the derivative w.r.t. αt and set it to zero: X X dZt (αt , ht ) = −Dt (xi ) exp [−αt ] + Dt (xi ) exp [αt ] = 0 dαt xi ∈A J. Corso (SUNY at Buffalo) (37) xi ∈A AdaBoost Part of Lecture 6 Mar. 17 2009 27 / 61 AdaBoost Analysis AdaBoost Convergence X X dZt (αt , ht ) = Dt (xi ) exp [αt ] = 0 −Dt (xi ) exp [−αt ] + dαt xi ∈A X xi ∈A J. Corso (SUNY at Buffalo) (38) xi ∈A Dt (xi ) = X Dt (xi ) exp [2αt ] (39) xi ∈A AdaBoost Part of Lecture 6 Mar. 17 2009 28 / 61 AdaBoost Analysis AdaBoost Convergence X X dZt (αt , ht ) = Dt (xi ) exp [αt ] = 0 −Dt (xi ) exp [−αt ] + dαt xi ∈A X xi ∈A (38) xi ∈A Dt (xi ) = X Dt (xi ) exp [2αt ] (39) xi ∈A And, by definition, we can write the error as t (h) = m X Dt (xi )δ(h(xi ) 6= yi ) = i=1 J. Corso (SUNY at Buffalo) X Dt (xi ), ∀h (40) xi ∈A AdaBoost Part of Lecture 6 Mar. 17 2009 28 / 61 AdaBoost Analysis AdaBoost Convergence X X dZt (αt , ht ) = Dt (xi ) exp [αt ] = 0 −Dt (xi ) exp [−αt ] + dαt xi ∈A X (38) xi ∈A Dt (xi ) = xi ∈A X Dt (xi ) exp [2αt ] (39) xi ∈A And, by definition, we can write the error as t (h) = m X Dt (xi )δ(h(xi ) 6= yi ) = i=1 X Dt (xi ), ∀h (40) xi ∈A Rewriting (39) and solving for αt yields αt = J. Corso (SUNY at Buffalo) 1 1 − t (ht ) ln 2 t (ht ) AdaBoost Part of Lecture 6 (41) Mar. 17 2009 28 / 61 AdaBoost Analysis AdaBoost Convergence We can plug it back into the normalization term to get the minimum: Zt (αt , ht ) = X X Dt (xi ) exp [−αt ] + xi ∈A Dt (xi ) exp [αt ] s = (1 − t (ht )) s t (ht ) 1 − t (ht ) + t (ht ) 1 − t (ht ) t (ht ) p = 2 t (ht )(1 − t (ht )) J. Corso (SUNY at Buffalo) (42) xi ∈A AdaBoost Part of Lecture 6 (43) (44) Mar. 17 2009 29 / 61 AdaBoost Analysis AdaBoost Convergence We can plug it back into the normalization term to get the minimum: Zt (αt , ht ) = X X Dt (xi ) exp [−αt ] + xi ∈A Dt (xi ) exp [αt ] s = (1 − t (ht )) s t (ht ) 1 − t (ht ) + t (ht ) 1 − t (ht ) t (ht ) p = 2 t (ht )(1 − t (ht )) Change a variable, γt = J. Corso (SUNY at Buffalo) (42) xi ∈A 1 2 (43) (44) − t (ht ), γt ∈ (0, 21 ]. AdaBoost Part of Lecture 6 Mar. 17 2009 29 / 61 AdaBoost Analysis AdaBoost Convergence We can plug it back into the normalization term to get the minimum: Zt (αt , ht ) = X X Dt (xi ) exp [−αt ] + xi ∈A Dt (xi ) exp [αt ] s = (1 − t (ht )) s t (ht ) 1 − t (ht ) + t (ht ) 1 − t (ht ) t (ht ) p = 2 t (ht )(1 − t (ht )) Change a variable, γt = 1 2 (43) (44) − t (ht ), γt ∈ (0, 21 ]. Then, we have the minimum to be p Zt (αt , ht ) = 2 t (ht )(1 − t (ht )) q = 1 − 4γt2 ≤ exp −2γt2 J. Corso (SUNY at Buffalo) (42) xi ∈A AdaBoost Part of Lecture 6 (45) (46) (47) Mar. 17 2009 29 / 61 AdaBoost Analysis AdaBoost Convergence Therefore, after t steps, the error rate of the strong classifier is bounded on top by # " T X 2 γt (48) Err(H) ≤ Z ≤ exp −2 t=1 J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 30 / 61 AdaBoost Analysis AdaBoost Convergence Therefore, after t steps, the error rate of the strong classifier is bounded on top by # " T X 2 γt (48) Err(H) ≤ Z ≤ exp −2 t=1 Hence, each step decreases the upper bound exponentially. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 30 / 61 AdaBoost Analysis AdaBoost Convergence Therefore, after t steps, the error rate of the strong classifier is bounded on top by # " T X 2 γt (48) Err(H) ≤ Z ≤ exp −2 t=1 Hence, each step decreases the upper bound exponentially. And, a weak classifier with small error rate will lead to faster descent. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 30 / 61 AdaBoost Analysis AdaBoost Convergence Summary of AdaBoost Convergence The objective of AdaBoost is to minimize an upper bound on the classification error: (α, h)∗ = argmin Z(α, h) (49) = argmin Zt (αt , ht ) . . . Z1 (α1 , h1 ) m X exp [−yi hα, h(xi )i] = argmin (50) (51) i=1 AdaBoost takes a stepwise minimization scheme, which may not be optimal (it is greedy). When we calculate the parameter for the tth weak classifier, the others remain set. We should stop AdaBoost if all of the weak classifiers have an error rate of 12 . J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 31 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) How Will Test Error Behave? (A First Guess) 1 error 0.8 0.6 0.4 test 0.2 train 20 40 60 80 100 # of rounds (T) expect: • training error to continue to drop (or reach zero) • test error to increase when Hfinal becomes “too complex” • • “Occam’s razor” overfitting • hard to know when to stop training J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 32 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Actual Typical Run 20 C4.5 test error error 15 (boosting C4.5 on “letter” dataset) 10 5 0 test train 10 100 1000 # of rounds (T) • test error does not increase, even after 1000 rounds • (total size > 2,000,000 nodes) • test error continues to drop even after training error is zero! train error test error # rounds 5 100 1000 0.0 0.0 0.0 8.4 3.3 3.1 • Occam’s razor wrongly predicts “simpler” rule is better J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 33 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) A Better Story: The Margins Explanation [with Freund, Bartlett & Lee] • key idea: training error only measures whether classifications are right or wrong • should also consider confidence of classifications • J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 34 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) A Better Story: The Margins Explanation [with Freund, Bartlett & Lee] • key idea: training error only measures whether classifications are right or wrong • should also consider confidence of classifications • • recall: Hfinal is weighted majority vote of weak classifiers J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 35 / 61 Test Error Analysis (From Schapire’s Slides) AdaBoost Analysis A Better Story: The Margins Explanation [with Freund, Bartlett & Lee] • key idea: training error only measures whether classifications are right or wrong • should also consider confidence of classifications • • recall: Hfinal is weighted majority vote of weak classifiers • measure confidence by margin = strength of the vote = (fraction voting correctly) − (fraction voting incorrectly) high conf. incorrect −1 J. Corso (SUNY at Buffalo) high conf. correct low conf. Hfinal incorrect 0 Hfinal correct AdaBoost Part of Lecture 6 +1 Mar. 17 2009 36 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Empirical Evidence: The Margin Distribution • margin distribution 20 error 15 10 test 5 0 train 10 100 1000 cumulative distribution = cumulative distribution of margins of training examples 1.0 1000 100 0.5 5 -1 # of rounds (T) train error test error % margins ≤ 0.5 minimum margin J. Corso (SUNY at Buffalo) 5 0.0 8.4 7.7 0.14 -0.5 margin 0.5 1 # rounds 100 1000 0.0 0.0 3.3 3.1 0.0 0.0 0.52 0.55 AdaBoost Part of Lecture 6 Mar. 17 2009 37 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 38 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 39 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption) J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 40 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption) • proof idea: similar to training error proof J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 41 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption) • proof idea: similar to training error proof • so: although final classifier is getting larger, margins are likely to be increasing, so final classifier actually getting close to a simpler classifier, driving down the test error J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 42 / 61 AdaBoost Analysis Test Error Analysis (From Schapire’s Slides) More Technically... • with high probability, ∀θ > 0 : generalization error ≤ P̂r[margin ≤ θ] + Õ p d/m θ ! (P̂r[ ] = empirical probability) • bound depends on • m = # training examples • d = “complexity” of weak classifiers • entire distribution of margins of training examples • P̂r[margin ≤ θ] → 0 exponentially fast (in T ) if (error of ht on Dt ) < 1/2 − θ (∀t) • so: if weak learning assumption holds, then all examples will quickly have “large” margins J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 43 / 61 AdaBoost Recap Summary of Basic AdaBoost AdaBoost is a sequential algorithm that minimizes an upper bound of the empirical classification error by selecting the weak classifiers and their weights. These are “pursued” one-by-one with each one being selected to maximally reduce the upper bound of error. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 44 / 61 AdaBoost Recap Summary of Basic AdaBoost AdaBoost is a sequential algorithm that minimizes an upper bound of the empirical classification error by selecting the weak classifiers and their weights. These are “pursued” one-by-one with each one being selected to maximally reduce the upper bound of error. AdaBoost defines a distribution of weights over the data samples. These weights are updated each time a new weak classifier is added such that samples misclassified by this new weak classifiers are given more weight. In this manner, currently misclassified samples are emphasized more during the selection of the subsequent weak classifier. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 44 / 61 AdaBoost Recap Summary of Basic AdaBoost AdaBoost is a sequential algorithm that minimizes an upper bound of the empirical classification error by selecting the weak classifiers and their weights. These are “pursued” one-by-one with each one being selected to maximally reduce the upper bound of error. AdaBoost defines a distribution of weights over the data samples. These weights are updated each time a new weak classifier is added such that samples misclassified by this new weak classifiers are given more weight. In this manner, currently misclassified samples are emphasized more during the selection of the subsequent weak classifier. The empirical error will converge to zero at an exponential rate. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 44 / 61 AdaBoost Recap Practical AdaBoost Advantages It is fast to evaluate (linear-additive) and can be fast to train (depending on weak learner). T is the only parameter to tune. It is flexible and can be combined with any weak learner. It is provably effective if it can consistently find the weak classifiers (that do better than random). Since it can work with any weak learner, it can handle the gamut of data. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 45 / 61 AdaBoost Recap AdaBoost Caveats Performance depends on the data and the weak learner. It can fail if The weak classifiers are too complex and overfit. The weak classifiers are too weak, essentially underfitting. AdaBoost seems, empirically, to be especially susceptible to uniform noise. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 46 / 61 AdaBoost Recap The Coordinate Descent View of AdaBoost (from Schapire) Coordinate Descent [Breiman] • {g1 , . . . , gN } = space of all weak classifiers • want to find λ1 , . . . , λN to minimize L(λ1 , . . . , λN ) = X i J. Corso (SUNY at Buffalo) exp −yi AdaBoost Part of Lecture 6 X j λj gj (xi ) Mar. 17 2009 47 / 61 AdaBoost Recap The Coordinate Descent View of AdaBoost (from Schapire) Coordinate Descent [Breiman] • {g1 , . . . , gN } = space of all weak classifiers • want to find λ1 , . . . , λN to minimize L(λ1 , . . . , λN ) = X i exp −yi X j λj gj (xi ) • AdaBoost is actually doing coordinate descent on this optimization problem: initially, all λj = 0 each round: choose one coordinate λj (corresponding to ht ) and update (increment by αt ) • choose update causing biggest decrease in loss • • • powerful technique for minimizing over huge space of functions J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 48 / 61 AdaBoost for Estimating Conditional Probabilities Estimating Conditional Probabilities [Friedman, Hastie & Tibshirani] • often want to estimate probability that y = +1 given x • AdaBoost minimizes (empirical version of): h i h i Ex,y e −yf (x) = Ex P [y = +1|x] e −f (x) + P [y = −1|x] e f (x) where x, y random from true distribution J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 49 / 61 AdaBoost for Estimating Conditional Probabilities Estimating Conditional Probabilities [Friedman, Hastie & Tibshirani] • often want to estimate probability that y = +1 given x • AdaBoost minimizes (empirical version of): h i h i Ex,y e −yf (x) = Ex P [y = +1|x] e −f (x) + P [y = −1|x] e f (x) where x, y random from true distribution • over all f , minimized when 1 f (x) = · ln 2 P [y = +1|x] P [y = −1|x] or P [y = +1|x] = 1 1 + e −2f (x) • so, to convert f output by AdaBoost to probability estimate, use same formula J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 50 / 61 Multiclass AdaBoost From Schapire’s Slides Multiclass Problems [with Freund] • say y ∈ Y = {1, . . . , k} • direct approach (AdaBoost.M1): ht : X → Y Dt+1 (i ) = Dt (i ) · Zt e −αt e αt Hfinal (x) = arg max y ∈Y J. Corso (SUNY at Buffalo) if yi = ht (xi ) if yi 6= ht (xi ) X αt t:ht (x)=y AdaBoost Part of Lecture 6 Mar. 17 2009 51 / 61 Multiclass AdaBoost From Schapire’s Slides Multiclass Problems [with Freund] • say y ∈ Y = {1, . . . , k} • direct approach (AdaBoost.M1): ht : X → Y Dt+1 (i ) = Dt (i ) · Zt e −αt e αt Hfinal (x) = arg max y ∈Y if yi = ht (xi ) if yi 6= ht (xi ) X αt t:ht (x)=y • can prove same bound on error if ∀t : t ≤ 1/2 in practice, not usually a problem for “strong” weak learners (e.g., C4.5) • significant problem for “weak” weak learners (e.g., decision stumps) • instead, reduce to binary • J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 52 / 61 Multiclass AdaBoost From Schapire’s Slides Reducing Multiclass to Binary [with Singer] • say possible labels are {a, b, c, d, e} • each training example replaced by five {−1, +1}-labeled examples: x (x, a) (x, b) , c → (x, c) (x, d) (x, e) , , , , , −1 −1 +1 −1 −1 • predict with label receiving most (weighted) votes J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 53 / 61 Applications AdaBoost for Face Detection Viola and Jones 2000-2002. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 54 / 61 Applications J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 55 / 61 Applications Figure 7: Output of our face detector on a number of test images from the MIT+CMU test set. [3] F. Crow. Summed-area tables for texture map Proceedings of SIGGRAPH, volume 18(3), pages 1984. [4] F. Fleuret and D. Geman. Coarse-to-fi ne face dete J. Computer Vision, 2001. J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 56 / 61 Applications ons such as advanced driving visual features as adaBoost Fig.1: Viola &Jones Haar-like features res, the “connected controlto give very good results on These weak classifiers compute the absolute difference n. We here report on results between the sum of pixel values in red and blue areas (see atures to a public lateral car F. database. Moutarde, Stanciulescu, and A.following Breheret. figure 1), with the respect of the rule: Real-time visual estrian images We B. tently outperform previously detection of vehicles and pedestrians with new efficient AdaBoost if Area ( A) ! Area ( B ) > Threshold then True ses, while still operating fast features. 2008.else False nd vehicles detection. AdaBoost for Car and Pedestrian Detection They define a different pixel level “connected control point” feature. D RELATED WORK well as most Advanced (ADAS) functions, require his environment perception ors such as lidars, radars, ever, compared to other ovide very rich information f an abstract enough scene time. red for such an automated detection of most common nt: vehicles and pedestrians. roposed for visual object g [10] for a review of some for pedestrian detection, . Of the various machinethis problem, only few are J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 57 / 61 Applications with respect ample. l values, an example groups are separated figure 3a). Negative ct this characteristic: roups are interleaved examples of controlpedestrian detection. -, half- or quarterdow size (80x32 for pedestrian case). An s resized to those 3 d. e 4, the feature will es (on the correctly white squares all have Fig. 4: Some examples of adaBoost-selected Control-Points features for car detection (top) and pedestrian detection (bottom line). Some features operate at full resolution of detection window (eg rightmost bottom), while others work on half-resolution (eg leftmost bottom), or even at quarterresolution (third on bottom line). J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 58 / 61 Applications B. Pedestr The pede test sets (eac and 5000 independent training sets, producing a feature type allowed, ther 2000 weak-c J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 59 / 61 2000 weak-classifiers.Applications t ds 8 Fig. 9: Averaged ROC curves for adaBoost pedestrian classifiers obtained with various feature families J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 60 / 61 References Sources These slides have made extensive use of the following sources. Y.Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Science, 55(1):119139, 1997. R.E. Schapire. The boosting approach to machine learning: an overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002. Schapire’s NIPS Tutorial http://nips.cc/Conferences/2007/Program/ schedule.php?Session=Tutorials P.Viola and M.Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2001. P.Viola and M.Jones. Fast and Robust Classification Using Asymmetric AdaBoost and a Detector Cascade. In Proceedings of Neural Information Processing Systems (NIPS), 2002. SC Zhu’s slides for AdaBoost (UCLA). J. Corso (SUNY at Buffalo) AdaBoost Part of Lecture 6 Mar. 17 2009 61 / 61