AdaBoost in a Nutshell Ilker Birbil September, 2019 In this short note, I will try to give a few additional details about the infamous AdaBoost algorithm. Consider a binary classication problem. The data is {(x1 , y1 ), . . . , (xn , yn )} where xi ∈ Rp and yi ∈ {+1, −1}. Assume that we know how to train a series of weak binary classiers y` (x) ∈ {+1, −1}, ` = 1, 2, . . . . We shall boost the compounded performance of these classiers in an iterative manner by giving dierent weights to the data points. In particular, the misclassied points shall get higher weights so that the classiers will pay more attention to those points to reduce the error. First, the algorithm. Algorithm 1: AdaBoost 1 2 3 Initialize weights wi1 = n1 for i = 1, . . . , n for k = 1, . . . , K do Fit a classier yk (x) that minimizes the weighted classication error n X wik I(yk (xi ) 6= yi ) i=1 4 Evaluate the relative weight of the misclassied data points εk = 5 6 7 Pn i=1 wik I(yk (xi )6=yi ) Pn k i=1 wi k Set the update parameter αk = ln 1−ε εk Update the weights wik+1 = wik eαk I(yk (xi )6=yi ) for i = 1, . . . , n Output: sign P K k=1 αk yk (x) There are a couple of important points in Algorithm 1. Note that in line 5, αk becomes large whenever the relative weights of the misclassied points are small. Thus, the equation in line 6 makes sure that the weights of the misclassied points go up at the next iteration. Moreover, a large αk value promotes the corresponding classier k in line 7 . Next, we shall try to give an interpretation to AdaBoost. Consider a classier of the form 1 Pk fk (x) = 2 `=1 α` y` (x). That is, fk (x) is a linear combination of k classiers and its 1 response is simply sign(fk (x)). We want to train fk (x) by minimizing the following error function n n X e−yi fk (xi ) = X i=1 1 Pk e−yi 2 `=1 α` y` (xi ) . i=1 We need to choose α1 , . . . , αk and the parameters of the classiers y` (x), ` = 1, . . . , k. How about minimizing this error function incrementally rather than minimizing the whole function at once? In particular, we shall assume that the classiers y1 (x), . . . , yk−1 (x) and their multipliers α1 , . . . , αk−1 have already been xed. Then, we want to minimize the error function by choosing αk and yk (x). The error function is given by 1 Pk −yi 2 i=1 e Pn `=1 α` y` (xi ) 1 −yi fk−1 (xi )− 2 yi αk yk (xi ) wik , e−yi fk−1 (xi ) i=1 e 1 P = ni=1 wik e− 2 yi αk yk (xi ) 1 1 Pn − 2 αk α k k = i=1 wi I(yk (xi ) = yi )e + I(yk (xi ) 6= yi )e 2 αk αk P αk P = ni=1 wik e− 2 + (e 2 − e− 2 ) ni=1 wik I(yk (xi ) 6= yi ). = Pn Minimizing this error function with respect to the parameters of the classier yk (x) is equivalent to minimizing n X wik I(yk (xi ) 6= yi ) i=1 as done in line 3 of Algorithm 1. On the other hand, minimizing the error function with respect to αk boils down to taking the derivative and nding the root of the resulting equation1 . This step leads to the αk evaluation in line 5. By denition, we also have wik+1 = e−yi fk (xi ) 1 = e−yi fk−1 (xi )− 2 yi αk yk (xi ) 1 = wik e− 2 yi αk yk (xi ) 1 = wik e− 2 αk (1−2I(yk (xi )6=yi )) 1 2 αk eαk I(yk (xi )6=yi ) . = wik e|−{z } ck Note that the term marked as ck just normalizes each weight with the same amount. Thus, we can safely drop it and obtain line 6 of Algorithm 1 as wik+1 = wik eαk I(yk (xi )6=yi ) . Finally, we output classication response according to k sign(fk (x)) = sign 1X αl yl (x) 2 l=1 giving line 7 of the AdaBoost algorithm. 1 Please try this step on your own. 2 ! = sign k X l=1 ! αl yl (x)