Uploaded by luigi.vitale

AdaBoostNotes-week3

advertisement
AdaBoost in a Nutshell
Ilker Birbil
September, 2019
In this short note, I will try to give a few additional details about the infamous AdaBoost
algorithm. Consider a binary classication problem. The data is {(x1 , y1 ), . . . , (xn , yn )}
where xi ∈ Rp and yi ∈ {+1, −1}.
Assume that we know how to train a series of weak binary classiers y` (x) ∈ {+1, −1},
` = 1, 2, . . . . We shall boost the compounded performance of these classiers in an iterative
manner by giving dierent weights to the data points. In particular, the misclassied points
shall get higher weights so that the classiers will pay more attention to those points to
reduce the error. First, the algorithm.
Algorithm 1: AdaBoost
1
2
3
Initialize weights wi1 = n1 for i = 1, . . . , n
for k = 1, . . . , K do
Fit a classier yk (x) that minimizes the weighted classication error
n
X
wik I(yk (xi ) 6= yi )
i=1
4
Evaluate the relative weight of the misclassied data points
εk =
5
6
7
Pn
i=1
wik I(yk (xi )6=yi )
Pn
k
i=1 wi
k
Set the update parameter αk = ln 1−ε
εk
Update the weights wik+1 = wik eαk I(yk (xi )6=yi ) for i = 1, . . . , n
Output: sign
P
K
k=1 αk yk (x)
There are a couple of important points in Algorithm 1. Note that in line 5, αk becomes
large whenever the relative weights of the misclassied points are small. Thus, the equation
in line 6 makes sure that the weights of the misclassied points go up at the next iteration.
Moreover, a large αk value promotes the corresponding classier k in line 7 .
Next, we shall
try to give an interpretation to AdaBoost. Consider a classier of the form
1 Pk
fk (x) = 2 `=1 α` y` (x). That is, fk (x) is a linear combination of k classiers and its
1
response is simply sign(fk (x)). We want to train fk (x) by minimizing the following error
function
n
n
X
e−yi fk (xi ) =
X
i=1
1 Pk
e−yi 2
`=1
α` y` (xi )
.
i=1
We need to choose α1 , . . . , αk and the parameters of the classiers y` (x), ` = 1, . . . , k.
How about minimizing this error function incrementally rather than minimizing the whole
function at once? In particular, we shall assume that the classiers y1 (x), . . . , yk−1 (x) and
their multipliers α1 , . . . , αk−1 have already been xed. Then, we want to minimize the error
function by choosing αk and yk (x).
The error function is given by
1 Pk
−yi 2
i=1 e
Pn
`=1
α` y` (xi )
1
−yi fk−1 (xi )− 2 yi αk yk (xi )
wik , e−yi fk−1 (xi )
i=1 e
1
P
= ni=1 wik e− 2 yi αk yk (xi )
1
1
Pn
− 2 αk
α
k
k
= i=1 wi I(yk (xi ) = yi )e
+ I(yk (xi ) 6= yi )e 2
αk
αk P
αk
P
= ni=1 wik e− 2 + (e 2 − e− 2 ) ni=1 wik I(yk (xi ) 6= yi ).
=
Pn
Minimizing this error function with respect to the parameters of the classier yk (x) is equivalent to minimizing
n
X
wik I(yk (xi ) 6= yi )
i=1
as done in line 3 of Algorithm 1. On the other hand, minimizing the error function with
respect to αk boils down to taking the derivative and nding the root of the resulting
equation1 . This step leads to the αk evaluation in line 5.
By denition, we also have
wik+1 = e−yi fk (xi )
1
= e−yi fk−1 (xi )− 2 yi αk yk (xi )
1
= wik e− 2 yi αk yk (xi )
1
= wik e− 2 αk (1−2I(yk (xi )6=yi ))
1
2 αk eαk I(yk (xi )6=yi ) .
= wik e|−{z
}
ck
Note that the term marked as ck just normalizes each weight with the same amount. Thus,
we can safely drop it and obtain line 6 of Algorithm 1 as
wik+1 = wik eαk I(yk (xi )6=yi ) .
Finally, we output classication response according to
k
sign(fk (x)) = sign
1X
αl yl (x)
2
l=1
giving line 7 of the AdaBoost algorithm.
1
Please try this step on your own.
2
!
= sign
k
X
l=1
!
αl yl (x)
Download