AdaBoost Part of Lecture 6 Jason Corso Mar. 17 2009

advertisement
AdaBoost
Part of Lecture 6
Jason Corso
SUNY at Buffalo
Mar. 17 2009
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
1 / 61
Introduction
In the first part of this lecture, we looked at
1
2
Lack of inherent superiority of any one particular classifier;
Some systematic ways for selecting a particular method over another
for a given scenario;
And, we talked briefly about the bagging method for integrating
component classifiers.
Now, we turn to boosting and the AdaBoost method for integrating
component classifiers into one strong classifier.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
2 / 61
Introduction
Rationale
Imagine the situation where you want to build an email filter that can
distinguish spam from non-spam.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
3 / 61
Introduction
Rationale
Imagine the situation where you want to build an email filter that can
distinguish spam from non-spam.
The general way we would approach this problem in ML/PR follows
the same scheme we have for the other topics:
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
3 / 61
Introduction
Rationale
Imagine the situation where you want to build an email filter that can
distinguish spam from non-spam.
The general way we would approach this problem in ML/PR follows
the same scheme we have for the other topics:
1
2
3
4
Gathering as many examples as possible of both spam and non-spam
emails.
Train a classifier using these examples and their labels.
Take the learned classifier, or prediction rule, and use it to filter your
mail.
The goal is to train a classifier that makes the most accurate
predictions possible on new test examples.
And, we’ve covered related topics on how to measure this like bias and
variance.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
3 / 61
Introduction
Rationale
Imagine the situation where you want to build an email filter that can
distinguish spam from non-spam.
The general way we would approach this problem in ML/PR follows
the same scheme we have for the other topics:
1
2
3
4
Gathering as many examples as possible of both spam and non-spam
emails.
Train a classifier using these examples and their labels.
Take the learned classifier, or prediction rule, and use it to filter your
mail.
The goal is to train a classifier that makes the most accurate
predictions possible on new test examples.
And, we’ve covered related topics on how to measure this like bias and
variance.
But, building a highly accurate classifier is a difficult task. (You still
get spam, right?!)
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
3 / 61
Introduction
We could probably come up with many quick rules of thumb. These
could be only moderately accurate. Can you think of an example for
this situation?
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
4 / 61
Introduction
We could probably come up with many quick rules of thumb. These
could be only moderately accurate. Can you think of an example for
this situation?
An example could be “if the subject line contains ‘buy now’ then
classify as spam.”
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
4 / 61
Introduction
We could probably come up with many quick rules of thumb. These
could be only moderately accurate. Can you think of an example for
this situation?
An example could be “if the subject line contains ‘buy now’ then
classify as spam.”
This certainly doesn’t cover all spams, but it will be significantly
better than random guessing.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
4 / 61
Introduction
Basic Idea of Boosting
Boosting refers to a general and provably effective method of
producing a very accurate classifier by combining rough and
moderately inaccurate rules of thumb.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
5 / 61
Introduction
Basic Idea of Boosting
Boosting refers to a general and provably effective method of
producing a very accurate classifier by combining rough and
moderately inaccurate rules of thumb.
It is based on the observation that finding many rough rules of
thumb can be a lot easier than finding a single, highly accurate
classifier.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
5 / 61
Introduction
Basic Idea of Boosting
Boosting refers to a general and provably effective method of
producing a very accurate classifier by combining rough and
moderately inaccurate rules of thumb.
It is based on the observation that finding many rough rules of
thumb can be a lot easier than finding a single, highly accurate
classifier.
To begin, we define an algorithm for finding the rules of thumb,
which we call a weak learner.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
5 / 61
Introduction
Basic Idea of Boosting
Boosting refers to a general and provably effective method of
producing a very accurate classifier by combining rough and
moderately inaccurate rules of thumb.
It is based on the observation that finding many rough rules of
thumb can be a lot easier than finding a single, highly accurate
classifier.
To begin, we define an algorithm for finding the rules of thumb,
which we call a weak learner.
The boosting algorithm repeatedly calls this weak learner, each time
feeding it a different distribution over the training data (in Adaboost).
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
5 / 61
Introduction
Basic Idea of Boosting
Boosting refers to a general and provably effective method of
producing a very accurate classifier by combining rough and
moderately inaccurate rules of thumb.
It is based on the observation that finding many rough rules of
thumb can be a lot easier than finding a single, highly accurate
classifier.
To begin, we define an algorithm for finding the rules of thumb,
which we call a weak learner.
The boosting algorithm repeatedly calls this weak learner, each time
feeding it a different distribution over the training data (in Adaboost).
Each call generates a weak classifier and we must combine all of
these into a single classifier that, hopefully, is much more accurate
than any one of the rules.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
5 / 61
Introduction
Key Questions Defining and Analyzing Boosting
1
How should the distribution be chosen each round?
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
6 / 61
Introduction
Key Questions Defining and Analyzing Boosting
1
How should the distribution be chosen each round?
2
How should the weak rules be combined into a single rule?
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
6 / 61
Introduction
Key Questions Defining and Analyzing Boosting
1
How should the distribution be chosen each round?
2
How should the weak rules be combined into a single rule?
3
How should the weak learner be defined?
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
6 / 61
Introduction
Key Questions Defining and Analyzing Boosting
1
How should the distribution be chosen each round?
2
How should the weak rules be combined into a single rule?
3
How should the weak learner be defined?
4
How many weak classifiers should we learn?
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
6 / 61
Basic AdaBoost
Getting Started
We are given a training set
D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
(1)
7 / 61
Basic AdaBoost
Getting Started
We are given a training set
D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}.
(1)
For example, xi could represent some encoding of an email message
(say in the vector-space text model), and yi is whether or not this
message is spam.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
7 / 61
Basic AdaBoost
Getting Started
We are given a training set
D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}.
(1)
For example, xi could represent some encoding of an email message
(say in the vector-space text model), and yi is whether or not this
message is spam.
Note that we are working in a two-class setting, and this will be the
case for the majority of our discussion. Some extensions to multi-class
scenarios will be presented.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
7 / 61
Basic AdaBoost
Getting Started
We are given a training set
D = {(xi , yi ) : xi ∈ Rd , yi ∈ {−1, +1}, i = 1, . . . , m}.
(1)
For example, xi could represent some encoding of an email message
(say in the vector-space text model), and yi is whether or not this
message is spam.
Note that we are working in a two-class setting, and this will be the
case for the majority of our discussion. Some extensions to multi-class
scenarios will be presented.
We
P need to define a distribution D over the dataset D such that
i D(i) = 1.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
7 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Weak Learners and Weak Classifiers
First, we concretely define a weak classifier:
ht : Rd → {−1, +1}
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
(2)
Mar. 17 2009
8 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Weak Learners and Weak Classifiers
First, we concretely define a weak classifier:
ht : Rd → {−1, +1}
(2)
A weak classifier must work better than chance. In the two-class
setting this mean is has less than 50% error and this is easy; if it
would have higher than 50% error, just flip the sign. So, we want only
a classifier than does not have exactly 50% error (since these
classifiers would add no information).
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
8 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Weak Learners and Weak Classifiers
First, we concretely define a weak classifier:
ht : Rd → {−1, +1}
(2)
A weak classifier must work better than chance. In the two-class
setting this mean is has less than 50% error and this is easy; if it
would have higher than 50% error, just flip the sign. So, we want only
a classifier than does not have exactly 50% error (since these
classifiers would add no information).
The error rate of a weak classifier ht (x) is calculated empirically over
the training data:
m
(ht ) =
1
1 X
δ(ht (xi ) 6= yi ) < .
m
2
(3)
i=1
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
8 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Given example images
where
for negative and positive examples respectively.
A WL/WC Example for Images
Initialize weights
for
respectively, where and are the number of negatives and
positives respectively.
Consider
the case that
our input data xi are
For
:
rectangular image patches.
1. Normalize the weights,
And consider this case closely as it is part of
project I.
so that
is a probability distribution.
2. For each feature, , train a classifi er
which
is restricted to using a single feature. The
error is evaluated with respect to
,
.
3. Choose the classifi er, , with the lowest error
.
4. Update the weights:
where
rectly,
if example
otherwise, and
is classifi ed cor.
The fi nal strong classifi er is:
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Figure 3: The first and se
aBoost. The two features ar
overlayed on a typical train
first feature measures the di
region of the eyes and a reg
feature capitalizes on the o
often darker than the cheek
the intensities in the eye re
bridge of the nose.
of the nose and cheeks (se
atively
large
in comparison
Mar.
17 2009
9 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Given example images
where
for negative and positive examples respectively.
A WL/WC Example for Images
Initialize weights
for
respectively, where and are the number of negatives and
positives respectively.
Consider
the case that
our input data xi are
For
:
rectangular image patches.
1. Normalize the weights,
And consider this case closely as it is part of
project I.
so that
is a probability distribution.
Define a collection
of Haar-like
eacheach
feature,
train
a classifi er
which
single feature [2].2.AsFor
a result
stage of, the
boosting
process,
which
selects
a
new
weak
classifier,
can beaviewed
is
restricted
to using
single feature. The
rectangle
features.
as a feature selection process. AdaBoost provides an effecerror is evaluated with respect to
,
Figure 3: The first and se
aBoost. The two features ar
overlayed on a typical train
B
first feature measures the di
region of the eyes and a reg
feature capitalizes on the o
often darker than the cheek
the intensities Din the eye re
bridge of the nose.
tive learning algorithm and strong bounds on generalization
.
A
performance [13, 9, 10].
The third major3.contribution
of classifi
this paper
Choose the
er,is a, method
with the lowest error .
for combining successively more complex classifiers in a
4. Update
the weights:
cascade structure which
dramatically
increases the speed of
the detector by focusing attention on promising regions of
the image. The notion behind focus of attention approaches
C
is that it is often possible to rapidly determine where in an
image an object mightwhere
occur [17, 8, 1]. More
complex pro- is classifi ed corif example
cessing is reserved only
for these promising
regions.
rectly,
otherwise,
andThe
. 1: Example rectangle features shown relative to the
Figure
key measure of such an approach is the “false negative” rate
enclosing detection window. The sum of the pixels which
of the attentional
process.
It
must
be
the
case
that
all,
or
lie within the white rectangles
are subtracted
from
the sum(se
The fi nal strong classifi er is:
of the
nose and
cheeks
almost all, object instances are selected by the attentional
of pixels in the grey rectangles. Two-rectangle features are
atively
large
in comparison
filter.(SUNY at Buffalo)
J. Corso
AdaBoost Part of Lecture
6 in (A) and (B). Figure
17a2009
9 / 61
shown
(C)Mar.
shows
three-rectangle
The feature value extracted is
the difference of the pixel sum
in the white sub-regions and the
black sub-regions.
With a base patch size of 24x24,
there are over 180,000 possible
such rectangle features.
Basic AdaBoost
Weak Learners and Weak Classifiers
Although these features are somewhat primitive in comparison to
things like steerable filters, SIFT keys, etc., they do provide a rich set
on which boosting can learn.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
10 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Although these features are somewhat primitive in comparison to
things like steerable filters, SIFT keys, etc., they do provide a rich set
on which boosting can learn.
And, they are quite efficiently computed when using the integral
image representation.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
10 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Although these features are somewhat primitive in comparison to
things like steerable filters, SIFT keys, etc., they do provide a rich set
on which boosting can learn.
And, they are quite efficiently computed when using the integral
image representation.
Define the integral image as the image whose pixel value at a
particular pixel x, y is the sum of the pixel values to the left and
above x, y in the original image:
X
ii(x, y) =
i(x, y)
(4)
x0 ≤x,y 0 ≤y
where ii is the integral image and i is the original image.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
10 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
Although these features are somewhat primitive in comparison to
things like steerable filters, SIFT keys, etc., they do provide a rich set
on which boosting can learn.
And, they are quite efficiently computed when using the integral
image representation.
Define the integral image as the image whose pixel value at a
particular pixel x, y is the sum of the pixel values to the left and
above x, y in the original image:
X
ii(x, y) =
i(x, y)
(4)
x0 ≤x,y 0 ≤y
where ii is the integral image and i is the original image.
Use the following pair of recurrences to compute the integral image in
just one pass.
s(x, y) = s(x, y − 1) + i(x, y)
(5)
ii(x, y) = ii(x − 1, y) + s(x, y)
(6)
where we define s(x, −1) = 0 and ii(−1, y) = 0.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
10 / 61
Basic AdaBoost
Weak Learners and Weak Classifiers
The sum of a particular rectangle can be computed in just 4
references using the integral image.
The value at point 1 is the sum
of the pixels in rectangle A.
A
B
1
Point 2 is A+B.
3
Point 3 is A+C.
2
D
C
4
Point 4 is A+B+C+D.
Figure 2: The sum of the pixels within rectangle can be
computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle
. The value at location 2 is
, at location 3 is
,
and at location 4 is
. The sum within can
be computed as
.
where
is the integral image and
is the original image. Using the following pair of recurrences:
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
11 /(1)
61
Basic AdaBoost
Weak Learners and Weak Classifiers
The sum of a particular rectangle can be computed in just 4
references using the integral image.
The value at point 1 is the sum
of the pixels in rectangle A.
A
B
1
Point 2 is A+B.
3
Point 3 is A+C.
2
D
C
4
Point 4 is A+B+C+D.
So, the sum within D alone is
4+1-2-3.
Figure 2: The sum of the pixels within rectangle can be
computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle
. The value at location 2 is
, at location 3 is
,
and at location 4 is
. The sum within can
be computed as
.
where
is the integral image and
is the original image. Using the following pair of recurrences:
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
11 /(1)
61
Basic AdaBoost
Weak Learners and Weak Classifiers
The sum of a particular rectangle can be computed in just 4
references using the integral image.
The value at point 1 is the sum
of the pixels in rectangle A.
A
B
1
Point 2 is A+B.
3
Point 3 is A+C.
2
D
C
4
Point 4 is A+B+C+D.
So, the sum within D alone is
4+1-2-3.
Figure 2: The sum of the pixels within rectangle can be
computed with four array references. The value of the inteWe have a bunch of features. We certainly
we
gral image atcan’t
location 1use
is thethem
sum of theall.
pixelsSo,
in rectangle
. The value at location 2 is
, at location 3 is
,
let the boosting procedure select theand
best.
But
before
we
can
do
this,
at location 4 is
. The sum within can
computed asweak learner.
.
we need to pair these features with abesimple
where
is the integral image and
is the original image. Using the following pair of recurrences:
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
11 /(1)
61
Basic AdaBoost
Weak Learners and Weak Classifiers
Each run, the weak learner is designed to select the single rectangle
feature which best separates the positive and negative examples.
The weak learner searches for the optimal threshold classification
function, such that the minimum number of examples are
misclassified.
The weak classifier ht (x) hence consists of the feature ft (x), a
threshold θt , and a parity pt indicating the direction of the inequality
sign:
(
+1 if pt ft (x) < pt θt
ht (x) =
(7)
−1 otherwise.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
12 / 61
Basic AdaBoost
The AdaBoost Classifier
The Strong AdaBoost Classifier
Let’s assume we have selected T weak classifiers and a scalar
constant αt associated with each:
J. Corso (SUNY at Buffalo)
h = {ht : t = 1, . . . , T }
(8)
α = {αt : t = 1, . . . , T }
(9)
AdaBoost Part of Lecture 6
Mar. 17 2009
13 / 61
Basic AdaBoost
The AdaBoost Classifier
The Strong AdaBoost Classifier
Let’s assume we have selected T weak classifiers and a scalar
constant αt associated with each:
h = {ht : t = 1, . . . , T }
(8)
α = {αt : t = 1, . . . , T }
(9)
Denote the inner product over all weak classifiers as F :
F (x) =
T
X
αt ht (x) = hα, h(x)i
(10)
t=1
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
13 / 61
Basic AdaBoost
The AdaBoost Classifier
The Strong AdaBoost Classifier
Let’s assume we have selected T weak classifiers and a scalar
constant αt associated with each:
h = {ht : t = 1, . . . , T }
(8)
α = {αt : t = 1, . . . , T }
(9)
Denote the inner product over all weak classifiers as F :
F (x) =
T
X
αt ht (x) = hα, h(x)i
(10)
t=1
Define the strong classifier as the sign of this inner product:
" T
#
X
H(x) = sign [F (x)] = sign
αt ht (x)
(11)
t=1
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
13 / 61
Basic AdaBoost
The AdaBoost Classifier
Our objective is to choose h and α to minimize the empirical
classification error of the strong classifier.
(h, α)∗ = argmin Err(H; D)
m
1 X
δ(H(xi ) 6= yi )
= argmin
m
(12)
(13)
i=1
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
14 / 61
Basic AdaBoost
The AdaBoost Classifier
Our objective is to choose h and α to minimize the empirical
classification error of the strong classifier.
(h, α)∗ = argmin Err(H; D)
m
1 X
δ(H(xi ) 6= yi )
= argmin
m
(12)
(13)
i=1
Adaboost doesn’t directly minimize this error but rather minimizes an
upper bound on it.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
14 / 61
Basic AdaBoost
The AdaBoost Classifier
Illustration of AdaBoost Classifier
Weak Learner
Input
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
15 / 61
Basic AdaBoost
The AdaBoost Algorithm
The Basic AdaBoost Algorithm
Given D = (xi , yi ), . . . , (xm , ym ) as before.
Initialize the distribution D1 to be uniform: D1 (i) =
Repeat for t = 1, . . . , T :
1 Learn weak classifier ht using distribution Dt .
1
m.
For the example given, this requires you to learn the threshold and the
parity at each iteration given the current distribution Dt for the weak
classifier h over each feature:
Note, there are other ways of doing this step...
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
16 / 61
Basic AdaBoost
The AdaBoost Algorithm
The Basic AdaBoost Algorithm
Given D = (xi , yi ), . . . , (xm , ym ) as before.
Initialize the distribution D1 to be uniform: D1 (i) =
Repeat for t = 1, . . . , T :
1 Learn weak classifier ht using distribution Dt .
1
m.
For the example given, this requires you to learn the threshold and the
parity at each iteration given the current distribution Dt for the weak
classifier h over each feature:
1
Compute the weighted error for each weak classifier.
t (h) =
m
X
Dt (i)δ(h(xi ) 6= yi ),
∀h
(14)
i=1
Note, there are other ways of doing this step...
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
16 / 61
Basic AdaBoost
The AdaBoost Algorithm
The Basic AdaBoost Algorithm
Given D = (xi , yi ), . . . , (xm , ym ) as before.
Initialize the distribution D1 to be uniform: D1 (i) =
Repeat for t = 1, . . . , T :
1 Learn weak classifier ht using distribution Dt .
1
m.
For the example given, this requires you to learn the threshold and the
parity at each iteration given the current distribution Dt for the weak
classifier h over each feature:
1
Compute the weighted error for each weak classifier.
t (h) =
m
X
Dt (i)δ(h(xi ) 6= yi ),
∀h
(14)
i=1
2
Select the weak classifier with minimum error.
ht = argminh t (h)
(15)
Note, there are other ways of doing this step...
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
16 / 61
Basic AdaBoost
The AdaBoost Algorithm
2 Set weight αt based on the error:
1
1 − t (ht )
αt = ln
2
t (ht )
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
(16)
Mar. 17 2009
17 / 61
Basic AdaBoost
The AdaBoost Algorithm
2 Set weight αt based on the error:
1
1 − t (ht )
αt = ln
2
t (ht )
(16)
3 Update the distribution based on the performance so far:
Dt+1 (i) =
1
Dt (i) exp [−αt yi ht (xi )]
Zt
(17)
where Zt is a normalization factor to keep Dt+1 a distribution. Note
the careful evaluation of the term inside of the exp based on the
possible {−1, +1} values of the label.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
17 / 61
Basic AdaBoost
The AdaBoost Algorithm
2 Set weight αt based on the error:
1
1 − t (ht )
αt = ln
2
t (ht )
(16)
3 Update the distribution based on the performance so far:
Dt+1 (i) =
1
Dt (i) exp [−αt yi ht (xi )]
Zt
(17)
where Zt is a normalization factor to keep Dt+1 a distribution. Note
the careful evaluation of the term inside of the exp based on the
possible {−1, +1} values of the label.
One chooses T based on some established error criterion or some
fixed number.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
17 / 61
Basic AdaBoost
A Toy Example (From Schapire’s Slides)
Toy Example
D1
weak classifiers = vertical or horizontal half-planes
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
18 / 61
Basic AdaBoost
A Toy Example (From Schapire’s Slides)
Round 1
h1
D2
ε1 =0.30
α1=0.42
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
18 / 61
A Toy Example (From Schapire’s Slides)
Basic AdaBoost
Round 2
h2
D3
ε2 =0.21
α2=0.65
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
18 / 61
A Toy Example (From Schapire’s Slides)
Basic AdaBoost
Round 3
h3
ε3 =0.14
α3=0.92
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
18 / 61
A Toy Example (From Schapire’s Slides)
Basic AdaBoost
Final Classifier
H
= sign
final
=
0.42
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
+ 0.65
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
$
%
&
'
&
'
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
(
)
(
)
(
)
"
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
(
)
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
+ 0.92
Mar. 17 2009
18 / 61
AdaBoost Analysis
Contents for AdaBoost Analysis
Facts about the weights and normalizing functions.
AdaBoost Convergence (why and how fast).
Why do we calculate the weight of each weak classifier to be
αt =
1 1 − t (ht )
ln
?
2
t (ht )
Why do we choose the weak classifier that has the minimum weighted
error?
Testing Error Analysis.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
19 / 61
AdaBoost Analysis
Facts About the Weights
Weak Classifier Weights
The selected weight for each new weak classifier is always positive.
t (ht ) <
J. Corso (SUNY at Buffalo)
1
1 1 − t (ht )
⇒ αt = ln
>0
2
2
t (ht )
AdaBoost Part of Lecture 6
(18)
Mar. 17 2009
20 / 61
AdaBoost Analysis
Facts About the Weights
Weak Classifier Weights
The selected weight for each new weak classifier is always positive.
t (ht ) <
1
1 1 − t (ht )
⇒ αt = ln
>0
2
2
t (ht )
(18)
The smaller the classification error, the bigger the weight and the
more this particular weak classifier will impact the final strong
classifier.
(hA ) < (hB ) ⇒ αA > αB
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
(19)
Mar. 17 2009
20 / 61
AdaBoost Analysis
Facts About the Weights
Data Sample Weights
The weights of the data points are multiplied by exp [−yi αt ht (xi )].
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
21 / 61
AdaBoost Analysis
Facts About the Weights
Data Sample Weights
The weights of the data points are multiplied by exp [−yi αt ht (xi )].
(
exp [−αt ] < 1 if ht (xi ) = yi
exp [−yi αt ht (xi )] =
(20)
exp [αt ] > 1
if ht (xi ) 6= yi
The weights of correctly classified points are reduced and the weights
of incorrectly classified points are increased. Hence, the incorrectly
classified points will receive more attention in the next run.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
21 / 61
AdaBoost Analysis
The weight distribution can be computed recursively:
1
Dt (i) exp [−αt yi ht (xi )]
(21)
Zt
1
Dt−1 (i) exp −yi αt ht (xi ) + αt−1 ht−1 (xi )
=
Zt−1 Zt
= ...
1
D1 (i) exp −yi αt ht (xi ) + · · · + α1 h1 (xi )
=
Z1 . . . Zt
Dt+1 (i) =
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
22 / 61
AdaBoost Analysis
Facts About the Normalizing Functions
At each iteration, the weights on the data points are normalized by
X
Zt =
Dt (xi ) exp [−yi αi ht (xi )]
(22)
xi
=
X
xi ∈A
X
Dt (xi ) exp [−αt ] +
Dt (xi ) exp [αt ]
(23)
xi ∈A
where A is the set of correctly classified points: {xi : yi = ht (xi )}.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
23 / 61
AdaBoost Analysis
Facts About the Normalizing Functions
At each iteration, the weights on the data points are normalized by
X
Zt =
Dt (xi ) exp [−yi αi ht (xi )]
(22)
xi
=
X
xi ∈A
X
Dt (xi ) exp [−αt ] +
Dt (xi ) exp [αt ]
(23)
xi ∈A
where A is the set of correctly classified points: {xi : yi = ht (xi )}.
We can write these normalization factors as functions of αt , then:
Zt = Zt (αt )
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
(24)
Mar. 17 2009
23 / 61
AdaBoost Analysis
Recall the data weights can be computed recursively:
Dt+1 (i) =
J. Corso (SUNY at Buffalo)
1
1
exp −yi F (xi ) .
Z1 . . . Zt m
AdaBoost Part of Lecture 6
(25)
Mar. 17 2009
24 / 61
AdaBoost Analysis
Recall the data weights can be computed recursively:
Dt+1 (i) =
1
1
exp −yi F (xi ) .
Z1 . . . Zt m
(25)
And, since we know the data weights must sum to one, we have
m
X
Dt (xi ) =
i=1
J. Corso (SUNY at Buffalo)
m
1
1 X
exp −yi F (xi ) = 1
Z1 . . . Zt m
(26)
i=1
AdaBoost Part of Lecture 6
Mar. 17 2009
24 / 61
AdaBoost Analysis
Recall the data weights can be computed recursively:
Dt+1 (i) =
1
1
exp −yi F (xi ) .
Z1 . . . Zt m
(25)
And, since we know the data weights must sum to one, we have
m
X
Dt (xi ) =
i=1
m
1
1 X
exp −yi F (xi ) = 1
Z1 . . . Zt m
(26)
i=1
Therefore, we can summarize this with a new normalizing function:
Z = Z1 . . . Zt =
m
1 X
exp −yi F (xi ) .
m
(27)
i=1
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
24 / 61
AdaBoost Analysis
AdaBoost Convergence
AdaBoost Convergence
Key Idea: AdaBoost minimizes an upper bound on the
classification error.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
25 / 61
AdaBoost Analysis
AdaBoost Convergence
AdaBoost Convergence
Key Idea: AdaBoost minimizes an upper bound on the
classification error.
Claim: After t steps, the error of the strong classifier is bounded
above by quantity Z, as we just defined it (the product of the data
weight normalization factors):
Err(H) ≤ Z = Z(α, h) = Zt (αt , ht ) . . . Z1 (α1 , h1 )
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
(28)
25 / 61
AdaBoost Analysis
AdaBoost Convergence
AdaBoost Convergence
Key Idea: AdaBoost minimizes an upper bound on the
classification error.
Claim: After t steps, the error of the strong classifier is bounded
above by quantity Z, as we just defined it (the product of the data
weight normalization factors):
Err(H) ≤ Z = Z(α, h) = Zt (αt , ht ) . . . Z1 (α1 , h1 )
(28)
AdaBoost is a greedy algorithm that minimizes this upper bound on
the classification error by choosing the optimal ht and αt to minimize
Zt at each step.
(h, α)∗ = argmin Z(α, h)
(29)
∗
(ht , αt ) = argmin Zt (αt , ht )
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
(30)
Mar. 17 2009
25 / 61
AdaBoost Analysis
AdaBoost Convergence
AdaBoost Convergence
Key Idea: AdaBoost minimizes an upper bound on the
classification error.
Claim: After t steps, the error of the strong classifier is bounded
above by quantity Z, as we just defined it (the product of the data
weight normalization factors):
Err(H) ≤ Z = Z(α, h) = Zt (αt , ht ) . . . Z1 (α1 , h1 )
(28)
AdaBoost is a greedy algorithm that minimizes this upper bound on
the classification error by choosing the optimal ht and αt to minimize
Zt at each step.
(h, α)∗ = argmin Z(α, h)
(29)
∗
(ht , αt ) = argmin Zt (αt , ht )
(30)
As Z goes to zero, the classification error goes to zero. Hence, it
converges. (But, we need to account for the case when no new weak
classifier has an error rate better than 0.5, upon which time we should stop.)
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
25 / 61
AdaBoost Analysis
AdaBoost Convergence
We need to show the claim on the error bound is true:
m
m
i=1
i=1
1 X
1 X
δ(H(xi ) 6= yi ) ≤ Z =
exp [−yi F (xi )] (31)
Err(H) =
m
m
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
26 / 61
AdaBoost Analysis
AdaBoost Convergence
We need to show the claim on the error bound is true:
m
m
i=1
i=1
1 X
1 X
δ(H(xi ) 6= yi ) ≤ Z =
exp [−yi F (xi )] (31)
Err(H) =
m
m
Proof:
J. Corso (SUNY at Buffalo)
F (xi ) = sign(F (xi ))|F (xi )|
(32)
= H(xi )|F (xi )|
(33)
AdaBoost Part of Lecture 6
Mar. 17 2009
26 / 61
AdaBoost Analysis
AdaBoost Convergence
We need to show the claim on the error bound is true:
m
m
i=1
i=1
1 X
1 X
δ(H(xi ) 6= yi ) ≤ Z =
exp [−yi F (xi )] (31)
Err(H) =
m
m
Proof:
F (xi ) = sign(F (xi ))|F (xi )|
(32)
= H(xi )|F (xi )|
(33)
The two cases are:
If H(xi ) 6= yi then the LHS = 1 ≤ RHS = e+|F (xi )| .
If H(xi ) = yi then the LHS = 0 ≤ RHS = e−|F (xi )| .
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
26 / 61
AdaBoost Analysis
AdaBoost Convergence
We need to show the claim on the error bound is true:
m
m
i=1
i=1
1 X
1 X
δ(H(xi ) 6= yi ) ≤ Z =
exp [−yi F (xi )] (31)
Err(H) =
m
m
Proof:
F (xi ) = sign(F (xi ))|F (xi )|
(32)
= H(xi )|F (xi )|
(33)
The two cases are:
If H(xi ) 6= yi then the LHS = 1 ≤ RHS = e+|F (xi )| .
If H(xi ) = yi then the LHS = 0 ≤ RHS = e−|F (xi )| .
So, the inequality holds for each term
δ(H(xi ) 6= yi ) ≤ exp [−yi F (xi )]
(34)
and hence, the inequality is true.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
26 / 61
AdaBoost Analysis
AdaBoost Convergence
Weak Classifier Pursuit
Now, we want to explore how we are solve the step-wise minimization
problem:
(ht , αt )∗ = argmin Z(αt , ht )
(35)
Recall, we can separate Z into two parts:
X
X
Dt (xi ) exp [αt ]
Zt (αt , ht ) =
Dt (xi ) exp [−αt ] +
xi ∈A
(36)
xi ∈A
where A is the set of correctly classified points: {xi : yi = ht (xi )}.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
27 / 61
AdaBoost Analysis
AdaBoost Convergence
Weak Classifier Pursuit
Now, we want to explore how we are solve the step-wise minimization
problem:
(ht , αt )∗ = argmin Z(αt , ht )
(35)
Recall, we can separate Z into two parts:
X
X
Dt (xi ) exp [αt ]
Zt (αt , ht ) =
Dt (xi ) exp [−αt ] +
xi ∈A
(36)
xi ∈A
where A is the set of correctly classified points: {xi : yi = ht (xi )}.
Take the derivative w.r.t. αt and set it to zero:
X
X
dZt (αt , ht )
=
−Dt (xi ) exp [−αt ] +
Dt (xi ) exp [αt ] = 0
dαt
xi ∈A
J. Corso (SUNY at Buffalo)
(37)
xi ∈A
AdaBoost Part of Lecture 6
Mar. 17 2009
27 / 61
AdaBoost Analysis
AdaBoost Convergence
X
X
dZt (αt , ht )
=
Dt (xi ) exp [αt ] = 0
−Dt (xi ) exp [−αt ] +
dαt
xi ∈A
X
xi ∈A
J. Corso (SUNY at Buffalo)
(38)
xi ∈A
Dt (xi ) =
X
Dt (xi ) exp [2αt ]
(39)
xi ∈A
AdaBoost Part of Lecture 6
Mar. 17 2009
28 / 61
AdaBoost Analysis
AdaBoost Convergence
X
X
dZt (αt , ht )
=
Dt (xi ) exp [αt ] = 0
−Dt (xi ) exp [−αt ] +
dαt
xi ∈A
X
xi ∈A
(38)
xi ∈A
Dt (xi ) =
X
Dt (xi ) exp [2αt ]
(39)
xi ∈A
And, by definition, we can write the error as
t (h) =
m
X
Dt (xi )δ(h(xi ) 6= yi ) =
i=1
J. Corso (SUNY at Buffalo)
X
Dt (xi ),
∀h
(40)
xi ∈A
AdaBoost Part of Lecture 6
Mar. 17 2009
28 / 61
AdaBoost Analysis
AdaBoost Convergence
X
X
dZt (αt , ht )
=
Dt (xi ) exp [αt ] = 0
−Dt (xi ) exp [−αt ] +
dαt
xi ∈A
X
(38)
xi ∈A
Dt (xi ) =
xi ∈A
X
Dt (xi ) exp [2αt ]
(39)
xi ∈A
And, by definition, we can write the error as
t (h) =
m
X
Dt (xi )δ(h(xi ) 6= yi ) =
i=1
X
Dt (xi ),
∀h
(40)
xi ∈A
Rewriting (39) and solving for αt yields
αt =
J. Corso (SUNY at Buffalo)
1 1 − t (ht )
ln
2
t (ht )
AdaBoost Part of Lecture 6
(41)
Mar. 17 2009
28 / 61
AdaBoost Analysis
AdaBoost Convergence
We can plug it back into the normalization term to get the minimum:
Zt (αt , ht ) =
X
X
Dt (xi ) exp [−αt ] +
xi ∈A
Dt (xi ) exp [αt ]
s
= (1 − t (ht ))
s
t (ht )
1 − t (ht )
+ t (ht )
1 − t (ht )
t (ht )
p
= 2 t (ht )(1 − t (ht ))
J. Corso (SUNY at Buffalo)
(42)
xi ∈A
AdaBoost Part of Lecture 6
(43)
(44)
Mar. 17 2009
29 / 61
AdaBoost Analysis
AdaBoost Convergence
We can plug it back into the normalization term to get the minimum:
Zt (αt , ht ) =
X
X
Dt (xi ) exp [−αt ] +
xi ∈A
Dt (xi ) exp [αt ]
s
= (1 − t (ht ))
s
t (ht )
1 − t (ht )
+ t (ht )
1 − t (ht )
t (ht )
p
= 2 t (ht )(1 − t (ht ))
Change a variable, γt =
J. Corso (SUNY at Buffalo)
(42)
xi ∈A
1
2
(43)
(44)
− t (ht ), γt ∈ (0, 21 ].
AdaBoost Part of Lecture 6
Mar. 17 2009
29 / 61
AdaBoost Analysis
AdaBoost Convergence
We can plug it back into the normalization term to get the minimum:
Zt (αt , ht ) =
X
X
Dt (xi ) exp [−αt ] +
xi ∈A
Dt (xi ) exp [αt ]
s
= (1 − t (ht ))
s
t (ht )
1 − t (ht )
+ t (ht )
1 − t (ht )
t (ht )
p
= 2 t (ht )(1 − t (ht ))
Change a variable, γt =
1
2
(43)
(44)
− t (ht ), γt ∈ (0, 21 ].
Then, we have the minimum to be
p
Zt (αt , ht ) = 2 t (ht )(1 − t (ht ))
q
= 1 − 4γt2
≤ exp −2γt2
J. Corso (SUNY at Buffalo)
(42)
xi ∈A
AdaBoost Part of Lecture 6
(45)
(46)
(47)
Mar. 17 2009
29 / 61
AdaBoost Analysis
AdaBoost Convergence
Therefore, after t steps, the error rate of the strong classifier is
bounded on top by
#
"
T
X
2
γt
(48)
Err(H) ≤ Z ≤ exp −2
t=1
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
30 / 61
AdaBoost Analysis
AdaBoost Convergence
Therefore, after t steps, the error rate of the strong classifier is
bounded on top by
#
"
T
X
2
γt
(48)
Err(H) ≤ Z ≤ exp −2
t=1
Hence, each step decreases the upper bound exponentially.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
30 / 61
AdaBoost Analysis
AdaBoost Convergence
Therefore, after t steps, the error rate of the strong classifier is
bounded on top by
#
"
T
X
2
γt
(48)
Err(H) ≤ Z ≤ exp −2
t=1
Hence, each step decreases the upper bound exponentially.
And, a weak classifier with small error rate will lead to faster descent.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
30 / 61
AdaBoost Analysis
AdaBoost Convergence
Summary of AdaBoost Convergence
The objective of AdaBoost is to minimize an upper bound on the
classification error:
(α, h)∗ = argmin Z(α, h)
(49)
= argmin Zt (αt , ht ) . . . Z1 (α1 , h1 )
m
X
exp [−yi hα, h(xi )i]
= argmin
(50)
(51)
i=1
AdaBoost takes a stepwise minimization scheme, which may not be
optimal (it is greedy). When we calculate the parameter for the tth
weak classifier, the others remain set.
We should stop AdaBoost if all of the weak classifiers have an error
rate of 12 .
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
31 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
How Will Test Error Behave? (A First Guess)
1
error
0.8
0.6
0.4
test
0.2
train
20
40
60
80
100
# of rounds (T)
expect:
• training error to continue to drop (or reach zero)
• test error to increase when Hfinal becomes “too complex”
•
•
“Occam’s razor”
overfitting
• hard to know when to stop training
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
32 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Actual Typical Run
20
C4.5 test error
error
15
(boosting C4.5 on
“letter” dataset)
10
5
0
test
train
10
100
1000
# of rounds (T)
• test error does not increase, even after 1000 rounds
•
(total size > 2,000,000 nodes)
• test error continues to drop even after training error is zero!
train error
test error
# rounds
5 100 1000
0.0 0.0
0.0
8.4 3.3
3.1
• Occam’s razor wrongly predicts “simpler” rule is better
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
33 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
•
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
34 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
•
• recall: Hfinal is weighted majority vote of weak classifiers
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
35 / 61
Test Error Analysis (From Schapire’s Slides)
AdaBoost Analysis
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
•
• recall: Hfinal is weighted majority vote of weak classifiers
• measure confidence by margin = strength of the vote
= (fraction voting correctly) − (fraction voting incorrectly)
high conf.
incorrect
−1
J. Corso (SUNY at Buffalo)
high conf.
correct
low conf.
Hfinal
incorrect
0
Hfinal
correct
AdaBoost Part of Lecture 6
+1
Mar. 17 2009
36 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Empirical Evidence: The Margin Distribution
• margin distribution
20
error
15
10
test
5
0
train
10
100
1000
cumulative distribution
= cumulative distribution of margins of training examples
1.0
1000
100
0.5
5
-1
# of rounds (T)
train error
test error
% margins ≤ 0.5
minimum margin
J. Corso (SUNY at Buffalo)
5
0.0
8.4
7.7
0.14
-0.5
margin
0.5
1
# rounds
100 1000
0.0
0.0
3.3
3.1
0.0
0.0
0.52 0.55
AdaBoost Part of Lecture 6
Mar. 17 2009
37 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization
error (independent of number of rounds)
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
38 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization
error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
39 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization
error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
• Theorem: boosting tends to increase margins of training
examples (given weak learning assumption)
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
40 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization
error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
• Theorem: boosting tends to increase margins of training
examples (given weak learning assumption)
• proof idea: similar to training error proof
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
41 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization
error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
• Theorem: boosting tends to increase margins of training
examples (given weak learning assumption)
• proof idea: similar to training error proof
• so:
although final classifier is getting larger,
margins are likely to be increasing,
so final classifier actually getting close to a simpler classifier,
driving down the test error
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
42 / 61
AdaBoost Analysis
Test Error Analysis (From Schapire’s Slides)
More Technically...
• with high probability, ∀θ > 0 :
generalization error ≤ P̂r[margin ≤ θ] + Õ
p
d/m
θ
!
(P̂r[ ] = empirical probability)
•
bound depends on
• m = # training examples
• d = “complexity” of weak classifiers
• entire distribution of margins of training examples
• P̂r[margin ≤ θ] → 0 exponentially fast (in T ) if
(error of ht on Dt ) < 1/2 − θ (∀t)
•
so: if weak learning assumption holds, then all examples
will quickly have “large” margins
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
43 / 61
AdaBoost Recap
Summary of Basic AdaBoost
AdaBoost is a sequential algorithm that minimizes an upper bound of
the empirical classification error by selecting the weak classifiers and
their weights. These are “pursued” one-by-one with each one being
selected to maximally reduce the upper bound of error.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
44 / 61
AdaBoost Recap
Summary of Basic AdaBoost
AdaBoost is a sequential algorithm that minimizes an upper bound of
the empirical classification error by selecting the weak classifiers and
their weights. These are “pursued” one-by-one with each one being
selected to maximally reduce the upper bound of error.
AdaBoost defines a distribution of weights over the data samples.
These weights are updated each time a new weak classifier is added
such that samples misclassified by this new weak classifiers are given
more weight. In this manner, currently misclassified samples are
emphasized more during the selection of the subsequent weak
classifier.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
44 / 61
AdaBoost Recap
Summary of Basic AdaBoost
AdaBoost is a sequential algorithm that minimizes an upper bound of
the empirical classification error by selecting the weak classifiers and
their weights. These are “pursued” one-by-one with each one being
selected to maximally reduce the upper bound of error.
AdaBoost defines a distribution of weights over the data samples.
These weights are updated each time a new weak classifier is added
such that samples misclassified by this new weak classifiers are given
more weight. In this manner, currently misclassified samples are
emphasized more during the selection of the subsequent weak
classifier.
The empirical error will converge to zero at an exponential rate.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
44 / 61
AdaBoost Recap
Practical AdaBoost Advantages
It is fast to evaluate (linear-additive) and can be fast to train
(depending on weak learner).
T is the only parameter to tune.
It is flexible and can be combined with any weak learner.
It is provably effective if it can consistently find the weak classifiers
(that do better than random).
Since it can work with any weak learner, it can handle the gamut of
data.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
45 / 61
AdaBoost Recap
AdaBoost Caveats
Performance depends on the data and the weak learner.
It can fail if
The weak classifiers are too complex and overfit.
The weak classifiers are too weak, essentially underfitting.
AdaBoost seems, empirically, to be especially susceptible to uniform
noise.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
46 / 61
AdaBoost Recap
The Coordinate Descent View of AdaBoost (from Schapire)
Coordinate Descent
[Breiman]
• {g1 , . . . , gN } = space of all weak classifiers
• want to find λ1 , . . . , λN to minimize
L(λ1 , . . . , λN ) =
X
i
J. Corso (SUNY at Buffalo)

exp −yi
AdaBoost Part of Lecture 6
X
j

λj gj (xi )
Mar. 17 2009
47 / 61
AdaBoost Recap
The Coordinate Descent View of AdaBoost (from Schapire)
Coordinate Descent
[Breiman]
• {g1 , . . . , gN } = space of all weak classifiers
• want to find λ1 , . . . , λN to minimize
L(λ1 , . . . , λN ) =
X
i

exp −yi
X
j

λj gj (xi )
• AdaBoost is actually doing coordinate descent on this
optimization problem:
initially, all λj = 0
each round: choose one coordinate λj (corresponding to
ht ) and update (increment by αt )
• choose update causing biggest decrease in loss
•
•
• powerful technique for minimizing over huge space of
functions
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
48 / 61
AdaBoost for Estimating Conditional Probabilities
Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]
• often want to estimate probability that y = +1 given x
• AdaBoost minimizes (empirical version of):
h
i
h
i
Ex,y e −yf (x) = Ex P [y = +1|x] e −f (x) + P [y = −1|x] e f (x)
where x, y random from true distribution
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
49 / 61
AdaBoost for Estimating Conditional Probabilities
Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]
• often want to estimate probability that y = +1 given x
• AdaBoost minimizes (empirical version of):
h
i
h
i
Ex,y e −yf (x) = Ex P [y = +1|x] e −f (x) + P [y = −1|x] e f (x)
where x, y random from true distribution
• over all f , minimized when
1
f (x) = · ln
2
P [y = +1|x]
P [y = −1|x]
or
P [y = +1|x] =
1
1 + e −2f (x)
• so, to convert f output by AdaBoost to probability estimate,
use same formula
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
50 / 61
Multiclass AdaBoost
From Schapire’s Slides
Multiclass Problems
[with Freund]
• say y ∈ Y = {1, . . . , k}
• direct approach (AdaBoost.M1):
ht : X → Y
Dt+1 (i ) =
Dt (i )
·
Zt
e −αt
e αt
Hfinal (x) = arg max
y ∈Y
J. Corso (SUNY at Buffalo)
if yi = ht (xi )
if yi 6= ht (xi )
X
αt
t:ht (x)=y
AdaBoost Part of Lecture 6
Mar. 17 2009
51 / 61
Multiclass AdaBoost
From Schapire’s Slides
Multiclass Problems
[with Freund]
• say y ∈ Y = {1, . . . , k}
• direct approach (AdaBoost.M1):
ht : X → Y
Dt+1 (i ) =
Dt (i )
·
Zt
e −αt
e αt
Hfinal (x) = arg max
y ∈Y
if yi = ht (xi )
if yi 6= ht (xi )
X
αt
t:ht (x)=y
• can prove same bound on error if ∀t : t ≤ 1/2
in practice, not usually a problem for “strong” weak
learners (e.g., C4.5)
• significant problem for “weak” weak learners (e.g.,
decision stumps)
• instead, reduce to binary
•
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
52 / 61
Multiclass AdaBoost
From Schapire’s Slides
Reducing Multiclass to Binary
[with Singer]
• say possible labels are {a, b, c, d, e}
• each training example replaced by five {−1, +1}-labeled
examples:
x

(x, a)




(x,
b)

, c →
(x, c)


(x,
d)



(x, e)
,
,
,
,
,
−1
−1
+1
−1
−1
• predict with label receiving most (weighted) votes
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
53 / 61
Applications
AdaBoost for Face Detection
Viola and Jones 2000-2002.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
54 / 61
Applications
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
55 / 61
Applications
Figure 7: Output of our face detector on a number of test images from the MIT+CMU test set.
[3] F. Crow. Summed-area tables for texture map
Proceedings of SIGGRAPH, volume 18(3), pages
1984.
[4] F. Fleuret and D. Geman. Coarse-to-fi ne face dete
J. Computer Vision, 2001.
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
56 / 61
Applications
ons such as advanced driving
visual features as adaBoost
Fig.1: Viola &Jones Haar-like features
res, the “connected controlto give very good results on
These weak classifiers compute the absolute difference
n. We here report on results
between the sum of pixel values in red and blue areas (see
atures to a public lateral car
F. database.
Moutarde,
Stanciulescu,
and
A.following
Breheret.
figure
1), with the respect
of the
rule: Real-time visual
estrian images
We B.
tently outperform
previously
detection
of vehicles
and
pedestrians
with
new
efficient AdaBoost
if Area ( A) ! Area ( B ) > Threshold then True
ses, while still operating fast
features. 2008.else False
nd vehicles detection.
AdaBoost for Car and Pedestrian Detection
They define a different pixel level “connected control point” feature.
D RELATED WORK
well as most Advanced
(ADAS) functions, require
his environment perception
ors such as lidars, radars,
ever, compared to other
ovide very rich information
f an abstract enough scene
time.
red for such an automated
detection of most common
nt: vehicles and pedestrians.
roposed for visual object
g [10] for a review of some
for pedestrian detection,
. Of the various machinethis problem,
only few are
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
57 / 61
Applications
with respect
ample.
l values, an example
groups are separated
figure 3a). Negative
ct this characteristic:
roups are interleaved
examples of controlpedestrian detection.
-, half- or quarterdow size (80x32 for
pedestrian case). An
s resized to those 3
d.
e 4, the feature will
es (on the correctly
white squares all have
Fig. 4: Some examples of adaBoost-selected Control-Points features for
car detection (top) and pedestrian detection (bottom line). Some features
operate at full resolution of detection window (eg rightmost bottom), while
others work on half-resolution (eg leftmost bottom), or even at quarterresolution (third on bottom line).
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
58 / 61
Applications
B. Pedestr
The pede
test sets (eac
and 5000
independent
training sets,
producing a
feature type
allowed, ther
2000 weak-c
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
59 / 61
2000 weak-classifiers.Applications
t
ds
8
Fig. 9: Averaged ROC curves for adaBoost pedestrian classifiers obtained
with various feature families
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
60 / 61
References
Sources
These slides have made extensive use of the following sources.
Y.Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-line
Learning and an Application to Boosting. Journal of Computer and System
Science, 55(1):119139, 1997.
R.E. Schapire. The boosting approach to machine learning: an overview. In
MSRI Workshop on Nonlinear Estimation and Classification, 2002.
Schapire’s NIPS Tutorial http://nips.cc/Conferences/2007/Program/
schedule.php?Session=Tutorials
P.Viola and M.Jones. Rapid object detection using a boosted cascade of
simple features. In Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, 2001.
P.Viola and M.Jones. Fast and Robust Classification Using Asymmetric
AdaBoost and a Detector Cascade. In Proceedings of Neural Information
Processing Systems (NIPS), 2002.
SC Zhu’s slides for AdaBoost (UCLA).
J. Corso (SUNY at Buffalo)
AdaBoost Part of Lecture 6
Mar. 17 2009
61 / 61
Download