Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research

advertisement
Foundations of Machine Learning
Boosting
Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Weak Learning
(Kearns and Valiant, 1994)
Definition: concept class C is weakly PAC-learnable
if there exists a (weak) learning algorithm L and > 0
such that:
for all > 0 , for all c C and all distributions D,
•
Pr
S D
R(hS )
1
2
1
,
• for samples S of size m = poly(1/ ) for a fixed
polynomial.
Mehryar Mohri - Foundations of Machine Learning
page 2
Boosting Ideas
Finding simple relatively accurate base classifiers
often not hard
weak learner.
Main ideas:
• use weak learner to create a strong learner.
• combine base classifiers returned by weak learner
(ensemble method).
But, how should the base classifiers be combined?
Mehryar Mohri - Foundations of Machine Learning
page 3
AdaBoost - Preliminaries
(Freund and Schapire, 1997)
Family of base classifiers: H
{ 1, +1}X .
Iterative algorithm (T rounds): at each round t [1, T ]
maintains distribution Dt over training sample.
t
convex combination hypothesis ft = s=1 s hs ,
with s 0 and hs H.
•
•
Weighted error of hypothesis h at round t :
m
Dt (i)1h(xi )=yi .
Pr[h(xi ) = yi ] =
Dt
Mehryar Mohri - Foundations of Machine Learning
i=1
page 4
AdaBoost
H
{ 1, +1} .
X
(Freund and Schapire, 1997)
AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym )))
1 for i
1 to m do
1
2
D1 (i)
m
3 for t
1 to T do
4
ht
base classifier in H with small error t = PrDt [ht (xi ) = yi ]
1 t
1
log
5
t
2
t
1
2
6
Zt
2[ t (1
normalization factor
t )]
7
for i
1 to m do
Dt (i) exp(
t yi ht (xi ))
8
Dt+1 (i)
Zt
T
9 f
t=1 t ht
10 return h = sgn(f )
Mehryar Mohri - Foundations of Machine Learning
page 5
Notes
Distributions Dt over training sample:
originally uniform.
at each round, the weight of a misclassified
example is increased.
observation: Dt+1 (i) = me Q Z , since
•
•
•
Dt+1 (i) =
yi ft (xi )
t
s=1 s
Dt (i)e
t yi ht (xi )
Zt
=
Dt
1 (i)e
t
1 y i ht
Zt
1 (xi )
1 Zt
e
t yi ht (xi )
1 e
=
m
yi
Pt
s=1
t
s=1
s hs (xi )
Zs
Weight assigned to base classifier ht : t directy
depends on the accuracy of ht at round t .
Mehryar Mohri - Foundations of Machine Learning
page 6
.
Illustration
t = 1
Mehryar Mohri - Foundations of Machine Learning
t = 2
page
7
t = 3
...
Mehryar Mohri - Foundations of Machine Learning
...
page 8
α1
+ α2
+ α3
=
Mehryar Mohri - Foundations of Machine Learning
page 9
Bound on Empirical Error
(Freund and Schapire, 1997)
Theorem: The empirical error of the classifier
output by AdaBoost verifies:
T
R(h)
exp
2
t=1
•
exp( 2
2
t
( 12
If further for all t [1, T ] ,
R(h)
•
1
2
2
.
t ) , then
T ).
does not need to be known in advance:
adaptive boosting.
Mehryar Mohri - Foundations of Machine Learning
page 10
• Proof: Since, as we saw, D
1
R(h) =
m
1
m
m
1yi f (xi )
i=1
m
• Now, since Z
t
0
1
m
,
exp( yi f (xi ))
i=1
T
T
Zt DT +1 (i) =
m
i=1
t+1 (i) =
m
yi ft (xi )
e Q
m ts=1 Zs
Zt .
t=1
t=1
is a normalization factor,
m
Dt (i)e
Zt =
=
i=1
t yi ht (xi )
Dt (i)e
t
i:yi ht (xi ) 0
= (1
t )e
= (1
t)
Mehryar Mohri - Foundations of Machine Learning
Dt (i)e
t
i:yi ht (xi )<0
t
+ te
t
1
+
t
+
t
t
1
t
t
=2
t (1
page 11
t ).
• Thus,
T
Zt =
t=1
T
T
t (1
2
t)
=
t=1
2
t
t=1
T
exp
2
1
1
2
4
2
1
2
t
T
= exp
2
t
t
(1
t )e
+ te .
, at each round, AdaBoost
assigns the same probability mass to correctly
classified and misclassified instances.
for base classifiers x [ 1, +1] , t can be
similarly chosen to minimize Zt .
t
•
2
t=1
t=1
• Notes:
• minimizer of
• since (1 )e =
1
2
Mehryar Mohri - Foundations of Machine Learning
t
te
t
page 12
.
AdaBoost = Coordinate Descent
Objective Function: convex and differentiable.
m
m
F( ) =
e
yi f (xi )
=
e
i=1
i=1
e
yi
PT
t=1
t ht (xi )
x
0 1 loss
Mehryar Mohri - Foundations of Machine Learning
page 13
.
• Direction: unit vector e with
t
et = argmin
dF (
t 1
•
dF (
t 1 + et )
d
m
t
1 + et ) =
m
yi
i=1
=0
Pt
1
s=1
s hs (xi )
yi ht (xi )
s=1
t 1
yi ht (xi )Dt (i) m
Zs
s=1
i=1
t 1
t 1
=
[(1
,
s hs (xi )
yi
i=1
m
=
e
t 1
yi ht (xi ) exp
=
=0
e
.
d
t
Since F (
+ et )
t)
t]
Zs = [2
m
t
1] m
Zs .
s=1
s=1
Thus, direction corresponding to base classifier with smallest error.
Mehryar Mohri - Foundations of Machine Learning
page14
• Step size: obtained via
dF (
t 1
m
t 1 + et )
=0
d
yi ht (xi ) exp
s hs (xi )
yi
e
yi ht (xi )
s=1
i=1
m
t 1
yi ht (xi )Dt (i) m
Zs e
yi ht (xi )
=0
s=1
i=1
m
yi ht (xi )Dt (i)e
yi ht (xi )
=0
i=1
[(1
=
t )e
1
1
log
2
te
t
]=0
.
t
Thus, step size matches base classifier weight of AdaBoost.
Mehryar Mohri - Foundations of Machine Learning
page 15
=0
Alternative Loss Functions
x
boosting loss square loss
x (1 x)2 1x
x e x
logistic loss
log2 (1 + e x )
hinge loss
x
max(1
1
x, 0)
zero-one loss
x 1x<0
Mehryar Mohri - Foundations of Machine Learning
page 16
Standard Use in Practice
Base learners: decision trees, quite often just
decision stumps (trees of depth one).
Boosting stumps:
data in RN, e.g., N = 2, (height(x), weight(x)) .
associate a stump to each component.
pre-sort each component: O(N m log m).
at each round, find best component and threshold.
total complexity: O((m log m)N + mN T ).
stumps not weak learners: think XOR example!
•
•
•
•
•
•
Mehryar Mohri - Foundations of Machine Learning
page 17
Overfitting?
Assume that VCdim(H) = d and for a fixed T , define
T
FT =
sgn
t ht
b :
t, b
R, ht
H .
t=1
FT can form a very rich family of classifiers. It can
be shown (Freund and Schapire, 1997) that:
VCdim(FT )
2(d + 1)(T + 1) log2 ((T + 1)e).
This suggests that AdaBoost could overfit for large
values of T , and that is in fact observed in some
cases, but in various others it is not!
Mehryar Mohri - Foundations of Machine Learning
page 18
Empirical Observations
20
cumulative distribution
Several empirical observations (not all): AdaBoost
does not seem to overfit, furthermore:
test error
error
15
training error
10
5
0
10
100
1000
# rounds
-1
1.0
0.5
-0.5
ma
Figure
2: Errortrees
curves
and the
margin
C4.5 decision
(Schapire
et al.,
1998). distribution graph for
the letter dataset as reported by Schapire et al. [69]. Left: the
error curves (lower and upper curves, respectively) of the comb
Mehryar Mohri - Foundations of Machine Learning
page
19
Rademacher Complexity of Convex Hulls
Theorem: Let H be a set of functions mapping
from X to R . Let the convex hull of H be defined as
p
conv(H) = {
p
µk hk : p 1, µk
0,
k=1
1, hk
µk
H}.
k=1
Then, for any sample S , RS (conv(H)) = RS (H).
Proof:
1
RS (conv(H)) =
E
m
1
E
=
m
1
=
E
m
p
m
sup
hk H,µ 0, µ
sup
1
1 i=1
p
sup
hk H µ 0, µ
µk hk (xi )
i
k=1
m
i hk (xi )
µk
1
1 k=1
i=1
m
i hk (xi )
sup max
hk H k [1,p]
i=1
m
1
E sup
=
m
h H i=1
Mehryar Mohri - Foundations of Machine Learning
i h(xi )
= RS (H).
page 20
Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002)
Corollary: Let H be a set of real-valued functions.
Fix > 0 . For any > 0 , with probability at least 1 ,
the following holds for all h conv(H) :
R(h)
R(h)
2
R (h) + Rm H +
2
R (h) + RS H + 3
log 1
2m
log 2
.
2m
Proof: Direct consequence of margin bound of
Lecture 4 and RS (conv(H)) = RS (H) .
Mehryar Mohri - Foundations of Machine Learning
page 21
Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998)
Corollary: Let H be a family of functions taking
values in { 1, +1} with VC dimension d . Fix > 0 .
For any > 0 , with probability at least 1 , the
following holds for all h conv(H):
R(h)
R (h) +
2
2d log em
d
+
m
log 1
.
2m
Proof: Follows directly previous corollary and VC
dimension bound on Rademacher complexity (see
lecture 3).
Mehryar Mohri - Foundations of Machine Learning
page 22
Notes
All of these bounds can be generalized to hold
uniformly for all (0, 1), at the cost of an additional
log log2 2
term
and other minor constant factor
m
changes (Koltchinskii and Panchenko, 2002).
For AdaBoost, the bound applies to the functions
x
f (x)
1
=
T
t=1
t ht (x)
conv(H).
1
Note that T does not appear in the bound.
Mehryar Mohri - Foundations of Machine Learning
page 23
Margin Distribution
Theorem: For any > 0 , the following holds:
Pr
T
yf (x)
1
t
2T
1
(1
m
1yi f (xi )
i=1
1
.
t=1
Proof: Using the identity Dt+1 (i) =
1
m
1+
t)
1
m
0
1
=
m
=e
m
exp( yi f (xi ) +
i=1
m
e
i=1
1
,
)
T
T
1
Zt DT +1 (i)
m
t=1
T
T
Zt = 2
1
t=1
Mehryar Mohri - Foundations of Machine Learning
yi f (xi )
eQ
m T
t=1 Zt
1
t
t
t=1
page 24
t (1
t ).
Notes
If for all t [1, T ] ,
can be bounded by
Pr
yf (x)
( 12
(1
t ), then
2 )1
the upper bound
(1 + 2 )1+
T /2
.
1
For < , (1 2 )1 (1+2 )1+ < 1 and the bound
decreases exponentially in T .
O(1/ m),
For the bound to be convergent:
O(1/ m) is roughly the condition on the
thus
edge value.
Mehryar Mohri - Foundations of Machine Learning
page 25
But, Does AdaBoost Maximize the Margin?
No: AdaBoost may converge to a margin that is
significantly below the maximum margin (Rudin et al.,
2004) (e.g., 1/3 instead of 3/8)!
Lower bound: AdaBoost can achieve asymptotically
a margin that is at least max
if data separable and
2
some conditions on the base learners (Rätsch and
Warmuth, 2002).
Several boosting-type margin-maximization
algorithms: but, performance in practice not clear
or not reported.
Mehryar Mohri - Foundations of Machine Learning
page
26
Outliers
AdaBoost assigns larger weights to harder
examples.
Application:
Detecting mislabeled examples.
Dealing with noisy data: regularization based on
the average weight assigned to a point (soft
margin idea for boosting) (Meir and Rätsch, 2003).
•
•
Mehryar Mohri - Foundations of Machine Learning
page
27
AdaBoost’s Weak Learning Condition
Definition: the edge of a base classifier ht for a
distribution D over the training sample is
1
(t) =
2
t
1
=
2
m
yi ht (xi )D(i).
i=1
Condition: there exists > 0 for any distribution D
over the training sample and any base classifier
(t)
Mehryar Mohri - Foundations of Machine Learning
.
page 28
Zero-Sum Games
Definition:
payoff matrix M = (Mij ) Rm n.
m possible actions (pure strategy) for row player.
n possible actions for column player.
Mij payoff for row player ( = loss for column
player) when row plays i , column plays j .
•
•
•
•
Example:
rock
paper
scissors
Mehryar Mohri - Foundations of Machine Learning
rock
0
1
-1
paper scissors
-1
1
0
-1
1
0
page 29
Mixed Strategies
(von Neumann, 1928)
Definition: player row selects a distribution p over
the rows, player column a distribution q over
columns. The expected payoff for row is
m
n
pi Mij qj = p Mq.
E [Mij ] =
i p
j q
i=1 j=1
von Neumann’s minimax theorem:
max min p Mq = min max p Mq.
p
q
• equivalent form:
q
p
max min p Mej = min max ei Mq.
p
j [1,n]
Mehryar Mohri - Foundations of Machine Learning
q
i [1,m]
page 30
John von Neumann (1903 - 1957)
Mehryar Mohri - Foundations of Machine Learning
page 31
AdaBoost and Game Theory
Game:
Player A: selects point xi , i [1, m].
Player B: selects base learner ht , t [1, T ].
Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ).
von Neumann’s theorem: assume finite H .
•
•
•
T
m
2
D(i)yi h(xi ) = max min yi
= min max
D
h H
i=1
Mehryar Mohri - Foundations of Machine Learning
i [1,m]
t=1
page 32
t ht (xi )
1
=
.
Consequences
Weak learning condition = non-zero margin.
thus, possible to search for non-zero margin.
AdaBoost = (suboptimal) search for
corresponding .
•
•
Weak learning =strong condition:
the condition implies linear separability with
margin 2 > 0 .
•
Mehryar Mohri - Foundations of Machine Learning
page 33
Linear Programming Problem
Maximizing the margin:
( · xi )
= max min yi
.
|| ||1
i [1,m]
This is equivalent to the following convex
optimization LP problem:
max
subject to : yi ( · xi )
1 = 1.
Note that:
| · x|
= x
H
1
Mehryar Mohri - Foundations of Machine Learning
, with H = {x :
· x = 0}.
page
34
Advantages of AdaBoost
Simple: straightforward implementation.
Efficient: complexity O(mN T ) for stumps:
when N and T are not too large, the algorithm is
quite fast.
•
Theoretical guarantees: but still many questions.
AdaBoost not designed to maximize margin.
regularized versions of AdaBoost.
•
•
Mehryar Mohri - Foundations of Machine Learning
page 35
Weaker Aspects
Parameters:
need to determine T , the number of rounds of
boosting: stopping criterion.
need to determine base learners: risk of
overfitting or low margins.
•
•
Noise: severely damages the accuracy of Adaboost
(Dietterich, 2000).
Mehryar Mohri - Foundations of Machine Learning
page 36
Other Boosting Algorithms
arc-gv (Breiman, 1996): designed to maximize the
margin, but outperformed by AdaBoost in
experiments (Reyzin and Schapire, 2006).
L1-regularized AdaBoost (Raetsch et al., 2001):
outperfoms AdaBoost in experiments (Cortes et al.,
2014).
DeepBoost (Cortes et al., 2014): more favorable
learning guarantees, outperforms both AdaBoost
and L1-regularized AdaBoost.
Mehryar Mohri - Foundations of Machine Learning
page 37
References
•
Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In ICML, pages 262-270,
2014.
•
•
Leo Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996.
•
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139,
1997.
•
G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In
NIPS, pages 447–454, 2001.
•
Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced
Lectures on Machine Learning (LNAI2600), 2003.
•
J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen,
100:295-320, 1928.
Thomas G. Dietterich. An experimental comparison of three methods for constructing
ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2):
139-158, 2000.
Mehryar Mohri - Foundations of Machine Learning
page 38
References
•
Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost:
Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5:
1557-1595, 2004.
•
Rätsch, G., and Warmuth, M. K. (2002) “Maximizing the Margin with Boosting”, in
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02),
Sidney, Australia, pp. 334–350, July 2002.
•
Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier
complexity. In ICML, pages 753-760, 2006.
•
Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D.
Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and
Classification. Springer, 2003.
•
Robert E. Schapire and Yoav Freund. Boosting, Foundations and Algorithms. The MIT Press,
2012.
•
Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):
1651-1686, 1998.
Mehryar Mohri - Foundations of Machine Learning
page 39
L1 Margin Definitions
Definition: the margin of a point x with label y is
(the · algebraic distance of x = [h1 (x), . . . , hT (x)]
to the hyperplane ·x = 0 ):
(x) =
yf (x)
m
t=T
=
y
T
t=1
t
t ht (x)
=y
1
·x
.
1
Definition: the margin of the classifier for a
sample S = (x1 , . . . , xm ) is the minimum margin of
the points in that sample:
= min yi
i [1,m]
Mehryar Mohri - Foundations of Machine Learning
· xi
.
1
page 40
• Note:
• SVM margin:
w · (xi )
= min yi
.
w 2
i [1,m]
• Boosting margin:
= min yi
i [1,m]
· H(xi )
,
1
h1 (x)
..
.
with H(x) =
.
hT (x)
• Distances:
·
q
distance to hyperplane w·x+b = 0 :
1 1
with + = 1.
p q
Mehryar Mohri - Foundations of Machine Learning
|w · x + b|
,
w p
page 41
• Maximum-margin solutions with different norms.
Norm || · ||2 .
Norm || · ||∞ .
• AdaBoost gives an approximate solution for the
LP problem.
Mehryar Mohri - Foundations of Machine Learning
page 42
Download