Learning with Deep Cascades Giulia DeSalvo, Mehryar Mohri, and Umar Syed

advertisement
Learning with Deep Cascades
Giulia DeSalvo,2 Mehryar Mohri,1,2 and Umar Syed1
Google Research1
Courant Institute of Mathematical Sciences2
September 28, 2015
1 / 20
Introduction
x > 10
Decision Trees: trees with
indicator functions at each
internal node and assignment
functions at each leaf.
x < 20
+1
1
+1
Figure: Decision Tree
[Breiman et al., 1984. Quinlan, 1986.]
2 / 20
Introduction
q1 (x)
Deep Cascade are a significantly
broader family of decision trees:
the node questions qj and leaf
classifiers hk in di↵erent hypothesis
sets.
P l Qdk
f := k=1 j=1
qj hk .
Used to tackle harder tasks
q2 (x)
h1 (x)
h2 (x)
h3 (x)
Figure: Deep Cascade
f = q1 h1 + q1 q2 h2 + q1 q2 h3
3 / 20
Motivation
Challenge: H = [pi=1 Hi could be very complex
! learning is prone to overfitting.
H
H3
H1
H2 H4
. . .
Hp
Goal: Would it be possible to learn with questions of
varying complexity at each node and yet not overfit?
4 / 20
Generalization Bounds
W. h. p., the following holds for all deep cascade functions, f :
R(f )  R̂S (f ) +
q
⇣
⌘
m+
k
e l log pl
min
4
R̂
(H
),
+
O
S
k
k=1
m
m
Pl
Definition:
Deep Cascade f :=
Pl
k=1
Qdk
j=1 qj hk .
Hk is the Root-Leaf k hypothesis class
functions of the form:
Qdk
j=1
qj hk .
5 / 20
Generalization Bounds
W. h. p., the following holds for all deep cascade functions, f :
R(f )  R̂S (f ) +
q
⇣
⌘
m+
k
e l log pl
min
4
R̂
(H
),
+
O
S
k
k=1
m
m
Pl
Rademacher
Complexity of
Root-Leaf k Class
Fraction of Correctly
Classified Points at
Leaf k
6 / 20
Generalization Bounds
Trivial Case:
If the tree has one node, then the bound is not interesting
(
+
R̂S (f ) + R̂S (H) if mm
R̂S (H)
R(f ) 
m+
1
if m < R̂S (H)
7 / 20
Generalization Bounds
If tree has more than one node, the critical measure:
balance of the class complexities and the fractions of
correctly classified points at each leaf.
R(f )  R̂S (f ) +
Pl
k=1
⇣
min 4R̂S (Hk ),
m+
k
m
⌘
8 / 20
Previous work
Generalization bounds for decision trees.
Mansour & McAllester 2000, Nobel 2002, Golea et. al 1997,
Scott & Nowak 2005.
Cascades in object detection
Viola & Jones 2004, Saberian 2010, Lefakis et. al 2010, Chen
et. al 2012, etc.
Decision tree algorithms that use SVMs
Bennett & Blue 1998
Multi-class classification: Dong 2008, Madjarov et. al 2012,
Takahashi et. al 2002.
Computational speeds: Arreola et. al 2006, Chang 2010,
Kumar 2010, Rodriguez-Lujan et. al 2012.
9 / 20
Assumptions of Algorithm
1
µ
µ
1
µ
µ
1
µ
µ
…
The parameter µ controls
the fraction of points that
proceed deeper into the tree
=) find the best tradeo↵
between the complexity term
and fraction of points at
each node.
Figure: Topology
10 / 20
Deep Cascade Algorithm
1. Generate di↵erent deep cascade functions by using SVMs at
each leaf of the cascade
2. Choose the best among them by minimizing an upper bound
of the generalization guarantee
11 / 20
1. Generating Deep Cascade Function
Initially, train using SVMs with a polynomial kernel.
Extract a µ fraction of
points closest to the
hyperplane;
On the next node, retrain
on this extracted
subsample using SVMs
with a polynomial kernel
of another degree.
Node
Question
q1 = 1
q1 = 0
Hyperplane
h1
µ1 fraction of
points
12 / 20
1. Generating Deep Cascade Function
Node 1: q1
1
Leaf 1: h1
Node
Question
µ
µ
q1 = 1
q1 = 0
Node 2: q2
1
µ
µ
Leaf 2: h2 1
µ
µ
Hyperplane
h1
…
µ1 fraction of
points
13 / 20
2. Choose the best Deep Cascade
Choose the deep cascade function f ⇤ that minimizes the following:
f⇤
(
argmin RbS (f ) +
f 2F
where R̂S (Hk )  Ak =
"
Pdk
j=1
l
X
k=1
r
m+
min 4Ak , k
m
df ,j log
em
df ,j
m
+
r
!)
df ,k log
m
em
df ,k
#
with df ,j = VCDIM(qj ) and df ,k = VCDIM(hk )
14 / 20
Experimental Results
Dataset
breastcancer
german
splice
ionosphere
a1a
SVM Algorithm
0.0426 ± 0.0117
0.297 ± 0.0193
0.205 ± 0.0134
0.0971 ± 0.0167
0.195 ± 0.0217
Deep Cascade Alg
0.0353 ± 0.00975
0.256 ± 0.0324
0.175 ± 0.0152
0.117 ± 0.0229
0.209± 0.0233
15 / 20
Next Steps
Use more than the simplest form of bound
Algorithm implementation:
optimizing the regularization parameter C at each node
searching over larger sets of µ fraction values and polynomial
degrees
Apply bounds to algorithms that use decision trees
(i.e. Random Forests)
Other algorithms could be inspired by this theoretical analysis.
16 / 20
Conclusion
1. Introduced broad learning model formed by cascades of
predictors, Deep Cascades.
2. New data-dependent theoretical guarantees for Deep
Cascades in terms of:
Rademacher complexities of the families composing the
cascade
Fraction of sample points correctly classified at each leaf.
3. Novel algorithm that benefits from the guarantees.
17 / 20
Thank you!
18 / 20
Generalization Guarantee of Deep Cascades
Theorem 1
Fix ⇢ > 0. Assume that for all k 2 [1, l], the functions in Hk take
values in { 1, +1}. Then, for any > 0, with probability at least
1
over the choice of a sample S of size m 1, the following
holds for all l 1 and all cascade functions f 2 Tl defined by (t, h):
R(f )  RbS (f ) +
k=1
+⌘
⇣
b S (Hk ), mk
min 4R
m
X h m+
k
+ min
L✓K
|L| |K|
l
X
1
⇢
k2L
where C (m, p, ⇢) =
2
⇢
q
m
log pl
m
i
b S (Hk ) + C (m, p, ⇢) +
4R
+
q
log pl
⇢2 m
log
⇥
⇢2 m ⇤
log pl
e
=O
s
⇣ q
1
⇢
log 4
,
2m
log pl
m
⌘
.
19 / 20
Proof Layout
1. For ↵ 2 int( ), define 8xP2 X ,
g↵ (x) = lk=1 ↵k t(x, k)hk (x)
! sgn(f ) coincides with sgn(g↵ )
2. Apply learning bound for convex ensembles Cortes et. al 2014.

l
4X b
b
e
R(f )  inf RS,⇢ (g↵ ) +
↵k RS (Hk ) + O(·)
⇢
↵2int( )
k=1
3. Crux: Remove ↵ to derive an explicit bound.
Pl P
Pl
b S (Hk ),
A(↵) = m1 k=1 t(xi ,k)=1 1yi ↵k hk (xi )<⇢ + ⇢4 k=1 ↵k R
Observe that A can be decoupled as a sum, solve the
optimization problem, and simplify.
20 / 20
Download