Learning with Deep Cascades Giulia DeSalvo,2 Mehryar Mohri,1,2 and Umar Syed1 Google Research1 Courant Institute of Mathematical Sciences2 September 28, 2015 1 / 20 Introduction x > 10 Decision Trees: trees with indicator functions at each internal node and assignment functions at each leaf. x < 20 +1 1 +1 Figure: Decision Tree [Breiman et al., 1984. Quinlan, 1986.] 2 / 20 Introduction q1 (x) Deep Cascade are a significantly broader family of decision trees: the node questions qj and leaf classifiers hk in di↵erent hypothesis sets. P l Qdk f := k=1 j=1 qj hk . Used to tackle harder tasks q2 (x) h1 (x) h2 (x) h3 (x) Figure: Deep Cascade f = q1 h1 + q1 q2 h2 + q1 q2 h3 3 / 20 Motivation Challenge: H = [pi=1 Hi could be very complex ! learning is prone to overfitting. H H3 H1 H2 H4 . . . Hp Goal: Would it be possible to learn with questions of varying complexity at each node and yet not overfit? 4 / 20 Generalization Bounds W. h. p., the following holds for all deep cascade functions, f : R(f ) R̂S (f ) + q ⇣ ⌘ m+ k e l log pl min 4 R̂ (H ), + O S k k=1 m m Pl Definition: Deep Cascade f := Pl k=1 Qdk j=1 qj hk . Hk is the Root-Leaf k hypothesis class functions of the form: Qdk j=1 qj hk . 5 / 20 Generalization Bounds W. h. p., the following holds for all deep cascade functions, f : R(f ) R̂S (f ) + q ⇣ ⌘ m+ k e l log pl min 4 R̂ (H ), + O S k k=1 m m Pl Rademacher Complexity of Root-Leaf k Class Fraction of Correctly Classified Points at Leaf k 6 / 20 Generalization Bounds Trivial Case: If the tree has one node, then the bound is not interesting ( + R̂S (f ) + R̂S (H) if mm R̂S (H) R(f ) m+ 1 if m < R̂S (H) 7 / 20 Generalization Bounds If tree has more than one node, the critical measure: balance of the class complexities and the fractions of correctly classified points at each leaf. R(f ) R̂S (f ) + Pl k=1 ⇣ min 4R̂S (Hk ), m+ k m ⌘ 8 / 20 Previous work Generalization bounds for decision trees. Mansour & McAllester 2000, Nobel 2002, Golea et. al 1997, Scott & Nowak 2005. Cascades in object detection Viola & Jones 2004, Saberian 2010, Lefakis et. al 2010, Chen et. al 2012, etc. Decision tree algorithms that use SVMs Bennett & Blue 1998 Multi-class classification: Dong 2008, Madjarov et. al 2012, Takahashi et. al 2002. Computational speeds: Arreola et. al 2006, Chang 2010, Kumar 2010, Rodriguez-Lujan et. al 2012. 9 / 20 Assumptions of Algorithm 1 µ µ 1 µ µ 1 µ µ … The parameter µ controls the fraction of points that proceed deeper into the tree =) find the best tradeo↵ between the complexity term and fraction of points at each node. Figure: Topology 10 / 20 Deep Cascade Algorithm 1. Generate di↵erent deep cascade functions by using SVMs at each leaf of the cascade 2. Choose the best among them by minimizing an upper bound of the generalization guarantee 11 / 20 1. Generating Deep Cascade Function Initially, train using SVMs with a polynomial kernel. Extract a µ fraction of points closest to the hyperplane; On the next node, retrain on this extracted subsample using SVMs with a polynomial kernel of another degree. Node Question q1 = 1 q1 = 0 Hyperplane h1 µ1 fraction of points 12 / 20 1. Generating Deep Cascade Function Node 1: q1 1 Leaf 1: h1 Node Question µ µ q1 = 1 q1 = 0 Node 2: q2 1 µ µ Leaf 2: h2 1 µ µ Hyperplane h1 … µ1 fraction of points 13 / 20 2. Choose the best Deep Cascade Choose the deep cascade function f ⇤ that minimizes the following: f⇤ ( argmin RbS (f ) + f 2F where R̂S (Hk ) Ak = " Pdk j=1 l X k=1 r m+ min 4Ak , k m df ,j log em df ,j m + r !) df ,k log m em df ,k # with df ,j = VCDIM(qj ) and df ,k = VCDIM(hk ) 14 / 20 Experimental Results Dataset breastcancer german splice ionosphere a1a SVM Algorithm 0.0426 ± 0.0117 0.297 ± 0.0193 0.205 ± 0.0134 0.0971 ± 0.0167 0.195 ± 0.0217 Deep Cascade Alg 0.0353 ± 0.00975 0.256 ± 0.0324 0.175 ± 0.0152 0.117 ± 0.0229 0.209± 0.0233 15 / 20 Next Steps Use more than the simplest form of bound Algorithm implementation: optimizing the regularization parameter C at each node searching over larger sets of µ fraction values and polynomial degrees Apply bounds to algorithms that use decision trees (i.e. Random Forests) Other algorithms could be inspired by this theoretical analysis. 16 / 20 Conclusion 1. Introduced broad learning model formed by cascades of predictors, Deep Cascades. 2. New data-dependent theoretical guarantees for Deep Cascades in terms of: Rademacher complexities of the families composing the cascade Fraction of sample points correctly classified at each leaf. 3. Novel algorithm that benefits from the guarantees. 17 / 20 Thank you! 18 / 20 Generalization Guarantee of Deep Cascades Theorem 1 Fix ⇢ > 0. Assume that for all k 2 [1, l], the functions in Hk take values in { 1, +1}. Then, for any > 0, with probability at least 1 over the choice of a sample S of size m 1, the following holds for all l 1 and all cascade functions f 2 Tl defined by (t, h): R(f ) RbS (f ) + k=1 +⌘ ⇣ b S (Hk ), mk min 4R m X h m+ k + min L✓K |L| |K| l X 1 ⇢ k2L where C (m, p, ⇢) = 2 ⇢ q m log pl m i b S (Hk ) + C (m, p, ⇢) + 4R + q log pl ⇢2 m log ⇥ ⇢2 m ⇤ log pl e =O s ⇣ q 1 ⇢ log 4 , 2m log pl m ⌘ . 19 / 20 Proof Layout 1. For ↵ 2 int( ), define 8xP2 X , g↵ (x) = lk=1 ↵k t(x, k)hk (x) ! sgn(f ) coincides with sgn(g↵ ) 2. Apply learning bound for convex ensembles Cortes et. al 2014. l 4X b b e R(f ) inf RS,⇢ (g↵ ) + ↵k RS (Hk ) + O(·) ⇢ ↵2int( ) k=1 3. Crux: Remove ↵ to derive an explicit bound. Pl P Pl b S (Hk ), A(↵) = m1 k=1 t(xi ,k)=1 1yi ↵k hk (xi )<⇢ + ⇢4 k=1 ↵k R Observe that A can be decoupled as a sum, solve the optimization problem, and simplify. 20 / 20