Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara Tony Jebara, Columbia University Boosting •Combining Multiple Classifiers •Voting •Boosting •Adaboost •Based on material by Y. Freund, P. Long & R. Schapire Tony Jebara, Columbia University Combining Multiple Learners •Have many simple learners •Also called base learners or weak learners which have a classification error of <0.5 •Combine or vote them to get a higher accuracy •No free lunch: there is no guaranteed best approach here •Different approaches: Voting combine learners with fixed weight Mixture of Experts adjust learners and a variable weight/gate fn Boosting actively search for next base-learners and vote Cascading, Stacking, Bagging, etc. Tony Jebara, Columbia University Voting •Have T classifiers ht (x ) •Average their prediction with weights f (x ) =  T t=1 T a t ht (x ) where a t • 0and  t = 1 a t = 1 •Like mixture of experts but weight is constant with input y a a a x1 x1 x1 x2 x2 x2 x3 x3 x3 Expert Expert x Expert Tony Jebara, Columbia University Mixture of Experts •Have T classifiers or experts h (x ) and a gating fn a t (x ) t •Average their prediction with variable weights •But, adapt parameters of the gating function and the experts (fixed total number T of experts) f (x ) =  T t=1 T a t (x )ht (x ) where a t (x ) • 0and  t = 1 a t (x ) = 1 Output Gating Network x1 x1 x1 x2 x2 x2 x3 x3 x3 Expert Expert Input Expert Tony Jebara, Columbia University Boosting •Actively find complementary or synergistic weak learners •Train next learner based on mistakes of previous ones. •Average prediction with fixed weights •Find next learner by training on weighted versions of data. training data (x1,y1 ),K , (x N ,y N ) { } weak learner  N i= 1 ( wistep - ht (x i )y i  {w ,K ,w } 1 N weights weak rule N i= 1 wi )< 1 2 ht (x ) g ensemble of learners (a 1,h1 ),K , (a T ,hT ) { } f (x ) = T a t ht (x ) prediction  t=1 Tony Jebara, Columbia University AdaBoost •Most popular weighting scheme T •Define margin for point i as y i  t = 1 a t ht (x i ) •Find an ht and find weight at to min the cost function N T sum exp-margins exp - y a h (x )  i= 1 training data (x1,y1 ),K , (x N ,y N ) { ( i  } N i= 1 N weights ) ( wistep - ht (x i )y i  1 i weak rule weak learner  {w ,K ,w } t t t=1 N i= 1 wi )< 1 2 ht (x ) g ensemble of learners (a 1,h1 ),K , (a T ,hT ) { } f (x ) = T a t ht (x ) prediction  t=1 Tony Jebara, Columbia University AdaBoost ( N ) T •Choose base learner & at: min a ,h  i = 1 exp - y i  t = 1 a t ht (x i ) t t •Recall error of N base classifier ht must be e =  i = 1 wistep (- ht (x i )y i ) < 1 - g t  N w i= 1 i 2 •For binary h, Adaboost puts this weight on weak learners: (instead of the Ê1 - e ˆ˜ t ˜ Á a t = 12 ln Á more general rule) ˜˜ Á Á Ë et ˜¯ •Adaboost picks the following for the weights on data for the next round (here Z is the normalizer to sum to 1) t+1 i w = ( ) wit exp - a t y iht (x i ) Zt Tony Jebara, Columbia University Decision Trees Y X>3 +1 5 -1 Y>5 -1 -1 -1 +1 3 X Tony Jebara, Columbia University Decision tree as a sum Y -0.2 X>3 sign +0.2 +1 +0.1 -0.1 -0.1 -1 Y>5 -0.3 -0.2 +0.1 -0.3 -1 +0.2 X Tony Jebara, Columbia University An alternating decision tree Y -0.2 Y<1 sign 0.0 +0.2 +1 X>3 +0.7 -1 -0.1 +0.1 -0.1 +0.1 -1 -0.3 Y>5 -0.3 +0.2 +0.7 +1 X Tony Jebara, Columbia University Example: Medical Diagnostics •Cleve dataset from UC Irvine database. •Heart disease diagnostics (+1=healthy,-1=sick) •13 features from tests (real valued and discrete). •303 instances. Tony Jebara, Columbia University Ad-Tree Example Tony Jebara, Columbia University Cross-validated accuracy Learning algorithm Number of splits ADtree 6 17.0% 0.6% C5.0 27 27.2% 0.5% 446 20.2% 0.5% 16 16.5% 0.8% C5.0 + boosting Boost Stumps Average Test error test error variance Tony Jebara, Columbia University AdaBoost Convergence Logitboost •Rationale? •Consider bound on the training error: =    = ’ R emp = £ 1 N 1 N N i= 1 N i= 1 Brownboost 0-1 loss ( ) exp (- y f (x )) step - y i f (x i ) i ( N Mistakes T t=1 Zt Margin Correct exp bound on step i ) exp - y i  t = 1 a t ht (x i ) i= 1 T Loss definition of f(x) recursive use of Z •Adaboost is essentially doing gradient descent on this. •Convergence? Tony Jebara, Columbia University AdaBoost Convergence •Convergence? Consider the binary ht case. R emp £ = ’ ’ T Z = t=1 t ’ T t=1  N ( ) t w exp - a t y i ht (x i ) i= 1 i 1 Ê Ê y h (x i )ˆˆ 2 i t Ê ˆ Á ˜˜˜˜ Á et ˜ T N Á Á Á t ˜˜˜˜ ˜˜ Á Á Á w exp ln  Á Á Á i ˜˜˜˜ i = 1 t=1 ˜˜ Á Á Á 1 e Ë ¯ Á ˜¯˜˜¯ t Á Ë Ë Á Ê ˆ˜ i t ( i ) et ˜ T N t Á Á ˜˜ w Á t = 1  i= 1 i Á ˜ Á Ë 1 - et ˜¯ Ê et T Á t t 1 - et Á w + w Á correct i  incorrect i e t = 1Á 1 e Á t t Ë at = 1 2 Ê1 - e ˆ˜ t ˜ Á ln Á ˜˜ Á Á Ë et ˜¯ et = 1 2 - gt £ yh x = ’ = ’ = ’ T 2 et (1 - et ) = t=1 ( ) ’ T t=1 (1 - ( ) ( ˆ˜ ˜˜ ˜˜ ˜¯ 4gt2 £ exp - 2 t gt2 Remp £ exp - 2 t gt2 £ exp - 2T g2 ) 1 2 ) So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma! - g Tony Jebara, Columbia University Curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters Tony Jebara, Columbia University Explanation using margins 0-1 loss Margin Tony Jebara, Columbia University Explanation using margins 0-1 loss No examples with small margins!! Margin Tony Jebara, Columbia University Experimental Evidence Tony Jebara, Columbia University AdaBoost Generalization Bound •Also, a VC analysis gives a generalization bound: Ê ˆ˜ (where d is VC of T d Á ˜ R £ Remp + O Á ˜˜ Á base classifier) Á ˜ N Á Ë ¯ •But, more iterations overfitting! •A margin analysis is possible, redefine margin as: y  t a t ht (x ) marf (x, y ) = at  t Ê ˆ˜ N d Á ˜˜ Then have R £ 1  step q- mar (x ,y ) + O Á Á f i i N 2 ˜ Á ˜¯ N q Á i= 1 Ë ( ) Tony Jebara, Columbia University AdaBoost Generalization Bound •Suggests this optimization problem: d m Margin Tony Jebara, Columbia University AdaBoost Generalization Bound •Proof Sketch Tony Jebara, Columbia University UCI Results % test error rates Database Other Boosting Error reduction Cleveland 27.2 (DT) 16.5 39% Promoters 22.0 (DT) 11.8 46% Letter 13.8 (DT) 3.5 74% Reuters 4 5.8, 6.0, 9.8 2.95 ~60% 7.4 ~40% Reuters 8 11.3, 12.1, 13.4 Tony Jebara, Columbia University Boosted Cascade of Stumps •Consider classifying an image as face/notface •Use weak learner as stump that averages of pixel intensity •Easy to calculate, white areas subtracted from black ones •A special representation of the sample called the integral image makes feature extraction faster. Tony Jebara, Columbia University Boosted Cascade of Stumps •Summed area tables •A representation that means any rectangle’s values can be calculated in four accesses of the integral image. Tony Jebara, Columbia University Boosted Cascade of Stumps •Summed area tables Tony Jebara, Columbia University Boosted Cascade of Stumps •The base size for a sub window is 24 by 24 pixels. •Each of the four feature types are scaled and shifted across all possible combinations •In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated. Tony Jebara, Columbia University Boosted Cascade of Stumps •Viola-Jones algorithm, with K attributes (e.g., K = 160,000) we have 160,000 different decision stumps to choose from At each stage of boosting •given reweighted data from previous stage •Train all K (160,000) single-feature perceptrons •Select the single best classifier at this stage •Combine it with the other previously selected classifiers •Reweight the data •Learn all K classifiers again, select the best, combine, reweight •Repeat until you have T classifiers selected •Very computationally intensive! Tony Jebara, Columbia University Boosted Cascade of Stumps •Reduction in Error as Boosting adds Classifiers Tony Jebara, Columbia University Boosted Cascade of Stumps •First (e.g. best) two features learned by boosting Tony Jebara, Columbia University Boosted Cascade of Stumps •Example training data Tony Jebara, Columbia University Boosted Cascade of Stumps •To find faces, scan all squares at different scales, slow •Boosting finds ordering on weak learners (best ones first) •Idea: cascade stumps to avoid too much computation! Tony Jebara, Columbia University Boosted Cascade of Stumps •Training time = weeks (with 5k faces and 9.5k non-faces) •Final detector has 38 layers in the cascade, 6060 features •700 Mhz processor: •Can process a 384 x 288 image in 0.067 seconds (in 2003 when paper was written) Tony Jebara, Columbia University Boosted Cascade of Stumps •Results