AdaBoost 1 Classifier • Simplest classifier 2 3 Adaboost: Agenda • (Adaptive Boosting, R. Scharpire, Y. Freund, ICML, 1996): • Supervised classifier • Assembling classifiers – Combine many low-accuracy classifiers (weak learners) to create a high-accuracy classifier (strong learners) 4 Example 1 5 Adaboost: Example (1/10) 6 Adaboost: Example (2/10) 7 Adaboost: Example (3/10) 8 Adaboost: Example (4/10) 9 Adaboost: Example (5/10) 10 Adaboost: Example (6/10) 11 Adaboost: Example (7/10) 12 Adaboost: Example (8/10) 13 Adaboost: Example (9/10) 14 Adaboost: Example (10/10) 15 16 Adaboost • Strong classifier = linear combination of T weak classifiers (1) Design of weak classifier ht (x) (2) Weight for each classifier (Hypothesis weight) t (3) Update weight for each data (example distribution) • Weak Classifier: < 50% error over any distribution 17 Adaboost: Terminology (1/2) 18 Adaboost: Terminology (2/2) 19 Adaboost: Framework 20 Adaboost: Framework 21 22 23 Adaboost • Strong classifier = linear combination of T weak classifiers (1) Design of weak classifier ht (x) (2) Weight for each classifier (Hypothesis weight) t (3) Update weight for each data (example distribution) • Weak Classifier: < 50% error over any distribution 24 Adaboost: Design of weak classifier (1/2) 25 Adaboost: Design of weak classifier (2/2) • Select a weak classifier with the smallest weighted error m ht arg min j i 1 Dt (i )[ yi h j ( xi )] h j H 1 • Prerequisite: t 2 26 Adaboost • Strong classifier = linear combination of T weak classifiers (1) Design of weak classifier ht (x) (2) Weight for each classifier (Hypothesis weight) t (3) Update weight for each data (example distribution) • Weak Classifier: < 50% error over any distribution 27 Adaboost: Hypothesis weight (1/2) • How to set t ? 1 , if yi H(xi ) i 0, else 1 , if yi f(xi ) 0 1 0, else N i 1 1 exp( yi f(xi )) exp( yi f(xi )) DT 1 (i) N N i Zt 1 training error( H ) N DT 1 (i ) Z t i Zt i i i 28 Adaboost: Hypothesis weight (2/2) 1 1 t t ln( ) 2 t 29 Adaboost • Strong classifier = linear combination of T weak classifiers (1) Design of weak classifier ht (x) (2) Weight for each classifier (Hypothesis weight) t (3) Update weight for each data (example distribution) • Weak Classifier: < 50% error over any distribution 30 Adaboost: Update example distribution (Reweighting) y * h(x) = 1 y * h(x) = -1 31 Reweighting In this way, AdaBoost “focused on” the informative or “difficult” examples. 32 Reweighting In this way, AdaBoost “focused on” the informative or “difficult” examples. 33 34 Summary t=1 1 1 t t ln( ) 2 t 35 Example 2 36 Example (1/5) Original Training set : Equal Weights to all training samples Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire Example (2/5) ROUND 1 Example (3/5) ROUND 2 Example (4/5) ROUND 3 Example (5/5) Example 3 42 1 2 1 t t ln( t ) 43 1 2 1 t t ln( t ) 44 1 2 1 t t ln( t ) 45 1 2 1 t t ln( t ) 46 1 2 1 t t ln( t ) 47 1 2 1 t t ln( t ) 48 49 Example 4 50 Adaboost: 51 52 53 54 Application 55 Discussion 56 Discrete Adaboost (DiscreteAB) (Friedman’s wording) 57 Discrete Adaboost (DiscreteAB) (Freund and Schapire’s wording) 58 Adaboost with Confidence Weighted Predictions (RealAB) 59 Adaboost Variants Proposed By Friedman • LogitBoost 60 Adaboost Variants Proposed By Friedman • GentleBoost 61 Reference 62 63 Robust Real-time Object Detection Key word : Features extraction, Integral Image , AdaBoost , Cascade 64 Outline 1. Introduction 2. Features 2.1 Features Extraction 2.2 Integral Image 3. AdaBoost 3.1 Training Process 3.2 Testing Process 4. The Attentional Cascade 5. Experimental Results 6. Conclusion 7. Reference 65 1. Introduction This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid object detection. Frontal face detection system achieves : 1. High detection rates 2. Low false positive rates Three main contributions: 1. Integral image 2. AdaBoost : Selecting a small number of important features. 3. Cascaded structure 66 2. Features Based on the simple features value. Reason : 1. Knowledge-based system is difficult to learn using a finite quantity of training data. 2. Much faster than Image-based system. ps. Feature-based: Use extraction features like eye, nose pattern. Knowledge-based: Use rules of facial feature. Image-based: Use face segments and predefined face pattern. [3] A.S.S.Mohamed, Ying Weng, S. S Ipson, and Jianmin Jiang, ”Face Detection based on Skin Color in Image by Neural 67 Networks”, ICIAS 2007, pp. 779-783, 2007. 2.1 Feature Extraction (1/2) Filter: Ex: haar-like filter Filter type. Feature: a. pattern 的座標位置. EX: eye , nose b. pattern 的大小 Feature value: Feature value = Filter Feature EX: convolution 68 2.1 Feature Extraction (2/2) • Haar-like filter: The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles. Figure 1 Filter C Feature 24 Filter type + -+ feature value 69 24 2.2 Integral Image (1/6) Integral image 1. Rectangle features 2. Computed very rapidly II(x , y) : sum of the pixels above and to the left of (x , y). ii( x, y) i( x' , y ' ) x' x , y ' y 70 2.2 Integral Image (2/6) Known: A: Sum of the pixels within rectangle A. B: Sum of the pixels within rectangle B. C: Sum of the pixels within rectangle C. D: Sum of the pixels within rectangle D. Location 1 value is A. Location 2 value is A+B. Location 3 value is A+C. Location 4 value is A+B+C+D. Equation: 1 = A 3=A+C 2 = A + B. 4=A+B+C+D Figure 2: Integral image Q : The sum of the pixels within rectangle D = ? A : 4 A B C D D 4 A B C D 4 ( A B) C D 4 2 C D 4 2 (3 A) D 4 2 (3 1) 4 1 (2 3) The sum within D can be computed as 4 + 171- (2 + 3). 2.2 Integral Image (3/6) Sum of the pixels 72 2.2 Integral Image (4/6) Using the following pair of recurrences to get integral image s( x, y ) s( x, y 1) i( x, y ) ii( x, y ) ii( x 1, y ) s ( x, y ) ii ( x, y ) i( x, y) is the integral image. (1) (2) ii(1, y) 0 is the original image. s( x, y ) is the cumulative row sum. ps. ii ( x, y ): s( x, 1) 0 ii( x, y) i( x' , y ' ) x' x y ' y 73 2.2 Integral Image (5/6) i( x, y) original image s ( x, y ) ii ( x, y ) cumulative row sum 1 3 4 7 2 4 74 6 10 6 13 18 8 s( x, y ) s( x, y 1) i( x, y ) ii( x, y ) ii( x 1, y ) s( x, y ) integral image (1) (2) 2.2 Integral Image (6/6) i( x, y) s ( x, y ) original image cumulative row sum (3*3) s( x, y ) = s( x, y 1) + s(0, 0) = s(0, 1) + s(1,0) = s (1, 1) + s(2, 0) = s(2, 1) + s(0,1) = s(0, 0) + s (1,1) = s(1,0) + ii ( x, y ) i( x, y) i(0,0) i (1, 0) i(2, 0) i (0,1) i (1,1) integral image (3*3) =0+1=1 =0+1=1 =0+4=4 =1+2=3 =1+2=3 (3*3) ii ( x, y ) = ii ( x 1, y ) + s ( x, y ) ii(0, 0) = ii(1,0) + s(0, 0) =0+1=1 ii(1, 0) ii (2, 0) ii(0,1) ii(1,1) = ii (0, 0) + s(1,0) =1+1=2 = ii (1, 0) + s (2, 0) =2+4=6 = ii(1,1) + s(0,1) =0+3=3 = ii (0,1) + s (1,1) =3+3=6 + i(2, 2) =7+1=8 … … s(2, 2) = s(2,1) ii (2, 2) = ii(1, 2) + s (2, 2) =10+8=18 75 3. AdaBoost (1/2) AdaBoost (Adaptive Boosting) is a machine learning algorithm. AdaBoost works by choosing and combining weak classifiers together to form a more accurate strong classifier ! – Weak classifier: Feature value positive , if filter(X )< h( X ) negative , otherwise Image set 76 Threshold 3. AdaBoost (2/2) Subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. [4] The goal is to minimize the number of features that need to be computed when given a new image, while still achieving high identification rates. 77 [4] AdaBoost - Wikipedia, the free encyclopedia , http://en.wikipedia.org/wiki/AdaBoost 3.1 Training Process - Flowchart 1. Input: Training image set X Face (24x24) l張 Non-Face (24x24) m張 2. Feature Extraction: Using haar-like filter feature 設每張 image 可 extract 出 N 個 feature value 共有N*(l+m) 個 feature value Candidate threshold θ 3. AdaBoost Algorithm: 3.0. Image weight initialization 1 , for postive 2l wi , i 1,..., n 1 , for negative 2m 3.1. Normalize image weight wt ,i wt ,i n wt , j , i 1,..., n j 1 3.2. Error calculation n i , j wt ,k hi , j xk yk , i 1,...n, j 1,...N k 1 3.3. Select a weak classifier ht with the lowest error εt 3.4. Image weight adjusting 4. Output: A strong classifier (24x24) wt ,i t , if xi is classified correctly wt 1,i wt ,i , otherwise T HT ( x) t ht ( x) t 1 Weak classifier Weak classifier weight T個 weak classifiers 3.2 Training Process - Input 1. Input: Training data set 以 X 表示 . 設有 l 張 positive image , m 張 negative image , 共 n (n=l+m) 張 image. 1 , postive/face X ( x1 , y1 ), ( x2 , y2 ),..., ( xn , yn ) , yi 0 , negative/non-face { … 1 1 0 1 0 } 1 設每張 image 可以 extract 出 N 個 local feature , f j ( xi ) 表示 image 裡的第 j 個 local feature value , 共有 N * n 個 local feature value. 79 xi 3.2 Training Process - Feature Extraction (1/2) 2. Feature Extraction: Using haar-like filter Haar-like filter : n 張 image … … … … … … … f: feature number convolution Candidate feature value (n*N 個) Ps. 1張 image 可 extract 出 N 個 feature value ∴N=4*f 80 3.2 Training Process - Feature Extraction (2/2) Define weak classifiers h: i , j Ex : 3 face & 1 non-face image extract by 5th feature Face 1 , if pi , j f j ( xk ) pi , ji , j hi , j ( xk ) 0 , otherwise Non-face i , j : 即 image i 的第 j 個 local feature value f j ( xk ) : 即 image k 的第 j 個 local feature value Polarity : 1 , if i , j 0.5 pi , j 1 , otherwise θ1,5 θ2,5 θ3,5 θ4,5 h1,5 h2,5 h3,5 h4,5 ε1,5 ε2,5 ε3,5 ε4,5 81 3.2 Training Process - AdaBoost Algorithm (1/4) 3. AdaBoost Algorithm: 3-0. Image weight initialization : 1 , for postive/face image 2l wi , i 1,..., n 1 , for negative/non-face image 2m l is the quantity of positive images. m is the quantity of negative images. 82 3.2 Training Process - AdaBoost Algorithm (2/4) Iterative: t = 1, … ,T T : weak classifier number 3-1. Normalize image weight: wt ,i wt ,i n , i 1,..., n wt , j Training data set X j 1 3-2. Error calculation : Candidate weak classifier n error rate i , j wt ,k hi , j xk yk , i 1,...n, j 1,...N k 1 3-3. Select a weak classifier positive or negative with ht the lowest error rate . t 3-4. Image weight adjusting : wt 1,i wt ,i t , if xi is classified correctly where t = t 1- t wt ,i , otherwise 83 3.2 Training Process – Output (1/2) T 1 T positive/face , t ht ( x) t HT ( x) 2 t 1 t 1 negative/non-face , otherwise 1 threshold t log t Weak classifier weight 84 3.2 Training Process – Output (2/2) Weak classifier weight ( t 1 t ) t ε 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 250 α 200 Fig. B Weak classifier weight ( t 1 t log( ) t ε 如 Fig. B,當ε (error rate)在 0 ~ 0.1 區間內與其他如 0.1 ~ 0.5 區間內即使ε有相同的變 化量,所對應到的α (weak classifier weight)變化量差異也 相當大,如此一來當ε越趨近 於 0 時,即使ε只有些微改 變,在 strong classifier 中其比 重也會劇烈加大。因此,取 log 是為了縮小 weight 彼此 間差距,使 strong classifier 中 的各個 weak classifiers 均佔有 一定比重。 ) ε 0.6 0.5 0.001 0.4 0.3 0.2 0.1 0 0 Fig. C 0.5 1 1.5 2 2.5 3 3.5 4 4.5 α 1 999 0.005 199 0.101 8.9 0.105 8.52 1 log( ) 2.99 800 0.7 2.29 0.94 0.38 0.93 0.01 AdaBoost Algorithm – Image Weight Adjusting Example If 1,3 1,3 0.167 取最小 , 則 0.2 1 1,3 1 0.167 t =1 時 W1 W2 W3 W4 W5 W6 初始值 0.167 0.167 0.167 0.167 0.167 0.167 經分類後 O O O O X O 0.167 0.167*0.2 0.5 0.1 Update wt 1 0.167*0.2 0.167*0.2 0.167*0.2 0.167*0.2 Normalize 0.1 0.1 0.1 0.1 Weight 變化 wt 1,i wt ,i t , if xi is classified correctly wt ,i , otherwise 每一輪都將分對的 image 調低其weight ,經過 Normalize 後,分錯的 image的 weight 會相對提高, 如此一來,常分錯的 image 就會擁有較高 weight。 如果一張 image 擁有較高 weight 表示在進行分類 評估時,會著重在此 image。 3.3 Testing Process - Flowchart Test Image 1. Extract Sub-windows Downsampling … (360*420) (24*28) (228*336) (360*420) About 100000 sub-windows (24*24) … 2. Strong Classifier (24*24)Detection Load T weak classifiers Strong Classifier … h1 h2 h3 Sub-window hT T 1 T positive , h ( x ) t t t H T ( x) 2 t 1 t 1 negative , otherwise Accept windows Result Image … 3. Merge Result average coordinate … For all sub-windows Reject windows … 4. The Attentional Cascade (1/5) Advantage: Reducing testing computation time. Method: Cascade stages. Idea: Reject as many negatives as possible at the earliest stage. More complex classifiers were then used in later stages. The detection process is that of a degenerate decision tree, so called “cascade”. 88 Stage True positive False positive True negative False negative Figure 4 : Cascade Structure 4. The Attentional Cascade (2/4) θ Stage 1 θ Stage 2 θ 89 Stage 3 4. The Attentional Cascade (3/4) True positive rates (detection rates): 將 positive 判斷為 positive 機率 True Positive True Positive False Negative False positive rates (FP): 將 negative 判斷為 positive 機率 False Positive False Positive True Negative True negative rates: 將 negative 判斷為 negative 機率 True Negative False Positive True Negative False negative rates (FN): 將 positive 判斷為 negative 機率 FP False Negative True Positive False Negative 90 + FN => Error Rate 4. The Attentional Cascade (4/4) Training a cascade of classifiers: Involves two types of tradeoffs : 1. Higher detection rates 2. Lower false positive rates More features will achieve higher detection rates and lower false positive rates. But classifiers require more time to compute. Define an optimization framework: 1. the number of stages 2. the number of features in each stage 3. the strong classifier threshold in each stage 91 4. The Attentional Cascade - Algorithm (1/3) 4. The Attentional Cascade - Algorithm (2/3) f : Maximum acceptable false positive rate. (最大 negative 辨識成 positive 錯誤百分比) d : Minimum acceptable detection rate. (最小辨識出 positive 的百分比) Ft arg et: Target overall false positive rate. (最後可容許的 false positive rate) Initial value: P : Total positive images N : Total negative images f = 0.5 d = 0.9999 Ft arg et 106 F0 1.0 初始 False positive rate. D0 1.0 初始 Detection rate. Threshold = 0.5 Threshold_EPS = i=0 AdaBoost threshold 104 Threshold adjust weight The number of cascade stage 93 4. The Attentional Cascade - Algorithm (3/3) Iterative: While( Fi Ft arg et ) { i=i+1 f : Maximum acceptable false positive rate d : Minimum acceptable detection rate Ft arg et : Target overall false positive rate Add Stage ni 0, Fi Fi 1 P : Total positive images N : Total negative images i : The number of cascade stage While( Fi f * Fi )1 Fi : False positive rate at ith stage { ni : The number of features at ith stage Di : Detection rate at ith stage Add Feature ni ni 1 ( Fi , D)i = AdaBoost(P,N, ) ni Get New Di , Fi While( Di d * Di )1 { Threshold = Threshold – Threshold_EPS Di = Re-computer current strong classifier detection rate with Threshold (this also affects ) } Threshold , 則Di ,Fi } If( Fi Ft arg et) N = false detections with current cascaded detector on the N } N = Fi *N Fi 5. Experimental Results (1/3) Face training set: Extracted from the world wide web. Use face and non-face training images. Consisted of 4916 hand labeled faces. Scaled and aligned to base resolution of 24 by 24 pixels. The non-face sub-windows come from 9544 images which were manually inspected and found to not contain any faces. Fig. 5: Example of frontal upright face images used for training 95 5. Experimental Results (2/3) In the cascade training: Use 4916 training faces. Use 10,000 non-face sub-windows. Use the AdaBoost training procedure. Evaluated on the MIT+CMU test set: An average of 10 features out of a stage are evaluated per sub-window. This is possible because a large majority of sub-windows are rejected by the first or second stage in the cascade. On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about .067 seconds . 96 5. Experimental Results (3/3) θ Fig. 6: Create the ROC curve (receiver operating characteristic) the threshold of the final stage classifier is adjusted from . to 97 Reference [1] P. Viola and M. Jones, “Rapid Object Detection Using A Boosted Cascade of Simple Features”, Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol.1, pp. 511-518, 2001 [2] P. Viola and M. Jones, “Robust Real-time Object Detection”, IEEE International Journal of Computer Vision, vol.57, no.2, pp.137-154, 2001. [3] A.S.S.Mohamed, Ying Weng, S. S Ipson, and Jianmin Jiang, ”Face Detection based on Skin Color in Image by Neural Networks”, ICIAS 2007, pp. 779783, 2007. [4] AdaBoost - Wikipedia, the free encyclopedia , http://en.wikipedia.org/wiki/AdaBoost 98