Robust Real-Time Face Detection
指導老師: 萬書言 老師
報告學生: 何炳杰
報告日期: 2010/11/26
1
文章出處
 Title: Robust Real-Time Face Detection
 Author: Paul Viola, MICHAEL J. JONES
 Publication: International Journal of Computer Vision
 Publisher: Springer
 Date: May 1, 2004
2
http://www.springerlink.com/content/q70v4h6715v5p152/
Outline
 Background
 Key words
 Abstract
 1. Introduction
 2. Features
 3. Learning Classification Functions
 4. The Attentional Cascade
 5. Results
 6. Conclusions
 References
3
Background
Chellappa, R. Sinha, P. Phillips, P.J. 2010. Face Recognition by Computers
and Humans. IEEE Computer Society.
4
http://www.opencv.org.cn/index.php/Cv%E6%A8%A1%E5%BC%8F%E8%AF%86%E5%88%AB
Background
 目標檢測方法最初由Paul Viola提出,並由Rainer
Lienhart 對這一方法進行了改善。首先,利用樣本(大
約幾百幅樣本圖片)的harr特徵進行分類器訓練,得到
一個級聯的boosted分類器。訓練樣本分為正例樣本和反
例樣本,其中正例樣本是指待檢目標樣本(例如人臉或
汽車等),反例樣本指其它任意圖片,所有的樣本圖片
都被歸一化(統一待策測影像大小)成同樣的尺寸大小
(例如,20 x 20)。
5
http://www.opencv.org.cn/index.php/Cv%E6%A8%A1%E5%BC%8F%E8%AF%86%E5%88%AB
Background
 分類器訓練完以後,就可以應用於輸入圖像中的感興趣
區域(與訓練樣本相同的尺寸)的檢測。檢測到目標區域
(汽車或人臉)分類器輸出為1,否則輸出為0。為了檢測
整幅圖像,可以在圖像中移動搜索視窗,檢測每一個位
置來確定可能的目標。為了搜索不同大小的目標物體,
分類器被設計為可以進行尺寸改變,這樣比改變待檢圖
像的尺寸大小更為有效。所以,為了在圖像中檢測未知
大小的目標物體,掃描程式通常需要用不同比例大小的
搜索視窗對圖片進行幾次掃描。
6
http://www.opencv.org.cn/index.php/Cv%E6%A8%A1%E5%BC%8F%E8%AF%86%E5%88%AB
Background
 分類器中的“級聯”(cascade)是指最終的分類器是由幾
個簡單分類器級聯組成。在圖像檢測中,被檢視窗依次
通過每一級分類器,這樣在前面幾層的檢測中大部分的
候選區域就被排除了,全部通過每一級分類器檢測的區
域即為目標區域。
 “級聯”:
7
http://www.opencv.org.cn/index.php/Cv%E6%A8%A1%E5%BC%8F%E8%AF%86%E5%88%AB
Key words
 Integral image
 AdaBoost
 Cascade classifier
8
Abstract
Three main
concepts:
積分影像
Integral
image
串聯/級聯
分類器
Adaptive
Boosting
AdaBoost
9
Cascade
classifier
Abstract
 This paper describes a face detection framework that is capable of
processing images extremely rapidly while achieving high
detection rates.
- The first is the introduction of a new mage representation called the “Integral
Image” which allows the features used by our detector to be computed very
quickly.
- The second is a simple and efficient classifier which is built using the
AdaBoost learning algorithm to select a small number of critical visual features
from a very large set of potential features.
- The third contribution is a method for combining classifiers in a “cascade”
which allows background regions of the image to be quickly discarded while
spending more computation on promising face-like regions.
10
1. Introduction
 前言:
- 在本篇文章中Paul Viola等人利用了三個演算法,
來快速找到人臉。第一個是Integral Image,第二個
是 AdaBoost,第三個是Cascade classifier。並重述
摘要中所提到的主要三個貢獻。
11
1.1 Overview
 Section 2:
- It will detail the form of the features as well as a new scheme for
computing them rapidly.
 Section 3:
- It will discuss the method in which these features are combined to form
a classifier .(AdaBoost)
 Section 4:
- It will describe a method for constructing a cascade of classifiers.
 Section 5:
- It will describe a number of experimental results.
 Section 6:
- It contains a discussion of this system and its relationship to related
systems.
12
Features
 Authors use three kinds of features. The value of a two-
rectangle feature is the difference between the sum of the
pixels within two rectangular regions.
13
Features
 The regions have the same size and shape and are
horizontally or vertically adjacent.
 Finally a four-rectangle feature computes the difference
between diagonal pairs of rectangles.
 The base resolution of the detector is 24 × 24, the
exhaustive set of rectangle features is quite large, 160,000.
14
2.1 Integral Image
 概念:
一個簡單的微積分類比:如果我們要經常計
算
,
那我們會先計算
,那
麼
。
積分圖的含意與此類似。
積分圖與積分的類比
15
2.1 Integral Image
 The integral image at location x, y contains the sum of the
pixels above and to the left of x, y, inclusive:
16
2.1 Integral Image
- ii(x, y) is the integral image.
- i(x, y) is the original image.
17
2.1 Integral Image
- s(x, y) is the cumulative row sum.
- s(x, -1) = 0, ii(-1, y) = 0
其中,s(x, y)為點(x, y)及其y方向向上所有原始影像之和,稱
為〝列積分和〞
座標A(x, y)的積分圖定義為其左上
角矩形所有像素之和(圖中陰影部
分)。
s(x, y) 為A(x, y) 及其y方向向上所有
像素之和。
18
2.1 Integral Image
 Using the integral image any rectangular sum can be
computed in four array references (see Fig. 3).
19
2.1 Integral Image
- The authors point out that in the case of linear operations (e.g.
f · g), any invertible linear operation can be applied to f or g if
its inverse is applied to the result.
ex:
In the case of convolution, if the derivative operator is applied
both to the image and the kernel the result must then be
double integrated.
20
2.1 Integral Image
- The authors go on to show that convolution can be
significantly accelerated if the derivatives of f and g are sparse
(or can be made so). A similar insight is that an invertible
linear operation can be applied to f if its inverse is applied to
g.
21
2.1 Integral Image
- Viewed in this framework computation of the rectangle sum
can be expressed as a dot product, i · r ,where i is the image
and r is the box car image (with value 1 within the rectangle
of interest and 0 outside). This operation can be rewritten.
22
2.2 Feature Discussion
 Rectangle features are also sensitive to the presence of
edges, bars, and other simple image structure, they are
quite coarse.
 The only orientations available are vertical, horizontal and
diagonal.
23
Learning Classification Function
 AdaBoost概念:
- AdaBoost全名為Adaptive Boosting。 AdaBoost是一种迭代
算法,其核心思想是針對同一個訓練集(training set)訓練
不同的分類器(弱分類器),然後把這些弱分類器集合起
來,構成一個更強的最终分類器(強分類器)。
- In our system a variant of AdaBoost is used both to select the
features and to train the classifier . In its original form, the
AdaBoost learning algorithm is used to boost the
classification performance of a simple learning algorithm.
24
Learning Classification Function
 AdaBoost概念:
- It does this by combining a collection of weak classification
functions to form a stronger classifier. In the language of
boosting the simple learning algorithm is called a weak
learner.
- 〝Weak learner〞:隨機猜測一個是或否的問題,將會有
50%的正確率。如果一個假設能夠稍微地提高猜測正確的
機率,那麼這個假設就是弱學習算法,得到這個算法的過
程稱為弱學習;反之, 如果一個假設能夠顯著地提高猜
測正確的機率,那麼這個假設就稱為強學習。
25
Learning Classification Function
 The learner is called weak because we do not expect even
the best classification function to classify the training data
well. (i.e. for a given problem the best perceptron may only
classify the training data correctly 51% of the time).
 A weak classifier (h(x, f, p, θ)) thus consists of a feature
( f ), a threshold (θ) and a polarity (p) indicating the
direction of the inequality:
26
Here x is a 24 × 24 pixel sub-window of an
image.
Learning Classification Function
 Table1. (See paper p.142, please.)
* Given example images(x1, y1), ... ,(xn, yn) where yi = 0, 1 for
negative and positive examples respectively.
* Initialize weights w1,i = 1/2m, 1/2l for yi = 0, 1 respectively,
where m and l are the number of negatives and positives
respectively.
* For t =1, ... , T:
1. Normalize the weights,
2. Select the best weak classifier with respect to the weighted
error
27
Learning Classification Function
 Table1. (See paper p.142, please.)
3. Define
where
are
the minimizers of
.
選取最佳的弱分類器 ht ( x ) (擁有最小錯誤率 )
4. Update the weights: (按照這個最佳弱分類器,調整權重)
(其中
分類。
28
表示
被正確地分類,
)
表示
被錯誤地
Learning Classification Function
 Table1. (See paper p.142, please.)
* The final strong classifier is:
29
3.1 Learning Discussion
 The algorithm described in Table 1 is used to select key
weak classifiers from the set of possible weak classifiers.
 Since there is one weak classifier for each distinct
feature/threshold combination, there are effectively KN
weak classifiers, where K is the number of features and N
is the number of examples.
 Therefore the total number of distinct thresholds is N.
Given a task with N = 20000 and K = 160000 there are 3.2
billion distinct binary weak classifiers.
30
3.1 Learning Discussion
 The weak classifier selection algorithm proceeds as follows:
(弱分類器的訓練及選取)
1. For each feature, the examples are sorted based on feature
value.
2. The AdaBoost optimal threshold for that feature can then be
computed in a single pass over this sorted list.
3. For each element in the sorted list, four sums are maintained
and evaluated:
- T  : The total sum of positive example weights.

T : The total sum of negative example weights.
S : The sum of positive weights below the current example.
S : The sum of negative weights below the current example.


31
3.1 Learning Discussion
 The weak classifier selection algorithm proceeds as follows:
(弱分類器的訓練及選取)
3. For each element in the sorted list, four sums are maintained
and evaluated:
- T  : 全部人臉樣本的權重的和。

T : 全部非人臉樣本的權重的和。
S : 在此元素之前的人臉樣本的權重的和。
S : 在此元素之前的非人臉樣本的權重的和。


32
3.1 Learning Discussion
 The weak classifier selection al
因此,透過把這個排序的表從頭到尾掃描一
遍就可以為弱分類器選擇使分類誤差最小的
閥值(最佳閥值),也就是選取了一個最佳
弱分類器。
33
訓練並選取最佳分類器算法
3.2 Learning Results
 Initial experiments demonstrated that a classifier constructed
from 200 features would yield reasonable results (see Fig. 4).
34
3.2 Learning Results
 The first feature selected seems to focus on the property that
the region of the eyes is often darker than the region of the
nose and cheeks (see Fig. 5).
 The second feature selected relies on the property that the
eyes are darker than the bridge of the nose.
35
3.2 Learning Results
 In summary the 200-feature classifier provides initial evidence
that a boosted classifier constructed from rectangle features is
an effective technique for face detection.
36
The Attentional Cascade
 Cascade概念:
- 對於Cascade classifier的概念,就如Figure 6所示。我們一開始
將feature分成好幾個classifier。最前面的classier辨識率最低,
但是可以先篩選掉很大一部份不是人臉的圖片;接下來的
Classifier處理比較難處理一點的case篩選掉的圖片也不如第一
個classifier多了;依此下去,直到最後一個classifier為止。最
後留下來的就會是我們想要的人臉的照片。
37
The Attentional Cascade
 This section describes an algorithm for constructing a cascade of
classifiers which achieves increased detection performance while
radically reducing computation time.
 The key insight is that smaller, and therefore more efficient,
boosted classifiers can be constructed which reject many of the
negative sub-windows while detecting almost all positive instances.
38
The Attentional Cascade
 A positive result from the first classifier triggers the evaluation of a
second classifier which has also been adjusted to achieve very high
detection rates. A positive result from the second classifier triggers
a third classifier, and so on. A negative outcome at any point leads
to the immediate rejection of the sub-window.
39
4.1 Training a Cascade of Classifiers
 In order to achieve good detection rates (between 85 and 95
percent) and extremely low false positive rates, The number of
cascade stages and the size of each stage must be sufficient to
achieve similar detection performance while minimizing
computation.
- F: The false positive rate of the classifier.
- K: The number of classifiers.
- f i : The false positive rate of the ith classifier.
40
4.1 Training a Cascade of Classifiers
 In order to achieve good detection rates (between 85 and 95
percent) and extremely low false positive rates, The number of
cascade stages and the size of each stage must be sufficient to
achieve similar detection performance while minimizing
computation.
- D: The false positive rate of the classifier.
- K: The number of classifiers.
- d i : The false positive rate of the ith classifier.
41
4.1 Training a Cascade of Classifiers
 Purpose:
- Given concrete goals for overall false positive and detection
rates, target rates can be determined for each stage in the
cascade process.
- Ex:
※ For a detection rate of 0.9 can be achieved by a 10 stage classifier if each
stage has a detection rate of 0.99 ( since 0.9 ≈ 0.99 10 ).
42
4.1 Training a Cascade of Classifiers
 The key measure of each classifier is its “positive rate”, the
proportion of windows which are labelled as potentially
containing a face.
 The expected number of features which are evaluated is:
- N : The expected number of features evaluated.
- K : The number of classifiers.
43
-
p i : The positive rate of the ith classifier.
-
n i : The number of features in the ith classifier.
4.1 Training a Cascade of Classifiers
 Table 2. (See paper p.146, please.)
* User selects values for f , the maximum acceptable false positive rate per
layer and d, the minimum acceptable detection rate per layer.
* User selects target overall false positive rate, Ft arg et .
* P = set of positive examples
N = set of negative examples
* F0 = 1.0 ; D 0 = 1.0 ; i = 0
*
1
2
3 4
44
then evaluate the current cascaded detector on
the set of non-face images and put any false
detections into the set N
4.2 Simple Experiment
 In order to explore the feasibility of the cascade approach two simple
detectors were trained:
- A monolithic 200-feature classifier.(集成的概念)
- A cascade of ten 20-feature classifiers.
* The first stage:
-
The classifier in the cascade was trained using 5000 faces and 10000
non-face sub-windows randomly chosen from non-face images.
The second stage:
- The second stage classifier was trained on the same 5000 faces plus
5000 false positives of the first classifier.
45
4.2 Simple Experiment
All Subwindows
T
Mono
-lithic
Type
1:
Outcome
F
Type
2:
All Subwindows
T
1
46
T
2
F
T T
3
F
10
F
F
Outcome
4.2 Simple Experiment
 Two-typed experiments’ outcome:
- Type 1:
The monolithic 200-feature classifier was trained on the union of
all examples used to train all the stages of the cascaded classifier.
Note that without reference it might be difficult to select a set of
non-face training examples to train the monolithic classifier.
- Type 2: The sequential way in which the cascaded classifier is
trained effectively reduces the non-face training set by throwing
out easy examples and focusing on the “hard” ones.
47
4.2 Simple Experiment
 Fig 7.
Figure 7.ROC curves comparing a 200-feature classifier with a cascaded classifier containing
ten 20-feature classifiers. Accuracy is not significantly different, but the speed of the cascaded
classifier is almost 10 times faster.
48
Results
 Preface:
- This section describes the final face detection system. The
discussion includes details on the structure and training of the
cascaded detector as well as results on a large real-world testing
set.
49
Training Dataset
 The face training set consisted of 4916 hand labeled faces
scaled and aligned to a base resolution of 24 by 24 pixels.
 This bounding box was then
enlarged by 50% and then
cropped and scaled to 24 by 24
pixels.
50
Structure of the Detector Cascade
 The final detector is a 38 layer cascade of classifiers which included
a total of 6060 features.
 The first classifier in the cascade is constructed using two features
and rejects about 50% of non-faces while correctly detecting close
to 100% of faces. The next classifier has ten features and rejects
80% of non-faces while detecting almost 100% of faces.
51
Experiments on a Real-World Test Set
 Authors test their system on the MIT + CMU frontal face test
set.
 This set consists of 130 images with 507 labeled frontal faces.
52
Experiments on a Real-World Test Set
53
Experiments on a Real-World Test Set
 Images from the MIT + CMU test set.
54
Conclusions
 Authors have presented an approach for face detection which minimizes
computation time while achieving high detection accuracy.
 The approach was used to construct a face detection system which is
approximately15 times faster than any previous approach.
 Main contributions:
- The first contribution is a new a technique for computing a rich set of image features
using the integral image.
- The second contribution of this paper is a simple and efficient classifier built from
computationally efficient features using AdaBoost for feature selection.
- The third contribution of this paper is a technique for constructing a cascade of
classifiers which radically reduces computation time
while improving detection accuracy.
55
Integral
image
AdaBoost
Cascade
claasifier
References
 Amit, Y. and Geman, D. 1999. A computational model for visual selection. Neural
Computation, 11:1691–1715.
 Crow, F. 1984. Summed-area tables for texture mapping. In Proceedings of SIGGRAPH,
18(3):207–212.
 Fleuret, F. and Geman, D. 2001. Coarse-to-fine face detection. Int. J. Computer Vision,
41:85–107.
 Freeman,W.T. and Adelson, E.H. 1991. The design and use of steerable filters. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 13(9):891–906.
 Freund, Y. and Schapire, R.E. 1995. A decision-theoretic generalization of on-line
learning and an application to boosting. In Computational Learning Theory: Eurocolt 95,
Springer-Verlag, pp. 23–37.
 Greenspan, H., Belongie, S., Gooodman, R., Perona, P., Rakshit, S., and Anderson, C.
1994. Overcomplete steerable pyramid filters and rotation invariance. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition.
 Itti, L., Koch, C., and Niebur, E. 1998. A model of saliency-based visual attention for
rapid scene analysis. IEEE Patt. Anal. Mach. Intell., 20(11):1254-–1259.
56
References
 John, G., Kohavi, R., and Pfeger, K. 1994. Irrelevant features and the subset selection
problem. In Machine Learning Conference Proceedings.
 Osuna, E., Freund, R., and Girosi, F. 1997a. Training support vector machines: An
application to face detection. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition.
 Osuna, E., Freund, R., and Girosi, F. 1997b. Training support vector machines: an
application to face detection. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition.
 Papageorgiou, C., Oren, M., and Poggio, T. 1998. A general framework for object
detection. In International Conference on Computer Vision.
 Quinlan, J. 1986. Induction of decision trees. Machine Learning, 1:81–106.
 Roth, D., Yang, M., and Ahuja, N. 2000. A snow based face detector. In Neural
Information Processing 12.
 Rowley, H., Baluja, S., and Kanade, T. 1998. Neural network-based face detection.
IEEE Patt. Anal. Mach. Intell., 20:22–38.
57
References

Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1997. Boosting the margin: A new explanation
for the effectiveness of votingvmethods. In Proceedings of the Fourteenth International Conference
on Machine Learning.

Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1998. Boosting the margin: A new explanation
for the effectiveness of votingvmethods. Ann. Stat., 26(5):1651–1686.

Schneiderman, H. and Kanade, T. 2000. A statistical method forv3D object detection applied to
faces and cars. In International Conference on Computer Vision.

Simard, P.Y., Bottou, L., Haffner, P., and LeCun, Y. (1999). Boxlets: A fast convolution algorithm
for signal processing and neural networks. In M. Kearns, S. Solla, and D. Cohn (Eds.), Advances in
Neural Information Processing Systems,vol. 11, pp. 571-577.

Sung, K. and Poggio, T. 1998. Example-based learning for view based face detection. IEEE Patt.
Anal. Mach. Intell., 20:39–51.

Tieu, K. and Viola, P. 2000. Boosting image retrieval. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition.

Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., and Nuflo, F. 1995. Modeling visual-attention
via selective tuning. Artificial Intelligence Journal, 78(1/2):507–545.

Webb, A. 1999. Statistical Pattern Recognition. Oxford University

58
Press: New York.
Thank You!
59