3.2. Multi-View Face Classification

advertisement
3.2. Multi-View Face Classification
Many face candidates are selected after the process
described in 3.1, but parts of them are false positives.
Hence, face classification is required to find the true faces.
SVM (Support Vector Machine [1]) is used as a face
classifier in this stage. In our approach, the candidates are
classified as face or nonface by a face classifier first. Then a
view classifier is applied to the positive face results in order
to identify the different views of faces. LIBSVM [2] is used
in the implementation.
The training face data are collected from the web via
search engines. Some keywords related to people (like
“wedding”, “movie star”, “birthday”) are used to find
photos that have people appearance, and the face regions
are cropped and resized to the resolution of 100×100 pixels.
To keep the balance of face views, 110 faces are collected
for each of the five views, for a total of 550 faces. These
data are directly used as the training data of view classifier.
On the other hand, the training data of face classifier
are collected by applying the face detector and the skin area
threshold in 3.1 to some photos which include people’s face,
and labeling the results as positive data (those which are
actually faces) and negative data (those which are not faces
but erroneously detected). This setting let the face classifier
complement the weakness of the detection procedure.
Furthermore, since the frontal faces are more than the
profile faces in general photos, some training data of view
classifier are added to the positive data of face classifier, so
that the views of faces are still balance in training data and
the classifier can work on multi-view faces.
The collected training images contain the whole head,
not only the facial features from eyes to mouth. This is
because that we want to use the hair information as features.
And since the training data are 100×100 images, the face
candidates selected from previous stages should be resized
to 100×100 pixels before sending into the classifiers.
For feature extraction, several types of features are
tried, and the features with better CV accuracy are selected.
In our final approach, three types of features are used:
(a) Texture features: this includes Gabor textures and phase
congruency [3]. The implementation of this type of feature
is based on the method in [4].
(b) Skin area: we use the skin detector described in 3.1. [5]
to find a 100×100 binary skin map, and then downsample it
to 20×20 as a feature vector. Each sample includes 5×5
pixels, and the feature value is the ratio of skin pixels in it.
This is an important feature for classification because a face
image must contain skin, and different views of faces have
different distribution of skin area.
(c) Block color histogram: the image is divided into 5×5
blocks, and an RGB color histogram with 4×4×4 bins is
constructed for each block. This feature plays a similar role
to the skin ones. Moreover, it can capture hair information
in addition to skin distribution.
The dimensions of features are reduced by a feature
selection algorithm. The quality of a feature is measure by
formula (1) as the signal-to-noise ratio [6] between classes:
s 2n( f ) 
1   2
1   2
(1)
where μ1, μ2 are the mean of feature values for two classes,
and σ1, σ2 are their standard deviation. For multi-class case
in view classification, this measure is computed between all
pairs of classes and then take their average. The features
with N highest signal-to-noise ratios are selected and used
in training. For each type of feature, different values of N
are tried and a parameter search of SVM with RBF kernel
[1] is done for each N according to 5-Fold CV accuracy.
The best CV accuracy obtained by each type of feature is
shown in Table 1, as well as the value of N.
Texture(172)
Skin(400)
Color(1600)
Face
View
97.09%(100)
87.45%(150)
96.73%(150)
77.45%(100)
77.27%(150)
76.18%(150)
Table 1. The best CV accuracy for three types of features:
texture(172 features), skin area(400 features), and color
histogram(1600 features). The face classifier has two classes
(face/nonface), and the view classifier has five classes for five
different views. The number in brackets is the value of N used
for that CV accuracy.
According to the results, the skin area feature is good
for view classification, but bad for face one. The reason is
that false positives of face detectors are usually have colors
similar to the skin, so it is harder to classify them. The
texture and color histogram features work well in both case,
hence the view classifier uses all three features, but the face
classifier uses only texture and color histogram ones.
The classifiers using different types of features are
combined to form a better one. This is done by constructing
probabilistic models by LIBSVM so that the classifier can
answer the probability distribution of a given instance as
well as predicting its label. The distributions are averaged
over all classifier (two for face, three for view), and the
label with the highest probability will be the final prediction
of an instance. This approach does raise the performance, as
the resulting CV accuracies of fusion models are 98.58%
and 87.96% for face and view classification respectively.
Although the classifiers can filter out most of the false
positives, there is still a problem in the result. For one
person, many face candidates may be found in slightly
different resolution by the face detector, and more than one
of them are classified as face. Since they represent the same
person, we will merge them to a single result. If two nearby
face regions result from the face classifier overlap, with the
overlapped area greater than 50% of the smaller region,
then they are considered to represent the same person, and
only the larger face region is kept as well as its view.
As described above, the multi-view face classification
algorithm takes the face candidates produced by 3.1. as
input, and find the true faces among them as well as the
views of those faces. The view information can be used in
photo thumbnailing on the consideration of aesthetics,
which will be described in Chapter 4.
4.1. Face Cropping
The thumbnails can be produced by the result of
multi-view face detection, as the aesthetic feeling may
depend on the view of face in a photo. Figure 1 is an
example: a girl is watching the flowers near her, so it is
more suitable to crop both the girl and the flowers in the
thumbnail. If the cropping is only based on head position,
the surrounding scene which is important to the image may
be cut, resulting in an inferior thumbnail.
The cropping of an image is based on the detected
views. If the detected face is frontal, then the cropping area
for thumbnailing will be centered on the face. For profile or
half-profile faces, the area is shifted left or right according
to the direction of view. Profile faces will cause more shifts
than half-profile ones in this approach. If multiple faces are
detected in an image, then each of them can be used to
construct a thumbnail of that image as an option.
References
[1] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin.
“A practical guide to support vector classification”,
Technical report, Department of Computer Science
and Information Engineering, National Taiwan
University,
Taipei,
2003.
Available
at:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[2] C.-C. Chang, and C.-J. Lin, “LIBSVM: a library for
support vector machines”, 2001. Available at:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[3] Kovesi, P.D., “Image features from phase
congruency”, Videre: Journal of Computer Vision
Research, Vol 1, No. 3, pp. 1-26, 1999.
[4] P. D. Kovesi, “MATLAB and Octave Functions for
Computer Vision and Image Processing,” School of
Computer Science & Software Engineering, The
University of Western Australia. Available at:
http://www.csse.uwa.edu.au/~pk/research/matlabfns/.
[5] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face
detection in color images,” IEEE Trans. Pattern
Recognition and Machine Learning, pp. 696-706,
2002.
[6] T. R. Golub et al., “Molecular classification of cancer:
Class discovery and class prediction by gene
expression monitoring”, Science, Vol 286, pp.
531-537, 1999.
[7]
Download