IMAGE THUMBNAILING VIA MULTI-VIEW FACE DETECTION CHIH-CHAU MA1, YI-HSUAN YANG2, WINSON HSU3 1 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan 2 Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan 3 Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan E-mail: b91902082@ntu.edu.tw, affige@gmail.com, winston@csie.ntu.edu.tw Abstract: The abstract is an essential part of the paper. Use short, direct, and complete sentences. Keywords: Tracking; estimation; management 1. information fusion; resource Introduction These are instructions for authors typesetting for the Asia-Pacific Workshop on Visual Information Processing (VIP07) to be held in Taiwan during 15 to 17 December 2007. This document has been prepared using the required format. The electronic copy of this document can be found on http://conf.ncku.edu.tw/vip2007/index.htm. Figure 1. Figure’s caption. 2. Related Works MS Word users: please use the paragraph styles contained in this document: Title, Author, Affiliation, Abstract, Keywords, Body Text, Equation, Reference, Figure, and Caption. Try not to change the styles manually. 2.1. 2.2. Length Papers should be limited to 6 pages. Do not include page numbers in the text. Mathematical formulas Mathematical formulas should be roughly centered and have to be numbered as formula (1). y f (x) (1) Face Detection Face Patch Classification in out Skin Color Detection Face Region Shift Face Region Merging SVM Face Classification SVM View Classification efficiency. We tune the parameter of fdlib so that most faces can be detected. Along with the resulting high recall rate, we also have more false alarms and therefore low precision rate. However, this is favorable since face candidates have been greatly reduced in comparison with exhaustive search, and the precision rate can be improved in the following stages. Figure 3 shows a sample result of fdlib. View Classification Figure 2. System diagram of the proposed multi-view face classification method. 3. Multi-view Face Detection We decompose multi-view face detection to two sub-problems, namely, face detection and multi-view face classification. For each input image, we first apply a face detector to select some face candidates, and then perform multi-view face classification to identify if each candidate is a face, as well as the view of them. The decomposition makes the optimization of the two classifiers easier since the useful features and parameters may be greatly different for the two sub-problems. To reduce computational effort, we develop a hierarchical approach for our algorithm based on the simple-to-complex strategy [1]. At the first and second stages, the face detectors are simple, yet sufficient to discard most of the non-faces. The remaining non-faces are removed by a classifier composed of several Support Vector Machines (SVM [8]), which is known for its strong discriminating power. The relatively higher computation burden of SVM is largely alleviated because typically only a few dozens of face candidates remain to be classified. For view classification, we train another classifier aiming at discriminating the five views of concern. The system diagram of the proposed multi-view face detection is shown in Figure 2, and the details of each component are described in the following subsections. 3.1. Figure 3. Sample result of the face patch classifier fdlib. Red squares represent detected face candidates. Left: low threshold (high precision, low recall); right: high threshold (low precision, high recall). Figure 4. Sample results of the skin color detector. Red squares: face candidates of the first stage; green squares: detected face candidates of the skin color detector. The precision rate has been greatly improved. Face Detection Given an input image, we must detect the regions of faces with multi-view in it. An efficient face detector is used first, and the results are thresholded to remove some false alarms. The remaining results are enlarged and shifted to locate on the head regions more precisely. Then they are sent to the face classifier in 3.2 as face candidates. As the first stage for face detection, we consider only the luminance part of the images and adopt a real-time face patch classifier fdlib [6] to detect faces for its speed and Figure 5. Sample results of the face region shift algorithm. Blue squares represent the shifted face regions. Note the blue squares cover almost the head areas. Next it comes to the chromatic part. We compute a binary skin color map for each image by the skin color detection method proposed in [7], and discard face candidates whose proportion of skin-tone pixels does not exceed a pre-defined threshold δc. The threshold is set adaptively according to the proportion of skin-tone pixels in the whole image (denote as r): c max(0.25, min(0.5, 2r )) . (1) Two sample results are shown in Figure 4. It can be found the employment of skin color detector removes most of the false alarms in the detection results. Because a face candidate from the first two stages only encompasses the face region, i.e. the area from eyes to mouth, we shift and scale the face candidate to include the whole head region so that we can exploit information such as the distribution of hair, which can be an useful feature for both face and view classification. The face candidates produced by previous stages are square regions with various scales. For a face candidate with center (x, y) and length l, we enlarge the face region to cover a region of (x–l: x+l, y–2l/3: y+4l/3). The unequal proportion along the y–axis is because a face region usually locates lower in the head area (due to hair and forehead). After scaling, the length of the face region becomes 2l. Because the face candidate does not always locates at the central part of the corresponding head area due to detection error (especially for profile faces), we compute the centroid of skin-tone pixels in the enlarged face region, and then shift the face region to center at the skin-tone centroid. Two sample results are shown in Figure 5. The shifted face region includes almost the whole head area now. 3.2. Multi-View Face Classification Many face candidates are selected in the process described in 3.1, but parts of them are false alarms. Hence, face classification is required to find the true faces. In our approach, the candidates are classified as face or non-face by a face classifier first. Then a view classifier is applied to the positive results in order to identify the different views of faces. Both classifiers are composed of several SVMs, and LIBSVM [9] is used in the implementation. The training face data are collected from the web via search engines. Some keywords related to people (like “wedding”, “movie star”, “birthday”) are used to find photos that have people appearance, and the face regions are cropped and resized to the resolution of 100×100 pixels. To keep the balance of face views, 110 faces are collected for each of the five views, for a total of 550 faces. These data are directly used as the training data of view classifier. On the other hand, the training data of face classifier are collected by applying the face detector and the skin area threshold in 3.1 to some photos which include people’s face, and labeling the results as positive data (those which are actually faces) and negative data (those which are not faces but erroneously detected). This setting let the face classifier complement the weakness of the detection procedure. Furthermore, since the frontal faces are more than the profile faces in general photos, some training data of view classifier are added to the positive data of face classifier, so that the views of faces are still balance in training data and the classifier can work on multi-view faces. The training images and validation data are available on the web 1. The collected training images contain the whole head, not only the facial features from eyes to mouth. This is because that we want to use the hair information as features. And since the training data are 100×100 images, the face candidates selected from previous stages should be resized to 100×100 pixels before sending into the classifiers. For feature extraction, several kinds of features are tested, and the features with better CV accuracy are selected. In our final approach, three feature sets are used: (a) Texture features: this includes Gabor textures and phase congruency [10]. The implementation of this type of feature is based on the method in [11]. (b) Skin area: we use the skin detector described in 3.1. to find a 100×100 binary skin map, and then downsample it to 20×20 as a feature vector. Each sample includes 5×5 pixels, and the feature value is the ratio of skin pixels in it. This is an important feature for classification because a face image must contain skin, and different views of faces have different distribution of skin area. (c) Block color histogram: the image is divided into 5×5 blocks, and an RGB color histogram with 4×4×4 bins is constructed for each block. This feature plays a similar role to the skin ones. Moreover, it can capture hair information in addition to skin distribution. The dimension of features in a feature set is reduced by a feature selection algorithm. The quality of a feature is measure by formula (1) as the signal-to-noise ratio [12] of feature values between classes: s 2n( f ) 1 2 1 2 (1) where μ1, μ2 are the mean of feature values for two classes, and σ1, σ2 are their standard deviation. For multi-class case in view classification, this measure is computed between all pairs of classes and then take their average. The features with N highest signal-to-noise ratios are selected and used in training. For each feature set, different values of N are tried and a parameter search of SVM with RBF kernel [8] is done for each N according to 5-Fold CV accuracy. More details are described in 5.2. The SVM classifiers using different feature sets are combined to form a better one. This is done by constructing probabilistic models by LIBSVM so that the classifier can answer the probability distribution of a given instance as well as predicting its label. The distributions are averaged over classifiers, and the label with the highest probability is the final prediction of an instance. For the combination, the view classifier uses all of the three feature sets, but the face classifier uses only texture and color histogram ones. This choice is according to the performance of feature sets. Although the classifiers can filter out most of the false positives, there is still a problem in the result. For one person, many face candidates may be found in slightly different resolution by the face detector, and more than one of them are classified as face. Since they represent the same person, we will merge them to a single result. If two nearby face regions result from the face classifier overlap, with the overlapped area greater than 50% of the smaller region, then they are considered to represent the same person, and only the larger face region is kept as well as its view. Figure 6 is a sample result of this merging algorithm. 5.2. Figure 6. A sample result of the face region merging algorithm. Left: before merging, red squares are detected faces with view categories, blue squares are detected non-faces; right: after merging, green squares are merged into the largest red square, and the view is set according to the red square. As described above, the multi-view face classification algorithm takes the face candidates produced by 3.1. as input, and find the true faces among them as well as the views of those faces. The view information can be used in photo thumbnailing on the consideration of aesthetics, which will be described in Chapter 4. 6. 4. 5. 5.1. Photo Thumbnailing Experimental Result Efficiency xx Accuracy The best CV accuracy obtained by each type of feature is shown in Table 1, as well as the value of N. Texture(172) Skin(400) Color(1600) Face View 97.09%(100) 87.45%(150) 96.73%(150) 77.45%(100) 77.27%(150) 76.18%(150) Table 1. The best CV accuracy for three types of features: texture(172 features), skin area(400 features), and color histogram(1600 features). The face classifier has two classes (face/nonface), and the view classifier has five classes for five different views. The number in brackets is the value of N used for that CV accuracy. According to the results, the skin area feature is good for view classification, but bad for face one. The reason is that false positives of face detectors are usually have colors similar to the skin, so it is harder to classify them. The texture and color histogram features work well in both cases. This approach does raise the performance, as the resulting CV accuracies of fusion models are 98.58% and 87.96% for face and view classification respectively. Conclusion References [1] Z.-Q. Zhang, L. Zhu, S.-Z. Li, H.-J. Zhang, “Real-time multi-view face detection,” Proceeding of Int. Conf. Automatic Face and Gesture Recognition, pp. 142–147, 2002. [2] Y. Li, S. Gong, J. Sherrah, and H. Liddell, “Support vector machine based multi-view face detection and recognition,” Image and Vision Computing, pp. 413–427, 2004. [3] P. Wang and Q. Ji, “Multi-view face detection under complex scene based on combined SVMs,” Proceeding of Int. Conf. Pattern Recognition, 2004. [4] K.-S. Huang and M.M. Trivedi, “Real-time multi-view face detection and pose estimation in Video Stream,” Proceeding of Int. Conf. Pattern Recognition, pp. 965–968, 2004. [5] B. Ma, W. Zhang, S. Shan, X. Chen, and W. Gao, “Robust head pose estimation using LGBP,” Proceeding of Int. Conf. Pattern Recognition, pp. 512–515, 2006. [6] W. Kienzle, G. Bakir, M. Franz, and B. Schölkopf, “Face detection - efficient and rank deficient,” Advances in Neural Information Processing Systems 17, pp. 673–680, 2005. [7] R.-L. Hsu, M. Abdel-Mottaleb, A. K. Jain, “Face detection in color images,” Trans. Pattern Recognition and Machine Learning, pp. 696–706, 2002. [8] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification”, Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 2003. [9] C.-C. Chang, and C.-J. Lin, “LIBSVM: a library for support vector machines”, 2001. Available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. [10] P. D. Kovesi, “Image features from phase congruency”, Videre: Journal of Computer Vision Research, Vol. 1, No. 3, pp. 1–26, 1999. [11] P. D. Kovesi, “MATLAB and Octave Functions for Computer Vision and Image Processing”, School of Computer Science & Software Engineering, The University of Western Australia. Available at: http://www.csse.uwa.edu.au/~pk/research/matlabfns/. [12] T. R. Golub et al., “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring”, Science, Vol. 286, pp. 531–537, 1999. [13] blah 1 http://www.csie.ntu.edu.tw/~r95007/facedata.htm