3.2. Multi-View Face Classification Many face candidates are selected after the process described in 3.1, but parts of them are false positives. Hence, face classification is required to find the true faces. SVM (Support Vector Machine [1]) is used as a face classifier in this stage. In our approach, the candidates are classified as face or nonface by a face classifier first. Then a view classifier is applied to the positive face results in order to identify the different views of faces. LIBSVM [2] is used in the implementation. The training face data are collected from the web via search engines. Some keywords related to people (like “wedding”, “movie star”, “birthday”) are used to find photos that have people appearance, and the face regions are cropped and resized to the resolution of 100×100 pixels. To keep the balance of face views, 110 faces are collected for each of the five views, for a total of 550 faces. These data are directly used as the training data of view classifier. On the other hand, the training data of face classifier are collected by applying the face detector and the skin area threshold in 3.1 to some photos which include people’s face, and labeling the results as positive data (those which are actually faces) and negative data (those which are not faces but erroneously detected). This setting let the face classifier complement the weakness of the detection procedure. Furthermore, since the frontal faces are more than the profile faces in general photos, some training data of view classifier are added to the positive data of face classifier, so that the views of faces are still balance in training data and the classifier can work on multi-view faces. The collected training images contain the whole head, not only the facial features from eyes to mouth. This is because that we want to use the hair information as features. And since the training data are 100×100 images, the face candidates selected from previous stages should be resized to 100×100 pixels before sending into the classifiers. For feature extraction, several types of features are tried, and the features with better CV accuracy are selected. In our final approach, three types of features are used: (a) Texture features: this includes Gabor textures and phase congruency [3]. The implementation of this type of feature is based on the method in [4]. (b) Skin area: we use the skin detector described in 3.1. [5] to find a 100×100 binary skin map, and then downsample it to 20×20 as a feature vector. Each sample includes 5×5 pixels, and the feature value is the ratio of skin pixels in it. This is an important feature for classification because a face image must contain skin, and different views of faces have different distribution of skin area. (c) Block color histogram: the image is divided into 5×5 blocks, and an RGB color histogram with 4×4×4 bins is constructed for each block. This feature plays a similar role to the skin ones. Moreover, it can capture hair information in addition to skin distribution. The dimensions of features are reduced by a feature selection algorithm. The quality of a feature is measure by formula (1) as the signal-to-noise ratio [6] between classes: s 2n( f ) 1 2 1 2 (1) where μ1, μ2 are the mean of feature values for two classes, and σ1, σ2 are their standard deviation. For multi-class case in view classification, this measure is computed between all pairs of classes and then take their average. The features with N highest signal-to-noise ratios are selected and used in training. For each type of feature, different values of N are tried and a parameter search of SVM with RBF kernel [1] is done for each N according to 5-Fold CV accuracy. The best CV accuracy obtained by each type of feature is shown in Table 1, as well as the value of N. Texture(172) Skin(400) Color(1600) Face View 97.09%(100) 87.45%(150) 96.73%(150) 77.45%(100) 77.27%(150) 76.18%(150) Table 1. The best CV accuracy for three types of features: texture(172 features), skin area(400 features), and color histogram(1600 features). The face classifier has two classes (face/nonface), and the view classifier has five classes for five different views. The number in brackets is the value of N used for that CV accuracy. According to the results, the skin area feature is good for view classification, but bad for face one. The reason is that false positives of face detectors are usually have colors similar to the skin, so it is harder to classify them. The texture and color histogram features work well in both case, hence the view classifier uses all three features, but the face classifier uses only texture and color histogram ones. The classifiers using different types of features are combined to form a better one. This is done by constructing probabilistic models by LIBSVM so that the classifier can answer the probability distribution of a given instance as well as predicting its label. The distributions are averaged over all classifier (two for face, three for view), and the label with the highest probability will be the final prediction of an instance. This approach does raise the performance, as the resulting CV accuracies of fusion models are 98.58% and 87.96% for face and view classification respectively. Although the classifiers can filter out most of the false positives, there is still a problem in the result. For one person, many face candidates may be found in slightly different resolution by the face detector, and more than one of them are classified as face. Since they represent the same person, we will merge them to a single result. If two nearby face regions result from the face classifier overlap, with the overlapped area greater than 50% of the smaller region, then they are considered to represent the same person, and only the larger face region is kept as well as its view. As described above, the multi-view face classification algorithm takes the face candidates produced by 3.1. as input, and find the true faces among them as well as the views of those faces. The view information can be used in photo thumbnailing on the consideration of aesthetics, which will be described in Chapter 4. 4.1. Face Cropping The thumbnails can be produced by the result of multi-view face detection, as the aesthetic feeling may depend on the view of face in a photo. Figure 1 is an example: a girl is watching the flowers near her, so it is more suitable to crop both the girl and the flowers in the thumbnail. If the cropping is only based on head position, the surrounding scene which is important to the image may be cut, resulting in an inferior thumbnail. The cropping of an image is based on the detected views. If the detected face is frontal, then the cropping area for thumbnailing will be centered on the face. For profile or half-profile faces, the area is shifted left or right according to the direction of view. Profile faces will cause more shifts than half-profile ones in this approach. If multiple faces are detected in an image, then each of them can be used to construct a thumbnail of that image as an option. References [1] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. “A practical guide to support vector classification”, Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 2003. Available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. [2] C.-C. Chang, and C.-J. Lin, “LIBSVM: a library for support vector machines”, 2001. Available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. [3] Kovesi, P.D., “Image features from phase congruency”, Videre: Journal of Computer Vision Research, Vol 1, No. 3, pp. 1-26, 1999. [4] P. D. Kovesi, “MATLAB and Octave Functions for Computer Vision and Image Processing,” School of Computer Science & Software Engineering, The University of Western Australia. Available at: http://www.csse.uwa.edu.au/~pk/research/matlabfns/. [5] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face detection in color images,” IEEE Trans. Pattern Recognition and Machine Learning, pp. 696-706, 2002. [6] T. R. Golub et al., “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring”, Science, Vol 286, pp. 531-537, 1999. [7]