Face Detection Using SURF Descriptor and SVM Hariprasad E.N. Jayasree M. Department of Computer Science & Engineering Government Engineering College Thrissur Kerala, India hariprasaden@gmail.com Department of Computer Science & Engineering Government Engineering College Thrissur Kerala, India mjayasree.arun@gmail.com Abstract—Face detection is the necessary first step for most of the face analyzing algorithms for example - face recognition. We present a novel method for detecting human faces from digital color images based on Speeded Up Robust Features (SURF) and Support Vector Machines (SVM). This face detection system extracts SURF descriptor from each of the detection window in the input image. The extracted feature is tested against an SVM classifier which is trained using SURF. This SVM classifies a detection window either as a face or as a non-face. All face classified windows are passed through a rule based skin-color extractor to filter out false positives. This gives final face detection output. Here we use only SURF descriptor and not using key-point detector of SURF. Recent researches proved SURF as an efficient feature descriptor for face detection. This ability of SURF is combined with an SVM classifier to make a practical face detector. We built a frontal face detector using this approach and found to be very promising for further research. Keywords— Face Detection; SURF; SVM; Speeded Up Robust Features; Skin Color Segmentation I. INTRODUCTION Face Detection is one of the fundamental techniques that enable human-computer interaction. It is an indispensable first step towards all automated face analyzing algorithms. Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image and, if present, return the image location and extent of each face [1]. The input of a face detection system is a digital image that contain zero or more human faces of any size those can spread anywhere in the image. The output is the location and extent of each face in the input image in some manner. For instance, a rectangle is drawn around each detected face. There are many challenges associated with building of a face detection system [1]. Some of them are facial expression, face occlusion, pose of the person, image orientation, presence and absence of structural components etc. Pose indicates out of plane rotation of a face, while orientation indicates in-plane rotation. Occlusion of face is the invisibility of some of the facial features by other objects. The scale and number of faces present in the image is also not known. These factors make it difficult to build a face detection system that unconditionally finds every faces in the input image. Most of the modern methods for face detection are appearance based methods. In appearance based methods, face templates are learnt by machine using statistical analysis and machine learning approaches. Support Vector Machines (SVM) is one of the efficient methods used for machine learning. A support vector machine constructs a hyper-plane or set of hyper-planes in a high dimensional space. This is utilized for classification, regression, or other tasks. Feature selection is a critical task in face detection. The important considerations during feature selection are represent ability and extraction speed. The extracted features should hold the capability of generalizing a human face while the extraction speed should not affect speed of the face detection. Viola-Jones algorithm [6] seeded to a number of promising boosted-cascade based face detection schemes. Haar features are the basic feature Viola-Jones method. A lot of researchers modified and improved these features [2]. But there are still problems with these haar features. First of all, this feature is introduced in the intensity space itself that can cause problems with poor imaging conditions. Secondly, there are enormous of features to be extracted from a single detection window, which can affect training speed. In this work, SURF features [3] is used for face detection. SURF is defined in gradient space to cope with lighting conditions. Only hundreds of features are needed to extract from a single detection window, which will speed up the overall feature extraction process. This paper introduces a novel method for face detection by concatenating SURF features and SVM. An SVM is trained using SURF descriptors of face and non-face and the same is used for detecting faces. Additionally, human skin color filtering is done to reduce the number of false positives. This method is shown to be effective by building a frontal face detector. II. RELATED WORKS Based on the milestone work of Viola-Jones, a number of boosted-cascade detection systems are introduced. Most of them vary in extracted features and weak classifier. A recent work introduced SURF cascade based face detection [4]. The SURF descriptor is trained using boosted logistic regression classifiers. They achieved a lot of improvement in training speed and detection accuracy. SURF and SVM are combined for face detection and face component localization in [7]. In this method, SURF key-point detection is used to find interest points automatically and store this information in SURF descriptors, which is finally used for face classifying as a face or non-face using a trained SVM. image. After this, detection window pixels are selected. A detection window is a small subset of original image in which, exactly one human face is suspected. From the window pixels, SURF features are extracted and are given to a trained SVM for classification. If the input image is color, human skin color filtering is done to avoid false positives; else SVM gives output as face or non-face. In the case of color image, if detection window contains a huge percent of skin color, it will give output as face otherwise as non-face. Final decision is taken to merge neighboring detections and detection contained in another. Finally, this process is repeated for all detection windows from top left to bottom right. Final output is the original input image along with rectangles drawn around each detected face. Fig. 1. Proposed Face Detection System Rest of this section explains some important portions of the proposed system. A. SURF Features SURF (Speeded Up Robust Features) is a detector and descriptor of local scale and rotation invariant image features[3]. This feature descriptor has been successfully applied in many applications of object recognition and image matching. This paper did not need interest point detection part, but only need the descriptor part. This work calculates SURF descriptor in a similar method of SURF cascade paper [4][5]. Fig. 2. Patch and Cell Configuration inside a detection window The difference of our method with this is that, we don’t use SURF interest point detection part. Instead, SURF descriptor is unconditionally stored in highly overlapped detection window subspaces. Face detection using skin-tone segmentation is realized in [8]. They introduced some rules to find skin and non-skin pixels, and some blob regions are found to detect faces. They proved that the use of color model rules for skin-color detection is easy and effective. III. PROPOSED SYSTEM Face detection system consists two phases - Training and Detection. In training, an SVM is trained with positive and negative face samples to make a structure file that distinguish between face and non-face. Detection phase is the actual testing of system. A test image is given as input and system returns location and extent of each faces by drawing a rectangle. Our proposed face detection system is given in Fig.1. In detection, input image is first preprocessed to remove noise. If the input is in color, it is converted into an 8 level gray SURF descriptor is defined in the gradient space. Let d x be the horizontal gradient image obtained using filter kernel [-1, 0, 1] and dy be the vertical gradient image obtained with the filter kernel [-1, 0, 1] T. dx and dy are summed up over each sub-region to form two entries of the feature vector. Also extract sum of the absolute values |dx| and |dy| to bring information about polarity of the intensity changes. These two values form next two entries of the feature vector. Thus each sub-region is described using a four-dimensional descriptor vector v, where v = (∑ dx, ∑ dy, ∑ |dx|, ∑ |dy|). A 40 × 40 detection window is densely sampled with the SURF descriptor. In a given detection window, overlapped local patches of variable size are considered. For instance, given a 40 × 40 detection window, local patches range from 12 × 12 to 40 × 40. The same patch is slid over the detection window with fixed step of 4 pixels. Size is also changed with a 4 pixel step. This ensures enough variability among local patches. The total number of generated patches from a single detection window of size 40 × 40 is 204. Configuration of patches and cells inside a detection window is given in Fig. 2. In the figure biggest rectangle is the detection window. 12 × 12 rectangle is a patch of the detection window. It can vary size up to the size of detection window. Whatever be the size of patch, it is divided into 4 cells in 2 × 2 configuration. Each local patch is represented using 4 feature vectors, each with size 4 as described above. That is, a local patch is divided into 2 × 2 cells of SURF features. This features are concatenated together to form a feature vector of 4 × 2 × 2 = 16 dimensions. These entire patch features are concatenated to form 16 × 204 = 3264 dimensional window feature vector. Finally, they are normalized using L2 normalization. This is the final representation of a 40 × 40 detection window template. An integral image technique, introduced by Viola-Jones [6] is used for fast feature extraction. Using summed tables for dx, dy, |dx| and |dy|, cell features are calculated efficiently. B. SVM Training SVM is trained using positive and negative samples of size 40 × 40. Since Face detection is a rare event, more negative samples are trained than positive samples. For each sample, SURF descriptors are extracted and stored as a vector. These descriptors altogether form a training matrix. SVM is trained for face detection as a two-class classification problem. Class 1 indicates a face window and class 0 indicates a non-face window. SVM automatically builds a hyper-plane for face and non-face separation. This SVM later used for classification. C. Classification process Detection window scans from top left to bottom right in the given resolution. SURF descriptor is extracted from each detection window. Extracted descriptor is tested against previously trained SVM. If it is closer to a face than non-face, SVM returns 1, indicating the given window contains a face. If it is non-face, SVM returns 0. Location of all these faces and current resolution of the image is stored in a face-list structure. The above process is repeated by reducing image size, until size becomes equal to the size of detection window. Merging of neighboring detections then takes place. If two detections are close enough or one face is contained in the other, then the average of those faces are stored as location and extent of the face. After skin color test, final output of this face detection is – input image annotated with rectangles around each face. These rectangles show location and extent of faces. D. Skin Color Test Each entry in the face-list is passed through a skin color test. Percentage of skin color is calculated using rule based skin color extraction module. If it passes test, face is retained as detection, otherwise deleted from face-list. Skin color segmentation algorithm employed in [8] is used for skin color extraction. They call it as RGB-HS-CbCr skin color model. The test image is analyzed in RGB, HSV and YCbCr spaces. There are 4 rules to find the presence of skin color. They are given below. In them, R, G and B indicates Red, Green and Blue components in RGB space respectively. Cb and Cr are blue difference and red difference chroma components in YCbCr space respectively. H and S are Hue and Saturation in HSV color space. The numbers are their respective values in traditional units. 1) RGB rule for skin color at uniform daylight illumination. (R < 50) & (G > 40) & (B > 20) & ((max(max(R,G),B) - min(min(R,G),B)) > 10) & (R - G ≥ 10) & (R > G) & (R > B) 2) RGB rule for skin color under flashlight or daylight lateral illumination (R > 220) & (G > 210) & (B > 170) & (|R - G|≤15) & (R > B)&(G > B) 3) Cb-Cr rule (Cb ≥ 60) & (Cb ≤ 130) & (Cr ≥ 130) & (Cr ≤ 165) 4) H-S rule (H ≥ 0) & (H ≤ 50) & (S ≥ 0.1) & (S ≤ 0.9) For applying these rules, each input color image is transformed into RGB, HSV and YCbCr. Respective color rules are tested against each pixel in the input face window. If majority pixels satisfy all of these color rules, then it is considered as a face, otherwise non-face. If a non-face found, it is considered as a false positive detection by SVM and is removed from the list. IV. IMPLEMENTATION DETAILS Using above said approach, we built a frontal face detector. Implementation is done with MATLAB environment using an ordinary 64 bit PC in Linux (Ubuntu) platform. System memory was 3 GB. Frontal face images are collected from various standard databases. It include GENKI dataset [9] and the face-tracer dataset [10] and from internet. All face images are cropped and scaled to 40 × 40. SVM is trained with 7500 positive face samples and 28520 negative non-face samples. For negative samples, images without any faces are considered. Templates of size 40 × 40 are extracted from these negative images. SURF descriptor is extracted from each of these samples. Vectors from all of the samples are combined to get training matrix. Also a group vector is prepared for the training to indicate which samples belong to positive class and which belong to negative class. An SVM is trained using training matrix and group vector. The output is a data structure that contains information about how to classify a test input based on the training data. Later, this SVM structure file is used for detection. In the detection phase, input image is divided into overlapping windows of size 40 × 40. SURF descriptor is extracted from each of these windows. Classification based on previously stored structure file determines presence or absence of face in a detection window. V. TEST RESULTS Face detector is tested using color images from standard datasets. Images with frontal faces without occlusion are considered for testing. It is found that in most of the images, there were no false negatives. That is all of the faces is correctly detected. But there were a small number of false positives. The reason for false positives can be attributed to the fact that training is done on a personal computer with limited memory. If training is done on a workstation, number of training window can be increased and in turn detection accuracy. Sample output for two images taken from bao dataset [11] is given in figures Fig. 3. and Fig. 4. VI. CONCLUSIONS AND FUTURE WORK A face detection system with SURF descriptors and SVM, along with skin color filter is proposed. A frontal face detector was built using this approach. SURF descriptor is proved to be effective for representing facial features. SURF descriptor combined with SVM is effective for detecting human faces from natural photographs. Information contained in human skin color is utilized to filter out false positives, and is achieved simply and effectively using color models based rules. Overall, this system has high detection rate, and a lot of research works can be directed towards it. As future work, this face detection system can be trained for profile faces along with frontal faces. More training will make a high accurate face detector. Another scope is to improve SURF features by modifying local patches. In this work, one SVM is trained for the whole detection window descriptor. Instead, each SVM can be trained for a group of local patches and a voting system can be implemented to increase detection rate in case of occlusion. Many future works can be initiated based on this concept. Fig. 3. Sample Output 1 for an image from bao dataset [11] Fig. 4. Sample Output 2 for an image from bao dataset [11] REFERENCES D. J. Kriegman M.-H. Yang and N. Ahuja, “Detecting faces in images: A survey”, IEEE Trans. on PAMI, 2002. [2] Cha Zhang and Zhengyou Zhang, “A survey of recent advances in face detection”, Technical Report, MSR-TR-2010-66, 2010. [3] T. Tuytelaars H. Bay, A. Ess and L. V. Gool, “Surf: Speeded up robust features”, CVIU, 110:346359, 2008. [4] Yimin Zhang Jianguo Li, “Face detection using surf cascade”, ICCV Workshop, 2011. [5] Yimin Zhang Jianguo Li, “Learning surf cascade for fast and accurate object detection”, Conference on Computer Vision and Pattern Recognition, IEEE, 2013. [6] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, In Proc. of CVPR, 2001. [7] Donghoon Kim, Rozenn Dahyot, “Face Components Detection using SURF Descriptors and SVMs”, International Machine vision and Image Processing Conference, IEEE, 2008 [8] Sayantan Thakur, Sayantanu Paul, Ankur Mondal, Swagatam Das, Ajith Abraham, “Face Detection Using Skin Tone Segmentation”, World Congress on Information and Communication Technologies, IEEE, 2011 [9] http://mplab.ucsd.edu. “The MPLab GENKI Database”. [10] N. Kumar, P. N. Belhumeur, and S. K. Nayar. Facetracer: “A search engine for large collections of images with faces”. In ECCV, 2008. [11] http://www.facedetection.com/downloads/BaoDataBase.zip, “Bao Face Database”. [1]