Analysis and Evaluation of SURF Descriptors for Automatic 3D Facial Expression Recognition Using Different Classifiers Amal Azazi, Syaheerah Lebai Lutfi and Ibrahim Venkat School of Computer Sciences Universiti Sains Malaysia Pulau Pinang, Malaysia Email: aaaa11 com017@student.usm.my, {syaheerah, ibrahim}@cs.usm.my Abstract—Emotion recognition plays a vital role in the field of Human-Computer Interaction (HCI). Among the visual human emotional cues, facial expressions are the most commonly used and understandable cues. Different machine learning techniques have been utilized to solve the expression recognition problem; however, their performance is still disputed. In this paper, we investigate the capability of several classification techniques to discriminate between the six universal facial expressions using Speed Up Robust Features (SURF). The evaluation were conducted using the BU-3DFE database with four classifiers, namely, Support Vector machine (SVM), Neural Network (NN), k-Nearest Neighbors (k-NN), and Naı̈ve Bayes (NB). Experimental results show that the SVM was successful in discriminating between the six universal facial expressions with an overall recognition accuracy of 79.36%, which is significantly better than the nearest accuracy achieved by Naı̈ve Bayes at significance level p < 0.05. Keywords-Human-computer interaction; 3D Facial Expression Recognition; Support Vector machine; Neural Network; k-nearest neighbors; Naı̈ve Bayes I. I NTRODUCTION The human face is a substantial mean for human interaction. People can communicate and understand others emotional state through faces. Facial expressions are the most common and understandable visual emotional cues. Even during speaking, it contributes up to 55% of the speaker effect [1]. People express their feelings all the time, even when they interact with machines. They can understand and interpret others expressions effortlessly; however, machines are missing these skills. Infusing the skill of understanding human facial expressions and adapting accordingly is a new active research area in human-computer interaction (HCI) field. We can benefit from machines with such skills in a wide range of domains such as: improving the learning environment [2], increasing the doctor awareness in tele-home care program [3], reducing the road accident risk [4], and measuring the customer satisfaction [5] to name a few. Towards unintrusive and realistic recognition applications, several conditions should be fulfilled, such as: automation, person independency, and reasonable accuracy. Notably, it is difficult to achieve high recognition accuracy in a fully automatic system. The ability in correctly locating the face and extracting its features highly affects the recognition accuracy. The overall recognition system accuracy depends on several factors. The head pose and illumination changes reduce the accuracy when they occur. Recently, 3D face images have been utilized as an alternative to 2D face images as 3D face images are robust against head pose and illumination changes. Moreover, the choice of facial feature representation and classification methods contributes mainly to the system accuracy. Although, feature representation and classification are done sequentially in two different phases, they are highly influenced by each other [6]. Different classifiers perform distinctively with different feature representations, and the final performance of their combination is task-dependent. Hence, investigating the optimal combination of feature representation and classification method is a fundamental issue in facial expression recognition field. In our previous work [7], we proposed a feature selection method that resulted in a set of 52 features as an optimal feature set that can discriminate between the six universal expressions (i.e.: anger, disgust, fear, happiness, sadness and surprise) in 3D textured face images with a reasonable accuracy. Speed Up Robust Features (SURF) and Radial Basis Function Support Vector Machine (RBF-SVM) have been employed for feature extraction and classification process. In this paper, we continue our research to find the optimal combination of feature representation and classifier by evaluating the SURF descriptors of the proposed optimal features using several state-of-art classification schemes; namely: Support Vector machine (SVM), Neural Network (NN), k-Nearest Neighbors (k-NN), and Naı̈ve Bayes (NB). The rest of this paper is organized as follows. Section 2 reviews the related studies in 3D facial expression recognition field. Section 3 introduces our proposed recognition system and classification methods. The results and their discussion are presented in Section 4. Finally, Section 5 concludes our work and shares some of our future directions. SVM 3D faces 3D to 2D mapping Optimal feature localization SURF feature descriptors extraction NB NN Anger\ disgust\ fear\ happiness\ sadness\ surprise k-NN Pre-processing Feature extraction Classification Figure 1: The general overview of the proposed 3D facial expression recognition framework II. R ELATED W ORKS In general, 3D facial expression recognition frameworks comprise three main phases: i) face pre-processing, ii) feature extraction, and iii) expression classification. Establishing an automatic and person-independent recognition system highly relies on the choice of feature extraction method. Feature descriptors could be extracted in a global approach [8]–[10] or local approach [11]–[13]. The global approach may assist the automation of the system as no facial landmarks are required. However, considering the whole 3D face as one pattern may increase the feature descriptor dimensionality, which consequently affects the computation time and storage. On the other hand, in the local approach, the most relevant information is only extracted from a set of located landmarks, which helps in reducing the dimensionality. Recently, selecting the optimal facial features became a very common phase in expression recognition frameworks that aims at reducing the dimensionality and improving the system accuracy simultaneously [14]–[16]. The classification outcome does not only depend on the classifier parameters, but also highly depends on the feature descriptor attributes, i.e.: the feature vector length/dimensionality, the descriptor type and the number of the samples [6]. In the context of facial expression recognition, SVM is the most frequently used in the classification phase. However, various studies found that other classification methods outperformed SVM. For example, Wang et al. [17] found that the classification outcomes of Quadratic Discriminant Classifier (QDC), Linear Discriminant Analysis (LDA), NB, and SVM vary notably using the same feature descriptors, the face primitive features. Among the four classification methods, LDA obtained the highest overall recognition accuracy, followed by SVM. However, Bartlett et al. [18] reported that SVM outperformed LDA in their experiments and attributed its better performance to the fact that LDA may work better than SVM only when the class distributions are Gaussian. Other studies have also compared SVM to NB, NN and kNN. For example, Hupont et al. [19] stated that NB yielded better accuracy yet comparable to SVM and NN. Khan et al. [20] reported that SVM and k-NN outperformed NB. In short, we can conclude that there is no fixed optimal classifier for expression recognition problem. Hence, in this paper, we investigate different classification methods with the same feature descriptors in an attempt to identify the best classification method for our proposed 3D expression recognition framework. III. 3D FACIAL EXPRESSION RECOGNITION Motivated by the high dimensionality problem inherent in the use of 3D face images, we proposed a fully automatic facial expression recognition framework that tackles this problem using conformal mapping and feature selection process. Fig. 1 illustrates a general overview of the proposed framework, which consists of three main steps: 3D faces preprocessing, feature extraction, and classification. In the pre-processing phase, all the 3D textured face images are mapped into the 2D plane using conformal mapping [21]. The mapping process generates 2D textured face images, as shown in Fig. 2, which are used as inputs to the feature extraction phase. The following two sections introduce the feature extraction and classification phases in detail. A. Feature extraction After mapping the 3D face images into the 2D plane, seven main landmarks (eyes corners, nose tip, and mouth corners) are first detected using structured output SVM introduced by Uřičář et al. [22]. Then a set of 52 facial points are located based on the main landmarks, as shown in Fig. 1. The features of the 52 facial points are considered as the optimal facial features in the textured mapped face images [7]. In Azazi et al. [7], we proposed a novel differential evolution based feature selection algorithm to identify the optimal facial feature set in the mapped face images. The optimal Anger Disgust Fear Happiness Sadness Surprise Figure 2: 3D face images of one subject from the BU-3DFE database with the six universal expressions. The first row presents the 3D face images and their corresponding 2D mapped images are presented in the second row. features could discriminate between the six universal facial expressions without redundancy and with a good recognition accuracy. The selection algorithm evaluated the nominated features based on the discriminative power of their SURF descriptors. As a result, a set of 52 features has been identified as the optimal feature set that could discriminate reasonably between the universal facial expressions. In this paper, we utilized this optimal feature set in the feature extraction phase. The length of SURF descriptor for each feature is 64. For each face image, the 52 descriptors are concatenated to form one descriptor for each image. These descriptors are then fed to the different classifiers for the training and testing process. B. Classification SVM has been widely used in most of the state-of-the-art facial expression recognition frameworks. However, there is still a dispute with regard to its performance in comparison to other classification methods. In this paper, we investigated the performance of SVM along with three of the most common classifiers, i.e.: NN, NB, and k-NN. 1) Support Vector machine (SVM): SVMs are well known classification techniques that have been applied in a wide variety of applications [23]. Their basic idea is based on mapping all the training examples into a hyperplane where every class is separated with some boundaries. Let xi , i = 1, ..., n are the feature vectors of the training set and yi , i = 1, .., n is the corresponding labels where yj ∈ {Lj }, j = 1, .., C and C is the number of classes. SVM defines the hyperplane as a function of a set of training examples, called support vectors, that lay on the boundaries between classes. The decision function is given by: m X f (x) = sgn( yi αi · K(xi , x) − b) i (1) where: m is the number of support vectors; αi ∈ {α1 , α2 , .., αm }, αi > 0 only for support vectors xi , b is the bias parameter, and K is the kernel function. Radial Basis Function (RBF) is one of the kernel mapping tricks that has the ability to deal with high dimensionality with less parameters. The kernel function is given as follows: K(xi , x) = exp(−γ||xi , x||2 ) (2) where: γ > 0 is the kernel parameter. Originally, SVM was invented for binary classification and then extended to multi class classification. The extension could be done using one-against-all method or one-againstone method. In this paper, one-against-one method has been adopted. 2) Neural Network (NN): Neural Network (NN) is a computational model that emulates the biological structure and functionality of nervous systems in the humans brain [24]. It is modeled in the form of a numerous of interconnected neurons with a defined sets of input and output neurons. As human brain, the networks are learnt by presenting some examples to build its experience that could be generalized later with new inputs. They have a great ability to learn the complex relationship between the input and output in a supervised or unsupervised manner. Radial Basis Function network (RBF-NN) is one of the most popular NNs which encloses supervised and unsupervised learning methods for training the network. The hidden layer contains of neurons that use kernel functions such as Gaussian as activation function. Based on the input and the weights associated to each neuron, they define their positions and functions parameters in unsupervised manner. This step is followed by a supervised learning method to set up the weights between the hidden layer and the output layer as least mean square error. The final output y for the input x could be expressed as: y(x) = X wi .exp(− i=1 (kx − µi k)2 ) 2σ 2 (3) where µi is the ith center in the hidden layer and σ is the controlling parameter of the Gaussian function. 3) k-Nearest Neighbors (k-NN): k-NN is a similarity based supervised classification method that proves its efficiency to solve a wide range of classification problems, though its simplicity [25]. It is considered as a lazy classifier as there is no training processing, which make it faster than other classification methods. Given a training dataset X = x1 , x2 , ..., xn , where each sample xi is assigned to a specific class ci . The k-NN algorithm assumes that all the samples lay on m-dimensional feature space, where m is the length of the samples X. For any test sample t, its k nearest neighbors in the feature space are found and their major vote is considered as the predicted class. Various distance metrics could be used to find the nearest neighbors in k-NN algorithm. City Block is one of the well known metrics that computes the distance d between pair of points as the absolute differences of the pair’s coordinates: d(x, t) = m X |xi − ti | (4) i=1 4) Naı̈ve Bayes: Naı̈ve Bayes is a supervised probabilistic classification technique based on the Bayesian theorem [26]. It is well known with its conditional independence concept which assumes that all the model parameters or features in the feature vector are independent given the class value. Let F = (f1 , f2 , ..., fn ) is a feature vector with length of n and C = (c1 , c2 , ..., cm ) is the classes vector with length of m. The probability of F being belongs to the class ci is then given by: P (ci |F ) = P (F |ci ).p(ci ) P (F ) (5) The Naı̈ve Bayes classification rule is then computed as follows: f (F ) = arg max P (C) C n Y P (fi |C) (6) Previous studies in 3D facial expression recognition used the BU-3DFE database for evaluation with different settings. In this paper, we chose to run the experiment 100 independent times for generalization. In each experiment, we randomly selected 60 subjects and conducted 10-fold cross-validation experiment. The overall accuracy is then computed as the average of the accuracies obtained by all the 100 experiments. B. Classification results and discussion Using the extracted SURF descriptors of the optimal features, we trained and tested all the four classifiers using the same experimental setup above mentioned in Section A. Table I to Table IV present the confusion matrices of all the classifiers. The rows present the predicted expressions and the column present the actual expressions in percentages. Table I: The confusion matrix of expression recognition using RBF-SVM Angry Disgust Fear Happy Sad Surprise Angry 71.33 6.00 5.67 0 16.83 0 Disgust 6.33 81.67 6.17 2.33 1.17 1.33 Fear 1.50 5.00 63.33 6.00 4.50 4.17 Happy 0 1.50 11.33 90.83 1.50 0.50 Sad 20.50 4.33 11.00 0 75.67 0.67 Surprise 0.33 1.50 2.50 0.83 0.33 93.33 Table II: The confusion matrix of expression recognition using NB Angry Disgust Fear Happy Sad Surprise Angry 66.83 4.83 2.50 0.33 18.83 0.17 Disgust 7.83 78.17 8.83 3.33 3.17 1.33 Fear 2.17 5.50 58.33 13.83 8.83 4.00 Happy 2.00 4.33 13.83 81.67 1.67 1.00 Sad 21.17 2.33 12.00 0 67.00 1.00 Surprise 0 4.83 4.50 0.83 0.50 92.50 Table III: The confusion matrix of expression recognition using RBF-NN i=1 Angry Disgust Fear Happy Sad Surprise Angry 51.00 9.00 8.33 2.83 21.17 2.17 Disgust 7.67 59.83 10.33 4.83 2.17 1.67 Fear 2.83 6.83 43.67 7.83 5.00 3.83 Happy 5.17 9.50 14.17 78.17 3.67 1.67 Sad 32.50 9.67 17.17 5.50 66.67 3.67 Surprise 0.83 5.17 6.33 0.83 1.33 87.00 IV. E XPERIMENTAL RESULTS AND DISCUSSION A. BU-3DFE database The Binghamton University 3D Facial Expression (BU3DFE) database [27] is one of the most popular databases for 3D face analysis. It contains the 3D face models of 100 subjects who posed the six universal facial expressions in four intensities and the neutral expression. The subjects are 44 males and 56 females from different origins and ages. Table IV: The confusion matrix of expression recognition using k-NN Angry Disgust Fear Happy Sad Surprise Angry 51.50 11.67 10.00 3.33 18.00 2.50 Disgust 5.00 50.50 7.17 1.67 2.17 2.00 Fear 7.00 11.83 37.67 10.50 8.00 5.17 Happy 1.00 4.83 16.83 77.67 1.83 0.33 Sad 33.83 13.17 21.5 6.33 67.83 5.00 Surprise 1.67 8.00 6.83 0.50 2.17 85.00 Table V: The correct recognition accuracies of the six expressions using the different classifiers RBF-SVM NB RBF-NN k-NN Angry 71.33 66.83 65.17 51.50 Disgust 81.67 78.17 76.50 50.50 Fear 63.33 58.33 51.83 37.67 Happy 90.83 81.67 82.50 77.67 Sad 75.67 67.00 65.83 67.83 Surprise 93.33 92.50 87.50 85.00 Average 79.36 74.08 71.56 61.69 • • All the classifiers learned best when discriminating surprise, followed by happiness. In contrary, fear is the least recognized expression by all the classifiers. These results are consistent with human ability in recognizing similar emotions. Human can easily recognize happiness and surprise but not fear [28]. All the classifiers showed the same tendency in confusing certain pairs of expressions such as: anger-sadness, happiness-fear and surprise-fear. V. C ONCLUSION AND F UTURE W ORK In this paper, we tackled the problem of 3D facial expression recognition using five state-of-art classification methods. Using conformal mapping, the 3D face images are first mapped into the 2D plane to reduce the dimensionality. The SURF descriptors of the optimal facial features are then extracted and sent for classification. The five classifiers are finally utilized to find out the best classification method for SURF descriptors of the optimal features. RBF-SVM significantly outperformed other classifiers followed by NB. In the future, we are going to explore more combination of feature extraction and classification methods and conduct the evaluation using a different facial database. R EFERENCES Table V compares the recognition accuracies of the six expressions using the different classification methods. From the Table, we can notice that: • • • • RBF-SVM yielded the highest recognition accuracy in recognizing the six expressions. Its overall recognition accuracy (79.36%) is statistically significant (p < 0.001) compared to other classifiers. This could be attributed to the capability of SVM to learn better with a small dataset as opposed to other classifiers. NB yielded the second highest overall recognition accuracy of 74.08%. Its expressions recognition accuracies, except happiness, are better than accuracies obtained by RBF-NN and k-NN. Similar to SVM, NB learns very well with small training dataset. Moreover, to some extent, NB accuracy is directly proportional to the feature vector length. In our case, the feature vector length equals to 52 × 64 (optimal features × descriptor length), which is not a low dimension feature vector. However, the NB’s conditional assumption may not be suitable for facial expression recognition. The expression recognition process considers the relation between the facial features movements. RBF-NN yielded a lower recognition accuracy of 71.56% as it requires a larger training dataset for better recognition accuracy. k-NN yielded the lowest accuracy of 61.69%, which could be due to that k-NN accuracy is inversely proportional to the feature vector length. [1] J. Segal and J. JAFFE, The language of emotional intelligence. McGraw-Hill Contemporary Learning, 2008. [2] K. Bahreini, R. Nadolski, and W. Westera, “Towards multimodal emotion recognition in e-learning environments,” Interactive Learning Environments, no. ahead-of-print, pp. 1– 16, 2014. [3] R. Khosla, M.-T. Chu, R. Kachouie, K. Yamada, F. Yoshihiro, and T. Yamaguchi, “Interactive multimodal social robot for improving quality of care of elderly in australian nursing homes,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 1173–1176. [4] A. Kolli, A. Fasih, F. Al Machot, and K. Kyamakya, “Nonintrusive car driver’s emotion recognition using thermal camera,” in Joint 3rd Int’l Workshop on Nonlinear Dynamics and Synchronization (INDS) & 16th Int’l Symposium on Theoretical Electrical Engineering (ISTET). IEEE, 2011, pp. 1–5. [5] N. M. Puccinelli, S. Motyka, and D. Grewal, “Can you trust a customer’s expression? insights into nonverbal communication in the retail context,” Psychology & Marketing, vol. 27, no. 10, pp. 964–988, 2010. [6] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” in Feature extraction, construction and selection. Springer, 1998, pp. 117–136. [7] A. Azazi, S. L. Lutfi, and I. Venkat, “Identifying universal facial emotion markers for automatic 3D facial expression recognition,” in International Conference on Computer & Information Sciences (ICCOINS2014), Kuala Lumpur, Malaysia, Jun. 2014. [8] I. Mpiperis, S. Malassiotis, and M. G. Strintzis, “Bilinear models for 3D face and facial expression recognition,” IEEE Transactions on Information Forensics and Security, vol. 3, no. 3, pp. 498–511, 2008. [20] R. A. Khan, A. Meyer, H. Konik, and S. Bouakaz, “Framework for reliable, real-time facial expression recognition for low resolution images,” Pattern Recognition Letters, vol. 34, no. 10, pp. 1159–1168, 2013. [9] B. Gong, Y. Wang, J. Liu, and X. Tang, “Automatic facial expression recognition on a single 3D face by exploring shape deformation,” in Proceedings of the 17th ACM international conference on Multimedia. ACM, 2009, pp. 569–572. [21] X. D. Gu and S.-T. Yau, Computational conformal geometry. Intl Pr of Boston Inc, 2008, vol. 3. [10] P. Lemaire, M. Ardabilian, L. Chen, and M. Daoudi, “Fully automatic 3D facial expression recognition using differential mean curvature maps and histograms of oriented gradients,” in 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 2013, pp. 1–7. [11] X. Zhao, D. Huang, E. Dellandréa, and L. Chen, “Automatic 3D facial expression recognition based on a bayesian belief net and a statistical facial feature model,” in 20th International Conference on Pattern Recognition (ICPR). IEEE, 2010, pp. 3724–3727. [12] P. Lemaire, B. Ben Amor, M. Ardabilian, L. Chen, and M. Daoudi, “Fully automatic 3D facial expression recognition using a region-based approach,” in Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding. ACM, 2011, pp. 53–58. [13] S. Berretti, B. B. Amor, M. Daoudi, and A. Del Bimbo, “3D facial expression recognition using sift descriptors of automatically detected keypoints,” The Visual Computer, vol. 27, no. 11, pp. 1021–1036, 2011. [14] U. Tekguc, H. Soyel, and H. Demirel, “Feature selection for person-independent 3D facial expression recognition using NSGA-II,” in 24th International Symposium on Computer and Information Sciences (ISCIS). IEEE, 2009, pp. 35–38. [15] H. Soyel, U. Tekguc, and H. Demirel, “Application of NSGAII to feature selection for facial expression recognition,” Computers & Electrical Engineering, vol. 37, no. 6, pp. 1232– 1240, 2011. [16] H. Rabiu, M. I. Saripan, S. Mashohor, and M. H. Marhaban, “3D facial expression recognition using maximum relevance minimum redundancy geometrical features,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, pp. 1–8, 2012. [17] J. Wang, L. Yin, X. Wei, and Y. Sun, “3D facial expression recognition based on primitive surface feature distribution,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2. IEEE, 2006, pp. 1399–1406. [18] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Recognizing facial expression: machine learning and application to spontaneous behavior,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2. IEEE, 2005, pp. 568–573. [19] I. Hupont, S. Baldassarri, R. Del Hoyo, and E. Cerezo, “Effective emotional classification combining facial classifiers and user assessment,” in Articulated Motion and Deformable Objects. Springer, 2008, pp. 431–440. [22] M. Uřičář, V. Franc, and V. Hlaváč, “Detector of facial landmarks learned by the structured output svm,” in Proceedings of International Conference on Computer Vision Theory and Applications, vol. 1. SciTePress, 2012, pp. 547–556. [23] V. N. Vapnik, The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc., 1995. [24] P. D. Wasserman, Advanced methods in neural computing. John Wiley & Sons, Inc., 1993. [25] E. Fix and J. L. Hodges Jr, “Discriminatory analysisnonparametric discrimination: consistency properties,” DTIC Document, Tech. Rep., 1951. [26] P. Langley, W. Iba, and K. Thompson, “An analysis of bayesian classifiers,” in AAAI, vol. 90. Citeseer, 1992, pp. 223–228. [27] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3D facial expression database for facial behavior research,” in 7th international conference on Automatic face and gesture recognition. IEEE, 2006, pp. 211–216. [28] A. Martinez and S. Du, “A model of the perception of facial expressions of emotion by humans: Research overview and perspectives,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 1589–1608, 2012.