Analysis and Evaluation of SURF Descriptors for Automatic 3D

advertisement
Analysis and Evaluation of SURF Descriptors for Automatic 3D Facial Expression
Recognition Using Different Classifiers
Amal Azazi, Syaheerah Lebai Lutfi and Ibrahim Venkat
School of Computer Sciences
Universiti Sains Malaysia
Pulau Pinang, Malaysia
Email: aaaa11 com017@student.usm.my, {syaheerah, ibrahim}@cs.usm.my
Abstract—Emotion recognition plays a vital role in the field
of Human-Computer Interaction (HCI). Among the visual human emotional cues, facial expressions are the most commonly
used and understandable cues. Different machine learning
techniques have been utilized to solve the expression recognition
problem; however, their performance is still disputed. In this
paper, we investigate the capability of several classification
techniques to discriminate between the six universal facial
expressions using Speed Up Robust Features (SURF). The
evaluation were conducted using the BU-3DFE database with
four classifiers, namely, Support Vector machine (SVM), Neural
Network (NN), k-Nearest Neighbors (k-NN), and Naı̈ve Bayes
(NB). Experimental results show that the SVM was successful
in discriminating between the six universal facial expressions
with an overall recognition accuracy of 79.36%, which is
significantly better than the nearest accuracy achieved by Naı̈ve
Bayes at significance level p < 0.05.
Keywords-Human-computer interaction; 3D Facial Expression Recognition; Support Vector machine; Neural Network;
k-nearest neighbors; Naı̈ve Bayes
I. I NTRODUCTION
The human face is a substantial mean for human interaction. People can communicate and understand others
emotional state through faces. Facial expressions are the
most common and understandable visual emotional cues.
Even during speaking, it contributes up to 55% of the
speaker effect [1].
People express their feelings all the time, even when they
interact with machines. They can understand and interpret
others expressions effortlessly; however, machines are missing these skills. Infusing the skill of understanding human
facial expressions and adapting accordingly is a new active
research area in human-computer interaction (HCI) field. We
can benefit from machines with such skills in a wide range
of domains such as: improving the learning environment [2],
increasing the doctor awareness in tele-home care program
[3], reducing the road accident risk [4], and measuring the
customer satisfaction [5] to name a few.
Towards unintrusive and realistic recognition applications,
several conditions should be fulfilled, such as: automation,
person independency, and reasonable accuracy. Notably, it
is difficult to achieve high recognition accuracy in a fully
automatic system. The ability in correctly locating the face
and extracting its features highly affects the recognition
accuracy. The overall recognition system accuracy depends
on several factors. The head pose and illumination changes
reduce the accuracy when they occur. Recently, 3D face
images have been utilized as an alternative to 2D face
images as 3D face images are robust against head pose and
illumination changes.
Moreover, the choice of facial feature representation and
classification methods contributes mainly to the system
accuracy. Although, feature representation and classification
are done sequentially in two different phases, they are highly
influenced by each other [6]. Different classifiers perform
distinctively with different feature representations, and the
final performance of their combination is task-dependent.
Hence, investigating the optimal combination of feature
representation and classification method is a fundamental
issue in facial expression recognition field.
In our previous work [7], we proposed a feature selection method that resulted in a set of 52 features as an
optimal feature set that can discriminate between the six
universal expressions (i.e.: anger, disgust, fear, happiness,
sadness and surprise) in 3D textured face images with a
reasonable accuracy. Speed Up Robust Features (SURF) and
Radial Basis Function Support Vector Machine (RBF-SVM)
have been employed for feature extraction and classification
process. In this paper, we continue our research to find the
optimal combination of feature representation and classifier
by evaluating the SURF descriptors of the proposed optimal
features using several state-of-art classification schemes;
namely: Support Vector machine (SVM), Neural Network
(NN), k-Nearest Neighbors (k-NN), and Naı̈ve Bayes (NB).
The rest of this paper is organized as follows. Section 2 reviews the related studies in 3D facial expression recognition
field. Section 3 introduces our proposed recognition system
and classification methods. The results and their discussion
are presented in Section 4. Finally, Section 5 concludes our
work and shares some of our future directions.
SVM
3D faces
3D to 2D
mapping
Optimal feature
localization
SURF feature
descriptors extraction
NB
NN
Anger\ disgust\ fear\
happiness\ sadness\
surprise
k-NN
Pre-processing
Feature extraction
Classification
Figure 1: The general overview of the proposed 3D facial expression recognition framework
II. R ELATED W ORKS
In general, 3D facial expression recognition frameworks
comprise three main phases: i) face pre-processing, ii) feature extraction, and iii) expression classification. Establishing an automatic and person-independent recognition system
highly relies on the choice of feature extraction method.
Feature descriptors could be extracted in a global approach [8]–[10] or local approach [11]–[13]. The global
approach may assist the automation of the system as no
facial landmarks are required. However, considering the
whole 3D face as one pattern may increase the feature
descriptor dimensionality, which consequently affects the
computation time and storage.
On the other hand, in the local approach, the most
relevant information is only extracted from a set of located
landmarks, which helps in reducing the dimensionality.
Recently, selecting the optimal facial features became a
very common phase in expression recognition frameworks
that aims at reducing the dimensionality and improving the
system accuracy simultaneously [14]–[16].
The classification outcome does not only depend on
the classifier parameters, but also highly depends on
the feature descriptor attributes, i.e.: the feature vector
length/dimensionality, the descriptor type and the number
of the samples [6]. In the context of facial expression
recognition, SVM is the most frequently used in the classification phase. However, various studies found that other
classification methods outperformed SVM. For example,
Wang et al. [17] found that the classification outcomes of
Quadratic Discriminant Classifier (QDC), Linear Discriminant Analysis (LDA), NB, and SVM vary notably using the
same feature descriptors, the face primitive features. Among
the four classification methods, LDA obtained the highest
overall recognition accuracy, followed by SVM. However,
Bartlett et al. [18] reported that SVM outperformed LDA
in their experiments and attributed its better performance to
the fact that LDA may work better than SVM only when
the class distributions are Gaussian.
Other studies have also compared SVM to NB, NN and kNN. For example, Hupont et al. [19] stated that NB yielded
better accuracy yet comparable to SVM and NN. Khan et
al. [20] reported that SVM and k-NN outperformed NB.
In short, we can conclude that there is no fixed optimal
classifier for expression recognition problem. Hence, in this
paper, we investigate different classification methods with
the same feature descriptors in an attempt to identify the
best classification method for our proposed 3D expression
recognition framework.
III. 3D FACIAL EXPRESSION RECOGNITION
Motivated by the high dimensionality problem inherent in
the use of 3D face images, we proposed a fully automatic
facial expression recognition framework that tackles this
problem using conformal mapping and feature selection
process. Fig. 1 illustrates a general overview of the proposed
framework, which consists of three main steps: 3D faces preprocessing, feature extraction, and classification.
In the pre-processing phase, all the 3D textured face
images are mapped into the 2D plane using conformal
mapping [21]. The mapping process generates 2D textured
face images, as shown in Fig. 2, which are used as inputs
to the feature extraction phase. The following two sections
introduce the feature extraction and classification phases in
detail.
A. Feature extraction
After mapping the 3D face images into the 2D plane,
seven main landmarks (eyes corners, nose tip, and mouth
corners) are first detected using structured output SVM
introduced by Uřičář et al. [22]. Then a set of 52 facial
points are located based on the main landmarks, as shown
in Fig. 1. The features of the 52 facial points are considered
as the optimal facial features in the textured mapped face
images [7].
In Azazi et al. [7], we proposed a novel differential evolution based feature selection algorithm to identify the optimal
facial feature set in the mapped face images. The optimal
Anger
Disgust
Fear
Happiness
Sadness
Surprise
Figure 2: 3D face images of one subject from the BU-3DFE database with the six universal expressions. The first row
presents the 3D face images and their corresponding 2D mapped images are presented in the second row.
features could discriminate between the six universal facial
expressions without redundancy and with a good recognition
accuracy. The selection algorithm evaluated the nominated
features based on the discriminative power of their SURF
descriptors. As a result, a set of 52 features has been
identified as the optimal feature set that could discriminate
reasonably between the universal facial expressions.
In this paper, we utilized this optimal feature set in the
feature extraction phase. The length of SURF descriptor for
each feature is 64. For each face image, the 52 descriptors
are concatenated to form one descriptor for each image.
These descriptors are then fed to the different classifiers for
the training and testing process.
B. Classification
SVM has been widely used in most of the state-of-the-art
facial expression recognition frameworks. However, there is
still a dispute with regard to its performance in comparison
to other classification methods. In this paper, we investigated
the performance of SVM along with three of the most
common classifiers, i.e.: NN, NB, and k-NN.
1) Support Vector machine (SVM): SVMs are well known
classification techniques that have been applied in a wide
variety of applications [23]. Their basic idea is based on
mapping all the training examples into a hyperplane where
every class is separated with some boundaries.
Let xi , i = 1, ..., n are the feature vectors of the training
set and yi , i = 1, .., n is the corresponding labels where
yj ∈ {Lj }, j = 1, .., C and C is the number of classes.
SVM defines the hyperplane as a function of a set of training
examples, called support vectors, that lay on the boundaries
between classes. The decision function is given by:
m
X
f (x) = sgn(
yi αi · K(xi , x) − b)
i
(1)
where: m is the number of support vectors; αi ∈
{α1 , α2 , .., αm }, αi > 0 only for support vectors xi , b is
the bias parameter, and K is the kernel function.
Radial Basis Function (RBF) is one of the kernel mapping
tricks that has the ability to deal with high dimensionality
with less parameters. The kernel function is given as follows:
K(xi , x) = exp(−γ||xi , x||2 )
(2)
where: γ > 0 is the kernel parameter.
Originally, SVM was invented for binary classification
and then extended to multi class classification. The extension
could be done using one-against-all method or one-againstone method. In this paper, one-against-one method has been
adopted.
2) Neural Network (NN): Neural Network (NN) is a computational model that emulates the biological structure and
functionality of nervous systems in the humans brain [24].
It is modeled in the form of a numerous of interconnected
neurons with a defined sets of input and output neurons. As
human brain, the networks are learnt by presenting some
examples to build its experience that could be generalized
later with new inputs. They have a great ability to learn
the complex relationship between the input and output in a
supervised or unsupervised manner.
Radial Basis Function network (RBF-NN) is one of the
most popular NNs which encloses supervised and unsupervised learning methods for training the network. The hidden
layer contains of neurons that use kernel functions such as
Gaussian as activation function. Based on the input and the
weights associated to each neuron, they define their positions
and functions parameters in unsupervised manner. This step
is followed by a supervised learning method to set up the
weights between the hidden layer and the output layer as
least mean square error. The final output y for the input x
could be expressed as:
y(x) =
X
wi .exp(−
i=1
(kx − µi k)2
)
2σ 2
(3)
where µi is the ith center in the hidden layer and σ is the
controlling parameter of the Gaussian function.
3) k-Nearest Neighbors (k-NN): k-NN is a similarity
based supervised classification method that proves its efficiency to solve a wide range of classification problems,
though its simplicity [25]. It is considered as a lazy classifier
as there is no training processing, which make it faster than
other classification methods.
Given a training dataset X = x1 , x2 , ..., xn , where each
sample xi is assigned to a specific class ci . The k-NN
algorithm assumes that all the samples lay on m-dimensional
feature space, where m is the length of the samples X. For
any test sample t, its k nearest neighbors in the feature space
are found and their major vote is considered as the predicted
class.
Various distance metrics could be used to find the nearest
neighbors in k-NN algorithm. City Block is one of the well
known metrics that computes the distance d between pair of
points as the absolute differences of the pair’s coordinates:
d(x, t) =
m
X
|xi − ti |
(4)
i=1
4) Naı̈ve Bayes: Naı̈ve Bayes is a supervised probabilistic
classification technique based on the Bayesian theorem [26].
It is well known with its conditional independence concept
which assumes that all the model parameters or features in
the feature vector are independent given the class value.
Let F = (f1 , f2 , ..., fn ) is a feature vector with length of
n and C = (c1 , c2 , ..., cm ) is the classes vector with length
of m. The probability of F being belongs to the class ci is
then given by:
P (ci |F ) =
P (F |ci ).p(ci )
P (F )
(5)
The Naı̈ve Bayes classification rule is then computed as
follows:
f (F ) = arg max P (C)
C
n
Y
P (fi |C)
(6)
Previous studies in 3D facial expression recognition used
the BU-3DFE database for evaluation with different settings. In this paper, we chose to run the experiment 100
independent times for generalization. In each experiment,
we randomly selected 60 subjects and conducted 10-fold
cross-validation experiment. The overall accuracy is then
computed as the average of the accuracies obtained by all
the 100 experiments.
B. Classification results and discussion
Using the extracted SURF descriptors of the optimal
features, we trained and tested all the four classifiers using
the same experimental setup above mentioned in Section A.
Table I to Table IV present the confusion matrices of all the
classifiers. The rows present the predicted expressions and
the column present the actual expressions in percentages.
Table I: The confusion matrix of expression recognition
using RBF-SVM
Angry
Disgust
Fear
Happy
Sad
Surprise
Angry
71.33
6.00
5.67
0
16.83
0
Disgust
6.33
81.67
6.17
2.33
1.17
1.33
Fear
1.50
5.00
63.33
6.00
4.50
4.17
Happy
0
1.50
11.33
90.83
1.50
0.50
Sad
20.50
4.33
11.00
0
75.67
0.67
Surprise
0.33
1.50
2.50
0.83
0.33
93.33
Table II: The confusion matrix of expression recognition
using NB
Angry
Disgust
Fear
Happy
Sad
Surprise
Angry
66.83
4.83
2.50
0.33
18.83
0.17
Disgust
7.83
78.17
8.83
3.33
3.17
1.33
Fear
2.17
5.50
58.33
13.83
8.83
4.00
Happy
2.00
4.33
13.83
81.67
1.67
1.00
Sad
21.17
2.33
12.00
0
67.00
1.00
Surprise
0
4.83
4.50
0.83
0.50
92.50
Table III: The confusion matrix of expression recognition
using RBF-NN
i=1
Angry
Disgust
Fear
Happy
Sad
Surprise
Angry
51.00
9.00
8.33
2.83
21.17
2.17
Disgust
7.67
59.83
10.33
4.83
2.17
1.67
Fear
2.83
6.83
43.67
7.83
5.00
3.83
Happy
5.17
9.50
14.17
78.17
3.67
1.67
Sad
32.50
9.67
17.17
5.50
66.67
3.67
Surprise
0.83
5.17
6.33
0.83
1.33
87.00
IV. E XPERIMENTAL RESULTS AND DISCUSSION
A. BU-3DFE database
The Binghamton University 3D Facial Expression (BU3DFE) database [27] is one of the most popular databases
for 3D face analysis. It contains the 3D face models of 100
subjects who posed the six universal facial expressions in
four intensities and the neutral expression. The subjects are
44 males and 56 females from different origins and ages.
Table IV: The confusion matrix of expression recognition
using k-NN
Angry
Disgust
Fear
Happy
Sad
Surprise
Angry
51.50
11.67
10.00
3.33
18.00
2.50
Disgust
5.00
50.50
7.17
1.67
2.17
2.00
Fear
7.00
11.83
37.67
10.50
8.00
5.17
Happy
1.00
4.83
16.83
77.67
1.83
0.33
Sad
33.83
13.17
21.5
6.33
67.83
5.00
Surprise
1.67
8.00
6.83
0.50
2.17
85.00
Table V: The correct recognition accuracies of the six
expressions using the different classifiers
RBF-SVM
NB
RBF-NN
k-NN
Angry
71.33
66.83
65.17
51.50
Disgust
81.67
78.17
76.50
50.50
Fear
63.33
58.33
51.83
37.67
Happy
90.83
81.67
82.50
77.67
Sad
75.67
67.00
65.83
67.83
Surprise
93.33
92.50
87.50
85.00
Average
79.36
74.08
71.56
61.69
•
•
All the classifiers learned best when discriminating
surprise, followed by happiness. In contrary, fear is
the least recognized expression by all the classifiers.
These results are consistent with human ability in recognizing similar emotions. Human can easily recognize
happiness and surprise but not fear [28].
All the classifiers showed the same tendency in confusing certain pairs of expressions such as: anger-sadness,
happiness-fear and surprise-fear.
V. C ONCLUSION AND F UTURE W ORK
In this paper, we tackled the problem of 3D facial
expression recognition using five state-of-art classification
methods. Using conformal mapping, the 3D face images are
first mapped into the 2D plane to reduce the dimensionality.
The SURF descriptors of the optimal facial features are then
extracted and sent for classification. The five classifiers are
finally utilized to find out the best classification method
for SURF descriptors of the optimal features. RBF-SVM
significantly outperformed other classifiers followed by NB.
In the future, we are going to explore more combination
of feature extraction and classification methods and conduct
the evaluation using a different facial database.
R EFERENCES
Table V compares the recognition accuracies of the six
expressions using the different classification methods. From
the Table, we can notice that:
•
•
•
•
RBF-SVM yielded the highest recognition accuracy in
recognizing the six expressions. Its overall recognition
accuracy (79.36%) is statistically significant (p <
0.001) compared to other classifiers. This could be
attributed to the capability of SVM to learn better with
a small dataset as opposed to other classifiers.
NB yielded the second highest overall recognition accuracy of 74.08%. Its expressions recognition accuracies,
except happiness, are better than accuracies obtained
by RBF-NN and k-NN. Similar to SVM, NB learns
very well with small training dataset. Moreover, to
some extent, NB accuracy is directly proportional to
the feature vector length. In our case, the feature
vector length equals to 52 × 64 (optimal features ×
descriptor length), which is not a low dimension feature
vector. However, the NB’s conditional assumption may
not be suitable for facial expression recognition. The
expression recognition process considers the relation
between the facial features movements.
RBF-NN yielded a lower recognition accuracy of
71.56% as it requires a larger training dataset for better
recognition accuracy.
k-NN yielded the lowest accuracy of 61.69%, which
could be due to that k-NN accuracy is inversely proportional to the feature vector length.
[1] J. Segal and J. JAFFE, The language of emotional intelligence. McGraw-Hill Contemporary Learning, 2008.
[2] K. Bahreini, R. Nadolski, and W. Westera, “Towards multimodal emotion recognition in e-learning environments,”
Interactive Learning Environments, no. ahead-of-print, pp. 1–
16, 2014.
[3] R. Khosla, M.-T. Chu, R. Kachouie, K. Yamada, F. Yoshihiro,
and T. Yamaguchi, “Interactive multimodal social robot for
improving quality of care of elderly in australian nursing
homes,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 1173–1176.
[4] A. Kolli, A. Fasih, F. Al Machot, and K. Kyamakya, “Nonintrusive car driver’s emotion recognition using thermal camera,” in Joint 3rd Int’l Workshop on Nonlinear Dynamics
and Synchronization (INDS) & 16th Int’l Symposium on
Theoretical Electrical Engineering (ISTET). IEEE, 2011,
pp. 1–5.
[5] N. M. Puccinelli, S. Motyka, and D. Grewal, “Can you trust
a customer’s expression? insights into nonverbal communication in the retail context,” Psychology & Marketing, vol. 27,
no. 10, pp. 964–988, 2010.
[6] J. Yang and V. Honavar, “Feature subset selection using a
genetic algorithm,” in Feature extraction, construction and
selection. Springer, 1998, pp. 117–136.
[7] A. Azazi, S. L. Lutfi, and I. Venkat, “Identifying universal facial emotion markers for automatic 3D facial expression recognition,” in International Conference on Computer
& Information Sciences (ICCOINS2014), Kuala Lumpur,
Malaysia, Jun. 2014.
[8] I. Mpiperis, S. Malassiotis, and M. G. Strintzis, “Bilinear
models for 3D face and facial expression recognition,” IEEE
Transactions on Information Forensics and Security, vol. 3,
no. 3, pp. 498–511, 2008.
[20] R. A. Khan, A. Meyer, H. Konik, and S. Bouakaz, “Framework for reliable, real-time facial expression recognition for
low resolution images,” Pattern Recognition Letters, vol. 34,
no. 10, pp. 1159–1168, 2013.
[9] B. Gong, Y. Wang, J. Liu, and X. Tang, “Automatic facial
expression recognition on a single 3D face by exploring shape
deformation,” in Proceedings of the 17th ACM international
conference on Multimedia. ACM, 2009, pp. 569–572.
[21] X. D. Gu and S.-T. Yau, Computational conformal geometry.
Intl Pr of Boston Inc, 2008, vol. 3.
[10] P. Lemaire, M. Ardabilian, L. Chen, and M. Daoudi, “Fully
automatic 3D facial expression recognition using differential
mean curvature maps and histograms of oriented gradients,”
in 10th IEEE International Conference and Workshops on
Automatic Face and Gesture Recognition (FG). IEEE, 2013,
pp. 1–7.
[11] X. Zhao, D. Huang, E. Dellandréa, and L. Chen, “Automatic
3D facial expression recognition based on a bayesian belief
net and a statistical facial feature model,” in 20th International Conference on Pattern Recognition (ICPR). IEEE,
2010, pp. 3724–3727.
[12] P. Lemaire, B. Ben Amor, M. Ardabilian, L. Chen, and
M. Daoudi, “Fully automatic 3D facial expression recognition using a region-based approach,” in Proceedings of the
2011 joint ACM workshop on Human gesture and behavior
understanding. ACM, 2011, pp. 53–58.
[13] S. Berretti, B. B. Amor, M. Daoudi, and A. Del Bimbo, “3D
facial expression recognition using sift descriptors of automatically detected keypoints,” The Visual Computer, vol. 27,
no. 11, pp. 1021–1036, 2011.
[14] U. Tekguc, H. Soyel, and H. Demirel, “Feature selection for
person-independent 3D facial expression recognition using
NSGA-II,” in 24th International Symposium on Computer and
Information Sciences (ISCIS). IEEE, 2009, pp. 35–38.
[15] H. Soyel, U. Tekguc, and H. Demirel, “Application of NSGAII to feature selection for facial expression recognition,”
Computers & Electrical Engineering, vol. 37, no. 6, pp. 1232–
1240, 2011.
[16] H. Rabiu, M. I. Saripan, S. Mashohor, and M. H. Marhaban,
“3D facial expression recognition using maximum relevance
minimum redundancy geometrical features,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, pp.
1–8, 2012.
[17] J. Wang, L. Yin, X. Wei, and Y. Sun, “3D facial expression
recognition based on primitive surface feature distribution,” in
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 2. IEEE, 2006, pp. 1399–1406.
[18] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel,
and J. Movellan, “Recognizing facial expression: machine
learning and application to spontaneous behavior,” in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2. IEEE, 2005, pp. 568–573.
[19] I. Hupont, S. Baldassarri, R. Del Hoyo, and E. Cerezo,
“Effective emotional classification combining facial classifiers
and user assessment,” in Articulated Motion and Deformable
Objects. Springer, 2008, pp. 431–440.
[22] M. Uřičář, V. Franc, and V. Hlaváč, “Detector of facial landmarks learned by the structured output svm,” in Proceedings
of International Conference on Computer Vision Theory and
Applications, vol. 1. SciTePress, 2012, pp. 547–556.
[23] V. N. Vapnik, The nature of statistical learning theory. New
York, NY, USA: Springer-Verlag New York, Inc., 1995.
[24] P. D. Wasserman, Advanced methods in neural computing.
John Wiley & Sons, Inc., 1993.
[25] E. Fix and J. L. Hodges Jr, “Discriminatory analysisnonparametric discrimination: consistency properties,” DTIC
Document, Tech. Rep., 1951.
[26] P. Langley, W. Iba, and K. Thompson, “An analysis of
bayesian classifiers,” in AAAI, vol. 90. Citeseer, 1992, pp.
223–228.
[27] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3D
facial expression database for facial behavior research,” in
7th international conference on Automatic face and gesture
recognition. IEEE, 2006, pp. 211–216.
[28] A. Martinez and S. Du, “A model of the perception of facial
expressions of emotion by humans: Research overview and
perspectives,” The Journal of Machine Learning Research,
vol. 13, no. 1, pp. 1589–1608, 2012.
Download