Rapid Facial Expression Classification Using Artificial Neural Networks Nathan Cantelmo Northwestern University 2240 Campus Drive, Rm. 2-434 Evanston, IL, 60208 1-847-467-4682 n-cantelmo@northwestern.edu ABSTRACT Facial expression classification is a classic example of a problem that is relatively easy for humans to solve yet difficult for computers. In this paper, the author describes an artificial neural network (ANN) approach to the problem of rapid facial expression classification. Building on prior work in the field, the present approach was able to achieve a 73.3% mean classification accuracy rate (across five trained nets) when compared with human annotators on an independent (never trained upon) testing set containing 30 grayscale images from the JAFFE facial expression data set. (http://www.kasrl.org/jaffe.html). Categories and Subject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis – object recognition, shape. General Terms Algorithms, Performance, Design, Reliability, Theory. Keywords ANNs, Artificial Neural Networks, Expression Classification, Computer Vision, Image Processing, Scene Analysis, Facial Recognition. 1. INTRODUCTION From a social psychology standpoint, the ability of humans to correctly classify faces and facial expressions is a critical skill for effective face-to-face interactions. Recent work has indicated that the process itself is both rapid and automatic, and that people will even draw inferences about a person’s personality traits from only a short (100ms) exposure to a face [11]. But if the ability to recognize certain facial expressions is an important part of being human, it is an absolutely crucial part of being human-like. For example, consider a situation in which a human is surprised to see a human-like virtual agent appear on her computer screen. If the agent is able to detect the human’s Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00. surprise, it will be better able to plan an appropriate greeting that explains its presence. On the other hand, if the agent can see that the computer user is visibly angry or annoyed, it may instead be best served by avoiding a face-to-face interaction altogether. However, facial expression classification is a classic example of a problem that is relatively easy for humans to solve yet difficult for computers. Indeed, a number of factors make the implementation of an expression-recognition system a non-trivial task, including differences in facial features, lighting conditions, and poses, as well as partial occlusion, and expression ambiguities [8], [12]. Earlier literature in the machine learning community speculated that the facial expression recognition process depends on a number of factors, including familiarity with the target face, experience with various expressions, attention paid to the face, and other non-visual cues [8]. More recent surveys, however, have indicated that two distinct approaches are commonly used with varying levels of success to approach the problem. The first of these classes of solutions involve geometric-based templates, which are applied to faces in order to identify common features. The second class of solutions focuses on deriving a stochastic model for facial expression recognition, often using multi-layer perceptron models [12]. In this paper, the author describes an artificial neural network (ANN) approach to the problem of facial expression recognition that is capable of rapidly classifying images into one of six basic expression categories – happiness, sadness, surprise, anger, disgust, and fear. These categories are identical to those used by Zhang in his work on ANN-based expression classification [12], and are a reflection of the six primary expressions identified by Ekman and Friesen [2]. Further, and perhaps more importantly, nearly all automatic expression classification systems used today rely on these six categories [6]. In order to test the proposed approach, the author developed an ANN similar to the one described in Mitchell, chapter four [5]. The completed classifier used a multi-level perceptron model with a single hidden layer. For the purposes of this study, two versions of the system were trained and tested: One with five hidden units and a second with ten, both in a single layer. As in the Mitchell text, the author used the backpropogation algorithm (with a momentum coefficient of 0.3) to repeatedly cross-train the system over a portion the data set. 2. METHOD 2.1 Training & Testing Data In order to train and test the classification system, the author used images from the Japanese Female Facial Expression (JAFFE) database, located at (http://www.kasrl.org/jaffe.html) [3]. This particular dataset is free for academic use and includes 213 grayscale images, each 256x256px in size. The dataset includes facial images of ten different female models, each assuming seven distinct poses (six basic expressions and one neutral pose). For each model/pose combination, there are (on average) three different images. This results in around 21 images per model or around 30 images per expression. Some resized sample images (with expression labels) from the JAFFE dataset are displayed below in Figure 1. Happiness Sadness Anger 2.2 Implementation As mentioned, the current work involves the use of an artificial neural network (ANN) for the task of facial expression classification. The general implementation of this system followed from the example described in chapter four of Mitchell’s Machine Learning text [5]. The specifics for each portion of the system are described below. 2.2.1 System Parameters For this work, a consistent set of system parameters were selected and maintained throughout all stages of the evaluation. Most importantly, a training rate of 0.3 was used for all training tasks. This value is argued for by Mitchell in chapter four of his text [5]. All versions of the system used a two-layer, feed-forward model with backpropogation through both of the layers. However, the size of the hidden layer was set to each of five or ten units, in order to determine whether five units are sufficient to achieve the same accuracy level as a system with ten units. Results from both of these configurations are described in section 3. 2.2.2 Input Layer Surprise Disgust Fear Each input to the ANN used for this task was a combination of the intensity values averaged over a 32x32 pixel square. This method of image representation is identical to the approach described in chapter four of the Mitchell text [5]. The choice of this specific multi-pixel segment size is supported by some older research into the minimum resolution needed to automatically detect human faces [9]. 2.2.3 Hidden Units Figure 1: Samples Images from the JAFFE Database The JAFFE dataset has previously been used in similar work on automatic expression recognition [3], [12], which makes it especially appealing for this study. In particular, Zhang used the same images with an ANN approach that relied upon hand-placed facial geometry points and Gabor wavelet coefficients as inputs to a multi-layer ANN [12]. Also relevant to the present study, Lyons et al obtained perceived expression recognition values from 60 female Japanese students on each of the 213 images in the JAFFE dataset [3]. For every image, semantic ratings for all six basic facial expressions [2] were collected using a five-point scale, with higher values indicating more of a particular expression. The six mean annotated values for each image were then calculated across all 60 students and reported. Notably, no values for expression neutrality were collected during this process. However, as the goal of the present work was to approximate the human expression recognition process, only the hand-annotated expression values were useful for training and evaluating the system (as opposed to the intended expression poses). Thus, for the purposes of this experiment (and unlike in the Zhang study, which did not rely on hand-coded expression values), all 30 images containing neutral poses were discarded. As described in section 2.2.1, the number of hidden units was varied across two conditions. In condition 1, ten hidden units were used. In condition 2, 5 hidden units were used. 2.2.4 Output Units The output units for the ANN were determined by the problem definition. For each of the six possible facial expression categories, one output unit was required. Each of these six units produced real values ranging from 0.0 to 1.0, which (after scaling) were comparable with the perceived facial expression values reported by the human evaluators. 2.3 Training In order to produce generalized results, 30 randomly-selected images from the training set were moved to a separate location and only ever used for evaluative purposes. Thus, the ANN was trained using the 150 remaining images from the JAFFE dataset. Each of the two system configurations under examination were trained five times over 20,000 epochs, and the mean of each training run was calculated and recorded after every 50 epochs. This was done in order to avoid producing spurious training results due to a convenient initial randomization of the weights. At each epoch, the 150 training images were randomly divided into two groups. The first group, containing roughly 30% of the 150 images, was used to train the ANN. The second group, containing the other 70% of the images, was used to evaluation the updated system. The training algorithm used was the common backpropogation approach described in Mitchell, chapter four [5]. The only variation from base algorithm was the use of a momentum coefficient (with a value of 0.3) during the weight update process [5]. 2.4 Evaluation As in other related work [12], the current system was evaluated using a relatively strict criterion for success. Specifically, the ANN was considered to have made a correct classification if and only if its highest estimated expression value matched the highest mean expression value made by the human annotators. No credit whatsoever was given for classification results that were close to correct, and no additional consideration was given to particularly hard cases (wherein the two highest-rated expressions only differed slightly). Notably, the classification rate for the testing data stabilized after about 5000 epochs and did not degrade, even as the accuracy over the training set continued to rise. This general trend held throughout the 20,000 epochs observed, indicating that overfitting was not a significant problem in this instance. 3.2 Configuration 2: N=5 Hidden Units Results from the second system configuration are displayed in Figure 3 below. In this second training configuration a much smaller hidden perceptron layer (N=5) was used. As shown, the mean classification accuracy for the system suffered as a result, stabilizing at around 65% over the independent (testing) dataset. Arguably, using such strict criterion may seem a bit unreasonable, especially given the inherent ambiguity in many facial expressions. Indeed, a more sensible approach might have involved some element of confidence, or relative error when compared with the human expression evaluations. However, the current approach was chosen in order to maintain consistency with related work, and thus must suffice for the present body of work. 3. RESULTS Evaluation results for both of the system configurations are described below. Figure 3: Classification Accuracy with N=5 Hidden Units 3.1 Configuration 1: N=10 Hidden Units Results from the first system configuration are displayed in Figure 2 below. As shown, the present implementation was able to achieve a (generalized) mean classification accuracy of around 72% over the independent (testing) dataset when N=10 hidden perceptron units were used. 4. DISCUSSION As described in the preceding section, configuration 1 was clearly shown to be preferable to configuration 2 for the task at hand, given the current approach. On the whole, the generalized classification results for configuration 1 were exactly in line with what other researchers have achieved using similar methods. Zhang, for instance, achieved a near identical outcome when training his system on a set of geometric points hand-placed on a facial expression image. [12]. Unfortunately, while interesting, the generalized results for configuration 2 were ultimately less useful than the results for configuration 1. However, they do provide clear evidence in support of using larger hidden layers for ANN-based expression classification. As mentioned in section 2.4, the meaningfulness of the reported classification accuracies are somewhat limited by the fact that they do not account for the perceived ambiguity present in many facial expression instances. Thus, future research in this area should take care to consider the potential applications when devising evaluation schemes. Figure 2: Classification Accuracy with N=10 Hidden Units 5. CONCLUSION In this paper, the author has described a stochastic solution to the problem of facial expression recognition using an artificial neural network. After motivating the problem, two configurations of the system were examined, each differing in the number of hidden units used (five vs. ten). After training and testing both configurations on a set of 180 grayscale images, the ten-unit configuration was shown to be the more effective classifier. The final system was able to achieve a 73.3% mean classification success rate when compared with the classifications made by a group of 60 human annotators. 6. ACKNOWLEDGEMENTS My thanks to ACM SIGCHI for allowing me to modify the templates they had developed. Also, I’d like to thank Francisco Iacobelli, Jonathan Sorg, and Bryan Pardo for their invaluable advice on neural network implementation, training, and evaluation techniques. 7. REFERENCES [1] Chellappa, R., Wilson, C. L., & Sirohey, S. (1995). Human and machine recognition of faces: A survey. Proceedings of the IEEE, 83, 705-740. [2] Ekman, P., & Friesen, W. V. (1977). Manual for the Facial Action Coding System. Palo Alto, CA: Consulting Psychologists Press. [3] Lyons, M., Akamatsu, S., Kamachi, M., & Gyoba, J. (1998). Coding facial expressions with gabor wavelets. Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, IEEE Computer Society, 200-205. [4] Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. [5] Mitchell, T. M. (1997). Machine Learning. New York, NY: McGraw-Hill Higher Education. [6] Pantic, M. & Rothkrantz, J. M. (2000) Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 14241445. [7] Rosenblatt, F. (1959). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-408. [8] Rumelhart, D., Widrow, B., & Lehr, M. (1994). The basic ideas in neural networks. Communications of the ACM, 37(3), 87-92 [9] Samal, A. (1991). Minimum resolution for human face detection and identification. SPIE Human Vision, Visual Processing, and Digital Display II, 1453, 81-89. [10] Samal, A., & Iyengar, P. (1992). Automatic recognition and analysis of human faces and facial expressions: A survey. Pattern Recognition, 25, 65-77. [11] Willis, J.,Todorov, A. (2006). First impressions: Making up your mind after a 100-ms exposure to a face. Psychological Science, 17, 592-598. [12] Z. Zhang, (1999). Feature-based facial expression recognition: Sensitivity analysis and experiments with a multilayer perceptron. International Journal of Pattern Recognition and Artificial Intelligence, 13(6), 893-911.