Header for SPIE use Moving to continuous facial expression space using the MPEG-4 facial definition parameter (FDP) set Kostas Karpouzis, Nicolas Tsapatsoulis and Stefanos Kollias Department of Electrical and Computer Engineering National Technical University of Athens Heroon Polytechniou 9, 157 73 Zographou, Greece ABSTRACT Research in facial expression has concluded that at least six emotions, conveyed by human faces, are universally associated with distinct expressions. Sadness, anger, joy, fear, disgust and surprise are categories of expressions that are recognizable across cultures. In this work we form a relation between the description of the universal expressions and the MPEG-4 Facial Definition Parameter Set (FDP). We also investigate the relation between the movement of basic FDPs and the parameters that describe emotion-related words according to some classic psychological studies. In particular Whissel suggested that emotions are points in a space, which seem to occupy two dimensions: activation and evaluation. We show that some of the MPEG-4 Facial Animation Parameters (FAPs), approximated by the motion of the corresponding FDPs, can be combined by means of a fuzzy rule system to estimate the activation parameter. In this way variations of the six archetypal emotions can be achieved. Moreover, Plutchik concluded that emotion terms are unevenly distributed through the space defined by dimensions like Whissel's; instead they tend to form an approximately circular pattern, called “emotion wheel”, modeled using an angular measure. The “emotion wheel” can be defined as a reference for creating intermediate expressions from the universal ones, by interpolating the movement of dominant FDP points between neighboring basic expressions. By exploiting the relation between the movement of the basic FDP points and the activation and angular parameters we can model more emotions than the primary ones and achieve efficient recognition in video sequences. Keywords: facial expression, MPEG-4 facial definition parameters, activation, emotion wheel, 3D expression synthesis 1. INTRODUCTION Thorough research in facial expression has concluded that at least six emotions, conveyed by human faces, are universally associated with distinct expressions. In particular, sadness, anger, joy, fear, disgust and surprise are categories of facial expressions that are recognizable across cultures1. However, very few studies4 have appeared in the computer science literature, that explore non-archetypal emotions. In the contrary psychological researchers have extensively investigated1, 2 a broader variety of emotions. Although the exploitation of the results obtained by the psychologists is far from being straightforward, computer scientists can use some hints to their research. In this work we combine some of the results obtained by Whissel3 and Plutchik2 with the MPEG-4 provisions for facial animation, to model facial expressions and related emotions. First we form a relation between the facial anatomy description of the six universal facial expressions6 and the MPEG-4 Facial Definition Parameter Set (FDP)5. In Figure 1 the FDP set is shown while in Table 1 the basic expressions are described according to the Facial Animation Parameter (FAP) set terminology. A small subset of the FDP set can be used for the estimation of the FAP set. This subset consists of points in the face area that can be automatically detected and tracked. As is illustrated in Table 3, the features used for the description of the FAP set are distances between these protuberant points within the facial area, some of which are constant during the expressions and are used as reference points, as well as their time derivatives. Distances between the reference points are used for normalization, while time derivatives are used for two different purposes: first, they define the positive intensities for the FAP set and second, they characterize the development of the expressions and are used for marking the expressions ‘apex’. In a second step we investigate the relation between the movement of basic FDPs and the parameters used in the description of emotion-related words3. Whissel suggested that emotions are points in a space with a relatively small number of dimensions, which with a first approximation, seem to occupy two dimensions: activation and evaluation. Activation is the __________________ Correspondence: E-mail: kkarpou@image.ntua.gr; Telephone: +301 7722491; Fax: +301 7722492 degree of arousal associated with the term, with terms like patient (at 3.3) representing a midpoint, surprised (over 6) representing high activation, and bashful (around 2) representing low activation. Evaluation is the degree of pleasantness associated with the term, with guilty (at 1.1) representing the negative extreme and delighted (at 6.6) representing the positive extreme4. Figure 2 and Table 2 illustrate the strong relation between the movement’s strength of some FDP points and the activation dimension due to Whissel (the activation for the term “delighted” is 4.2 while for “Joyful” is 5.4). Finally the “emotion wheel”2 is taken as a reference for creating intermediate expressions from the universal ones. In particular by interpolating the movement of dominant FDP points between neighboring basic expressions we construct intermediate ones. The third column in Table 2 represents Plutchik’s observation that emotion terms are unevenly distributed through the space defined by dimensions like Whissell's. Instead they tend to form an approximately circular pattern called “emotion wheel”. The last column in Table 2 shows empirically derived positions on the circle for some selected terms in the list, according to Plutchik’s study, using an angular measure in which the midline runs from Acceptance (0) to Disgust (180). By exploiting the relation between the movement of the basic FDP points of Table 1 and the activation and angular parameters of Table 2 we can model and animate much more expressions than the primary ones studied so far and achieve more efficient recognition in video sequences. Anger Squeeze_l_eyebrow, squeeze_r_eyebrow, raise_u_midlip, raise_l_midlip Sadness raise_l_i_eyebrow, raise_r_i_eyebrow, close_upper_l_eyelid, close_upper_r_eyelid, close_lower_l_eyelid, close_lower_r_eyelid Joy close_upper_l_eyelid, close_upper_r_eyelid, close_lower_l_eyelid, stretch_l_cornerlip, stretch_r_cornerlip, raise_l_m_eyebrow, raise_r_m_eyebrow Disgust close_upper_l_eyelid, close_upper_r_eyelid, close_lower_l_eyelid, close_lower_r_eyelid, raise_u_midlip Fear raise_l_o_eyebrow, raise_r_o_eyebrow, raise_l_m_eyebrow, raise_r_m_eyebrow, raise_l_i_eyebrow, raise_r_I_eyebrow, squeeze_l_eyebrow, squeeze_r_eyebrow, open_jaw Surprise raise_l_o_eyebrow, raise_r_o_eyebrow, raise_l_m_eyebrow, raise_r_m_eyebrow_m, raise_l_i_eyebrow, raise_r_I_eyebrow, open_jaw close_lower_r_eyelid, Table 1: Description of the six primary expressions using the movement of some basic FDP points. Figure 1: The feature points in the FDP Figure 2: Facial expression labeled as: (a) “delighted” (b) “Joyful” Activ Eval Accepting Angle Activ Eval Angle 0 Disgusted 5 3.2 181.3 Afraid 4.9 3.4 70.3 Delighted 4.2 6.4 318.6 Angry 4.2 2.7 212 Bashful 2 2.7 74.7 Patient 3.3 3.8 39.7 Surprised 6.5 5.2 146.7 Guilty 4 1.1 102.3 Joyful 5.4 6.1 323.4 Sad 3.8 2.4 108.5 Ecstatic 5.2 5.5 286 Table 2: Selected emotion words from Whissell and Plutchik. 2. MPEG-4 AND FACIAL EXPRESSIONS The establishment of the MPEG standards and especially the MPEG-4 indicate an alternative way of analyzing and modeling facial expressions and related emotions. FAPs and FDPs are utilized in the framework of MPEG-4 for facial animation purposes. On the other hand, automatic detection of particular FDPs is an active research area, which can be employed within the MPEG-4 standard for analyzing facial expressions. Facial expression analysis has been mainly concentrated on the six primary expressions. Are we able to model and analyze more than these expressions? In this study we introduce the idea of grading the FAPs in order to capture variations of the primary expressions. In general, facial expressions and emotions can be described as a set of measurements and transformations that can be considered atomic with respect to the MPEG-4 standard; this way, one can describe both the anatomy of a human face, as well as any animation parameters with groups of distinct tokens, eliminating the need to specify the topology of the underlying geometry. These tokens can then be mapped to automatically detected measurements and indications of motion on a video sequence and thus help recognize the emotion or expression conveyed by the subject. Reversal of the description of the six universal emotions with MPEG-4 tokens and using a priori knowledge that is embedded within a fuzzy rule system accomplish this. Interpolating and combining the tokens that describe the universal emotions can distinguish emotions that lie in between them with respect to the emotion wheel; this reasoning can be applied to the synthesis and animation of new expressions, such as discontent or exhilaration. Because FAPs do not correspond to specific models or topologies, this scheme can be extended to other models or characters, different to the one that was analyzed. 3. MODELING FAPS: THE ANALYSIS POINT OF VIEW The FAPs are practical and very useful for animation purposes but inadequate for analyzing facial expressions from video scenes or still images. To bridge the gap between analysis and animation / synthesis we propose the estimation of some important FAPs using the features shown in Table 3. The feature set employs FDPs that lie in the facial area and, under some constraints, can be automatically detected and tracked. It consist of distances between these protuberant points, some of which are constant during the expressions and are used as reference points, as well as their time derivatives. Distances between the reference points are used for normalization, while time derivatives are used for two different purposes: first, they define the intensity for the FAP set and second, they characterize the development of the expressions and are used for marking the expressions ‘apex’. FAP name Squeeze_l_eyebrow Squeeze_r_eyebrow raise_u_midlip raise_l_midlip raise_l_i_eyebrow raise_r_i_eyebrow raise_l_o_eyebrow raise_r_o_eyebrow raise_l_m_eyebrow raise_r_m_eyebrow open_jaw close_upper_l_eyelid –close_lower_l_eyelid close_upper_r_eyelid –close_lower_r_eyelid stretch_l_cornerlip –stretch_r_cornerlip Vertical_wrinkles between eyebrows Features used for the description Positive Intensity s (1,3) df1 , ESo dt s (4,6) df 2 , f2 = ESo dt s (16,30) df 3 , f3 = ENSo dt s (16,33) df 4 , f4 = ENSo dt s (3,8) df 5 , f5 = ENSo dt s (6,12) df 6 , f6 = ENSo dt s (1,7) df 7 , f7 = ENSo dt s (4,11) df 8 , f8 = ENSo dt s (2,7) df 9 , f9 = ENSo dt s (5,11) df10 , f10 = ENSo dt s (16,33) df11 , f11 = ENSo dt s (9,10) df12 , f12 = ENSo dt s (13,14) df13 , f13 = ENSo dt s (28,29) df14 , f14 = ESo dt df 15 f 15 =s′(3,6), dt df1 <0 dt df 2 <0 dt df 3 <0 dt df 4 <0 dt df 5 >0 dt df 6 >0 dt df 7 >0 dt df 8 >0 dt df 9 >0 dt df10 >0 dt df11 >0 dt df12 <0 dt df13 <0 dt df14 >0 dt df 15 >0 dt f1 = Table 3: Description of FAP set using a subset of the MPEG-4 FDP set. Note: s(i,j)=Euclidean distance between FDP points i and j, {ESo, ENSo}=Horizontal and vertical distances used for normalization and s′(3,6) is the maximum difference between pixel values along the line defined by the FDPs 3 and 6. 4. COMBINING FAPS TO PRODUCE VARIATIONS OF PRIMARY EXPRESSIONS Grading of FAPs is strongly related with the activation parameter proposed by Whissel. Since this relation is expressed differently for the particular expressions, a fuzzy rule system seems appropriate for mapping FAPs to the activation axis. Table 1 presents which particular FAPs are related with a particular expression. What is the contribution of each FAP to expressions’ formation is unknown. Observations obtained from experiments like the ones presented in section 5.2 as well as cues from psychological studies can be used to form rules that describe the contribution of the particular FAPs. Since the estimated, from FDPs movement, FAP values are affected from inaccurate computations a kind of fuzzy partitioning is necessary. In our implementation each FAP take membership values for being low, medium and high. In a similar way the activation of a particular expression is also expressed using membership values which correspond to variations of the basic underlying expression. The continuity of the emotion space as well as the uncertainty involved in the feature estimation process, make the use of fuzzy logic appropriate for the feature-to-expression mapping. The structure of the proposed fuzzy inference system is described in Figure 3. Actually the input depends on the particular primary expression; for example capturing variations of joy requires the FAPs listed in the corresponding row of Table 1. The output also depends on the particular expression; more variations of joy can be modeled exist than that of sadness. On the universe of discourse of each input (or output) parameter, a fuzzy linguistic partition is defined. The linguistic terms of the fuzzy partitions (for example medium open_jaw) are connected with the aid of the IF-THEN rules of the Rule Base. These IF-THEN rules are heuristically constructed and express the a priori knowledge of the system. The activation of the antecedents of a rule causes the activation of the consequences, i.e. the expression is concluded from the degree of the increment (or decrement) of the FAPs. Fuzzification FAP Subset Fuzzy Inference Defuzzification Fuzzy Rule Base Variations of a particular expression Figure 3: The structure of the fuzzy system 5. 5.1 IMPLEMENTATION ISSUES Automatic detection of facial protuberant points The detection of the FDP subset used to describe the involved FAPs was based on the work presented in7. However, for accurate detection in many cases human assistance was necessary. The authors are working towards a fully automatic implementation of the point detection procedure. 5.2 Investigating the efficiency of the selected features A critical point related with the features illustrated in Table 3 is their efficiency to describe the corresponding FAPs and furthermore, their particular contribution to the classification of the primary expressions as well as their ability to discriminate between similar expressions. In order to explore the efficiency of the selected features we set up the following experiments: First, we used sequences obtained from the MIT Media Lab, which show standard archetypal emotions happiness, surprise, anger and disgust. Based upon the technique presented in7 we detected the relevant FDP subset. Accurate detection of the FDP points, was, however, assisted by human intervention in many cases. Then, for each frame / pair of subsequent frames that illustrate a face in an emotional state, a feature vector corresponding to FAPs was computed. A neural network architecture was trained and then used to classify the feature vectors in one of the above categories. Using capabilities of neural networks, we were able to evaluate the contribution of each particular feature of the 15-tuble feature vector in the obtained classification results. Figure 4 indicates that about 8 FAPs, related to the eyebrow and lip points, mainly contributed to the classification of the above expressions. The role of the wrinkle detection feature was found important for correct classification of the anger expression. The results also depict that features sensitive to accurate detection of the FDP points, such as open_eyelid and close_eyelid, seem to be ignored by the neural classifier. From the analysis point of view this fact suggests discarding the corresponding features; however for animation purposes the open_eyelid and close_eyelid FAPs still very important. A redundancy related with symmetrical features can be observed from Figure 4; only one of the components of symmetrical features is taken into account, i.e. the contribution of the raise_l_i_eyebrow is much higher than that of raise_r_i_eyebrow. This redundancy, however, is not universal; there are expressions, such as variations of disgust, where facial symmetry is not guaranteed. Moreover by keeping both symmetrical features we increase the robustness of the system against computation errors. Since we concentrated in the input contribution of the particular features rather than the classification performance, the obtained rates have little importance; as a matter of fact a higher level combination of the feature vector elements is required, as indicated in Table 1, to describe a particular expression. The second experiment considered the possibility of subtler discriminations, involving expressions other than the primary ones. The expressions considered were amusement, happiness and excitement. Stimuli were drawn from two sources, the MIT facial database, and a pilot database of selected extracts from BBC television programs. Following the same procedure as before, we trained a neural network to classify the feature vectors in one of the three categories. Input contribution of the first 14 FAPs –the wrinkle related feature has been left out- is shown in Figure 5. It can be seen that eyebrow related FAPs have increased input contribution; this fact indicates that by grading the corresponding features one can model more expressions than the primary ones. Input contribution (normalized with respect to the highest value) df 1 df 2 dt dt df 3 dt df 4 df 5 dt dt df 6 dt df 7 df 8 dt dt df 9 df 10 df 11 df 12 df 13 df14 df 15 dt dt dt dt dt dt dt Figure 4: Contribution of the particular features for the classification of primary expressions. Input contribution (normalized with respect to the highest value) df 1 df 2 dt dt df 3 dt df 4 df 5 dt dt df 6 dt df 7 df 8 dt dt df 9 dt df 10 df 11 df 12 df 13 df14 dt dt dt dt dt Figure 5: Contribution of the particular features for the classification of amusement, happiness and excitement. 6. SYNTHESIZING FACIAL EXPRESSIONS In the first part of the synthesis procedure we adapt a generic face model to the static geometrical measurements that were computed from the video sequence (see Figure 6) correspond to the FDPs that characterize the human face and, thus, are locally defined. As a result, the transformations that are required in order to achieve the matching of the generic model to the specific subject are local as well and use gradually descending weights so as to preserve the smoothness of the surface in the final topology8. This technique can also be used in combination with texture mapping in static images, in addition to animation purposes. VIDEO SEQUENCE ANALYSIS FDPs EXPRESSION SYNTHESIS FAPs FACE MODEL ADAPTATION Figure 6: Intermediate expression synthesis In order to synthesize new intermediate expressions, we interpolate and combine FAPs that correspond to the six universal ones. This means that magnitudes closer or further to those of a normal or neutral expression can be used in transformations and FAPs, as well as mixing of tokens that correspond to different universal emotions. This procedure actually imitates the way people express mixed emotions, by adopting amalgamated or ambiguous face poses: for example, the expression of a "happiness" differs to "exhilaration" only in the extent of the transformation that results from the FAPs. On the other hand, the mixture of the FAPs that synthesize a "sad" and a "disgusted" face can be used to display guilt or discontent. Figure 7: Synthesized expressions of "happiness" and "content" Figure 8: "Sadness" (a universal emotion) and the synthesis of "Guilt" 7. CONCLUSION Exploitation of the results obtained by psychological studies related with emotion recognition from computer scientists is possible although not straightforward. We have shown that terms like emotion wheel and parameters like activation are suitable for extending the facial expressions that can be modeled. Accurate detection /tracking of an FDP subset can be used to approximate the MPEG-4 FAPs which subsequently can be exploited for the estimation of the activation parameter. By modifying the activation variations of the archetypal expressions can be analyzed. Furthermore, interpolation between the values of activation and angular parameters, corresponding to the primary emotions, provides an even broader set of expressions that can be modeled. ACKNOWLEDGMENT This work is funded by the project PHYSTA of the Training Mobility and Research Program of the European Community. The authors are within the team of the project, where speech and psychological cues are also used for emotion classification. We would also like to thank the BBC for allowing us to use video sequences recorded from its broadcasted program REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. P. Ekman and W. Friesen, The Facial Action Coding System, Consulting Psychologists Press, San Francisco, CA, 1978. R. Plutchik, Emotion: A psychoevolutionary synthesis, Harper and Row, New York, 1980. C. M. Whissel, The dictionary of affect in language, R. Plutchnik and H. Kellerman (Eds) Emotion: Theory, research and experience: vol 4, The measurement of emotions. Academic Press, New York, 1989. EC TMR Project PHYSTA Report, “Development of Feature Representation from Facial Signals and Speech,” January 1999. ISO/IEC JTC1/SC29/WG11 MPEG96/N1365, “MPEG4 SNHC: Face and body definition and animation parameters,” 1996 F. Parke and K. Waters, Computer Facial Animation, A K Peters, 1996 Kin-Man Lam and Hong Yan, “An Analytic-to-Holistic Approach for Face Recognition Based on a Single Frontal View,” IEEE Trans. on PAMI, vol. 20, no. 7, July 1998. K. Karpouzis, G. Votsis, N. Tsapatsoulis and S. Kollias, “Compact 3D Model Generation based on 2D Views of Human Faces: Application to Face Recognition,” Machine Graphics and Vision, vol. 7, no.1-2, 1998