Moving to continuous facial expression space using the MPEG

advertisement
Header for SPIE use
Moving to continuous facial expression space using the MPEG-4 facial
definition parameter (FDP) set
Kostas Karpouzis, Nicolas Tsapatsoulis and Stefanos Kollias
Department of Electrical and Computer Engineering
National Technical University of Athens
Heroon Polytechniou 9, 157 73 Zographou, Greece
ABSTRACT
Research in facial expression has concluded that at least six emotions, conveyed by human faces, are universally associated
with distinct expressions. Sadness, anger, joy, fear, disgust and surprise are categories of expressions that are recognizable
across cultures. In this work we form a relation between the description of the universal expressions and the MPEG-4 Facial
Definition Parameter Set (FDP). We also investigate the relation between the movement of basic FDPs and the parameters
that describe emotion-related words according to some classic psychological studies. In particular Whissel suggested that
emotions are points in a space, which seem to occupy two dimensions: activation and evaluation. We show that some of the
MPEG-4 Facial Animation Parameters (FAPs), approximated by the motion of the corresponding FDPs, can be combined
by means of a fuzzy rule system to estimate the activation parameter. In this way variations of the six archetypal emotions
can be achieved. Moreover, Plutchik concluded that emotion terms are unevenly distributed through the space defined by
dimensions like Whissel's; instead they tend to form an approximately circular pattern, called “emotion wheel”, modeled
using an angular measure. The “emotion wheel” can be defined as a reference for creating intermediate expressions from the
universal ones, by interpolating the movement of dominant FDP points between neighboring basic expressions. By
exploiting the relation between the movement of the basic FDP points and the activation and angular parameters we can
model more emotions than the primary ones and achieve efficient recognition in video sequences.
Keywords: facial expression, MPEG-4 facial definition parameters, activation, emotion wheel, 3D expression synthesis
1. INTRODUCTION
Thorough research in facial expression has concluded that at least six emotions, conveyed by human faces, are universally
associated with distinct expressions. In particular, sadness, anger, joy, fear, disgust and surprise are categories of facial
expressions that are recognizable across cultures1. However, very few studies4 have appeared in the computer science
literature, that explore non-archetypal emotions. In the contrary psychological researchers have extensively investigated1, 2 a
broader variety of emotions. Although the exploitation of the results obtained by the psychologists is far from being
straightforward, computer scientists can use some hints to their research. In this work we combine some of the results
obtained by Whissel3 and Plutchik2 with the MPEG-4 provisions for facial animation, to model facial expressions and
related emotions. First we form a relation between the facial anatomy description of the six universal facial expressions6 and
the MPEG-4 Facial Definition Parameter Set (FDP)5. In Figure 1 the FDP set is shown while in Table 1 the basic expressions
are described according to the Facial Animation Parameter (FAP) set terminology. A small subset of the FDP set can be
used for the estimation of the FAP set. This subset consists of points in the face area that can be automatically detected and
tracked. As is illustrated in Table 3, the features used for the description of the FAP set are distances between these
protuberant points within the facial area, some of which are constant during the expressions and are used as reference
points, as well as their time derivatives. Distances between the reference points are used for normalization, while time
derivatives are used for two different purposes: first, they define the positive intensities for the FAP set and second, they
characterize the development of the expressions and are used for marking the expressions ‘apex’.
In a second step we investigate the relation between the movement of basic FDPs and the parameters used in the description
of emotion-related words3. Whissel suggested that emotions are points in a space with a relatively small number of
dimensions, which with a first approximation, seem to occupy two dimensions: activation and evaluation. Activation is the
__________________
Correspondence: E-mail: kkarpou@image.ntua.gr; Telephone: +301 7722491; Fax: +301 7722492
degree of arousal associated with the term, with terms like patient (at 3.3) representing a midpoint, surprised (over 6)
representing high activation, and bashful (around 2) representing low activation. Evaluation is the degree of pleasantness
associated with the term, with guilty (at 1.1) representing the negative extreme and delighted (at 6.6) representing the
positive extreme4. Figure 2 and Table 2 illustrate the strong relation between the movement’s strength of some FDP points
and the activation dimension due to Whissel (the activation for the term “delighted” is 4.2 while for “Joyful” is 5.4).
Finally the “emotion wheel”2 is taken as a reference for creating intermediate expressions from the universal ones. In
particular by interpolating the movement of dominant FDP points between neighboring basic expressions we construct
intermediate ones. The third column in Table 2 represents Plutchik’s observation that emotion terms are unevenly distributed
through the space defined by dimensions like Whissell's. Instead they tend to form an approximately circular pattern called
“emotion wheel”. The last column in Table 2 shows empirically derived positions on the circle for some selected terms in the
list, according to Plutchik’s study, using an angular measure in which the midline runs from Acceptance (0) to Disgust
(180). By exploiting the relation between the movement of the basic FDP points of Table 1 and the activation and angular
parameters of Table 2 we can model and animate much more expressions than the primary ones studied so far and achieve
more efficient recognition in video sequences.
Anger
Squeeze_l_eyebrow, squeeze_r_eyebrow, raise_u_midlip, raise_l_midlip
Sadness
raise_l_i_eyebrow, raise_r_i_eyebrow, close_upper_l_eyelid, close_upper_r_eyelid, close_lower_l_eyelid,
close_lower_r_eyelid
Joy
close_upper_l_eyelid,
close_upper_r_eyelid,
close_lower_l_eyelid,
stretch_l_cornerlip, stretch_r_cornerlip, raise_l_m_eyebrow, raise_r_m_eyebrow
Disgust
close_upper_l_eyelid, close_upper_r_eyelid, close_lower_l_eyelid, close_lower_r_eyelid, raise_u_midlip
Fear
raise_l_o_eyebrow, raise_r_o_eyebrow, raise_l_m_eyebrow, raise_r_m_eyebrow, raise_l_i_eyebrow,
raise_r_I_eyebrow, squeeze_l_eyebrow, squeeze_r_eyebrow, open_jaw
Surprise
raise_l_o_eyebrow, raise_r_o_eyebrow, raise_l_m_eyebrow, raise_r_m_eyebrow_m, raise_l_i_eyebrow,
raise_r_I_eyebrow, open_jaw
close_lower_r_eyelid,
Table 1: Description of the six primary expressions using the movement of some basic FDP points.
Figure 1: The feature points in the FDP
Figure 2: Facial expression labeled as: (a) “delighted” (b) “Joyful”
Activ
Eval
Accepting
Angle
Activ
Eval
Angle
0
Disgusted
5
3.2
181.3
Afraid
4.9
3.4
70.3
Delighted
4.2
6.4
318.6
Angry
4.2
2.7
212
Bashful
2
2.7
74.7
Patient
3.3
3.8
39.7
Surprised
6.5
5.2
146.7
Guilty
4
1.1
102.3
Joyful
5.4
6.1
323.4
Sad
3.8
2.4
108.5
Ecstatic
5.2
5.5
286
Table 2: Selected emotion words from Whissell and Plutchik.
2. MPEG-4 AND FACIAL EXPRESSIONS
The establishment of the MPEG standards and especially the MPEG-4 indicate an alternative way of analyzing and
modeling facial expressions and related emotions. FAPs and FDPs are utilized in the framework of MPEG-4 for facial
animation purposes. On the other hand, automatic detection of particular FDPs is an active research area, which can be
employed within the MPEG-4 standard for analyzing facial expressions. Facial expression analysis has been mainly
concentrated on the six primary expressions. Are we able to model and analyze more than these expressions? In this study
we introduce the idea of grading the FAPs in order to capture variations of the primary expressions.
In general, facial expressions and emotions can be described as a set of measurements and transformations that can be
considered atomic with respect to the MPEG-4 standard; this way, one can describe both the anatomy of a human face, as
well as any animation parameters with groups of distinct tokens, eliminating the need to specify the topology of the
underlying geometry. These tokens can then be mapped to automatically detected measurements and indications of motion
on a video sequence and thus help recognize the emotion or expression conveyed by the subject. Reversal of the description
of the six universal emotions with MPEG-4 tokens and using a priori knowledge that is embedded within a fuzzy rule
system accomplish this. Interpolating and combining the tokens that describe the universal emotions can distinguish
emotions that lie in between them with respect to the emotion wheel; this reasoning can be applied to the synthesis and
animation of new expressions, such as discontent or exhilaration. Because FAPs do not correspond to specific models or
topologies, this scheme can be extended to other models or characters, different to the one that was analyzed.
3.
MODELING FAPS: THE ANALYSIS POINT OF VIEW
The FAPs are practical and very useful for animation purposes but inadequate for analyzing facial expressions from video
scenes or still images. To bridge the gap between analysis and animation / synthesis we propose the estimation of some
important FAPs using the features shown in Table 3. The feature set employs FDPs that lie in the facial area and, under some
constraints, can be automatically detected and tracked. It consist of distances between these protuberant points, some of
which are constant during the expressions and are used as reference points, as well as their time derivatives. Distances
between the reference points are used for normalization, while time derivatives are used for two different purposes: first,
they define the intensity for the FAP set and second, they characterize the development of the expressions and are used for
marking the expressions ‘apex’.
FAP name
Squeeze_l_eyebrow
Squeeze_r_eyebrow
raise_u_midlip
raise_l_midlip
raise_l_i_eyebrow
raise_r_i_eyebrow
raise_l_o_eyebrow
raise_r_o_eyebrow
raise_l_m_eyebrow
raise_r_m_eyebrow
open_jaw
close_upper_l_eyelid –close_lower_l_eyelid
close_upper_r_eyelid –close_lower_r_eyelid
stretch_l_cornerlip –stretch_r_cornerlip
Vertical_wrinkles between eyebrows
Features used for the description
Positive Intensity
s (1,3) df1
,
ESo dt
s (4,6) df 2
,
f2 =
ESo
dt
s (16,30) df 3
,
f3 =
ENSo
dt
s (16,33) df 4
,
f4 =
ENSo
dt
s (3,8) df 5
,
f5 =
ENSo
dt
s (6,12) df 6
,
f6 =
ENSo
dt
s (1,7) df 7
,
f7 =
ENSo
dt
s (4,11) df 8
,
f8 =
ENSo
dt
s (2,7) df 9
,
f9 =
ENSo
dt
s (5,11) df10
,
f10 =
ENSo
dt
s (16,33) df11
,
f11 =
ENSo
dt
s (9,10) df12
,
f12 =
ENSo
dt
s (13,14) df13
,
f13 =
ENSo
dt
s (28,29) df14
,
f14 =
ESo
dt
df 15
f 15 =s′(3,6),
dt
df1
<0
dt
df 2
<0
dt
df 3
<0
dt
df 4
<0
dt
df 5
>0
dt
df 6
>0
dt
df 7
>0
dt
df 8
>0
dt
df 9
>0
dt
df10
>0
dt
df11
>0
dt
df12
<0
dt
df13
<0
dt
df14
>0
dt
df 15
>0
dt
f1 =
Table 3: Description of FAP set using a subset of the MPEG-4 FDP set. Note: s(i,j)=Euclidean distance between FDP points i and j,
{ESo, ENSo}=Horizontal and vertical distances used for normalization and s′(3,6) is the maximum difference between pixel values along
the line defined by the FDPs 3 and 6.
4.
COMBINING FAPS TO PRODUCE VARIATIONS OF PRIMARY EXPRESSIONS
Grading of FAPs is strongly related with the activation parameter proposed by Whissel. Since this relation is expressed
differently for the particular expressions, a fuzzy rule system seems appropriate for mapping FAPs to the activation axis.
Table 1 presents which particular FAPs are related with a particular expression. What is the contribution of each FAP to
expressions’ formation is unknown. Observations obtained from experiments like the ones presented in section 5.2 as well
as cues from psychological studies can be used to form rules that describe the contribution of the particular FAPs. Since the
estimated, from FDPs movement, FAP values are affected from inaccurate computations a kind of fuzzy partitioning is
necessary. In our implementation each FAP take membership values for being low, medium and high. In a similar way the
activation of a particular expression is also expressed using membership values which correspond to variations of the basic
underlying expression. The continuity of the emotion space as well as the uncertainty involved in the feature estimation
process, make the use of fuzzy logic appropriate for the feature-to-expression mapping. The structure of the proposed fuzzy
inference system is described in Figure 3. Actually the input depends on the particular primary expression; for example
capturing variations of joy requires the FAPs listed in the corresponding row of Table 1. The output also depends on the
particular expression; more variations of joy can be modeled exist than that of sadness. On the universe of discourse of each
input (or output) parameter, a fuzzy linguistic partition is defined. The linguistic terms of the fuzzy partitions (for example
medium open_jaw) are connected with the aid of the IF-THEN rules of the Rule Base. These IF-THEN rules are
heuristically constructed and express the a priori knowledge of the system. The activation of the antecedents of a rule causes
the activation of the consequences, i.e. the expression is concluded from the degree of the increment (or decrement) of the
FAPs.
Fuzzification
FAP
Subset
Fuzzy Inference
Defuzzification
Fuzzy
Rule Base
Variations of a
particular
expression
Figure 3: The structure of the fuzzy system
5.
5.1
IMPLEMENTATION ISSUES
Automatic detection of facial protuberant points
The detection of the FDP subset used to describe the involved FAPs was based on the work presented in7. However, for
accurate detection in many cases human assistance was necessary. The authors are working towards a fully automatic
implementation of the point detection procedure.
5.2
Investigating the efficiency of the selected features
A critical point related with the features illustrated in Table 3 is their efficiency to describe the corresponding FAPs and
furthermore, their particular contribution to the classification of the primary expressions as well as their ability to
discriminate between similar expressions. In order to explore the efficiency of the selected features we set up the following
experiments: First, we used sequences obtained from the MIT Media Lab, which show standard archetypal emotions happiness, surprise, anger and disgust. Based upon the technique presented in7 we detected the relevant FDP subset.
Accurate detection of the FDP points, was, however, assisted by human intervention in many cases. Then, for each frame /
pair of subsequent frames that illustrate a face in an emotional state, a feature vector corresponding to FAPs was computed.
A neural network architecture was trained and then used to classify the feature vectors in one of the above categories. Using
capabilities of neural networks, we were able to evaluate the contribution of each particular feature of the 15-tuble feature
vector in the obtained classification results. Figure 4 indicates that about 8 FAPs, related to the eyebrow and lip points,
mainly contributed to the classification of the above expressions. The role of the wrinkle detection feature was found
important for correct classification of the anger expression. The results also depict that features sensitive to accurate
detection of the FDP points, such as open_eyelid and close_eyelid, seem to be ignored by the neural classifier. From the
analysis point of view this fact suggests discarding the corresponding features; however for animation purposes the
open_eyelid and close_eyelid FAPs still very important. A redundancy related with symmetrical features can be observed
from Figure 4; only one of the components of symmetrical features is taken into account, i.e. the contribution of the
raise_l_i_eyebrow is much higher than that of raise_r_i_eyebrow. This redundancy, however, is not universal; there are
expressions, such as variations of disgust, where facial symmetry is not guaranteed. Moreover by keeping both symmetrical
features we increase the robustness of the system against computation errors.
Since we concentrated in the input contribution of the particular features rather than the classification performance, the
obtained rates have little importance; as a matter of fact a higher level combination of the feature vector elements is
required, as indicated in Table 1, to describe a particular expression.
The second experiment considered the possibility of subtler discriminations, involving expressions other than the primary
ones. The expressions considered were amusement, happiness and excitement. Stimuli were drawn from two sources, the
MIT facial database, and a pilot database of selected extracts from BBC television programs. Following the same procedure
as before, we trained a neural network to classify the feature vectors in one of the three categories. Input contribution of the
first 14 FAPs –the wrinkle related feature has been left out- is shown in Figure 5. It can be seen that eyebrow related FAPs
have increased input contribution; this fact indicates that by grading the corresponding features one can model more
expressions than the primary ones.
Input contribution
(normalized with
respect to the
highest value)
df 1 df 2
dt dt
df 3
dt
df 4 df 5
dt dt
df 6
dt
df 7 df 8
dt dt
df 9 df 10 df 11 df 12 df 13 df14 df 15
dt dt dt dt
dt dt dt
Figure 4: Contribution of the particular features for the classification of primary expressions.
Input contribution
(normalized with
respect to the
highest value)
df 1 df 2
dt dt
df 3
dt
df 4 df 5
dt dt
df 6
dt
df 7 df 8
dt dt
df 9
dt
df 10 df 11 df 12 df 13 df14
dt dt dt
dt dt
Figure 5: Contribution of the particular features for the classification of amusement, happiness and excitement.
6.
SYNTHESIZING FACIAL EXPRESSIONS
In the first part of the synthesis procedure we adapt a generic face model to the static geometrical measurements that were
computed from the video sequence (see Figure 6) correspond to the FDPs that characterize the human face and, thus, are
locally defined. As a result, the transformations that are required in order to achieve the matching of the generic model to
the specific subject are local as well and use gradually descending weights so as to preserve the smoothness of the surface in
the final topology8. This technique can also be used in combination with texture mapping in static images, in addition to
animation purposes.
VIDEO
SEQUENCE
ANALYSIS
FDPs
EXPRESSION
SYNTHESIS
FAPs
FACE MODEL
ADAPTATION
Figure 6: Intermediate expression synthesis
In order to synthesize new intermediate expressions, we interpolate and combine FAPs that correspond to the six universal
ones. This means that magnitudes closer or further to those of a normal or neutral expression can be used in transformations
and FAPs, as well as mixing of tokens that correspond to different universal emotions. This procedure actually imitates the
way people express mixed emotions, by adopting amalgamated or ambiguous face poses: for example, the expression of a
"happiness" differs to "exhilaration" only in the extent of the transformation that results from the FAPs. On the other hand,
the mixture of the FAPs that synthesize a "sad" and a "disgusted" face can be used to display guilt or discontent.
Figure 7: Synthesized expressions of "happiness" and "content"
Figure 8: "Sadness" (a universal emotion) and the synthesis of "Guilt"
7.
CONCLUSION
Exploitation of the results obtained by psychological studies related with emotion recognition from computer scientists is
possible although not straightforward. We have shown that terms like emotion wheel and parameters like activation are
suitable for extending the facial expressions that can be modeled. Accurate detection /tracking of an FDP subset can be used
to approximate the MPEG-4 FAPs which subsequently can be exploited for the estimation of the activation parameter. By
modifying the activation variations of the archetypal expressions can be analyzed. Furthermore, interpolation between the
values of activation and angular parameters, corresponding to the primary emotions, provides an even broader set of
expressions that can be modeled.
ACKNOWLEDGMENT
This work is funded by the project PHYSTA of the Training Mobility and Research Program of the European Community.
The authors are within the team of the project, where speech and psychological cues are also used for emotion classification.
We would also like to thank the BBC for allowing us to use video sequences recorded from its broadcasted program
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
P. Ekman and W. Friesen, The Facial Action Coding System, Consulting Psychologists Press, San Francisco, CA, 1978.
R. Plutchik, Emotion: A psychoevolutionary synthesis, Harper and Row, New York, 1980.
C. M. Whissel, The dictionary of affect in language, R. Plutchnik and H. Kellerman (Eds) Emotion: Theory, research
and experience: vol 4, The measurement of emotions. Academic Press, New York, 1989.
EC TMR Project PHYSTA Report, “Development of Feature Representation from Facial Signals and Speech,” January
1999.
ISO/IEC JTC1/SC29/WG11 MPEG96/N1365, “MPEG4 SNHC: Face and body definition and animation parameters,”
1996
F. Parke and K. Waters, Computer Facial Animation, A K Peters, 1996
Kin-Man Lam and Hong Yan, “An Analytic-to-Holistic Approach for Face Recognition Based on a Single Frontal
View,” IEEE Trans. on PAMI, vol. 20, no. 7, July 1998.
K. Karpouzis, G. Votsis, N. Tsapatsoulis and S. Kollias, “Compact 3D Model Generation based on 2D Views of
Human Faces: Application to Face Recognition,” Machine Graphics and Vision, vol. 7, no.1-2, 1998
Download