Abstract - Image Formation and Processing (IFP) group

advertisement
IA3: Intelligent Affective Animated Agents
T.S. Huang, I. Cohen, P. Hong, Y. Li
Beckman Institute, University of Illinois at Urbana-Champaign, USA
Email: huang@ifp.uiuc.edu
Abstract
Information systems should be human-centered. Human-computer interface needs to be improved to make
computers not only user-friendly but also enjoyable to interact with. Computers should be proactive and
take initiatives. A step toward this direction is the construction of Intelligent Affective Animated Agents
(IA3). Three essential components of IA3 are: The agent needs to recognize human emotion. Based on its
understanding of human speech and emotional state, the agent needs to reason and decide on how to
respond. The response needs to be manifested in the form of a synthetic talking face which exhibits
emotion. In this paper, we describe our preliminary research results in these three areas. We believe that
although challenging research issues remain, for restricted domains effective IA3 could be constructed in
the near future.
1.
Introduction
As we enter the age of Information
Technology, it behooves us to remember that
information systems should be human-centered
[1]. Technology should serve people, not people
technology. Thus, it is of the utmost importance
to explore new and better ways of humancomputer interaction, to make computers not
only more user-friendly, but more enjoyable to
use. The computer should be proactive, taking
initiatives in asking the right questions, offering
encouragement, etc. In many applications, it is
highly effective to have an embodied intelligent
agent to represent the computer [2].
For
example, the agent can be manifested as a
synthetic talking face with synthesized speech. It
can exhibit emotion in terms of facial expression
and the tone of the voice. In addition to
recognizing the speech input from the human,
the computer needs to recognize also the
emotional and cognitive state of the human
(through visual and audio sensors), so that the
agent can decide on the appropriate response.
In this paper, we shall discuss some aspects
of this type of intelligent Affective Animated
Agents (IA3), and present some preliminary
results of our research. In Section 2, we describe
research in the automatic recognition of human
emotion by analyzing facial expression and the
tone of the voice. Section 3 presents our
synthetic talking face model (the iFACE) and
discuss issues related to the use of text to drive
the model.
In Section 4, we offer some
preliminary thoughts about intelligent dialog
systems. We conclude with a few remarks in
Section 5.
2. Audio/Visual Emotion Recognition
One of the main problems in trying to
recognize emotions is the fact that there is no
uniform agreement about the definition of
emotions. In general, it is agreed that emotions
are a short-term way of expressing inner feeling,
whereas moods are long term, and temperaments
or personalities are very long term [3]. Emotions
can be expressed in various different ways,
through voice, facial expressions, and other
physiological means. Although there are
arguments on how to interpret these
physiological measurements, it is quite clear that
there is a strong correlation between measurable
physiological signals and the emotion of a
person.
In the past 20 years there has been much
research on recognizing emotion through facial
expressions. This research was pioneered by
Ekman and Friesen [4] who started their work
from the psychology perspective. In the early
1990s the engineering community started to use
these results to construct automatic methods of
recognizing emotions from facial expressions in
images or video [5][6][7][8][9]. Studies of vocal
emotions have been conducted for over 60 years.
Most recent studies [10][11][13][14] used
prosodic information such as the pitch, duration,
and intensity of the phrase as the features to
recognize emotions in voice. Work on
recognition of emotions from visual and audio
combined has been recently studied by Chen [9],
Chen et al. [15], and DeSilva et al [16].
2.1 Automatic Facial Expression Recognition
The very basis of any recognition system is
extracting the best features to describe the
physical phenomena. As such, categorization of
the visual information revealed by facial
expression is a fundamental step before any
recognition of facial expressions can be
achieved. First a model of the facial muscle
motion corresponding to different expressions
has to be found. This model has to be generic
enough for most people if it is to be useful in any
way. The best known such model is given in the
study by Ekman and Friesen [4], known as the
Facial Action Coding System (FACS). Ekman
has since argued that emotions are linked directly
to the facial expressions, and that there are six
basic
“universal
facial
expressions”
corresponding to happiness, surprise, sadness,
fear, anger, and disgust. The FACS codes the
facial expressions as a combination of facial
movements known as action units (AUs). The
AUs have some relation to facial muscular
motion and were defined based on anatomical
knowledge and by studying videotapes of how
the face changes its appearance when displaying
the expressions. Ekman defined 46 such action
units each corresponds to an independent motion
of the face.
We implemented a face-tracking algorithm.
The face tracking algorithm and system are
based on the work of Tao and Huang [18] called
the Piecewise Bézier Volume Deformation
(PBVD) tracker. This system was modified to
extract the features for the emotion expression
recognition by Chen [9]. The estimated
motions are represented in terms of magnitudes
of some predefined AUs. These AUs are similar
to what Ekman and Friesen [4] proposed, but
Figure 1. AUs extracted by the face tracker
only 12 AUs are used. Each AU corresponds to
a simple deformation on the face, defined in
terms of the Bézier volume control parameters.
In addition to the 12 AUs, the global head
motion is also determined from the motion
estimation. Figure 1 shows the 12 AUs being
measured for emotion expression recognition,
where the arrow represents the motion direction
of the AU moving away from the neutral
position of the face.
Using the measurements of these action units,
two types of classifiers were constructed. The
first is a frame-based classifier [9]. The second
uses the entire facial expression time
information.
The frame based classifier makes a decision
among the seven classes (happiness, sadness,
surprise, anger, fear, disgust, and neutral) for
each time frame using a Sparse Network of
Winnows (SNoW) classifier [17]. The SNoW
classifier transforms the original AUs to higher
dimensional feature space, after which the
connections between the transformed feature
nodes to the output target nodes (the emotion
classes in this case) will be sparse. The training
uses a multiplicative update rule (Winnow), this
in contrast to a neural network that uses an
additive update rule. The advantages of using
SNoW is that it does not require a large number
of training data, and in the sparseness of the
connections between the layers, which gives a
lower probability of error and speed. For testing,
the output target with the highest score is the
winning class (“winner-takes-all”).
The second classifier is a novel architecture
of a multilevel hidden Markov model (HMM)
classifier. The multilevel HMM both serves as a
classifier of the emotion sequences and does
automatic segmentation of the video to the
different emotions. The architecture is
constructed of a lower level of six emotionspecific HMMs, trained on labeled segmented
facial
expression
sequences,
with
the
observations being the AU measurements of the
face tracker. The state sequence of each of the
six HMMs is decoded using the Viterbi
algorithm, and this state sequence vector (six
dimensional) serves as the observation to the
high level HMM. The high level HMM consists
of seven states, one representing each emotion
and a neutral state. The state in which the high
level HMM is in at each time can be interpreted
as the classification of the time sequence. The
high level HMM does both the segmentation and
the classification at the same time. Since the
observation vector is the state sequence of the
lower level HMMs, it also learns the
discrimination function between the six HMMs.
This is the main difference between this work
and the work of Otsuka and Ohya [6], who used
emotion-specific HMMs but did not attempt to
use a higher-level architecture to learn the
discrimination between the different models.
These algorithms were tested on a
database collected by Chen [9], and the first was
also implemented in real time, for persondependent recognition. The subjects in the
database were asked to express different
emotions given different stimuli. The database is
of 100 subjects of different genders and ethnic
backgrounds. It includes sequences of facial
expressions only, as well as sequences of
emotional speech and video. Testing on this
database yielded recognition accuracy of over
90% for both methods, using a person-dependent
approach, and a much lower accuracy of around
60-70% for a person independent approach. It
was noticed that happiness and surprise are
classified very well for both person-dependent
and person-independent cases, and the other
emotions are greatly confused with each other,
especially in the person-independent test. If the
number of classes is reduced by combining the
classes, disgust, anger, fear, and sadness to one
“negative” class, the accuracy becomes much
higher for both the person-dependent tests (about
97%) and the person-independent tests (about
90%). Figure 2 shows four examples of the real
time implementation of the first method. The
label shows the recognized emotion of the user.
asked for, don’t you ever listen? The audio is
processed on a phrase level to extract prosodic
features. The features are statistics of the pitch
contour, its derivative, statistics of the RMS
energy envelope and its derivatives. A measure
of the syllabic rate is also extracted (it expresses
the ‘rate’ of speaking). The features are
computed for a whole phrase, since it is unlikely
that the emotion changes in the speech very fast.
An optimal Naïve Bayes classifier is then used.
The overall accuracy using this classifier was
around 75% for a person-dependent test, and
around 58% for a person-independent test, which
shows that there is useful information in the
audio for recognizing emotions (pure chance is
1/7=14.29%).
2.3 Emotion Recognition from Combined
Audio and Video.
There are some inherent differences between
the features of the audio and video. Facial
expressions can change at a much faster rate than
the vocal emotions, which are expressed in
longer sequences of a phrase or sentence. To
account for these time differences, a classifier is
designed for each of the channels, and not a
combined classifier that has to wait for the audio
to be processed. The combination of the two
classifiers is handled using a system that can
work in three modes; audio only, video only and
combined audio and video. The mode is set
using two detectors. An audio detector to
recognize that the user is speaking, a video
detector determines if the user is being tracked,
and if the tracking of the mouth region is reliable
enough since when the user talks, the mouth
movement is very fast, and then the mouth
region is not reliable for expression recognition.
In the video and audio mode, only the top region
of the face is used for the expression recognition
because of the fast mouth movement.
3. Synthetic Talking Face (iFace)
Figure 2: Frames from the real time facial
expression recognizer
2.2 Emotion Recognition from Audio
Emotions are expressed using the voice as
well as through the facial expressions. In the
database that was collected, the subjects were
asked to read sentences while displaying and
voicing the emotion. For example, a sentence
displaying anger is: Computer, this is not what I
3.1 Face Modeling
We have developed a system called iFACE,
which provides functionalities for face modeling,
editing, and animation. A realistic 3D head
model is one of the key factors of natural human
computer interaction. In recent years, researchers
have been trying to combine computer vision
and computer graphics techniques to build
realistic head models [12][20]. In our system, we
try to make the head modeling processing more
systematic. The whole process is nearly
automatic with only a few manual adjustments
Wired frame
Shaded
(a) Generic face model of iFACE
Texture data
Range data
(b) Cyberscanner data
Frontal view.
Side view.
(d) The customized head model with texture mapping
(c) 35 Feature points are selected for model fitting
Figure 3. Face Modeling
necessary. The generic face model we use is a
geometrical model (figure 3(a)) that consists of
all the facial accessories such as eyes, teeth,
tongue, etc. To customize the face model for a
particular person, we first obtain both the texture
data and range data of that person by scanning
his/her head using Cyberware cyberscanner. An
example of the cyberscanner data is shown in
figure 3(b). 35 feature points (figure 3(c)) are
manually selected on the scanned data. Those
feature points have their correspondences on the
generic geometry face model. We then fit the
generic model to the person by deforming it
based on the range data and the selected feature
points. Manual adjustments are required where
the scanned data are missing. Figure 3(d) show
an example of a customized head model.
3.2 Text Driven Talking Face
When text is used in communication, e.g. in
the context of text-based electronic chatting over
Internet, visual speech synthesized from text will
greatly help deliver information. iFACE is able
to synthesize visual speech from a text stream.
The structure of the system is illustrated in
Figure 4. The system uses Microsoft Text To
Speech (TTS) engine for text analysis and speech
synthesis. First, the text is parsed into phoneme
sequence. A phoneme is a member of the set of
the smallest units of speech that serve to
distinguish one utterance from another in a
language or dialect. Each phoneme is mapped to
a viseme. A viseme is a generic facial shape that
serves to describe a particular sound. A
phoneme-to-viseme mapping is built for finding
the corresponding face shape for each phoneme.
From information on phoneme durations, we can
locate the frames at which the phonemes start.
For these frames, viseme images are generated
which are used as key frames for animation. To
synthesize animation sequence, we adopt a key
frame based scheme similar to [19]. The face
shapes between key frames are decided by an
interpolation scheme.
3.3 Synthesizing Expressive Talking Head
A set of basic facial shapes is built by
adjusting the control points. Those basic facial
shapes are in spirit similar to the Action Units of
[4]. They are built so that all kinds of facial
expressions can be approximated by the linear
combinations of them. Given a script of
expression sequence, we use key frame
technique to synthesize expression animation
sequence, such as nodding head, eye blinking,
raising eyebrow, etc. By combining the script of
expression and text, we can generate expressive
talking head.
4. Intelligent Dialog Systems
4.1 Brief Introduction to Intelligent agent
and dialog systems
Different types of information are exchanged
between computers and users in multimodal
Text to Speech Engine (TTS)
Text …
Text Analysis
Phoneme
Sequence
Speech Synthesis
Speech
Signals
Play Speech
Signals
synchronize
Map Phoneme
to Viseme
Viseme
Sequence
Generate
Key Frames
Key Frames
Animate Face
Model by
Interpolation
Figure 4. The structure of text driven talking face.
interaction. Results of video tracking and speech
recognition must be integrated by the system to
produce proper responses. In order to build such
a system in an orderly way, an intelligent agent
should be used. Special techniques should also
be developed for intelligent agent to handle
dialogs.
An intelligent agent models after basic
functions of human minds. It includes a set of
beliefs and goals, a world model which has a set
of parameter variables, a set of primitive actions
it can take, a planning module and a reasoning
module. For the agent to think and act, there
must be a world where the agent exists. This
agent world is a simplified model of the real
world. This agent world is actually a list of
parameters where each parameter reflects some
property of the real world. At a specific time,
the parameters take specific values and the agent
is said to be in a specific world state. Like
human, the agent has its own beliefs: which
world states are good, which world states are
bad, and what kind of ideal world states the
agent should pursue. Once the agent decides the
current goals, it will generate a plan to achieve
these goals. A plan is simply a list of primitive
action sequences. Each primitive action will have
some prerequisites and cause some parameters in
the world state to change. An agent also has
reasoning ability. The agent can discover new
conclusions from current knowledge.
A spoken dialog system should understand
what the user says and generate correct speech
responses. A spoken dialog system is composed
with a speech recognition engine, a syntax
analysis parser, an information extraction
module, a dialogue management module, and a
text-to-speech (TTS) engine. In our case, the
speech recognition engine is IBM ViaVoice.
The TTS engine is IBM ViaVoice Outloud. The
speech recognition engine transcribes sound
waves into text. Then the syntax parser parses
the text and generates a parsing tree. This
parsing tree, containing a rich information
extraction module, can do semantic and
pragmatic analysis and get the important
information using the semantic clues provided by
the parsing tree. However, the ability of this
type of information extraction is limited. In
order to increase the flexibility of the system,
people use semantic networks and ideas such as
finite state machine to cluster related words and
statistically model the input sentences. But the
improvements are very limited.
After the
messages are understood, correctly or not, the
dialogue management module uses various
pattern classification techniques to identify the
user’s intentions. A plan model is built for each
of such intentions. Using these plan models, the
system can decide whether to answer a question,
to accept a proposal, to ask a question, to give a
proposal, or to reject a proposal and give a
suggestion.
A response is then generated.
Current systems put focus on generating the
correct responses. This is absolutely important.
However, it is also misleading because the
understanding issue is put aside. Though a quick
solution
can
be
established
without
understanding, the ability of such a dialogue
system is always inherently limited.
4.2 Intelligent Agent With Speech Capability
In a multimodal interface, different types of
information and the complexity of tasks require
the presence of an intelligent agent.
The
intelligent agent handles not only results from
video tracking, emotion analysis, but also text
strings of natural language from the speech
recognizer. The intelligent agent produces not
only operational responses but also speech
responses. Therefore, it is necessary to study
ways to give an intelligent agent speech
capability. Actually, pure dialog systems also
benefit from an intelligent agent approach since
intelligent agents are powerful in handling
complex tasks and have the ability to learn.
Most importantly, there have been many research
results regarding various aspects of intelligent
agents.
However, achieving speech capability for
intelligent agents is not a trivial task. Until now,
there is no research focusing on giving
intelligent agents speech capability.
The
difficulty mainly comes from natural language
understanding. The issue of natural language
understanding is not solved yet.
For an
intelligent agent to act on language input, the
agent must understand the meanings of the
sentences. In order to understand a single
sentence, a knowledge base related to the
semantic information of each word in the
sentence must be accessible to the agent. There
have been many efforts that try to build universal
methods for knowledge representation that can
be used by computers or by an intelligent agent.
However, these methods only represent the
knowledge by using a highly abstracted symbol
language. They can be easily implemented on
computers, but are not suitable for storing and
organizing information related to a specific
word. There is no method designed to represent
knowledge embedded in human speech. In order
to build a structure that can store the associated
concept of each word, we propose a word
concept model to represent knowledge in a
“speech-friendly” way. We hope to establish a
knowledge representation method that not only
can be used to store word concept, but also can
universally be used by an intelligent agent to do
reasoning and planning.
4.3 Word Concept Model
For human beings, there is a underlying
concept in one’s mind associated with each word
he knows. If we hear the word “book”, the
associated concept will appear in our mind,
consciously or unconsciously. We know a book
is something composed of pages and made of
paper. We know there are words printed in a
book. We also know we can read a book for
learning or relaxing. Furthermore, we know the
usual ranges of the size of a book. There may be
other information we know. The central idea is
that the concept of “book” should include all
these information. Human understands the word
“book” because he knows this underlying related
information. Clearly, if an intelligent agent
wants to understand human language, it must
have access to this underlying information and
be able to use it to extract information from the
sentences spoken by the users.
In the human mind, complex concepts are
explained by simple concepts. We need to
understand the concept of “door” and “window”
before we understand the concept of “room”.
The concept of “window” is further explained
by “frame”, “glass” and so on. Ultimately, all
the concepts are explained based on the input
signal patterns perceived by the human sensory
motors. This layered abstraction is the key to
fast processing of information. When reasoning
is being done, only related information is pulled
out. Lower layered information may not appear
in a reasoning process. In our word concept
model, such layered concept abstraction is also
used. We want to build complex concepts by
using simple concepts. Before we can do that,
we need a basic concept space on which we can
define primitive concepts. For human, the basic
concept space is the primitive input signal
patterns from sensors. For a computer, it may be
difficult to do so. However, for a specific
domain, it is possible to define the basic concept
space by carefully studying the application
scenarios.
Here, we borrowed ideas from
previous research on concept space. A concept
space is defined by a set of concept attributes.
Each attribute can take discrete or continuous
values. The attribute “color” can be discrete
such as “red”, “yellow”, etc. Or it can be
continuous in terms of light wavelength.
Whether to use discrete values or continuous
values should be determined by the application.
Once we have established this concept space,
some primitive concept in the system can be
defined on this concept space. Then complex
concept can be built using simple concept.
Complex concept can also have a definition on
the concept space if possible.
As we started to map word into this concept
structure, we feel it necessary to treat different
types of words differently. Since the most basic
type of concept in the word are physical matters,
we first map nouns and pronouns into the
concept space. We call these types of words
solid word. For solid word, we can find the
mapping on the concept space quite easily since
they refer to specific types of things in the
physical world.
There may be multiple
definitions for one word because one word can
have many meaning. The agent can decided
which definition to use by combining context
information. After we have defined the solid
words, other types of words are then defined
around solid words instead of in the concept
space. But their definitions are closely coupled
with the concept space. For verbs, we first
define the positions where solid word can appear
around this verb. For example, solid word can
appear both before and after the verb “eat”. For
some verb, more than one solid word can appear
after the verb. In all the cases, the positions
where solid word can appear are numbered.
Then, definition of this verb is described by how
the attribute of each solid word should change.
For example, if “Tom” is the solid word before
“go” and “home” is the solid word after “go”,
then the “position” attribute of “Tom” should be
set to “home” or the location of “home”. Of
course, there are complex situations where this
simple scheme doesn’t fit. However, powerful
techniques can be developed using the same
idea. For adjectives, definitions are described as
values of the attributes. “Big” may be defined as
“more than 2000 square feet” for the “size”
attribute if it is used for houses. Obviously,
multiple definitions are possible and should be
resolved by context. Adverbs can be defined
similarly. For prepositions, we first define a
basic set of relationships. These include “in”,
“on”, “in front of”, “at left of”, “include”,
“belong to”, etc. Then prepositions can be
defined by using these basic relationships. There
are other types of words, but most of the rest are
grammar words and can be dealt with by a
language parser.
At this stage, our word concept model is not
complete yet.
Though have we already
established a basic concept space and defined a
vocabulary of around two hundred words for a
travel-related content-based image retrieve
database, we expect to see many practical issues
when we start to build the dialogue system. We
should meet the challenge of building a robust
information extraction module and an intelligent
agent that can reason on the word concept model
we described. We also expect to build a core
system that can learn new words from the user.
We hope to have a small system that can grow
into a powerful one through the interaction with
human.
5. Concluding Remarks
In this paper we have described our
preliminary research results on some essential
aspects of Intelligent Affective Animated
Agents, namely: (1) Automatic recognition of
human emotion from a combination of visual
analysis of facial expression and audio analysis
of speech prosody. (2) Construction of synthetic
talking face and how it could be driven by text
and emotion script. (3) Intelligent dialog
systems. Challenging research issues remain in
all these three areas. However, we believe that
in constrained domains, effective Intelligent
Affective Animated Agents could be built in the
near future.
6. References
[1] NSF Workshops on “Human-Centered
Systems: Information, Interaction, and
Intelligence,” organized by J.L. Flanagan
and T. S. Huang, 1997. Final Reports
available at http://www.ifp.uiuc.edu/nsfhcs/
[2] J. Cassell, J. Sullivan, S. Prevost, and E.
Churchill (eds.), "Embodied Conversational
Agents", MIT Press, 2000.
[3] J.M. Jenkins, K. Oatley, and N.L. Stein,
eds., Human Emotions: A Reader, Malden,
MA: Blackwell Publishers, 1998.
[4] P. Ekman and W.V. Friesen, Facial Action
Coding System: Investigator’s Guide, Palo
Alto, CA: Consulting Psychologist Press,
1978.
[5] K. Mase, “Recognition of facial expression
from optical flow,” IEICE Transactions, vol.
E74, pp. 3474-3483, October 1991.
[6] T. Otsuka and J. Ohya, “Recognizing
multiple persons facial expressions using
HMM based on automatic extraction of
significant frames from image sequences,''
in Proc. Int. Conf. on Image Processing
(ICIP-97), (Santa Barbara, CA, USA), pp.
546-549, Oct. 26-29, 1997.
[7] Y. Yacoob and L. Davis, “Recognizing
human facial expressions from long image
sequences using optical flow,” IEEE
Transactions on Pattern Analysis and
Machine Intelligence, vol. 18, pp. 636-642,
June 1996.
[8] M. Rosenblum, Y. Yacoob, and L. Davis,
“Human expression recognition from motion
using a radial basis function network
architecture,” IEEE Transactions on Neural
Network, vol.7, pp.1121-1138, September
1996.
[9] L.S. Chen, “Joint processing of audio-visual
information for the recognition of emotional
expressions in human-computer interaction,”
PhD dissertation, University of Illinois at
Urbana-Champaign, Dept. of Electrical
Engineering, 2000.
[10] R. Cowie and E. Douglas-Cowie,
“Automatic statistical analysis of the signal
and prosodic signs of emotion in speech,” in
Proc. International Conf. on Spoken
Language Processing 1996, Philadelphia,
PA, USA, October 3-6, 1996, pp.1989-1992.
[11] F. Dellaert, T.Polzin, and A. Waibel,
“Recognizing emotion in speech,” in Proc.
International Conf. on Spoken Language
Processing 1996, Philadelphia, PA, USA,
October 3-6, 1996, pp.1970-1973.
[12] B. Guenter, C. Grimm, D. Wood, et al.
Making Faces, in Proc. SIGGRAPH'98,
1998.
[13] T. Johnstone, “Emotional speech elicited
using computer games,” in Proc.
International Conf. on Spoken Language
Processing 1996, Philadelphia, PA, USA,
October 3-6, 1996, pp.1985-1988.
[14] J. Sato and S. Morishima, “Emotion
modeling in speech production using
emotion space,” in Proc. IEEE Int.
Workshop
on
Robot
and
Human
Communication, Tsukuba, Japan, Nov.
1996, pp.472-477.
[15] L.S. Chen, H. Tao, T.S. Huang, T. Miyasato,
and R. Nakatsu, “Emotion recognition from
audiovisual information,” in Proc. IEEE
Workshop on Multimedia Signal Processing,
(Los Angeles, CA, USA), pp.83-88, Dec. 79, 1998.
[16] L.C. De Silva, T. Miyasato, and R. Nakatsu,
“Facial
emotion
recognition
using
multimodal information,'' in Proc. IEEE Int.
Conf. on Information, Communications and
Signal Processing (ICICS'97), Singapore,
pp.397-401, Sept. 1997.
[17] D. Roth, “Learning to resolve natural
language ambiguities: A unified approach,”
in National Conference on Artifical
Intelligence, Madison, WI, USA, pp.806813, 1998.
[18] H. Tao and T.S. Huang, “Bézier Volume
Deformation Model for Facial Animation
and Video Tracking,” in “Modeling and
Motion Capture Techniques for Virtual
Environments”, eds. N. Magnenat-Thalman
and D. Thalman, Springer, USA, 1998, pp.
242-253.
[19] K. Waters and T. M. Levergood, DECface,
“An
Automatic
Lip-Synchronization
Algorithm for Synthetic Faces,” Digital
Equipment
Corporation,
Cambridge
Research Lab, Technical Report CRL 93-4,
1994.
[20] V. Blanz and T. Vetter, A Morphable Model
for the Synthesis of 3D Faces, in Proc.
SIGGRAPH’99, 1999.
Download