Microsoft Agent - Utah State University

advertisement
GRAPHICAL
REPRESENTATIONS OF
EMOTIONAL AGENTS
by
Srividya Dantuluri
1
Chapter 1
Introduction
The purpose of this report is to perform a literature search in the field of graphical
representation of believable agents and to identify the basic requirements for a tool or a
theory in this area. We will identify the key issues that must be addressed in creating a
credible virtual being and analyze the current research with respect to these issues.
1.1 Why do we need believable agents?
Extensive research has shown that users interpret the actions of computer systems using
the same conventions and social rules used to interpret actions of humans [Oz, 1997].
This is more pronounced in the case of anthropomorphic software interface agents.
Software agents are expected to be believable. This does not imply that they should
always speak the truth and should be reliable. The user, when interacting with a
believable agent, should feel that he/she is dealing with a life-like character instead of a
life-less computer. If humans sympathize with and accept an agent as human, they will be
able to communicate with it better. Agents that are not believable are viewed as life-less
machines [Chandra, 1997].
Research in the implementation of believable agents is not targeted to trick the user into
believing that he is communicating with a human. Rather, it has a more benign purpose.
As designers start building agent systems with emotions, they will need techniques for
communicating these emotions to the user. This is precisely why research in believable
agents is needed.
A more remote, yet cogent reason, for pursuing this research is to build companions like
Data on Startrek, which is termed as the AI dream [Oz, 1997]. In the words of Woody
Bledsoe, a former president of AAAI,
“Twenty-five years ago I had a dream, a daydream, if you will. A dream shared with
many of you. I dreamed of a special kind of computer, which had eyes and ears and arms
and legs, in addition to its "brain" ... my dream was filled with the wild excitement of
seeing a machine act like a human being, at least in many ways.”
Research in believable agents directly deals with building complete agents with
personality and emotion, and thus provides a new prospectus for pursuing the AI dream
[Oz, 1997].
The current literature terms believable agents as “Virtual Humans”. Additionally,
animated life-like characters are also called avatars [Leung, 2001]. The word “avatar”
originates from the Hindu religion. It means “an incarnation of a Hindu deity, in human
or animal form, an embodiment of a quality or concept, or a temporary manifestation of a
continuing entity” [Dictionary]. For the purpose of this report, an avatar can be assumed
to be a vivid representation of a user in a virtual environment. The term “avatar” is also
2
used to denote the graphical representation of a software agent in a Collaborative Virtual
Environment (CVE) [Salem, 2000]. The CVE is a multi-user virtual space in which users
are represented by a 3D image.
The terms believable agents, animated life-like agents, virtual humans, and avatars are
used interchangeably in this report.
1.2 What are the requirements for believable animation?
The Oz group from the Carnegie Mellon University’s School of Computer Science
identified the following requirements for believability: [Oz, 1997]


Personality: Personality is an attribute, which distinguishes one character from
another. It includes everything unique and specific about the character, from the
way they talk to the way they think.
Emotions: Emotion can be defined as a mental state that occurs involuntarily
based on the current situation [Dictionary]. The range of emotions exhibited by a
character is personality-specific. Given a situation, characters with different
personalities react with different emotions.
Personality and emotion are closely related [Gratch, 2002]. [Moffat, 1997] says that
personality remains stable over an extended period of time whereas emotions are short
term. Furthermore, while emotions focus on particular situations, events, or objects,
elements determining personality are more extended and indirect.
Apart from personality and emotions, mood is also an important attribute that has to be
considered while working with emotional agents. Mood and emotion differ in two
dimensions: duration and intensity. Emotions are short-lived and intense, whereas mood
is longer and has a lower intensity [Descamps, 2001].
Building a virtual human involves joining traditional artificial intelligence with computer
graphics and social science [Gratch, 2002]. Synthesizing a human-like body that can be
controlled in real-time includes computer graphics and animation. Once the virtual
human starts looking like a human, people expect it to behave like one too. To have a
believable intelligent human-like agent, the agent needs to possess personality and needs
to display mood and emotion [Chandra, 1997]. Thus, research in the field of building a
believable agent or a virtual human must rely immensely on psychology and
communication theory to adequately convey nonverbal behavior, emotion and
personality.
The key to realistic animation starts with creating believable avatars [Pina, 2002]. The
expressiveness of an avatar is considered to be crucial for their effective communication
capabilities [Salem, 2000]. The advantage of creating an embodiment for an agent
(avatar) is to make it anthropomorphic and to provide a more natural method of
interacting with it.
3
Gratch [Gratch, 2002] identifies the key issues that must be addressed in the creation of
virtual humans as face-to-face conversations, emotions and personality, and human figure
animation. The avatar can be animated to create body movements, hand gestures, facial
expressions and lip synchronization.
[Noma, 2000] specifies that the animation of a virtual human should posses:

Natural motion: In order to be plausible, the virtual human’s motion should look
as natural as possible. The virtual human must have a body language that is
human-like.

Speech synchronization: The body motion, in particular the lip movement of the
virtual human, should be in synchronization with the speech.

Proper Controls: The user should be able to control the agent. Changes are to be
allowed if needed. The user should be able to represent the basic emotions and
use them to come up with a combination of emotions.

Widespread system applicability: The tools for developing applications with
animated life-like agents should be integrated into the current animation or
interface systems.
It has been recognized that the non-verbal aspect of communication plays an important
role in the daily life of humans [Tosa, 1996]. Human face-to-face conversation involves
sending and receiving information through both verbal and non-verbal channels. Having
a human-like agent makes it easier to understand the aim of a conversation and provides
us with an opportunity to exploit some of the advantages of Non-Verbal Communication
(NVC) like facial expressions and gestures [Salem, 2000].
Body animation, gestures, facial expressions, and lip synchronization are all very
important for Non-Verbal Communication. The face exhibits emotions while the body
demonstrates mood [Salem, 2000]. It is true that in a few applications, only the head and
shoulders of a person may fill the screen. This does not imply that facial animation is
more important than body animation. In applications where avatars are at a relatively
larger distance from each other, facial expression can be too subtle and can be easily
missed. In such a situation, gestures are more important. Therefore, an animation
technique which provides both facial and body animation is deemed necessary.
Body gestures, facial expressions and acoustic realization act as efficient vehicles to
convey emotions [Gratch, 2002]. Animation techniques are required to encompass body
gestures, locomotion, hand movements, body pose, faces, eyes, speech, and other
psychological necessities like breathing, blinking, and perspiring [Gratch, 2002].
4
Additionally [Gratch, 2002] designates the following requirements for the control
architecture of believable agents.

Conversational support: Initiating a conversation, giving up the floor, and
acknowledging the other person are all important features of human face-to-face
communication. The architecture used to build virtual humans should provide
support for such actions. For example, looking repeatedly at the other person
might be used as a form of giving the other person a chance to speak or waiting
for the other person to speak. A quick nod as the speaker finishes a sentence acts
as an acknowledgement.

Seamless transition: A few behaviors like gestures require the virtual human to
reconfigure its limbs from time to time. It would help if the architecture allows
the transition from one posture to the other to be smooth and seamless.
In summary, it is generally agreed that techniques for building animated life-like agents
are expected to synthesize virtual humans that depict plausible body animation, gestures,
facial animation, and lip synchronization
Based on the above requirements, we have compiled several interesting questions in
order to analyze the existing research in the graphical representation of emotional agents.
They are as follows:
 How does the technique arrive at a set of useful gestures or expressions?
 Can gestures and expressions be triggered in a more natural way than selection from a
tool bar?
 How is it ensured that all the gestures and expressions are synchronized?
 Can the user control the gestures and expression?
 Can the technique handle multiple emotions simultaneously?
 Does the system represent the mood of the agent?
 Does the technique provide both facial animation and body animation?
 Does the technique provide conversation support (gestures or expressions that
indicate desire to initiate a conversation, to give up the floor and to acknowledge the
speaker)?
 How does the technique decide the mapping between emotion or mood and graphics?
 Is the technique extendable?
 How many emotions does the technique represent? Is it possible to form complicated
emotions using a combination of the represented emotions?
 Is the animation seamless?
 Is the technique evaluated? If yes, do professionals or users do the evaluation, is it
limited or extensive, is it scientific or not?
 Is there a working model or demonstration available? If yes, does it comply with all
the claims made?
As mentioned above, body animation, gestures, facial animation and lip synchronization
are the important aspects in the animation of believable agents. This report will handle
5
each of them independently in Chapters 2, 3 and 4 respectively. The additional
requirements, if any, for each aspect will be identified and the current literature will be
analyzed.
A few existing tools for the creation of animated agents like the Microsoft Agent, the
NetICE project, Ken Perlin’s responsive face, DI-Guy, and the Jack animation system,
will be evaluated based on the above mentioned criteria and a few additional criteria like
license agreements and cost (Chapter 5).
We will then consider whether any of the existing techniques can be integrated together
to provide a complete and plausible animation. The possible difficulties in such
integration will be considered (Chapter 6).
This report seeks to explain key features of graphical representations. We will compare
and contrast various graphical representations of agents as reported in the current
literature. In addition to categorizing the various agents, we will offer some explanation
of why various researchers have chosen to include or omit important features and what
we see as future trends.
6
Chapter 2
Body Animation
Animating the human body demands more than just controlling a skeleton. A plausible
body animation needs to incorporate intelligent movement strategies and soft musclebased body surfaces that can change shape when joints move or when an external force is
applied to them [Gratch, 2002]. The movement strategies include solid foot contact,
proper reach, grasp, and plausible interactions with the agent’s own body and the objects
in the environment. The challenge in body animation is to build a life-like animation of
the human body that has sufficient detail to make both obvious and subtle movements
believable. Also for a realistic body animation, maintaining an accurate geometric surface
through out the simulation is necessary [Magnenat, 2003]. This means that the shape of
the body should not change when viewed from a different angle or when the agent starts
moving.
The existing human body modeling techniques can be classified as creative,
reconstructive, and interpolated [Seo, 2003b, Magnenat, 2003 and Seo, 2003a].
The creative modeling techniques use multiple layers to mimic individual muscles, bones
and tissues of the human body. The muscles and bones are modeled as triangle meshes
and ellipsoids [Scheepers, 1997]. Muscles are designed is such a way that they change
shape when the joints move. The skin is generated by filtering and extracting a polygonal
“isosurface” [Wilhelms, 1997]. An isosurface is defined as “a surface in 3D space, along
which some function is constant” [Jean]. The isosurface representation takes a set of
inputs and draws a 3D surface corresponding to points with a single scalar value [Iso]. To
be put in simpler words, the isosurface is a 3D surface and the vertices on the surface can
be coupled with vertices on any other surface so that when a particular point on the latter
moves, the corresponding point on the isosurface is also displaced in the same direction
by the same amount. In creative modeling techniques, the vertices on the isosurface are
coupled with the underlying muscles, which make the skin motion consistent with the
muscle motion and the joint motion.
The creative modeling techniques were popular in the late 90s. Although the generated
simulation looks real, it involves substantial user involvement in the form of human
models, resulting in slow production time. Modern systems prefer reconstructive or
interpolated models because of the above drawbacks in creative models.
The reconstructive approach aims at building the animation using motion capture
techniques [Gratch, 2002]. The image captured can be modified using additional
techniques. Research [Gleicher, 2001; Tolani, 2000; Lewis, 2000; and Sloan, 2001] has
shown that using motion capture techniques to produce plausible animations is a
strenuous task since it would be difficult to maintain “environmental constraints” like
proper foot contacts, grasp and interaction with other objects in the environment [Gratch,
2002].
7
Since motion capture deals with the modification of the existing image, it is quite a
challenge to make the animation look plausible. Chances are that the animation would
look as if a separate image was pasted in the already existing environment. In other
words, it might look as if the animation is not a part of the environment. Also, it is
difficult to modify the generated images to produce different body shapes that the user
intends. Thus, the user has little control over the animation [Magnenat, 2003].
Interpolation modeling uses the existing set of example models to build new models. The
modifications to the existing set can be done using procedural code. The procedural code
allows programmatical control over the generated image.
Procedural approaches provide kinematic and dynamic techniques to parameterize the
location of objects and the types of movements to produce a believable motion [Gratch,
2002]. In general, kinematics is used for goal-directed and controlled actions, and
dynamics is used for applications that require response to forces or impacts. Kinematics
and dynamics differ in their applicability. Human animation, in general, might require
both the approaches, but in the case of emotional agents, kinematics seems to be more
useful.
Realistic body animation is not easy to produce since many body motions result from
synchronous movements of several joints. One option for producing a plausible body
animation is to attach motion sensors to the user in order to determine the user’s posture
and movements. Zen [Tosa, 2000] uses this approach. The use of motion sensors and
other techniques to handle complex body motions will be discussed in detail under the
“Gestures” section (Chapter 3) of the report.
However, the remaining part of this chapter deals with two of the techniques available for
body animation in the current literature, the 2.5D Video Avatar and the MIRALab
animation. Each of these techniques is analyzed based on the applicable criteria identified
in the Introduction section (Chapter 1) of the report. Additionally, we specify whether the
technique or theory uses motion capture or the procedural approach. As mentioned above,
research claims that it is not easy to generate a plausible animation using motion capture.
Hence, if motion capture is used, we will examine how the generated animation handles
this challenge.
Table 1 gives an overview of the two systems. The remaining section of the chapter
describes each of the techniques in detail.
Criteria
Animation Method
Extensibility
Control to the user
2.5D Video Avatar
Reconstructive or motion
capture
Limited
Very limited
Real time animation
Possible
MIRA Lab’s system
Interpolation
Extensible
User can control the
animation.
Possible
8
Evaluation
Scope of Evaluation
Demo
Is the animation
plausible?
Scientific
Limited
Pictures of the animation are
available
Not quite
Scientific
Decent
Pictures of the animation are
available
Reasonably plausible
Table 1: Body animation techniques
2.1 The 2.5D Video Avatar
The 2.5D Video Avatar, which uses motion capture, is part of the research done by the
MVL (Multimedia Virtual Laboratory) Research Center founded at the University of
Tokyo and a research center called the Gifu Technoplaza. The 2.5D Video Avatar falls
between the 2D Video Avatar and the 3D Video Avatar. [MVL] states that a 2D video
avatar is represented using two-dimensional image and does not have three-dimensional
information. A 3D video avatar cannot be generated in real-time, because the time
required to generate a single picture is on the order of 5 seconds [Yamada, 1999]. The
2.5D avatar is used to model only the user’s surface, which can be generated in real time
since it takes only 0.9 seconds to produce an animation.
The image generated by using this technique is transported into a system known as
Computer Augmented Booth for Image Navigation (CABIN) [Hirose, 1999]. CABIN,
developed by the Intelligent Modeling Laboratory (IML) at the University of Tokyo, uses
the immersive projection technology to build a virtual world. Immersive projection
technology is a technique used to construct virtual worlds. It is used to build multiperson, high-resolution, 3D graphics video and audio environments [Immer]. The user is
fully immersed into the environment using special stereoscopic glasses that help him/her
to see 3D images of objects float in space. The animation developed by MVL is designed
to work with other environments like the CAVE at the University of Illinois, CoCABIN
at Tsukuba University, UNIVERS at the communication Research Laboratory of the
Ministry of Posts and Telecommunications, and COSMOS at the Gifu prefecture
[Yamada, 1999]. All of the above-mentioned environments use the immersive projection
technology.
It is not indicated if the 2.5D avatar can be customized and transferred to other virtual
environments, which do not use the immersive projection technology. Hence, we assume
that the technique has limited applicability.
The 2.5 D video avatar method uses depth information from stereo cameras to capture the
subject’s image, and the image is then superimposed on the virtual environment
[Yamada, 1999]. The use of depth information makes the image look more realistic than
a traditional 2D image. However, it does not provide the effect of a 3D image.
A Triclops (Point Grey Research Inc.) stereo camera system that has three lenses and
uses two baselines (horizontally and vertically) is used for video capture. The subject is
photographed from three angles, 0 degrees, 5 degrees and 15 degrees. The captured video
9
images are sent to a PC (Pentium2 450MHz), which rectifies the distortions and
calculates the depth map. The depth information is computed by determining the
corresponding pixels between the images captured by the stereo cameras along an
equipolar line. Applying a triangulation algorithm on these pixels produces the depth
map. The triangulation method takes the images and calculates the distance between the
corresponding pixels using three coordinate axes x, y, and z. [Yamada, 1999] explains the
triangulation theorem in more detail.
The rectified color and depth images produced by the PC are used to create a triangular
mesh. Mapping each color image onto each triangular mesh as texture data generates the
2.5D video avatar [Tamagawa, 2001].
When the subject interacts with a virtual object, both the image information and the
spatial information (like the subject’s positional relationships) are also transferred. This
makes the simulation look plausible when the subject wishes to point to an object in the
environment.
Though the evaluation is performed using scientific methods, it is limited since the
developers have concentrated only on evaluating the pointing accuracy. A square lattice
with a number of balls positioned 10 cm apart at the grid positions is placed in the shared
virtual world. The video avatar is made to point at one of the balls and an observer is
asked to orally indicate which ball is pointed at [Tamagawa, 2001]. If the observer selects
the wrong ball, the positional error offset is recorded. In order to avoid parallax errors,
the observers are encouraged to look at the avatar from various directions by walking
around the display space. When an observer is looking at an object, the actual position of
the object can be determined if the line of vision is at right angles to the plane of the
object. The object appears to be at a different place if the line of vision is changed. This is
said to be a parallax error. Their evaluation shows that there is an error of 7.4 cm on an
average.
We assume that they chose this method of evaluation because an offset of 10 cm on a
pointing gesture can produce a reasonably plausible animation, unless there are two
objects in the range of less than 10 cm.
They claim that it takes 0.9 seconds to generate the 2.5D avatar. This shows that the
technique could be used to generate real-time animation.
We have identified the following drawbacks in the technique. From the evaluation, it is
clear that the simulation looks reasonably plausible when the 2.5D video avatar tries to
point at an object in the environment, but the problem of the virtual being trying to hold
an object still remains. Another disadvantage is that it is not possible to segment the
image from the environment effectively. There has to be a method to recognize where the
subject’s image ends and where the environment starts. The technique does not offer any
special procedure for this. Also from the pictures of the simulation provided in
[Tamagawa, 2001], it is clear that the superimposed animation does not have proper foot
contact in the target environment. It looks as if the image is floating in the environment.
10
2.2 MIRALab
The MIRALab research group at the University of Geneva uses the interpolation
technique and the already existing captured body geometry of real people to produce
plausible body animations. The system uses the existing techniques for human body
animation to build a library of body templates. The dimensions of each of the templates
are stored in the form of a vector. These dimensions include the following details from
the body geometry of the template [Seo, 2003a]:





The vertical distance from the crown of the head to the ground.
The vertical distance from the center of the body to the ground.
A set of individual distances from the shoulder line to the elbow, the elbow to the
wrist and the wrist to the tip of the small finger.
The girth of the neck.
Maximum circumference at the chest, trunk and the waist.
When a new animation is needed, the requirement is specified as an input vector with
values specified for each of the dimensions mentioned above. A closest match is found in
the library of templates by comparing the input with the dimensions of the template. This
template is then modified to produce the new animation.
If the animation has to represent a live person, 3D scanning is used to generate the input.
Range scanners are used to capture the sizes and shapes of real people [Magnenat, 2003].
The image is then measured and the body geometry is stored as a vector which acts as the
input.
Using this data, the most suitable template found. It is given that the library of templates
is formed using existing animation techniques, but the animation techniques used are not
specified.
The interpolation function determines the necessary deformation by blending the input
parameters and the existing templates, producing the displacement vector [Seo, 2003a].
However, the implementation of the interpolation function is not made public.
Appropriate shape and proportion of the human body are generated using a deformation
function which takes the displacement vector as input [Seo, 2003a]. They use a technique
called “radial basis interpolation” to generate the deformation functions by using 3D
scanned data of several human bodies.
The advantage of using an existing template to produce new animation is two fold. It
allows vector representation of the parameters, which is an easy way to describe the
shape needed. Also, the initial skin attachment information can be reused [Sea, 2003b].
Body geometry is assumed to consist of two distinct entities, the rigid component and the
elastic component [Seo, 2003a]. The desired physique is produced by manipulating the
rigid and elastic entities in the vector. The rigid deformation is used to specify the joint
parameters which determine the linear approximation of the physique. In other words, it
11
works by modifying the skeletal view. The elastic deformation depicts the shape of the
body when added to the rigid deformation [Seo, 2003a].
The problem of modeling a population of virtual humans is reduced to the problem of
generating a parameter set [Seo, 2003a]. When the animation has to represent a living
human, parameters are generated by scanning the body. If a fictitious human is to be
built, the most suitable template is selected by looking at the body model of each of the
template, and it is modified to generate the required body animation. Since the user can
generate a new model or modify an existing one by inputting a number of sizing
parameters, we categorize the system as extensible.
If the user is not satisfied with the generated animation, he/she can make minor changes
by programmatically modifying the set of parameters using trial and error. Different
postures can be generated by applying appropriate transformations to the rigid entries in
the vector. Proper foot hold, grasp, and reach can be modeled by changing the
parameters. Hence, the system offers the user sufficient control over the generated
animation.
The system also models the clothes worn by the virtual human. Various algorithms
segment the garments into pieces depending on whether they stick to the body surface or
flow on to it [Magnenat, 2003]. The segmented pieces are then coupled with the skin
parameters.
Once the animation looks believable, the system provides a mapping function which
attempts to estimate the height and weight of the generated animation. This mapping
function takes both the rigid and elastic entities into consideration and estimates the
height and weight, respectively, of the person represented by the animation.
Implementation details of the mapping function are not made public. If these entities do
not deviate from the height and weight of the human the animation is trying to represent
by more than 0.1%, it is concluded that the animation is plausible enough. This method is
used to minimize the error in the representation and aids in generating a realistic body
animation. Although this approach is quite interesting, its feasibility is questionable since
(for example) the bone weight of different people is different even if they have the same
size of bones [Bone].
They claim that it takes less than a second on a 1.0 GHz Pentium 3 to generate the
animation after receiving the input parameters from the user, indicating that the technique
could be used to generate real-time animation.
The evaluation is scientific and is done by cross-validating using the existing templates.
As mentioned earlier, the library of templates are generated using the existing body
animation techniques which generate plausible animations but are tedious to use [Seo,
2003b]. For the evaluation, each one of the templates is removed from the library, and its
parameters are given as an input to the synthesizer. The generated output model is then
compared with the input template. If the output matches the input, it is concluded that this
technique produces plausible animations and is quite easy to use when compared to the
12
other available techniques. Results of the evaluation show that the difference between the
input and output is at most 0.001 cm. Since the deviation is at most 0.001cm, the
performance of the synthesizer can be considered to be good.
13
Chapter 3
Gestures
Gestures are a natural way of human-to-human communication. They help to show
emotions during communication and enrich the clarity of speech [Leung, 2001]. Gestures
are an integral part of human communication and are often used spontaneously and
instinctively. Hence, even in virtual humans, gestures must be made to occur in the flow
of the animation. They should not be explicitly activated by using menu or button
controls [Salem, 2000].
A few gestures have the entire message contained in them [Salem, 2000]. For example, a
nod denotes a yes. Other types of gestures, for example, a thinking gesture, are used to
complement speech [Leung, 2001]. Some other gestures like the pointing gesture are
context dependent [Yamada, 1999]. For example, a pointing gesture can be used to refer
to an object, or direction of displacement. The animation technique should be able to
represent all kinds of gestures.
Gestures can be animated by calling predefined functions from a library of gestures and
expressions [Salem, 2000].
There are several ways in which functions can generate the required gesture. Motion
capture technique can be used, and the captured image can be modified to reflect the
required gesture. Sensors can be connected to the subject’s body and the avatar’s limbs
and face can be manipulated based on the movement of the subject [Tosa, 2000]. Key
frame-based techniques can also be used to generate the gesture functions by
“interpolating” pre-determined frames [Leung, 2001]. In key frame animation, the current
posture of the avatar is stored as the source key-frame and the desired posture is stored as
the target key-frame. The transition from the source to the target is achieved by
manipulating a few frames in the representation. Details about key-frame animation are
provided in [Cad]. Key frame animation is a popular technique in 3D animated feature
films.
Using motion capture or sensors to animate gestures requires special equipment and
involves a heavy cost [Salem, 2000].
Once the library of gestures is formed, it is possible to determine which function has to
be called based on the frequency or the tone of the subject’s speech, the words used in the
speech, and the mood and personality of the subject. For example, if the user raises his
voice to emphasize certain words, the avatar can make a suitable gesture to show the
emphasis.
This section deals with three of the techniques available for gesture animation, a
technique used for the design of a Collaborative Virtual Environment (CVE) [Salem,
2000], the BEAT architecture [Cassell, 2001], and a technique used for building a Virtual
Human Presenter [Noma, 2000]. Each of these techniques will be analyzed based on the
applicable criteria identified in the Introduction section (Chapter 1) of the report. We will
14
determine if the motion capture, sensors, or the key frame based method is used to
generate the needed gesture. Additionally, we will discuss whether the technique or
theory represents all the types of gestures.
Table 2 summarizes each of the techniques. The entries in the table which say “cannot be
determined” indicate that not enough information is provided to determine if the system
satisfies that particular criteria. The entries which say “possible” indicate that though the
description of the system does not explicitly say anything about satisfying that criteria, it
can be inferred from the available description that the system can possibly satisfy the
criteria.
Criteria
CVE
BEAT
How are gestures
triggered?
Through words
in the input
text
Cannot be
determined
Through words in
the input text
Are all kinds of gestures
represented?
Virtual human
presenter
Through words in the
input text
Support for
pointing gestures is
not explicit
Set of rules formed
from existing
research.
Yes
Yes
Yes
Cannot be
determined
Yes
Yes
Yes
Yes
Cannot be
determined
Not specified
Possible
Yes, but is time
consuming
Possible
How is the mapping
between words and
gestures decided?
Is the system extensible?
Can the system be
controlled by the user?
Can the system be used in
real time?
Does it include
personality and mood?
Can combination of
gestures be generated?
How are the gestures
represented graphically?
Not specified
Can they be integrated
with the existing body
animation techniques?
Is the transition between
gestures smooth?
Evaluation
Cannot be
determined
Limited
Yes
Cannot be
determined
Not done
Set of rules formed
from books on
gestural vocabulary
Limited
Yes
Possibly using
interpolation
techniques
Possible
Using interpolation
techniques
Cannot be
determined
Scientific and
limited
No
Yes
Is there a demonstration
No
available?
Table 2: Gesture animation techniques
Possible
User evaluation and
limited
Yes
15
The remaining part of the section explains each of the techniques in detail.
3.1 The Collaborative Virtual Environment
The Collaborative Virtual Environment (CVE) was designed as an expansion of the textbased chat room by a research group at the University of Plymouth. Avatars in the CVE
communicate by using text and other non verbal channels of communication. The nonverbal channels of communication involve facial expressions, eye glances, body postures
and gestures [Salem, 2000].
Input from the user is taken in the form of text and the appropriate gestures are generated
using the words in the message. The text is scanned to find abbreviations, punctuations,
emotion icons and performative words. Performative words are used to denote words in
the input text which need a physical action to be performed. For example, wave is a
performative word.
Abbreviations like LOL (Laughed Out Loud), IMHO (In My Humble Opinion), and
CUL8R (See You Later) are recognized and the relevant gesture for each of them,
laughing, surprise, neutral pose, and wave, respectively are invoked.
Punctuation like ? and ! are recognized and are interpreted respectively as questioning
and emphasizing a message. The gesture for questioning is animated as the head slightly
thrown back, one eyebrow raised and a hand out-stretched. The gesture for emphasizing a
message is animated as the head slightly thrown back, eyebrows raised and torso upright.
Since the technique is used to extend the already existing text based chat room, emotion
icons, which are very common in any chat environment, are also considered. For
example, :-) (smile), :-( (sad/upset), and :-* (kiss) are animated as smile, head and
shoulders drooped, and blowing a kiss respectively.
A few common words like ‘yes’ and ‘no’ are also associated with appropriate gestures
like nodding the head and shaking the head. Apart from these, phrases which are
categorized as performative words are also handled. A performative word is contained in
two asterisks. For example, a *wave* in the user’s text indicates that the user wants to
wave. So the avatar is made to wave in the virtual environment.
The system allows the user to customize the mapping of a gesture to a keyword. The user
is provided with the ability to assign a gesture to a different keyword than the keyword
suggested by the system. Such changes can be saved as a separate file and can be loaded
whenever necessary. This makes the system extensible.
[Gratch, 2002] identifies the importance of conversational support. Initiating a
conversation, giving up the floor, and acknowledging the other person form an integral
part of human face-to-face communication. The system provides gestures for all three
intentions. Initiating a conversation is accomplished with a greeting which is a wave of
the hand. Giving up the floor is accomplished by forwarding the arms, offering, and then
16
pulling back. Acknowledging the other person is done by a nod of the head. Intent to
leave is expressed by keeping the gaze connected and a quarter turn of the body.
The information contained in the input text is used to control the movement of hands,
arms and legs. Also, it is claimed that a set of gestures for a particular avatar is generated
by taking the personality, mood and other relevant characteristics as an input and then
coupling it with a predefined library of generic gestures and expressions. As soon as the
mood of the avatar is changed, a new set of gestures is generated.
We have identified several points of the system which are difficult to evaluate. The paper
[Salem, 2000] gives a few examples for each of the keywords like abbreviations,
punctuations, emotion icons and performative word, but the entire list of all the key
words is not made public. Also, it is not clear how the mapping between words and the
actions is achieved. The gestures (actions) associated with the words mentioned as
examples are obvious, but the mapping for complex gestures is not mentioned. Also it is
indicated that the gestures can be customized, but it is not clear if the user is allowed to
add new gestures. Furthermore, information about how the gestures are represented
graphically is missing. Hence, it is difficult to determine if the system can be integrated
with any of the existing body animation techniques.
Even though many aspects of the theory look promising, the unavailability of a
demonstration and lack of an evaluation makes it impossible to determine if they have
met all their claims.
3.2 BEAT
The Behavior Expression Animation Toolkit (BEAT) was developed by the Gesture and
Language Narrative Group (GNL) at the MIT Media Lab. The tool takes typed text to be
spoken by the animated human figure as input and produces the appropriate nonverbal
behavior [Cassell, 2001]. The nonverbal behavior is generated on the basis of “linguistic”
and “contextual” analysis of the input text.
The linguistic analysis is used to identify the key words in the text. Key words are the
words that represent the emotion of the speaker when he utters the word. For example, in
the sentence “I am surprised!” the word surprised is a key word.
Contextual analysis is used estimate the context in which the given text is spoken. The
nonverbal behavior produced can then be sent to an animation system. The toolkit
automatically suggests appropriate gestures and facial expressions for a given input text.
A set of rules formed from the existing research in the field of communication is used to
map the text to the appropriate gesture. Also, the system allows animators to include their
own set of rules to work for different personalities in the form of filters and a knowledge
base, which are to be written in XSL (Extensible Stylesheet language). Filters can be used
to reflect the personality and mood of the avatar. Details about the filters and knowledge
bases are provided in the subsequent paragraphs.
17
[Cassell, 2001] describes the technique as follows: The system uses an “input-to-output
pipeline” approach and provides support for user generated filters and knowledge bases.
The term input-output pipeline means that each stage in the system is sequential. The
output from one stage forms the input to the next stage. The system is written in Java, and
XML. The use of XML and Java makes the technique portable.
The input text is sent to a “language tagging module”, which converts it into tags. These
tags are then analyzed and coupled with a generic knowledge base and a set of behaviors
(called “suggested behaviors”) is formed.
The generic knowledge base provides common gestures that include the beat, which is a
vague flick of the hand, the deictic, which is a pointing gesture, the iconic, which is an
act of surprise, and the contrast, which is the contrastive gesture. For example, a tag
<surprise> might be mapped to a gesture which shows raised eye brows in the knowledge
base. The knowledge base, in general, is used to store some basic knowledge about the
world and is used to draw inferences from the input text. Kinds of gestures to be used
and places where emphasis is needed is determined from these inferences. These
inferences form the set of suggested behaviors.
The user specified knowledge base and personality filters are used to filter the set of
suggested behaviors to form the selected behavior. The selected behavior contains the
name of the gesture and the command to represent it graphically. For example, to move
the right arm, the generated gesture would be
<GESTURE NAME= “MOVE”>
<RIGHTARM HANDSHAPE=5/>
</GESTURE>
Animators are allowed to design new gestures and include them into the system. This
requires a new tag to be added into the knowledge database and a corresponding gesture
command mapped to it. This makes the system extensible.
The description of the working of the system which is provided does not explicitly deal
with integration details, but it appears that the animator can change these gestures to
work with the chosen body animation technique. He/she can use a key frame-based
interpolation [Wu, 2001] approach which takes the body animation and the above gesture
command and moves the right arm by 5 degrees. Thus, it can be said that the system can
be integrated with the existing body animation techniques.
Similarly, even if the description of the system does not talk about representing a
combination of gestures, it can be inferred that they can be generated by appending
instructions to the generated tag. For example, the above tag can be modified to represent
a titled head by appending HEAD = 30 to RIGHTARM HANDSHAPE = 5.
18
An utterance coupled with a gesture is estimated to be generated in 500 – 1000ms, which
is calculated to be less than the natural pause in a dialogue [Cassell, 2001]. Hence, the
system can be used for real-time animation.
From the description, it is not clear if the theory can represent pointing gestures. Also, no
attempt is made to make the transition between gestures smooth.
[Cassell, 2001] claims that BEAT was extensively tested and a demonstration is available
at http://www.media.mit.edu/groups/gn/projects/beat. However, the specified link is no
longer active as of September 27, 2003. From the verbal description of the evaluation
provided, it can be said that the evaluation is scientific, but not very extensive. The
system was tested using an input text with two sentences and pictures of the generated
gestures are provided. The generated animation looks to be moderate. The evaluation can
be improved by using input texts which require a combination of gestures for example; it
would be interesting to see how the avatar depicts the sentence “I wonder how this
works!” From the description of the theory, the literals wonder, how and ! are separated
and the corresponding gestures surprise, questioning, and emphasizing are generated.
When a real human says this sentence, he would be showing primarily questioning with a
combination of surprise and emphasis. If the avatar can achieve the same combination of
gestures, it can be said to be plausible.
3.3 The Virtual Human Presenter
The Virtual Human Presenter was developed on the Jack animated-agent system at the
Center for Human Modeling and Simulation at the University of Pennsylvania. The
system serves as a programming toolkit for the generation of human animations [Noma,
2000].
The system takes the input text, scans it and automatically embeds gesture commands on
the basis of the words used. The virtual human is then made to speak the text and the
embedded commands produce the animation in synchronization with his speech. A
command starts with a backslash and can be followed by arguments enclosed in braces
depending on its type.
For example an input text which says “This system supports gestures like giving and
taking, rejecting, and warning” can be modified into “The system supports gestures like
\gest_givetake giving and taking,\gest_reject rejecting, and gest_warn warning.” The
commands in the example do not take any arguments. Other commands like
\point_idxf(),\point_back(), \point_down() and \point_move() represent pointing gestures
and take arguments. Additionally, commands like \posture_neutral and \posture_slant are
used to specify the body orientations. The gestures generated are controlled by means of
the commands.
19
A set of Parallel Transition Networks (PaT-Nets) are used to control the avatar. Pat –
Nets are parallel state machines which are easy to manipulate. They can monitor
resources used, the current state of the environment, and sequence actions [Badler, 1995].
The networks handle all the tasks from parsing the inputs to animating the joints.
Smooth transition between gestures is produced by using motion blending techniques.
The motion blending algorithm uses many motion editing algorithms and “multi-target
interpolation” to produce a smooth animation [Kovar, 2003].
It is claimed that the library of gesture commands includes all the required gestures for
presentation, and the mapping between words and actions is based on a published
convention for gestures, presentations and public speaking. The gesture commands are
generated by collecting vocabularies from psychological literature on gestures and
popular books on presentation and public speaking. The mapping between words and
gestures is achieved from a book about Delsarte [T. Shawn, Every Little Movement - A
Book about Delsarte, M. Witmark & Sons, Chicago, 1954]. However, the list of gestures
available is not specified.
It is not explicitly indicated whether a combination of gestures can be represented.
However, gesture functions can be parameterized and modified to reflect additional
gestures. Hence, it can be inferred that modifying the library of gesture functions makes a
combination of gestures possible. Since each gesture command is a call to a function, the
code in the function can be changed to reflect what the user requires. However, it is not
specified if the animator is given enough privilege to do it.
Also, if an avatar with a different personality and mood is to be built, the entire library of
commands must be reconstructed. To change the library each time the personality or the
mood changes is a strenuous task and is time consuming. Hence, extensibility of the
system is achieved at the cost of time.
The animation is produced by using key frame-based interpolation. The speed of the
animation is claimed to be 30 frames per second. Hence, the tool can be used for realtime animation.
The limited evaluation is done by users. Since the tool is used to generate a virtual human
presenter, emphasis is given to the quality of speech of the presenter rather than the
gestures generated. They only concentrate on the pointing gestures.
The demonstration of the working of the tool is available at http://www.pluto.ai. Kyutech
.ac.jp /~noma/ vpre-e.html as of September 28, 2003. The gestures shown in the
demonstration are not very impressive. The movement of the avatar is not human like.
Also, the gestures demonstrated are mainly pointing gestures. A video showing all the
different gestures is available, but it is not clear what the virtual human is trying to enact.
There is no synchronization between the gestures and speech.
20
Chapter 4
Facial Animation and Lip Synchronization
In human face-to-face communications, facial expressions are excellent carriers of
emotions [Salem, 2000]. Eye contact and gaze awareness play an important role in
conveying messages non-verbally [Leung, 2001]. Like gestures, facial expressions in
humans also occur spontaneously and instinctively. Thus, they should be made to occur
in the flow of the animation instead of being explicitly driven by menu or button controls.
Lip synchronization is an important component in facial animation. Some animation
techniques provide random jaw and lip movements. When speech is attached to such
animations, the resulting animation does not look plausible. In order to avoid this, many
of the facial animation techniques provide support for lip synchronization.
Facial animation can be done in three different ways [Gratch, 2002]. The first method is
to use keyframe based interpolation techniques. These methods are called parametric
animation techniques, and they use geometric interpolation to produce the required shape
[Byun, 2002]. Geometric interpolation is similar to keyframe-based approach. (A brief
introduction to the keyframe-based approach was provided in the “Gesture Animation”
(Chapter 3) section of the report.)
The second method is to produce facial animation from text or speech. In this method, an
algorithm used to analyze the text or the speech identifies a set of “phonemes”. Phonemes
can be defined as the smallest unit in language that is capable of conveying a distinction
in meaning [Dictionary]. The phonemes are then mapped to visemes. Visemes act as
visual phonemes. A model called the speech articulation model takes the visemes as an
input and animates the face [Gratch, 2002].
The speech articulation model operates on a generic face, which is represented in the
form of a mesh of triangles and polygons. The animator is expected to provide this face
mesh. It uses physics-based models to simulate skin and facial muscles [Byun, 2002].
Mathematical models are used to produce changes in the skin tissues and skin surface
when the facial muscles move. Keyframe based interpolation is used to identify the key
poses and produce a smooth transition between them. Though the generated animation is
realistic, generating the initial face mesh involves a lot of manual work by the animators.
Another set of methods for facial animation, termed performance-driven methods, extract
the required facial expressions from live humans (or from videos of those humans) by
using some special electromechanical devices [Byun, 2002]. A library of the regularly
used facial expressions is made from the captured images. The required facial expression
is called from the library as needed. These methods are usually used in combination with
motion capture methods for body animation. They require special equipment, a
tremendous amount of human involvement in the form of models. The techniques can be
used to generate a library of facial expressions but it would be an involved task to
customize the expressions to work for a new face model because they are specific to a
particular person [Gratch, 2002]. Each time an animation of a different subject is
21
required, he or she is made to go through the entire process. This can be a time
consuming process.
Owing to the drawbacks in the other two methods, parametric animation techniques are
being widely used in the latest animation systems. They take two sets of parameters, the
Facial Action Coding System (FACS) and the Facial Animation Parameters (FAPs). The
FACS and the FAPs are explained in the remaining part of the section.
[Leung, 2001] describes the Facial Action Coding System (FACS) which was developed
by Ekman and Friesen in 1978. [FACS] is a list of all “visually distinguishable facial
movements.” The list of FACS is frequently updated. That is the reason the parametric
animation techniques take the FACS as a parameter.
The Facial Animation Parameters (FAPs) represent a facial expression in the form of a
set of distances between various facial features. Different expressions are produced by
changing these FAPs [MPEG-4]. In simpler terms, the FAP is a set of 66 parameters
which store the distance between various facial feature points of a given face model. A
complete description of what each of these 66 parameters represent is available at [ISO,
1997]. Simply stated, 16 FAPs represent the jaws, chin and lips; 12 FAPs represent the
eyeballs, pupils and eyelids; 8 FAPs represent the eyebrows; 4 FAPs represent the
cheeks; 5 represent the tongue [MPEG-4] and so on. Each parametric animation
technique uses the FACS and FAPs in a different way.
There are two ways in which lip synchronization can be achieved in facial animation
[Leung, 2001]. The first approach uses “energy detection techniques” to convert the input
speech into an angle for the mouth opening. The energy content in the speech is measured
and the lips of the avatar are animated accordingly. For example, an “o” in the uttered
word results in the lips of the agent forming a brief circle, whereas two “o”s result in a
more pronounced lip movement. The higher the intensity, the more pronounced is the lip
movement. The quality of the generated animation depends on the quality of the input.
The speech has to be recorded with good quality and the energy information has to be
captured accurately. This requires the presence of additional equipment. Hence, the
technique is not a preferred method.
The second approach generates phonemes by scanning the input text. The phonemes are
then mapped to appropriate visemes, which are used to manipulate the lip movement of
the avatar. The method of generating lip movements based on the viseme information
depends on the animation technique being used. Most of the existing facial animation
techniques use the second method to provide lip synchronization.
Facial animation requires generating plausible facial models and mechanisms to move the
surface of the produced face model to reflect the required expressions and emotions
[Egges, 2003]. Lip synchronization, jaw rotation, and eye movement are some of the
important considerations in facial animations.
22
The generated animation should be believable, i.e., the agent should blink appropriately,
the lips, teeth, and tongue should be modeled and properly animated, and emotions must
be readable from the face [Byun, 2002]. The parameters used to control the animation
should be easy to use. The control parameters should be consistent and easily adaptable
across different face models. In other words, customizing the animation data for a
different model should require as little human involvement as possible. The generated
facial animation must be able to work with body animation and gesture animation.
We describe and analyze four of the tools available for facial animation: the BEAT
[Cassell, 2001], the FacMOTE [Byun, 2002], the MIRALab’s tool for facial animation
[Egges, 2003], and the BALDI system [Baldi]. Each of these tools will be evaluated
based on the applicable criteria from Chapter 1 of the report and the additional
requirements identified so far in this section.
Table 3 summarizes each of the techniques. The entries in the table labeled “cannot be
determined” indicate that enough information is not provided to determine if the system
satisfies that particular criteria. The entries labeled “possible” indicate that although the
description of the system does not explicitly say anything about satisfying that criteria, it
can be inferred from the available description that the system possibly satisfies the
criteria.
Criteria
How are facial
expressions triggered?
BEAT
Through
words in the
input text
MIRALab
Through
words in the
input text.
BALDI
Through
words in the
input text.
Possible
FacMOTE
Possibly
through words
in the input
text.
Cannot be
determined
Possible
Does it include
personality and mood?
Is the agent made to
blink often?
Can combination of
expressions be
generated?
Does the technique
provide support for lip
synchronization?
Can the animation be
controlled by the user?
Is the system
extensible?
Can the intensity of the
emotion be changed?
Is it portable to other
face models?
Can it be integrated
with the existing body
Yes
Cannot be
determined
Possible
Cannot be
determined.
Yes
Possible
Possible
Yes
Cannot be
determined
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Possible
Yes
Yes
Cannot be
determined
Yes
Yes
Yes
Yes
Only MPEG-4
models
Possible
Yes
Cannot be
determined.
Possible
Yes
Possible
23
animation techniques?
Can the system be
used in real time?
Evaluation
Demonstration
Which of the above
mentioned methods of
animation is used?
Yes
Yes
Scientific and Scientific and
limited
decent
No
No
Viseme
Parametric
generation
animation
method
technique
Table 3: Facial Animation tools
Yes
Yes
User
evaluated and
very limited
No
Parametric
animation
technique
User
evaluated and
scientific
Yes
Not specified.
4.1 BEAT
The Behavior Expression Animation Toolkit (BEAT) developed by the Gesture and
Language Narrative Group (GNL) at the MIT Media Lab can be used to produce facial
expressions. The technique was described in the “Gesture Animation” section (Section
3.2) of the report.
The tool takes typed text to be spoken by the animated human figure as input and
produces the appropriate nonverbal behavior [Cassell, 2001]. The nonverbal behavior is
generated on the basis of “linguistic” and “contextual” analysis of the input text. It uses
the generic knowledge database to produce a set of “suggested behaviors”. This set is
then coupled with the user-generated filters to produce a selected behavior [Cassell,
2001].
The behavior suggestion module contains a series of facial expression generators like an
eyebrow flash generator and a gaze generator. The eyebrow flash generator signals the
raising of eyebrows when some thing surprising happens. This can be customized as
mentioned in the “Gesture Animation” section (Section 3.2) of the report.
The gaze generator is algorithmic and suggests gazing away from the user at the
beginning of a dialog and gazing towards the user at the end of the dialog. If the dialog
process is long, it suggests gazing at periodic intervals.
As mentioned in the description of the BEAT in the “Gesture Animation” section
(Section 3.2), the system is extensible and can be used in real time. In addition, it can be
inferred that a combination of facial expressions can be generated.
From the description of the system, we have inferred that it uses the second method of
animation (the viseme generation method) described earlier in the section. Input text is
analyzed to produce visemes, and the animation is generated by the speech articulation
model taking these visemes as input.
24
From the available data, we infer that blinking of the eyes at regular intervals can be
achieved by including the corresponding call in the selected behavior. It is, however, not
specified if the intensity of the facial expression can be changed.
Lip synchronization is provided by recognizing the visemes in the input text. Lip
movements for ten distinct visemes are available. The description does not specify if the
animator can modify the existing lip movements or generate new movements.
4.2 FacMOTE
FacMOTE is a facial animation technique designed at the Department of Computer and
Information Science at the University of Pennsylvania. The technique produces the
required facial animation using a parametric animation technique. The system can work
with a facial model created by using motion capture or generated manually as long as it is
expressed in the MPEG-4 form [Byun, 2002; Grachery, 2001].
MPEG–4 is a standard that is used to produce high quality visual communication [Goto,
2001]. It defines a set of points on the face as Face Definition Parameters (FDP). Some of
these points are used to define the shape of the face. A particular position of the face
which does not show any emotion is decided to be the neutral position. A set of
parameters called the Facial Animation Parameters (FAP) specify displacements from the
neutral face position.
These FAPs are applied to the FDPs, and the required facial expression is generated.
FAPs can be used to generate visemes and expressions. As mentioned earlier, Visemes
are visual phonemes used to represent lip synchronization. For example, when the avatar
has to utter a word like “hello”, the visemes make sure that at the end of the utterance, the
lips of the avatar look like an “o”. Fourteen distinguishable visemes are included in the
library provided by the MPEG-4 standard. Transitions from one viseme to the other can
be produce by blending the two visemes together by using a weighting factor for each of
them [Garchery, 2001]. Since the MPEG-4 deals with both the facial expressions and
visemes, it can be inferred that any animation technique which follows the MPEG-4
standard provides support for lip synchronization.
Similarly, six facial expressions (joy, sadness, anger, fear, disgust and surprise) are
provided. Each facial expression is associated with a value which specifies the intensity
of the expression. The intensity can be varied as needed. Also, it is possible to produce a
combination of expressions by blending the provided expressions with a weighing factor.
For example, 70% of fear and 30% of surprise can be blended together to show horror.
Details about how the visemes and expressions can be blended are specific to the
techniques that use the MPEG model.
A 3D model of the person who has to be represented is obtained using a 3D laser scanner.
Facial Animation Parameters (FAPs) are then generated from this 3D model using
multiple complicated mathematical algorithms [Byun, 2002].
25
Since the large number of Facial Animation Parameters (66) makes it difficult to use
FAPs as a direct animation tool, the FacEMOTE uses a set of four higher level
parameters that drive the underlying 66 FAPs. Mapping between the FAPs and the higher
level parameters is described in the subsequent paragraphs. The use of such parameters
allows easy control over the face.
The FAPs are organized as sets of individual FAP units (FAPUs). For example, all the
FAPs used to control the eyes could be grouped into an eye FAPU, and so on. The four
higher level parameters are used to control each of the FAPUs. The set of higher level
parameters are categorized in space, weight, time and flow. These parameters were
adapted from the effort parameters of the EMOTE system [Chi, 2000]. Each of these
parameters takes values ranging from -1 to 1. A “0” represents a neutral pose.
[Byun, 2002] offers the following examples.
Space parameters can vary between indirect, which is represented by a -1 and direct,
which is represented by a 1. For example, space parameters controlling an eye can be
described as a gaze when the value selected is -1 and a focused look when the value
selected is a 1. Similarly, space parameters are linked with other FAPUs to produce
various other expressions.
Weight parameters can vary between light and strong. When associated with speech, a
light action can be whispering and a strong action can be snarling.
Time parameters can vary between sustained and quick. When associated with the
FAPUs of the mouth, a sustained action could be yawning and a quick action could be
clearing of the throat.
Flow parameters can vary between free and bound. A free action could be laughing,
while a bound action could be chuckling. Free and bound are similar to sustained actions
and quick actions in time parameters. The only difference is that flow parameters can be
associated with speech. For example, a person can be laughing and at the same time
he/she is saying “This is very funny”.
The set of FAPs is used to specify each expression. A neutral expression can be generated
by setting the four parameters to zero, which in turn sets all the FAPUs to zero and hence
all the FAPs to zero. For example, a smile can be generated by setting the weight
parameter associated with the lips FAPU to a value between -1 and 0 and the flow
parameter to a value between 0 and 1. The intensity of the smile can be varied by
changing these values. By using this approach, the animator is provided better means to
control the FAPs when compared to having to deal with 66 parameters.
Keyframe based interpolation techniques can be used to generate the animation from the
given set of FAPs. Hence, it is possible to integrate the generated animation with the
existing body animations which are produced using interpolation techniques.
26
Evaluation is done by trying to generate all the possible facial expression mentioned in
FACS by varying the values of the four parameters in each FAPU. [Byun, 2002] shows a
snapshot of the generated animation for a set of values and it looks to be decent.
Combinations of expressions are also represented.
They claim that since the FAPs store the distance between various facial feature points,
the same FAP data can be used on different face models and generate realistic animation.
A lack of a demonstration makes it impossible to determine if their claim is true.
The explanation of the system does not explicitly specify how facial expression can be
triggered. They have generated a library of all the regularly used expressions (like smile,
and surprise) by changing the values of each of the four parameters for each FAPU.
Assuming that their claim that the same FAP data can be used over different face models
is true, we have inferred that the required expression can be called from the library based
on the analysis of the input text. This was the approach used in gesture animation.
The approach can be used in real–time because the facial expressions are generated by
keyframe interpolation techniques, which is quite fast [Cad].
From the description of the method, we have inferred that it is possible to make the agent
blink at specific intervals of time by embedding the appropriate trigger into the input text.
When the input text is being scanned, a call to the eye blinking expression can be
included. It is possible that there is a more profession way of including eye blinking, but
it is not trivial from the provided description.
The information provided about the technique does not explicitly state the ability of the
user to add new expressions or to modify the existing expressions. We have inferred that
the user could change or intensify an expression by changing the values of the four
parameters. New expressions can also be generated by using the FAPs of an existing
expression and changing the values of the four parameters. Once the intended expression
is produced, it can be stored in the library. The disadvantage with this approach is that the
four parameters are changed by trial and error in order to produce the required facial
expression. This can be quite time consuming and frustrating at times.
4.3 The MIRALab
This section describes and evaluates a facial animation technique designed by the
MIRALab research group at the University of Geneva. The technique operates on input
from the user in the form of text or audio. If the input is in the form of audio, it is
converted into text by using available speech-to-text software. The system produces the
facial animation in real-time and couples speech to it. The generated 3D face hence
shows facial expressions and speaks the specified text [Egges, 2003].
27
The text input is analyzed and tags are produced. These tags are then used to determine
the appropriate facial expression. The research group at MIRALab also uses the MPEG-4
technique to generate facial animation.
They claim that the use of Face Animation Parameters (FAPs) alone does not provide
sufficient quality of facial animation for all applications. As mentioned earlier, it is
difficult to produce an animation by controlling 66 parameters. Hence, they use a Facial
Animation Table (FAT) which is also provided by the MPEG-4 standard in their
animation [MPEG-4].
The Facial Animation Table (FAT) defines the effect of changing the set of FAPs. The
table is indexed by facial expressions, called “IndexedFaceSet”. The IndexedFaceSet
shows a facial expression graphically and points to the set of FAPs for that expression.
The FAT contains different fields like coordIndex, which contains the list of FAPS that
are to be changed to represent the current facial expression, and coordinate, which
specifies the intensity and direction of the displacement of the vertices in the coordIndex
field.
The person to be represented is photographed, and 3D animation algorithms use these
photographs to produce the 3D model of the person in the form of triangles. 3D graphical
tools like 3D Studio Max, and Maya could be used to change the model if needed
[Grachery, 2001]. The set of FAPs are generated from the 3D models using specific
algorithms. Details about how these algorithms works can be found in [Kshirsagar,
2001].
The research group has developed tools to automatically build the Facial Animation
Table (FAT) from the 3D animated face produced using [Kshirsagar, 2001]. The tool
takes the animated face model, generates the FAPs and modifies these FAPs slightly to
produce a different facial expression. The animator is then asked if he needs this
expression. Hence, the animator can choose to store the generated expression if it is
deemed necessary. Controls are provided in the form of slider bars, and the animator can
himself/herself change a particular Facial Animation Parameters Unit (FAPU) to reflect
the needed expression. Though the generation of the FAT is a one-time task and is less
strenuous when compared to the method used by FacMOTE, it is still time consuming.
The research group claims that the FAT can be downloaded and the animator does not
have to generate a new FAT each time he wishes to create a new face animation. The
same FAT data can be used on different face models and it is still possible to generate
realistic animation. Again as in the case of FacMOTE, the lack of a demonstration makes
it impossible to determine if their claim is true.
The system can be used in real time once the FAT is formed or downloaded, since frames
of animation are generated at the rate of 3.4 frames/second. The animation is produced by
means of a keyframe based interpolation technique, which was also developed at the
28
MIRALab. The generated animation can be integrated with body animation which is also
generated using keyframe based methods.
The tool represents both visemes and expressions. The Face Animation Table (FAT)
provided stores the fourteen basic visemes and the six emotions identified by MPEG-4.
The animator can generate additional expressions and visemes using the tool that was
used to build the FAT from 3D face model. The intensity of the emotion can be
controlled by changing the intensity of the corresponding FAP.
Since analysis of text is used to trigger the facial expressions, we assume that blinking of
eyes can be modeled by placing the appropriate tag at regular intervals in the text. It is
claimed that the technique can be extended to any face model that follows the MPEG-4
standard. If a particular model does not follow that standard, a special algorithm which is
developed at the MIRALab [Grachery, 2001] is used to extract Facial Animation
Parameters (FAPs).
The technique was used to build a virtual tutor application which was evaluated by many
human users. Details of the criteria of evaluation and the responses of the evaluators are
missing. Hence, we have assumed that the evaluation is not scientific and is limited.
4.4 BALDI
The BALDI is a conversational agent project done for the Center for Spoken Language
Understanding (CSLU) at the University of Colorado. It is funded by the National
Science Foundation (NSF) grant. The aim of the project is to develop interactive learning
tools for the language training of deaf children. The system takes a recorded utterance
and a typed version of the same utterance as input. The input is scanned and expressions
are triggered based on the words in the input. Lip synchronization is achieved by
producing visemes from the input text.
The head of the agent is made up a number of polygons joined and blended together to
form a smooth surface. The source code for the generation of the head, which is written
in C, is provided on request.
When a particular user has to be represented, a picture of the user is taken, which is
projected on the generic face generated by using the above function. Special functions,
called texture mapping functions, are used to blend the projection with the generic face.
Since the system is used to train deaf children in any particular language, more
importance is given to lip synchronization. The lip synchronization is controlled by 33
parameters. However, basic facial expressions (like happiness, anger, fear, disgust,
surprise and sadness) are represented [Toolkit].
The user is provided with an interactive interface by means of which he can add new
expressions or change the intensity of the existing expressions. As part of the evaluation,
users are made to look at the animation and are asked to interpret what the agent is
29
saying. Success in interpreting spoken text correctly is recorded. They state that, on
average, there is a miss once in 17 times. The evaluation is scientific and requires user
involvement.
A demonstration of the technique is available at http://www.cse.ogi.edu/CSLU/toolkit
active as of October 1, 2003. The toolkit can be installed on a PC and can be tested. The
generated animation in the demonstration produces proper lip synchronization, but does
not represent any facial expressions. The agent is realistic to look at and blinks its eyes
periodically. The animation can be generated in real time.
The technique uses interpolation techniques to produce the facial animation [Massaro,
1998]. Hence, we have inferred that it can be coupled with the body animation techniques
which use interpolation techniques.
From the provided information, it is not clear if the tool can be extended over other face
models. Also, it is not clear whether the personality and mood can be integrated into the
tool.
30
Chapter 5
Evaluation of the Existing Tools
In this section, we evaluate a few existing animation tools, the Microsoft Agents, the
NetICE, the Jack, the DI-Guy and Dr. Ken Perlin’s responsive face based on the
applicable criteria from the introduction part of the report.
Table 4 gives a list of evaluation criteria and the performance of each of the tools with
respect to those criteria.
Criteria
MS Agent
NetICE
Jack
DI-Guy
Public/Private
Body
animation and
gestures
Facial
animation
Speech
Public
Decent
Private
Limited
Public
Decent
Public
Appears to
be decent
Responsive
face
Private
Not
applicable
Limited
Very
Limited
Limited
Limited
Decent
Claimed to
be provided
Very limited
Very limited
Not
available
Basic +
TNT2
graphics
card and a
32MB
VRAM.
Not
provided
Provided
Yes
Yes
Not
provided
Cannot be
determined
Provided
Not
provided
Easy to test.
Control
Provided
Extensible
Yes
Demonstration Yes
Limited
No
Yes
Very
Limited
Not
provided
Provided
Limited
Yes
System
requirements
Basic
Not
specified
Not
specified
Examples in
code
Difficulty
level
Support
Provided
Not
provided
Cannot be
determined
Provided
Cost
Free
Not
provided
Cannot be
determined
Not
applicable
Not
applicable
Decent
Easy to use
Provided
Not
specified
$9000 for
basic
features.
Table 4: Existing animation tools
Basic
Not
applicable
Not
applicable
31
Microsoft Agent
The Microsoft Agent (MS Agent) is publicly available software that provides a few
animated characters, which show gestures and some expressions. Two of the characters
available (Merlin and Genie) look human-like. There are other characters built by third
party developers that are MS agent compatible. Agentry [Agentry] is one of them. Some
of these agents represent both the body and the face.
The user can control the MS agent or the MS compatible agent programmatically. There
are some examples in code that act as a demonstration. These examples are available in
languages like VC++, J++, Visual Basic, and HTML and can be downloaded from the
MS Agent web page [MS Agent]. The code is easy to understand and execute. We have
tried modifying the existing example code in Visual Basic for our evaluation of the tool.
The code works as expected and creates an agent which performs all the specified
actions.
We have tried to integrate some of the animations developed by the third party sources
into the MS agent software. We tested them, but our major evaluation was done on
Merlin, an animation provided by Microsoft. We have chosen “Merlin”, since the
character looked more human-like when compared to the others.
The input is given in the form of a text. The agent analyzes the text and speaks, showing
appropriate changes in its tone. For example, “What a wonderful day!!” is said with the
needed emphasis, and “What day is it today?” is said with a hint of questioning.
Additionally the spoken text can also be displayed in a text bubble. A snapshot of the
Microsoft agent is shown in figures 1, 2 and 3.
We have observed that lip synchronization is available and is reasonably plausible. To
improve the lip synchronization, special software called the Linguistic sound editing tool
can be used which allows the animator to develop phoneme and word-break information.
The agents provided by Microsoft show emotions like happiness, sorrow and surprise.
Various useful gestures like explaining, acting helpless, pointing to something are also
represented. The gestures and expressions that can be represented by each character were
listed in one of the examples available on the website [MS Agent]. The third party agents,
on the other hand, show only a limited number of emotions or gestures. We have tried
generating an animation by calling the various gesture functions from a driver program
and giving different input texts. The generated animation was seamless and impressive.
The software can be downloaded to a PC. Few components (like the localization support,
agent character files, text-to-speech engines) can be downloaded, and they are available
in the Microsoft website.
32
Figure 1: Initial interface when a MS agent is run
Figure 2: Agent demonstrating the confused gesture
33
Figure 3: Agent demonstrating the confused gesture and speaking text simultaneously
The developers are provided with a feature called the Agent Character Editor that allows
the creation of custom agent characters. Documentation for the agent character editor is
available. We have tried using the editor, but it was not trivial. Support is available in the
form of a troubleshooting section and frequently asked questions section. The support
provided is helpful.
The MS agent is available royalty-free when used for the developer’s own application. A
distribution license is required to use the tool if the application is to be posted on to a
server or distributed using via electronic mail.
NetICE
The Networked Intelligent Virtual Environment (NetICE) is a project done by the
Advanced Multimedia Processing Lab at the Carnegie Mellon University. It aims at
providing a virtual conference setting so that people from remote places can still feel that
they are communicating in person.
The NetICE is client-server architecture. The server distributes information to the client.
Each client at a remote location is rendered a 3D audiovisual environment. The client can
add his/her avatar to the environment; see the virtual environment, and see all the avatars
of the other participants. He/she can change his/her position, look around the
environment and operate his/her hands (raise and lower both hands). There is a white
board available for the client to write on. Figures 3, 4 and 5 demonstrate the working of
this tool.
34
Figure 4: The collaborative virtual environment and the agents in it.
Figure 5: A closer look at the animation
35
Figure 6: Agent demonstrating the raising of a hand gesture
The website for NetICE [NetICE] provides a downloadable client side executable file.
The file can be downloaded, and a connection to the server can be established using the
specified port and IP address. This provides the client with a virtual environment
containing his/her avatar. The user can choose to use his/her head model, a face model
for this avatar, or use the synthetic model provided. If the user wishes to use his/her face
model, it should be created by some other means. The tool currently does not provide any
support for the creation of face models. The avatar can move around the room. No other
gestures are provided other that raising and lowering hands.
The basic facial expressions - joy, anger, surprise, sadness, fear, and disgust - are
provided. The user’s face is seen well by others in the environment since the user always
sees the back of his avatar.
A demonstration of a virtual environment is available in the web site. We have observed
from this demonstration that the movements of the virtual human are robot-like. In other
words, the animation is not seamless. The body of the virtual human is also robot like.
Facial expressions are limited, and the user is offered no control over them. Also, the lip
movement is not synchronized with the speech. Though speech support is provided, the
utterances are always in the same tone, no matter what the text is. In other words, there is
no emotion shown in the speech.
No special software is needed to run the demonstration. It is not specified if the tool
needs any special kind of software. From a video presentation of the product available at
[NetICE], it is clear that a tracking system is needed to track the user’s eyes and transfer
36
them to the environment. The tracking system is used to make the avatar maintain eye
contact with the other avatars in the environment.
It is claimed that provision for the user to use his own voice is provided, but this is not
sufficiently demonstrated.
As it currently stands, the tool provides reasonably good support for virtual business
conferences where it might not be very necessary to represent the emotions of the
participants, but it is not very useful to represent emotional agents.
JACK
Jack is a product of the Electronic Data Systems Corporation (EDS), which provides IT
services [EDS]. It is a software tool which helps developers to build virtual humans to
operate in virtual environments. These virtual humans are designed with the intent of
replacing real humans in testing and analyzing the performance of machines. A female
embodiment called Jill is also provided.
The virtual humans when assigned to various tasks in a virtual environment can tell
engineers what they see, reach, when and why they are getting hurt. This helps
developers design safer and more efficient products.
A demonstration of the working of the virtual human is provided at [Jack]. Figures 7 and
8 show the snapshots from the demonstration. From this demonstration, we observed that
the body animation of both the virtual humans is plausible and seamless. The tool
provides a motion capture toolkit [Jack] which can be used to generate gestures. This can
be done by using either motion sensors or by using controls provided in the form of slider
bars. There is a library of movements available. If the animator needs to modify any
existing movement slightly to generate a gesture, he can use the controls provided. If a
more complicated gesture is needed, he can use the motion sensors. The sensor
attachments link the virtual human and the real human. The action of the real human
attached with a sensor is reflected in the virtual human. The required gesture can be
generated and stored in the library of available gestures.
The availability and the cost of the motion sensors is however not clear.
Facial animation is provided, but the virtual human does not show any expressions or
emotions.
The tool provides a template of 77 body animations that can be used as a virtual human.
It is said that this template can be modified to form a new virtual human, but there is no
description or demonstration available that shows how this can be achieved. We therefore
conclude that the extensibility of the tool is limited since it does not provide sufficient
evidence to prove its claims.
37
Figure 7: Agent showing the ability to hold an object.
Figure 8: Agent demonstrating the done gesture
The cost of the product and the license agreement details are not explicitly stated.
Support is provided in the form of customer service and a frequently asked questions
section on the website [Jack].
38
Details about the requirements of the computer to run this tool are missing. Also, there
are no examples in code, demonstrating how the tool works.
DI- Guy
The DI-Guy [DI-Guy] is commercial software developed by the Boston Dynamics lab for
adding human-like characters to simulations. Though product is used mainly to train
military personnel, from the information provided, we have observed that the tool can be
used to generate body animation and gestures.
It claims to provide realistic human models, a library of 1000 behaviors and an API to
control the behaviors. It also claims that the tool is compatible over platforms like
Windows, Linux, and Solaris. To run the toolkit, the system needs to posses a TNT2
graphics card and a 32MB VRAM.
The DI-Guy comes with a set of characters, and a set of facial expressions (like smile,
trust, distrust, conniving, head nodding, head shaking and blinking) can be represented.
The user can combine these expressions to generate new expressions. Support for lip
synchronization is also claimed.
The current version of the DI-Guy forces one to use the body models provided with the
tool. He/she cannot add other body models. This limits the extensibility of the system.
Also, it is not clear if the user can control the animation in any way.
Two types of licenses (called development license and runtime license) are available. The
development license allows a new DI-Guy to be built. The runtime license allows the
user to run the application on an additional computer.
The DI-Guy product costs $9000 with an additional $3500 for expressive faces. We feel
that the major disadvantage of this tool as the lack of demonstration. We feel that a
demonstration should be provided to help the user decide if he can purchase the product
or not.
Responsive face
Responsive face is the work done by Dr. Ken Perlin, at the New York University Media
Research Lab. It is part of the Improv project.
A demonstration of the face is available at [Face]. Figures 9, 10 and 11 show few
snapshots from the demonstration. The face exhibits some predefined emotions like
fright, anger, and disappointment. A panel of controls is available, which the user can
operate to produce additional expressions built from the provided expressions. Once the
required expression is formed, a snapshot can be taken and added to a time line. The time
line is represented in the form of a bar and contains a list of all the snapshots needed to
generate the required animation. After producing a series of snapshots, the animation can
be played to make the face show all the animations.
39
Figure 9: The face and the panel of controls
Figure 10: The timeline which does not have any snapshots
40
Figure 11: Timeline with snapshots
We observe that this face can represent multiple emotions. The quality of the animation is
seamless and very impressive. The transition from one snapshot to the other in the time
line is done without any discontinuity.
Dr. Perlin mentions that the responsive face has been integrated with a body animation in
the Improv project. Also, there is a control button in the panel provided which makes the
face speak. From this information, we infer that the face can be integrated with a few
existing body animation techniques and support for speech can also be provided.
However, the way in which this can be achieved is not trivial.
41
Chapter 6
Recommendations
We have analyzed various techniques for body animation, gesture animation, facial
animation, and lip synchronization in the previous sections of the report. In this chapter,
we summarize our analysis and, where possible, suggest improvements to the analyzed
systems. We also analyze whether any of the existing techniques can be integrated
together to produce a believable animation and what, if any, are the possible difficulties
in such an integration.
Body animation
We have considered two body animation techniques, the 2.5D video avatar and the body
animation technique, designed by the MIRA Lab.
We have observed that the 2.5D avatar looks plausible as long as it points to an object in
the virtual environment, but other movements (like holding an object) cannot be
simulated because motion capture is used to produce the animation. As mentioned earlier
(Section 2.1), modifying a captured motion is a strenuous task. Hence, we feel that this
technique can only be used to produce animations for applications which need pointing
gestures like a virtual human presenter.
Additionally, we have identified that the system does not segment the image captured
from the original environment effectively. Because of this, the captured motion often has
a part of the environment in its back ground. A method to recognize where the subject’s
image ends and where the environment starts would be very helpful. We observe that
generated animation does not have proper foot contact in the target environment. It looks
as if the avatar is floating in the air. We feel that any approach which can add some
gravity to the image and make the avatar have solid foot contact in the target environment
would help immensely to increase the plausibility of the generated animation.
The animation technique designed by the MIRA Lab (Section 2.2) maintains a vector
which stores some distances from the body geometry (like the vertical distance from the
crown of the head to the ground and so on). The animation technique takes the vector as
an input and compares it with the vectors of the existing body templates (stored in a
library). The required animation is generated by blending the two vectors using special
procedures as discussed in the corresponding section of the report. If the user has to
modify the generated animation, he/she will have to manually change the values in the
vector using trial and error. We think that trial and error can be quite frustrating and time
consuming. We feel that this can be minimized by representing each of the dimensions in
a panel as slider bars. The user can change the parameters by modifying the slider bars.
From the description of the MIRA Lab’s animation technique and the evaluation done,
we feel that it is possible to produce reasonably believable body animations using this
technique.
42
Gesture animation
We have considered three gesture animation techniques: the Collaborative Virtual
Environment, the BEAT architecture, and the Virtual Human Presenter.
The technique used for the design of Collaborative Virtual Environment (CVE) (section
3.1) takes an input text and triggers the appropriate gestures by using the words in the
text. The list of gesture functions available is not made public, and it is not specified if
the user is allowed to add any new gestures.
We feel that it would be helpful to the animator if the list of available gestures is made
public. Also, the presence of a user-friendly interface by means of which the animator
can produce a new gesture by blending two or more available gestures would make the
tool extensible. A means of selecting the weight of each of the gestures in the final
gesture will be very helpful. For example, the animator should be able to choose a gesture
denoting surprise and blend it with a gesture denoting fear, by assigning each of them a
weight of 30% and 70%. The resulting gesture can be used to represent horror. The
animator should be able to change the existing gestures to form new gestures. Each new
gesture could be assigned a name and could be linked to a word which triggers the
gesture. Also, we strongly believe that the user must be given a chance to choose his
avatar. In other words, he should be allowed to use the existing synthetic body provided
by the CVE or add new body models generated using any of the existing body animation
techniques. This technique is used in a 3D chat environment called Outer worlds [Outer]
and is very helpful.
Another technique, called Virtual Human Presenter (Section 3.3), takes the input text,
analyzes it, and embeds gesture function calls in the text. We feel that the technique
works well to design a virtual presenter, but to represent an emotional agent, a few
features have to be included. Firstly, the animator should be provided control over the
animation. He/she should be able to add new gestures and modify the existing gestures. It
would be convenient if a user-friendly interface can be provided for this.
We have observed that the speech is not synchronized with gestures. This might be
because the agent is made to speak the text and then the appropriate gesture function is
called. For example, the text “I warn you” is modified as “I warn \gest_warn you”. We
suggest that the gesture function should show the required gesture and simultaneously
make the agent say the text. In the above example, the \gest_warn function can make the
agent utter the word, warn.
The Behavior Expression Animation Toolkit (BEAT) (Section 3.2), developed by the
Gesture and Language Narrative Group, produces the required gesture commands by
linguistic and contextual analysis of the input text.
We feel that the generated gesture commands can be animated using the interpolation
techniques. The advantage of using interpolation techniques is that the gestures generated
43
by the BEAT architecture can then be integrated to the existing body animation
techniques which also use interpolation techniques.
From the information provided, we infer that the system can be used to represent a
combination of gestures. We observe that BEAT does not provide support for pointing
gestures. However, they can be added into the system by modifying the user knowledge
base.
From the description of the technique and our analysis we feel that BEAT is a useful
technique to produce gesture animation for emotion agents.
Facial animation and lip synchronization
We have three facial animation techniques, the BEAT, the FacMOTE , the MIRA Lab’s
approach, and the BALDI. The BEAT for facial animation (Section 4.1) operates in the
same way it operates for gesture animation. The FacMOTE (Section 4.2) uses the
MPEG-4 standard and produces the animation by modifying a set of Facial Animation
Parameters (FAP). We infer from the description of the system that it is possible to add
new expressions or to modify the existing expressions. This can be achieved by
modifying the values in the Facial Animation Parameter Units (FAPU). It would be
helpful if the user is provided with a list of basic expressions and control bars to change
these expressions. For example, a set of control bars each controlling the eyes, eye brows,
and lips can be provided. The animator can choose any existing expression and change
these control bars to generate the needed expression. Once the needed expression is
generated, it can be saved into the library of expressions. A similar approach is used in
Dr. Ken Perlin’s responsive face and is helpful.
The MIRA Lab (Section 4.3) also uses the MPEG-4 standard to produce facial
expressions and lip synchronization. It provides all the missing features in the FacMOTE
technique. The Facial Animation Parameters are modified to produce some basic
expressions and these expressions are stored in the Facial Animation Table (FAT). The
user is allowed to add new expressions and modify existing expressions. The only
drawback with this approach is that it does not specify how the personality, mood and
emotion of the character can be considered.
The system can be integrated into techniques which take the personality, mood and
emotion of the agent into consideration and decide what the expression is and what its
intensity is. This data can then be used by the MIRA Lab’s system to manipulate the
needed FAPs and produce the required expression.
The BALDI (Section 4.4), which is designed by the Center for Spoken Language
Understanding (CSLU), is used to train deaf children. From the description of the system
and the available demonstration, we feel that the tool can be used for applications in
which lip synchronization is most needed . We observe that the tool represents only the
basic emotions and does not show any complicated emotions or expressions. Hence, we
feel that the system is not well suited for the animation of emotional agents.
44
Possible integration of techniques
From our evaluation and analysis of the tools, it is clear that different tools were designed
with different purposes in mind. We believe that some of the tools can be integrated
together to form a meta tool. This meta tool can then be used to generate the animation of
an emotional agent.
Based on our research, we feel that MIRA Lab’s body animation system (Section 2.2),
the BEAT architecture (Section 3.2), and MIRA Lab’s facial animation system (Section
4.3) can be tied together to produce plausible animations.
The library of the existing body templates in the MIRA Lab’s body animation tool can be
organized by using logic similar to a hash table. For example, the templates of all
medium height, medium built, black haired males can be grouped together. A table
containing all such groups can be maintained. When the animator needs to generate an
animation, he/she can choose the required group and look for the required template. This
template can be modified if needed and stored as a body model, for example, Bob’s body
model.
Input to the animation can be taken in the form of speech and the appropriate gesture
command can be generated using the BEAT architecture. A key frame based
interpolation technique can be used to reflect the generated gesture on the selected body
model, for example Bob’s body model.
Motion blending techniques can be used to make the resulting animation look seamless.
These techniques, as mentioned in Section 3.3, use multiple editing algorithms and multitarget interpolation to produce a plausible animation. Details about the motion blending
algorithms can be found in [Kovar, 2003].
The input speech can be used by the MIRA Lab’s facial animation system to produce the
appropriate facial expressions and lip synchronization. At every stage in the animation,
the user can be provided some kind of control to change a gesture or expression.
Challenges in integration
A major challenge in integrating various different techniques to produce a believable
animation is to maintain synchronization between the generated gestures, facial
expressions, lip synchronization and speech.
Coordination between the verbal and non-verbal channels is necessary to produce a
plausible animation. In other words, it is important that speech, gaze, head movements,
expressions, lip synchronization, and gestures work together. Even if each of them work
really well independently, the animation is plausible only when all the features blend
together appropriately. For example, when the speaker wants to emphasize some thing,
this is done with a strong voice, eyes turned towards the listener and appropriate hand
45
movements all working together. The animation is not plausible if even one of behaviors
happens a couple of seconds later. Hence, it can be said that synchrony is essential to
have a believable conversation. When it is destroyed, satisfaction and trust in the
conversation diminishes since the agent might appear clumsy or awkward [Gratch, 2002].
A technique called motion graph is proposed to make for synthesizing synchronous and
plausible animations [Kovar, 2002]. A database of animation clips is maintained. For
example, a clip which denotes the agent smiling, a clip which denotes an agent waving
and so on. The motion graph is implemented as a directed graph in which the edges
represent clips of animation data. Nodes act as the points where the small bits of motion
data join seamlessly.
The motion graphs convert the problem of synthesizing an animation into the process of
selecting sequences of nodes. The motion graph takes a database of clips as input. The
edges correspond to the clips of motion and the nodes act as decision points where it is
determined which clip is the successor to the current clip. Transitions between clips are
generated such that they can seamlessly connect the two clips. This achieved by means of
a special algorithm which can be found in [Kovar, 2002].
The problem with the technique is that it is quite time consuming. If the database contains
F frames, it is estimated that finding the next frame requires O(F2). User involvement is
needed when more than one frame is recognized as the possible successor of the existing
frame. On an average, it is found that the time needed to produce a plausible animation is
equal to the length of the animation and at least 5 minutes of the user time. Hence, the
applicability of the technique in real time is questionable.
We feel that in applications, where plausibility of the animation is more important that
the time consumed to generate it, the theory of motion graphs is really helpful.
46
Conclusions
We have identified the basic requirements for a tool or theory for the graphical
representation of emotional agents. To generate a lifelike agent, it is important to have a
plausible body animation technique, gesture animation technique, facial animation and lip
synchronization technique. We have analyzed many current theories in each of these
fields and the possible improvements were suggested.
We found that the MIRALab’s body animation tool can be used to produce plausible
body animations. Similarly, if the suggested improvements are made, the BEAT
architecture can evolve to be a good tool for gesture animation. Believable facial
animation and lip synchronization can be produced using the MIRALab’s technique.
We propose that three of the existing tools, MIRA Lab’s body animation system, the
BEAT architecture and MIRA Lab’s facial animation system be integrated to produce a
plausible animation. Synchronization is identified as a possible difficulty in the
integration.
We describe a technique called motion graph which aims at producing synchronized
animations. The major disadvantage with this is that it is very time consuming. We
conclude by saying that in applications where plausibility of the animation is more
important than the time taken to generate the animation, the theory of motion graphs is
really helpful.
47
References
[Agentry] http://www.agentry.net/ active as on October 5, 2003
[Badler, 1995] Badler, N. I., “Planning and Parallel Transition Networks: Animation's
New Frontiers”, Pacific Graphics '95
[Baldi] http://www.distance-educator.com/dnews/Article3208.phtml active as of October
1, 2003.
[Bone] http://www.ucc.ie/fcis/DHBNFbone.htm active as on October 6, 2003
[Byun, 2002] Byun, M. and Badler, N. I., “FacEMOTE: qualitative parametric modifiers
for facial animations”, July 2002.Proceedings of the 2002 ACM SIGGRAPH/
Eurographics symposium on Computer animation
[Cad] http://www.cadtutor.net/dd/bryce/anim/anim.html active as of September 29, 2003.
[Cassell, 2001] Cassell, J., Vilhjamsson, H., and Bickmore, T., “BEAT: the Behaviour
Expression Animation Toolkit”, Proceedings of SIGGRAPH 2001, pp. 477-486.
[Chandra, 1997] Chandra, A., “A computational Architecture to Model Human
Emotions”. Proceedings of the 1997 IASTED International Conference on Intelligent
Information Systems. (IIS ’97) IEEE.
[Chi, 2000] Chi, D., Costa, M., Zhao, L., and Badler, N. I., “The EMOTE model for
Effort and Shape”, In Proceedings of ACM SIGGRAPH 2000, ACM Press / ACM
SIGGRAPH, Computer Graphics Proceedings, Annual Conference Series, ACM, 173-182
[CVLab] http://cvlab.epfl.ch/index.html active as on October 5, 2003
[Descamps, 2001] Descamps, S. and Ishizuka, M., “Bringing Affective Behavior to
Presentation Agents”. Proceedings of the 21st International Conference on Distributed
Computing Systems Workshops (ICDCSW ’01). 2001.
[Dictionary] www.dictionary.com (15th September, 2003)
[DI-Guy] http://www.bdi.com/content/sec.php?section=diguy active as on October 5,
2003
[EDS] www.eds.com. Active as on October 5, 2003
[Egges, 2003] Egges, A., Zhang, X., Kshirsagar, S., and Thalmann, N. M., “Emotional
Communication with Virtual Humans", Multimedia Modeling, Taiwan, 2003
[Face] http://www.mrl.nyu.edu/projects/improv/ , 2002; active as on October 5, 2003
48
[FACS] http://www-2.cs.cmu.edu/afs/cs/project/face/www/facs.htm 2002; active as on
September 29, 2003.
[Garchery, 2001] Garchery, S. and Magnenat N. T., "Designing MPEG-4 Facial
Animation Tables for Web Applications", Multimedia Modeling 2001, Amsterdam, pp
39-59., May, 2001
[Gleicher, 2001] Gleicher, M., 2001. “Comparing constraint-based motion editing
methods”. Graphical Models 63(2), pp. 107-134, 2001.
[Goto, 2001] Goto T., Kshirsagar, S., and Magnenat-Thalmann, N., “Real Time Facial
Feature Tracking and Speech Acquisition for Cloned Head”, IEEE Signal Processing
Magazine, Special Issue on Immersive Interactive Technologies, 2001.
[Gratch, 2002] Gratch, J., Rickel, J., Andre, E., Badler, N., Cassell, J., and Petajan, E.,
"Creating Interactive Virtual Humans: Some Assembly Required," in IEEE Intelligent
Systems, July/August 2002, pp. 54-63.
[H-Anim] www.h-anim.org ; 1999, active as on October 5, 2003
[Hirose, 1999] Hirose, M., Ogi, T., Ishiwata, S., and Yamada, T., “Development and
evaluation of immersive multiscreen display ‘CABIN’ systems and computers in Japan,”
Scripta Technica, vol. 30, no. 1, pp. 13-22, 1999.
[Immer] http://www.ejeisa.com/nectar/fluids/bulletin/16.htm 1997; active as on October
6, 2003.
[Iso] http://www.ks.uiuc.edu/Research/vmd/vmd-1.7.1/ug/node70.html 2001; active as on
September 29, 2003.
[ISO, 1997] ISO/IEC 14496-2, Coding of Audio-Visual Objects: Visual (MPEG-4
video), Committee Draft, October 1997.
[Jack] www.plmsolutions-eds.com/products/efactory/jack 2001;active as on October 5,
2003.
[Jean] http://jean-luc.ncsa.uiuc.edu/Glossary/I/Isosurface/ active as on September 29,
2003.
[Kovar, 2002] Kovar, L., Gleicher M., and Pighin F., “Motion Graphics”.
Proceedings of the ACM SIGGRAPH 2002.
[Kovar, 2003] Kovar, L. and Gleicher, M., “Flexible automatic motion blending with
registration curves”. Proceedings of the 2003 ACM SIGGRAPH/Eurographics
Symposium on Computer Animation. Pages 214 – 224.
49
[Kshirsagar, 2001] Kshirsagar, S., Garchery, S. and Magnenat-Thalmann, N., “Feature
Point Based Mesh Deformation Applied to MPEG-4 Facial Animation”. Deformable
Avatars, Kluwer Academic Press, 2001, pp 24-34.
[Leung, 2001] Leung W. H. and Chen T., "Immersive Interactive Technologies towards
a Multi-User 3D Virtual Environment", IEEE Signal Processing Magazine, May 2001.
[Lewis, 2000] Lewis, J., Cordner, M. and Fong, N., “Pose space deformations: A unified
approach to shape interpolation and skeleton-driven deformation”. ACM SIGGRAPH,
July 2000, pp. 165-172.
[Magnenat, 2003] Magnenat-Thalmann N., Seo H. and Cordier F.," Automatic Modeling
of Virtual Humans and Body clothing", Proc. 3-D Digital Imaging and Modeling, IEEE
Computer Society Press, October, 2003
[Maldonado, 1998] Maldonado H., Picard A., Doyle P., and Hayes-Roth B.. “Tigrito: A
Multi-Mode Interactive Improvisational Agent”. In: Proceedings of the 1998
International Conference on Intelligent User Interfaces, San Francisco, CA, 1998, pp.
29--32.
[Massaro, 1998a] Massaro, D.W. “Perceiving Talking Faces: From Speech Perception to
a Behavioral Principle”. Cambridge, MA: MIT Press. 1998
[Massaro, 1998] Massaro, D.W. and Stork, D.G. (1998). “Speech recognition and sensory
integration”. American Scientist, 86, 236-244.
[Moffat, 1997] Moffat D., “Personality Parameters and Programs,” Creating
Personalities for Synthetic Actors, Springer Verlag, New York, 1997, pp. 120–165.
[MPEG-4] MPEG-4 SNHC. “Information technology-generic coding of audio-visual
objects part 2: Visual”, ISO/IEC 14996-2, Final draft of international standard, ISO/IEC
JTCI/SC29/WG11 N2501. 1998
[MS Agent] http://www.microsoft.com/msagent/ active as on October 5, 2003.
[MVL] The MVL research center
http://green.iml.u-tokyo.ac.jp/tetsu/PPT/ICIP99/sld008.htm active as on September 28,
2003.
[NetICE] http://amp.ece.cmu.edu/projects/NetICE active as on October 5, 2003
[Noma, 2000] Noma T., Zhao L., and Badler N., “Design of a Virtual Human Presenter”,
IEEE Journal of Computer Graphics and Applications, 20(4):79-85, July/August, 2000,
pp. 79-85
50
[Outer] www.outerworlds.com active as on October 6, 2003
[Oz, 1997] http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/oz/web/papers/CMU-CS97- 156.html active as on September 15, 2003)
[Pina, 2002] Pina, A., Serón F. J., and Gutiérrez D., “The ALVW system: an interface
for smart behavior-based 3D Computer Animation”. ACM International Conference
Proceeding Series Proceedings of the 2nd international symposium on Smart graphics
Hawthorne, New York Pages: 17 - 20.Year of Publication: 2002
[Rousseau, 1998] Rousseau, D. and Hayes-Roth, B., “A social-psychological model for
synthetic actors”, in Proceedings 2nd International Conference on Autonomous Agents
(Agents’98), pp. 165-172.
[Salem, 2000] Salem, B. and Earle, N., “Designing a Non-Verbal language for
Expressive Avatars”. Proceedings of the third international conference on Collaborative
virtual environments San Francisco, California, United States Pages: 93 - 101 Year of
Publication: 2000
[Scheepers, 1997] Scheepers, F., Parent, R. E., Carlson, W. E. and May, S. F.,
“Anatomy-based modeling of the human musculature”, Proceedings SIGGRAPH ‘97,
pp.163 - 172, 1997.
[Seo, 2003a] Seo, H., Magnenat-Thalmann, N.,"An Automatic Modeling of Human
Bodies from Sizing Parameters", ACM SIGGRAPH 2003 Symposium on Interactive 3D
Graphics, pp19-26, pp234, 2003
[Seo, 2003b] Seo, H., Cordier, F., Magnenat-Thalmann, N.,"Synthesizing Animatable
Body Models with Parameterized Shape Modifications", ACM SIGGRAPH/
Eurographics Symposium on Computer Animation, July, 2003.
[Sloan, 2001] Sloan, P., Rose, C., and Cohen, M. 2001 “Shape by Example”. Symposium
on Interactive 3D Graphics, March, 2001.
[Tamagawa, 2001] Tamagawa, K., Yamada, T., Ogi, T., and Hirose, M., “Development
of 2.5D Video Avatar for Immersive communication”, IEEE Signal Processing
Magazine, Special Issue on Immersive Interactive Technologies, 2001.
[Tolani, 2000] Tolani, D., Goswami, A., and Badler, N., “Real-time inverse
kinematics techniques for anthropomorphic limbs”. Graphical Models 62 (5), pp.
353-388.
[ToolKit] http://cslu.cse.ogi.edu/toolkit/docs/users.html active as of October 1, 2003.
[Tosa, 1996] Tosa, N., and Nakatsu R., “Life-Like Communication Agent – Emotion
Sensing Character “MIC” and Feeling Session Character “MUSE”. Proceedings of the
51
1996 International Conference on Multimedia Computing and Systems (ICMCS ’96)
IEEE.
[Tosa, 2000]Tosa, N. and Nakatsu, R., “Interactive Art for Zen: 'Unconscious Flow”,
International Conference on Information Visualisation (IV2000). July 19 - 21,
2000. London, England, p. 535
[Tsukahara, 2001]Tsukahara, W. and Ward, N., “Responding to Subtle, Fleeting Changes
in the User's Internal State (Make Corrections)”. Proceedings of the SIGCHI conference
on Human factors in computing systems 2001, Seattle, Washington, United States
2001
[Web3D] www.web3d.org active as on October 5, 2003
[Wilhelms, 1997] Wilhelms, J. and Van-Gelder, A., “Anatomically Based Modeling”,
Proceedings SIGGRAPH ‘97, pp. 173 - 180, 1997.
[Wu, 2001] Wu, Y. and Huang, T. S., “Human Hand Modeling, Analysis and Animation
in the Context of Human Computer Interaction”, IEEE Signal Processing Magazine,
Special Issue on Immersive InteractiveTechnologies, 2001.
[Yamada, 1999] Yamada, T., Hirose, M., Ogi, T., and Tamagawa, K., “Development of
Stereo Video Avatar in Networked Immersive Projection Environment”, Proceedings of
the 1999 International Conference on Image Processing (ICIP '99), Kobe, Japan,
October 24-28, 1999. IEEE Computer Society, 1999, ISBN 0-7803-5467-2, Volume III.
52
Download