GRAPHICAL REPRESENTATIONS OF EMOTIONAL AGENTS by Srividya Dantuluri 1 Chapter 1 Introduction The purpose of this report is to perform a literature search in the field of graphical representation of believable agents and to identify the basic requirements for a tool or a theory in this area. We will identify the key issues that must be addressed in creating a credible virtual being and analyze the current research with respect to these issues. 1.1 Why do we need believable agents? Extensive research has shown that users interpret the actions of computer systems using the same conventions and social rules used to interpret actions of humans [Oz, 1997]. This is more pronounced in the case of anthropomorphic software interface agents. Software agents are expected to be believable. This does not imply that they should always speak the truth and should be reliable. The user, when interacting with a believable agent, should feel that he/she is dealing with a life-like character instead of a life-less computer. If humans sympathize with and accept an agent as human, they will be able to communicate with it better. Agents that are not believable are viewed as life-less machines [Chandra, 1997]. Research in the implementation of believable agents is not targeted to trick the user into believing that he is communicating with a human. Rather, it has a more benign purpose. As designers start building agent systems with emotions, they will need techniques for communicating these emotions to the user. This is precisely why research in believable agents is needed. A more remote, yet cogent reason, for pursuing this research is to build companions like Data on Startrek, which is termed as the AI dream [Oz, 1997]. In the words of Woody Bledsoe, a former president of AAAI, “Twenty-five years ago I had a dream, a daydream, if you will. A dream shared with many of you. I dreamed of a special kind of computer, which had eyes and ears and arms and legs, in addition to its "brain" ... my dream was filled with the wild excitement of seeing a machine act like a human being, at least in many ways.” Research in believable agents directly deals with building complete agents with personality and emotion, and thus provides a new prospectus for pursuing the AI dream [Oz, 1997]. The current literature terms believable agents as “Virtual Humans”. Additionally, animated life-like characters are also called avatars [Leung, 2001]. The word “avatar” originates from the Hindu religion. It means “an incarnation of a Hindu deity, in human or animal form, an embodiment of a quality or concept, or a temporary manifestation of a continuing entity” [Dictionary]. For the purpose of this report, an avatar can be assumed to be a vivid representation of a user in a virtual environment. The term “avatar” is also 2 used to denote the graphical representation of a software agent in a Collaborative Virtual Environment (CVE) [Salem, 2000]. The CVE is a multi-user virtual space in which users are represented by a 3D image. The terms believable agents, animated life-like agents, virtual humans, and avatars are used interchangeably in this report. 1.2 What are the requirements for believable animation? The Oz group from the Carnegie Mellon University’s School of Computer Science identified the following requirements for believability: [Oz, 1997] Personality: Personality is an attribute, which distinguishes one character from another. It includes everything unique and specific about the character, from the way they talk to the way they think. Emotions: Emotion can be defined as a mental state that occurs involuntarily based on the current situation [Dictionary]. The range of emotions exhibited by a character is personality-specific. Given a situation, characters with different personalities react with different emotions. Personality and emotion are closely related [Gratch, 2002]. [Moffat, 1997] says that personality remains stable over an extended period of time whereas emotions are short term. Furthermore, while emotions focus on particular situations, events, or objects, elements determining personality are more extended and indirect. Apart from personality and emotions, mood is also an important attribute that has to be considered while working with emotional agents. Mood and emotion differ in two dimensions: duration and intensity. Emotions are short-lived and intense, whereas mood is longer and has a lower intensity [Descamps, 2001]. Building a virtual human involves joining traditional artificial intelligence with computer graphics and social science [Gratch, 2002]. Synthesizing a human-like body that can be controlled in real-time includes computer graphics and animation. Once the virtual human starts looking like a human, people expect it to behave like one too. To have a believable intelligent human-like agent, the agent needs to possess personality and needs to display mood and emotion [Chandra, 1997]. Thus, research in the field of building a believable agent or a virtual human must rely immensely on psychology and communication theory to adequately convey nonverbal behavior, emotion and personality. The key to realistic animation starts with creating believable avatars [Pina, 2002]. The expressiveness of an avatar is considered to be crucial for their effective communication capabilities [Salem, 2000]. The advantage of creating an embodiment for an agent (avatar) is to make it anthropomorphic and to provide a more natural method of interacting with it. 3 Gratch [Gratch, 2002] identifies the key issues that must be addressed in the creation of virtual humans as face-to-face conversations, emotions and personality, and human figure animation. The avatar can be animated to create body movements, hand gestures, facial expressions and lip synchronization. [Noma, 2000] specifies that the animation of a virtual human should posses: Natural motion: In order to be plausible, the virtual human’s motion should look as natural as possible. The virtual human must have a body language that is human-like. Speech synchronization: The body motion, in particular the lip movement of the virtual human, should be in synchronization with the speech. Proper Controls: The user should be able to control the agent. Changes are to be allowed if needed. The user should be able to represent the basic emotions and use them to come up with a combination of emotions. Widespread system applicability: The tools for developing applications with animated life-like agents should be integrated into the current animation or interface systems. It has been recognized that the non-verbal aspect of communication plays an important role in the daily life of humans [Tosa, 1996]. Human face-to-face conversation involves sending and receiving information through both verbal and non-verbal channels. Having a human-like agent makes it easier to understand the aim of a conversation and provides us with an opportunity to exploit some of the advantages of Non-Verbal Communication (NVC) like facial expressions and gestures [Salem, 2000]. Body animation, gestures, facial expressions, and lip synchronization are all very important for Non-Verbal Communication. The face exhibits emotions while the body demonstrates mood [Salem, 2000]. It is true that in a few applications, only the head and shoulders of a person may fill the screen. This does not imply that facial animation is more important than body animation. In applications where avatars are at a relatively larger distance from each other, facial expression can be too subtle and can be easily missed. In such a situation, gestures are more important. Therefore, an animation technique which provides both facial and body animation is deemed necessary. Body gestures, facial expressions and acoustic realization act as efficient vehicles to convey emotions [Gratch, 2002]. Animation techniques are required to encompass body gestures, locomotion, hand movements, body pose, faces, eyes, speech, and other psychological necessities like breathing, blinking, and perspiring [Gratch, 2002]. 4 Additionally [Gratch, 2002] designates the following requirements for the control architecture of believable agents. Conversational support: Initiating a conversation, giving up the floor, and acknowledging the other person are all important features of human face-to-face communication. The architecture used to build virtual humans should provide support for such actions. For example, looking repeatedly at the other person might be used as a form of giving the other person a chance to speak or waiting for the other person to speak. A quick nod as the speaker finishes a sentence acts as an acknowledgement. Seamless transition: A few behaviors like gestures require the virtual human to reconfigure its limbs from time to time. It would help if the architecture allows the transition from one posture to the other to be smooth and seamless. In summary, it is generally agreed that techniques for building animated life-like agents are expected to synthesize virtual humans that depict plausible body animation, gestures, facial animation, and lip synchronization Based on the above requirements, we have compiled several interesting questions in order to analyze the existing research in the graphical representation of emotional agents. They are as follows: How does the technique arrive at a set of useful gestures or expressions? Can gestures and expressions be triggered in a more natural way than selection from a tool bar? How is it ensured that all the gestures and expressions are synchronized? Can the user control the gestures and expression? Can the technique handle multiple emotions simultaneously? Does the system represent the mood of the agent? Does the technique provide both facial animation and body animation? Does the technique provide conversation support (gestures or expressions that indicate desire to initiate a conversation, to give up the floor and to acknowledge the speaker)? How does the technique decide the mapping between emotion or mood and graphics? Is the technique extendable? How many emotions does the technique represent? Is it possible to form complicated emotions using a combination of the represented emotions? Is the animation seamless? Is the technique evaluated? If yes, do professionals or users do the evaluation, is it limited or extensive, is it scientific or not? Is there a working model or demonstration available? If yes, does it comply with all the claims made? As mentioned above, body animation, gestures, facial animation and lip synchronization are the important aspects in the animation of believable agents. This report will handle 5 each of them independently in Chapters 2, 3 and 4 respectively. The additional requirements, if any, for each aspect will be identified and the current literature will be analyzed. A few existing tools for the creation of animated agents like the Microsoft Agent, the NetICE project, Ken Perlin’s responsive face, DI-Guy, and the Jack animation system, will be evaluated based on the above mentioned criteria and a few additional criteria like license agreements and cost (Chapter 5). We will then consider whether any of the existing techniques can be integrated together to provide a complete and plausible animation. The possible difficulties in such integration will be considered (Chapter 6). This report seeks to explain key features of graphical representations. We will compare and contrast various graphical representations of agents as reported in the current literature. In addition to categorizing the various agents, we will offer some explanation of why various researchers have chosen to include or omit important features and what we see as future trends. 6 Chapter 2 Body Animation Animating the human body demands more than just controlling a skeleton. A plausible body animation needs to incorporate intelligent movement strategies and soft musclebased body surfaces that can change shape when joints move or when an external force is applied to them [Gratch, 2002]. The movement strategies include solid foot contact, proper reach, grasp, and plausible interactions with the agent’s own body and the objects in the environment. The challenge in body animation is to build a life-like animation of the human body that has sufficient detail to make both obvious and subtle movements believable. Also for a realistic body animation, maintaining an accurate geometric surface through out the simulation is necessary [Magnenat, 2003]. This means that the shape of the body should not change when viewed from a different angle or when the agent starts moving. The existing human body modeling techniques can be classified as creative, reconstructive, and interpolated [Seo, 2003b, Magnenat, 2003 and Seo, 2003a]. The creative modeling techniques use multiple layers to mimic individual muscles, bones and tissues of the human body. The muscles and bones are modeled as triangle meshes and ellipsoids [Scheepers, 1997]. Muscles are designed is such a way that they change shape when the joints move. The skin is generated by filtering and extracting a polygonal “isosurface” [Wilhelms, 1997]. An isosurface is defined as “a surface in 3D space, along which some function is constant” [Jean]. The isosurface representation takes a set of inputs and draws a 3D surface corresponding to points with a single scalar value [Iso]. To be put in simpler words, the isosurface is a 3D surface and the vertices on the surface can be coupled with vertices on any other surface so that when a particular point on the latter moves, the corresponding point on the isosurface is also displaced in the same direction by the same amount. In creative modeling techniques, the vertices on the isosurface are coupled with the underlying muscles, which make the skin motion consistent with the muscle motion and the joint motion. The creative modeling techniques were popular in the late 90s. Although the generated simulation looks real, it involves substantial user involvement in the form of human models, resulting in slow production time. Modern systems prefer reconstructive or interpolated models because of the above drawbacks in creative models. The reconstructive approach aims at building the animation using motion capture techniques [Gratch, 2002]. The image captured can be modified using additional techniques. Research [Gleicher, 2001; Tolani, 2000; Lewis, 2000; and Sloan, 2001] has shown that using motion capture techniques to produce plausible animations is a strenuous task since it would be difficult to maintain “environmental constraints” like proper foot contacts, grasp and interaction with other objects in the environment [Gratch, 2002]. 7 Since motion capture deals with the modification of the existing image, it is quite a challenge to make the animation look plausible. Chances are that the animation would look as if a separate image was pasted in the already existing environment. In other words, it might look as if the animation is not a part of the environment. Also, it is difficult to modify the generated images to produce different body shapes that the user intends. Thus, the user has little control over the animation [Magnenat, 2003]. Interpolation modeling uses the existing set of example models to build new models. The modifications to the existing set can be done using procedural code. The procedural code allows programmatical control over the generated image. Procedural approaches provide kinematic and dynamic techniques to parameterize the location of objects and the types of movements to produce a believable motion [Gratch, 2002]. In general, kinematics is used for goal-directed and controlled actions, and dynamics is used for applications that require response to forces or impacts. Kinematics and dynamics differ in their applicability. Human animation, in general, might require both the approaches, but in the case of emotional agents, kinematics seems to be more useful. Realistic body animation is not easy to produce since many body motions result from synchronous movements of several joints. One option for producing a plausible body animation is to attach motion sensors to the user in order to determine the user’s posture and movements. Zen [Tosa, 2000] uses this approach. The use of motion sensors and other techniques to handle complex body motions will be discussed in detail under the “Gestures” section (Chapter 3) of the report. However, the remaining part of this chapter deals with two of the techniques available for body animation in the current literature, the 2.5D Video Avatar and the MIRALab animation. Each of these techniques is analyzed based on the applicable criteria identified in the Introduction section (Chapter 1) of the report. Additionally, we specify whether the technique or theory uses motion capture or the procedural approach. As mentioned above, research claims that it is not easy to generate a plausible animation using motion capture. Hence, if motion capture is used, we will examine how the generated animation handles this challenge. Table 1 gives an overview of the two systems. The remaining section of the chapter describes each of the techniques in detail. Criteria Animation Method Extensibility Control to the user 2.5D Video Avatar Reconstructive or motion capture Limited Very limited Real time animation Possible MIRA Lab’s system Interpolation Extensible User can control the animation. Possible 8 Evaluation Scope of Evaluation Demo Is the animation plausible? Scientific Limited Pictures of the animation are available Not quite Scientific Decent Pictures of the animation are available Reasonably plausible Table 1: Body animation techniques 2.1 The 2.5D Video Avatar The 2.5D Video Avatar, which uses motion capture, is part of the research done by the MVL (Multimedia Virtual Laboratory) Research Center founded at the University of Tokyo and a research center called the Gifu Technoplaza. The 2.5D Video Avatar falls between the 2D Video Avatar and the 3D Video Avatar. [MVL] states that a 2D video avatar is represented using two-dimensional image and does not have three-dimensional information. A 3D video avatar cannot be generated in real-time, because the time required to generate a single picture is on the order of 5 seconds [Yamada, 1999]. The 2.5D avatar is used to model only the user’s surface, which can be generated in real time since it takes only 0.9 seconds to produce an animation. The image generated by using this technique is transported into a system known as Computer Augmented Booth for Image Navigation (CABIN) [Hirose, 1999]. CABIN, developed by the Intelligent Modeling Laboratory (IML) at the University of Tokyo, uses the immersive projection technology to build a virtual world. Immersive projection technology is a technique used to construct virtual worlds. It is used to build multiperson, high-resolution, 3D graphics video and audio environments [Immer]. The user is fully immersed into the environment using special stereoscopic glasses that help him/her to see 3D images of objects float in space. The animation developed by MVL is designed to work with other environments like the CAVE at the University of Illinois, CoCABIN at Tsukuba University, UNIVERS at the communication Research Laboratory of the Ministry of Posts and Telecommunications, and COSMOS at the Gifu prefecture [Yamada, 1999]. All of the above-mentioned environments use the immersive projection technology. It is not indicated if the 2.5D avatar can be customized and transferred to other virtual environments, which do not use the immersive projection technology. Hence, we assume that the technique has limited applicability. The 2.5 D video avatar method uses depth information from stereo cameras to capture the subject’s image, and the image is then superimposed on the virtual environment [Yamada, 1999]. The use of depth information makes the image look more realistic than a traditional 2D image. However, it does not provide the effect of a 3D image. A Triclops (Point Grey Research Inc.) stereo camera system that has three lenses and uses two baselines (horizontally and vertically) is used for video capture. The subject is photographed from three angles, 0 degrees, 5 degrees and 15 degrees. The captured video 9 images are sent to a PC (Pentium2 450MHz), which rectifies the distortions and calculates the depth map. The depth information is computed by determining the corresponding pixels between the images captured by the stereo cameras along an equipolar line. Applying a triangulation algorithm on these pixels produces the depth map. The triangulation method takes the images and calculates the distance between the corresponding pixels using three coordinate axes x, y, and z. [Yamada, 1999] explains the triangulation theorem in more detail. The rectified color and depth images produced by the PC are used to create a triangular mesh. Mapping each color image onto each triangular mesh as texture data generates the 2.5D video avatar [Tamagawa, 2001]. When the subject interacts with a virtual object, both the image information and the spatial information (like the subject’s positional relationships) are also transferred. This makes the simulation look plausible when the subject wishes to point to an object in the environment. Though the evaluation is performed using scientific methods, it is limited since the developers have concentrated only on evaluating the pointing accuracy. A square lattice with a number of balls positioned 10 cm apart at the grid positions is placed in the shared virtual world. The video avatar is made to point at one of the balls and an observer is asked to orally indicate which ball is pointed at [Tamagawa, 2001]. If the observer selects the wrong ball, the positional error offset is recorded. In order to avoid parallax errors, the observers are encouraged to look at the avatar from various directions by walking around the display space. When an observer is looking at an object, the actual position of the object can be determined if the line of vision is at right angles to the plane of the object. The object appears to be at a different place if the line of vision is changed. This is said to be a parallax error. Their evaluation shows that there is an error of 7.4 cm on an average. We assume that they chose this method of evaluation because an offset of 10 cm on a pointing gesture can produce a reasonably plausible animation, unless there are two objects in the range of less than 10 cm. They claim that it takes 0.9 seconds to generate the 2.5D avatar. This shows that the technique could be used to generate real-time animation. We have identified the following drawbacks in the technique. From the evaluation, it is clear that the simulation looks reasonably plausible when the 2.5D video avatar tries to point at an object in the environment, but the problem of the virtual being trying to hold an object still remains. Another disadvantage is that it is not possible to segment the image from the environment effectively. There has to be a method to recognize where the subject’s image ends and where the environment starts. The technique does not offer any special procedure for this. Also from the pictures of the simulation provided in [Tamagawa, 2001], it is clear that the superimposed animation does not have proper foot contact in the target environment. It looks as if the image is floating in the environment. 10 2.2 MIRALab The MIRALab research group at the University of Geneva uses the interpolation technique and the already existing captured body geometry of real people to produce plausible body animations. The system uses the existing techniques for human body animation to build a library of body templates. The dimensions of each of the templates are stored in the form of a vector. These dimensions include the following details from the body geometry of the template [Seo, 2003a]: The vertical distance from the crown of the head to the ground. The vertical distance from the center of the body to the ground. A set of individual distances from the shoulder line to the elbow, the elbow to the wrist and the wrist to the tip of the small finger. The girth of the neck. Maximum circumference at the chest, trunk and the waist. When a new animation is needed, the requirement is specified as an input vector with values specified for each of the dimensions mentioned above. A closest match is found in the library of templates by comparing the input with the dimensions of the template. This template is then modified to produce the new animation. If the animation has to represent a live person, 3D scanning is used to generate the input. Range scanners are used to capture the sizes and shapes of real people [Magnenat, 2003]. The image is then measured and the body geometry is stored as a vector which acts as the input. Using this data, the most suitable template found. It is given that the library of templates is formed using existing animation techniques, but the animation techniques used are not specified. The interpolation function determines the necessary deformation by blending the input parameters and the existing templates, producing the displacement vector [Seo, 2003a]. However, the implementation of the interpolation function is not made public. Appropriate shape and proportion of the human body are generated using a deformation function which takes the displacement vector as input [Seo, 2003a]. They use a technique called “radial basis interpolation” to generate the deformation functions by using 3D scanned data of several human bodies. The advantage of using an existing template to produce new animation is two fold. It allows vector representation of the parameters, which is an easy way to describe the shape needed. Also, the initial skin attachment information can be reused [Sea, 2003b]. Body geometry is assumed to consist of two distinct entities, the rigid component and the elastic component [Seo, 2003a]. The desired physique is produced by manipulating the rigid and elastic entities in the vector. The rigid deformation is used to specify the joint parameters which determine the linear approximation of the physique. In other words, it 11 works by modifying the skeletal view. The elastic deformation depicts the shape of the body when added to the rigid deformation [Seo, 2003a]. The problem of modeling a population of virtual humans is reduced to the problem of generating a parameter set [Seo, 2003a]. When the animation has to represent a living human, parameters are generated by scanning the body. If a fictitious human is to be built, the most suitable template is selected by looking at the body model of each of the template, and it is modified to generate the required body animation. Since the user can generate a new model or modify an existing one by inputting a number of sizing parameters, we categorize the system as extensible. If the user is not satisfied with the generated animation, he/she can make minor changes by programmatically modifying the set of parameters using trial and error. Different postures can be generated by applying appropriate transformations to the rigid entries in the vector. Proper foot hold, grasp, and reach can be modeled by changing the parameters. Hence, the system offers the user sufficient control over the generated animation. The system also models the clothes worn by the virtual human. Various algorithms segment the garments into pieces depending on whether they stick to the body surface or flow on to it [Magnenat, 2003]. The segmented pieces are then coupled with the skin parameters. Once the animation looks believable, the system provides a mapping function which attempts to estimate the height and weight of the generated animation. This mapping function takes both the rigid and elastic entities into consideration and estimates the height and weight, respectively, of the person represented by the animation. Implementation details of the mapping function are not made public. If these entities do not deviate from the height and weight of the human the animation is trying to represent by more than 0.1%, it is concluded that the animation is plausible enough. This method is used to minimize the error in the representation and aids in generating a realistic body animation. Although this approach is quite interesting, its feasibility is questionable since (for example) the bone weight of different people is different even if they have the same size of bones [Bone]. They claim that it takes less than a second on a 1.0 GHz Pentium 3 to generate the animation after receiving the input parameters from the user, indicating that the technique could be used to generate real-time animation. The evaluation is scientific and is done by cross-validating using the existing templates. As mentioned earlier, the library of templates are generated using the existing body animation techniques which generate plausible animations but are tedious to use [Seo, 2003b]. For the evaluation, each one of the templates is removed from the library, and its parameters are given as an input to the synthesizer. The generated output model is then compared with the input template. If the output matches the input, it is concluded that this technique produces plausible animations and is quite easy to use when compared to the 12 other available techniques. Results of the evaluation show that the difference between the input and output is at most 0.001 cm. Since the deviation is at most 0.001cm, the performance of the synthesizer can be considered to be good. 13 Chapter 3 Gestures Gestures are a natural way of human-to-human communication. They help to show emotions during communication and enrich the clarity of speech [Leung, 2001]. Gestures are an integral part of human communication and are often used spontaneously and instinctively. Hence, even in virtual humans, gestures must be made to occur in the flow of the animation. They should not be explicitly activated by using menu or button controls [Salem, 2000]. A few gestures have the entire message contained in them [Salem, 2000]. For example, a nod denotes a yes. Other types of gestures, for example, a thinking gesture, are used to complement speech [Leung, 2001]. Some other gestures like the pointing gesture are context dependent [Yamada, 1999]. For example, a pointing gesture can be used to refer to an object, or direction of displacement. The animation technique should be able to represent all kinds of gestures. Gestures can be animated by calling predefined functions from a library of gestures and expressions [Salem, 2000]. There are several ways in which functions can generate the required gesture. Motion capture technique can be used, and the captured image can be modified to reflect the required gesture. Sensors can be connected to the subject’s body and the avatar’s limbs and face can be manipulated based on the movement of the subject [Tosa, 2000]. Key frame-based techniques can also be used to generate the gesture functions by “interpolating” pre-determined frames [Leung, 2001]. In key frame animation, the current posture of the avatar is stored as the source key-frame and the desired posture is stored as the target key-frame. The transition from the source to the target is achieved by manipulating a few frames in the representation. Details about key-frame animation are provided in [Cad]. Key frame animation is a popular technique in 3D animated feature films. Using motion capture or sensors to animate gestures requires special equipment and involves a heavy cost [Salem, 2000]. Once the library of gestures is formed, it is possible to determine which function has to be called based on the frequency or the tone of the subject’s speech, the words used in the speech, and the mood and personality of the subject. For example, if the user raises his voice to emphasize certain words, the avatar can make a suitable gesture to show the emphasis. This section deals with three of the techniques available for gesture animation, a technique used for the design of a Collaborative Virtual Environment (CVE) [Salem, 2000], the BEAT architecture [Cassell, 2001], and a technique used for building a Virtual Human Presenter [Noma, 2000]. Each of these techniques will be analyzed based on the applicable criteria identified in the Introduction section (Chapter 1) of the report. We will 14 determine if the motion capture, sensors, or the key frame based method is used to generate the needed gesture. Additionally, we will discuss whether the technique or theory represents all the types of gestures. Table 2 summarizes each of the techniques. The entries in the table which say “cannot be determined” indicate that not enough information is provided to determine if the system satisfies that particular criteria. The entries which say “possible” indicate that though the description of the system does not explicitly say anything about satisfying that criteria, it can be inferred from the available description that the system can possibly satisfy the criteria. Criteria CVE BEAT How are gestures triggered? Through words in the input text Cannot be determined Through words in the input text Are all kinds of gestures represented? Virtual human presenter Through words in the input text Support for pointing gestures is not explicit Set of rules formed from existing research. Yes Yes Yes Cannot be determined Yes Yes Yes Yes Cannot be determined Not specified Possible Yes, but is time consuming Possible How is the mapping between words and gestures decided? Is the system extensible? Can the system be controlled by the user? Can the system be used in real time? Does it include personality and mood? Can combination of gestures be generated? How are the gestures represented graphically? Not specified Can they be integrated with the existing body animation techniques? Is the transition between gestures smooth? Evaluation Cannot be determined Limited Yes Cannot be determined Not done Set of rules formed from books on gestural vocabulary Limited Yes Possibly using interpolation techniques Possible Using interpolation techniques Cannot be determined Scientific and limited No Yes Is there a demonstration No available? Table 2: Gesture animation techniques Possible User evaluation and limited Yes 15 The remaining part of the section explains each of the techniques in detail. 3.1 The Collaborative Virtual Environment The Collaborative Virtual Environment (CVE) was designed as an expansion of the textbased chat room by a research group at the University of Plymouth. Avatars in the CVE communicate by using text and other non verbal channels of communication. The nonverbal channels of communication involve facial expressions, eye glances, body postures and gestures [Salem, 2000]. Input from the user is taken in the form of text and the appropriate gestures are generated using the words in the message. The text is scanned to find abbreviations, punctuations, emotion icons and performative words. Performative words are used to denote words in the input text which need a physical action to be performed. For example, wave is a performative word. Abbreviations like LOL (Laughed Out Loud), IMHO (In My Humble Opinion), and CUL8R (See You Later) are recognized and the relevant gesture for each of them, laughing, surprise, neutral pose, and wave, respectively are invoked. Punctuation like ? and ! are recognized and are interpreted respectively as questioning and emphasizing a message. The gesture for questioning is animated as the head slightly thrown back, one eyebrow raised and a hand out-stretched. The gesture for emphasizing a message is animated as the head slightly thrown back, eyebrows raised and torso upright. Since the technique is used to extend the already existing text based chat room, emotion icons, which are very common in any chat environment, are also considered. For example, :-) (smile), :-( (sad/upset), and :-* (kiss) are animated as smile, head and shoulders drooped, and blowing a kiss respectively. A few common words like ‘yes’ and ‘no’ are also associated with appropriate gestures like nodding the head and shaking the head. Apart from these, phrases which are categorized as performative words are also handled. A performative word is contained in two asterisks. For example, a *wave* in the user’s text indicates that the user wants to wave. So the avatar is made to wave in the virtual environment. The system allows the user to customize the mapping of a gesture to a keyword. The user is provided with the ability to assign a gesture to a different keyword than the keyword suggested by the system. Such changes can be saved as a separate file and can be loaded whenever necessary. This makes the system extensible. [Gratch, 2002] identifies the importance of conversational support. Initiating a conversation, giving up the floor, and acknowledging the other person form an integral part of human face-to-face communication. The system provides gestures for all three intentions. Initiating a conversation is accomplished with a greeting which is a wave of the hand. Giving up the floor is accomplished by forwarding the arms, offering, and then 16 pulling back. Acknowledging the other person is done by a nod of the head. Intent to leave is expressed by keeping the gaze connected and a quarter turn of the body. The information contained in the input text is used to control the movement of hands, arms and legs. Also, it is claimed that a set of gestures for a particular avatar is generated by taking the personality, mood and other relevant characteristics as an input and then coupling it with a predefined library of generic gestures and expressions. As soon as the mood of the avatar is changed, a new set of gestures is generated. We have identified several points of the system which are difficult to evaluate. The paper [Salem, 2000] gives a few examples for each of the keywords like abbreviations, punctuations, emotion icons and performative word, but the entire list of all the key words is not made public. Also, it is not clear how the mapping between words and the actions is achieved. The gestures (actions) associated with the words mentioned as examples are obvious, but the mapping for complex gestures is not mentioned. Also it is indicated that the gestures can be customized, but it is not clear if the user is allowed to add new gestures. Furthermore, information about how the gestures are represented graphically is missing. Hence, it is difficult to determine if the system can be integrated with any of the existing body animation techniques. Even though many aspects of the theory look promising, the unavailability of a demonstration and lack of an evaluation makes it impossible to determine if they have met all their claims. 3.2 BEAT The Behavior Expression Animation Toolkit (BEAT) was developed by the Gesture and Language Narrative Group (GNL) at the MIT Media Lab. The tool takes typed text to be spoken by the animated human figure as input and produces the appropriate nonverbal behavior [Cassell, 2001]. The nonverbal behavior is generated on the basis of “linguistic” and “contextual” analysis of the input text. The linguistic analysis is used to identify the key words in the text. Key words are the words that represent the emotion of the speaker when he utters the word. For example, in the sentence “I am surprised!” the word surprised is a key word. Contextual analysis is used estimate the context in which the given text is spoken. The nonverbal behavior produced can then be sent to an animation system. The toolkit automatically suggests appropriate gestures and facial expressions for a given input text. A set of rules formed from the existing research in the field of communication is used to map the text to the appropriate gesture. Also, the system allows animators to include their own set of rules to work for different personalities in the form of filters and a knowledge base, which are to be written in XSL (Extensible Stylesheet language). Filters can be used to reflect the personality and mood of the avatar. Details about the filters and knowledge bases are provided in the subsequent paragraphs. 17 [Cassell, 2001] describes the technique as follows: The system uses an “input-to-output pipeline” approach and provides support for user generated filters and knowledge bases. The term input-output pipeline means that each stage in the system is sequential. The output from one stage forms the input to the next stage. The system is written in Java, and XML. The use of XML and Java makes the technique portable. The input text is sent to a “language tagging module”, which converts it into tags. These tags are then analyzed and coupled with a generic knowledge base and a set of behaviors (called “suggested behaviors”) is formed. The generic knowledge base provides common gestures that include the beat, which is a vague flick of the hand, the deictic, which is a pointing gesture, the iconic, which is an act of surprise, and the contrast, which is the contrastive gesture. For example, a tag <surprise> might be mapped to a gesture which shows raised eye brows in the knowledge base. The knowledge base, in general, is used to store some basic knowledge about the world and is used to draw inferences from the input text. Kinds of gestures to be used and places where emphasis is needed is determined from these inferences. These inferences form the set of suggested behaviors. The user specified knowledge base and personality filters are used to filter the set of suggested behaviors to form the selected behavior. The selected behavior contains the name of the gesture and the command to represent it graphically. For example, to move the right arm, the generated gesture would be <GESTURE NAME= “MOVE”> <RIGHTARM HANDSHAPE=5/> </GESTURE> Animators are allowed to design new gestures and include them into the system. This requires a new tag to be added into the knowledge database and a corresponding gesture command mapped to it. This makes the system extensible. The description of the working of the system which is provided does not explicitly deal with integration details, but it appears that the animator can change these gestures to work with the chosen body animation technique. He/she can use a key frame-based interpolation [Wu, 2001] approach which takes the body animation and the above gesture command and moves the right arm by 5 degrees. Thus, it can be said that the system can be integrated with the existing body animation techniques. Similarly, even if the description of the system does not talk about representing a combination of gestures, it can be inferred that they can be generated by appending instructions to the generated tag. For example, the above tag can be modified to represent a titled head by appending HEAD = 30 to RIGHTARM HANDSHAPE = 5. 18 An utterance coupled with a gesture is estimated to be generated in 500 – 1000ms, which is calculated to be less than the natural pause in a dialogue [Cassell, 2001]. Hence, the system can be used for real-time animation. From the description, it is not clear if the theory can represent pointing gestures. Also, no attempt is made to make the transition between gestures smooth. [Cassell, 2001] claims that BEAT was extensively tested and a demonstration is available at http://www.media.mit.edu/groups/gn/projects/beat. However, the specified link is no longer active as of September 27, 2003. From the verbal description of the evaluation provided, it can be said that the evaluation is scientific, but not very extensive. The system was tested using an input text with two sentences and pictures of the generated gestures are provided. The generated animation looks to be moderate. The evaluation can be improved by using input texts which require a combination of gestures for example; it would be interesting to see how the avatar depicts the sentence “I wonder how this works!” From the description of the theory, the literals wonder, how and ! are separated and the corresponding gestures surprise, questioning, and emphasizing are generated. When a real human says this sentence, he would be showing primarily questioning with a combination of surprise and emphasis. If the avatar can achieve the same combination of gestures, it can be said to be plausible. 3.3 The Virtual Human Presenter The Virtual Human Presenter was developed on the Jack animated-agent system at the Center for Human Modeling and Simulation at the University of Pennsylvania. The system serves as a programming toolkit for the generation of human animations [Noma, 2000]. The system takes the input text, scans it and automatically embeds gesture commands on the basis of the words used. The virtual human is then made to speak the text and the embedded commands produce the animation in synchronization with his speech. A command starts with a backslash and can be followed by arguments enclosed in braces depending on its type. For example an input text which says “This system supports gestures like giving and taking, rejecting, and warning” can be modified into “The system supports gestures like \gest_givetake giving and taking,\gest_reject rejecting, and gest_warn warning.” The commands in the example do not take any arguments. Other commands like \point_idxf(),\point_back(), \point_down() and \point_move() represent pointing gestures and take arguments. Additionally, commands like \posture_neutral and \posture_slant are used to specify the body orientations. The gestures generated are controlled by means of the commands. 19 A set of Parallel Transition Networks (PaT-Nets) are used to control the avatar. Pat – Nets are parallel state machines which are easy to manipulate. They can monitor resources used, the current state of the environment, and sequence actions [Badler, 1995]. The networks handle all the tasks from parsing the inputs to animating the joints. Smooth transition between gestures is produced by using motion blending techniques. The motion blending algorithm uses many motion editing algorithms and “multi-target interpolation” to produce a smooth animation [Kovar, 2003]. It is claimed that the library of gesture commands includes all the required gestures for presentation, and the mapping between words and actions is based on a published convention for gestures, presentations and public speaking. The gesture commands are generated by collecting vocabularies from psychological literature on gestures and popular books on presentation and public speaking. The mapping between words and gestures is achieved from a book about Delsarte [T. Shawn, Every Little Movement - A Book about Delsarte, M. Witmark & Sons, Chicago, 1954]. However, the list of gestures available is not specified. It is not explicitly indicated whether a combination of gestures can be represented. However, gesture functions can be parameterized and modified to reflect additional gestures. Hence, it can be inferred that modifying the library of gesture functions makes a combination of gestures possible. Since each gesture command is a call to a function, the code in the function can be changed to reflect what the user requires. However, it is not specified if the animator is given enough privilege to do it. Also, if an avatar with a different personality and mood is to be built, the entire library of commands must be reconstructed. To change the library each time the personality or the mood changes is a strenuous task and is time consuming. Hence, extensibility of the system is achieved at the cost of time. The animation is produced by using key frame-based interpolation. The speed of the animation is claimed to be 30 frames per second. Hence, the tool can be used for realtime animation. The limited evaluation is done by users. Since the tool is used to generate a virtual human presenter, emphasis is given to the quality of speech of the presenter rather than the gestures generated. They only concentrate on the pointing gestures. The demonstration of the working of the tool is available at http://www.pluto.ai. Kyutech .ac.jp /~noma/ vpre-e.html as of September 28, 2003. The gestures shown in the demonstration are not very impressive. The movement of the avatar is not human like. Also, the gestures demonstrated are mainly pointing gestures. A video showing all the different gestures is available, but it is not clear what the virtual human is trying to enact. There is no synchronization between the gestures and speech. 20 Chapter 4 Facial Animation and Lip Synchronization In human face-to-face communications, facial expressions are excellent carriers of emotions [Salem, 2000]. Eye contact and gaze awareness play an important role in conveying messages non-verbally [Leung, 2001]. Like gestures, facial expressions in humans also occur spontaneously and instinctively. Thus, they should be made to occur in the flow of the animation instead of being explicitly driven by menu or button controls. Lip synchronization is an important component in facial animation. Some animation techniques provide random jaw and lip movements. When speech is attached to such animations, the resulting animation does not look plausible. In order to avoid this, many of the facial animation techniques provide support for lip synchronization. Facial animation can be done in three different ways [Gratch, 2002]. The first method is to use keyframe based interpolation techniques. These methods are called parametric animation techniques, and they use geometric interpolation to produce the required shape [Byun, 2002]. Geometric interpolation is similar to keyframe-based approach. (A brief introduction to the keyframe-based approach was provided in the “Gesture Animation” (Chapter 3) section of the report.) The second method is to produce facial animation from text or speech. In this method, an algorithm used to analyze the text or the speech identifies a set of “phonemes”. Phonemes can be defined as the smallest unit in language that is capable of conveying a distinction in meaning [Dictionary]. The phonemes are then mapped to visemes. Visemes act as visual phonemes. A model called the speech articulation model takes the visemes as an input and animates the face [Gratch, 2002]. The speech articulation model operates on a generic face, which is represented in the form of a mesh of triangles and polygons. The animator is expected to provide this face mesh. It uses physics-based models to simulate skin and facial muscles [Byun, 2002]. Mathematical models are used to produce changes in the skin tissues and skin surface when the facial muscles move. Keyframe based interpolation is used to identify the key poses and produce a smooth transition between them. Though the generated animation is realistic, generating the initial face mesh involves a lot of manual work by the animators. Another set of methods for facial animation, termed performance-driven methods, extract the required facial expressions from live humans (or from videos of those humans) by using some special electromechanical devices [Byun, 2002]. A library of the regularly used facial expressions is made from the captured images. The required facial expression is called from the library as needed. These methods are usually used in combination with motion capture methods for body animation. They require special equipment, a tremendous amount of human involvement in the form of models. The techniques can be used to generate a library of facial expressions but it would be an involved task to customize the expressions to work for a new face model because they are specific to a particular person [Gratch, 2002]. Each time an animation of a different subject is 21 required, he or she is made to go through the entire process. This can be a time consuming process. Owing to the drawbacks in the other two methods, parametric animation techniques are being widely used in the latest animation systems. They take two sets of parameters, the Facial Action Coding System (FACS) and the Facial Animation Parameters (FAPs). The FACS and the FAPs are explained in the remaining part of the section. [Leung, 2001] describes the Facial Action Coding System (FACS) which was developed by Ekman and Friesen in 1978. [FACS] is a list of all “visually distinguishable facial movements.” The list of FACS is frequently updated. That is the reason the parametric animation techniques take the FACS as a parameter. The Facial Animation Parameters (FAPs) represent a facial expression in the form of a set of distances between various facial features. Different expressions are produced by changing these FAPs [MPEG-4]. In simpler terms, the FAP is a set of 66 parameters which store the distance between various facial feature points of a given face model. A complete description of what each of these 66 parameters represent is available at [ISO, 1997]. Simply stated, 16 FAPs represent the jaws, chin and lips; 12 FAPs represent the eyeballs, pupils and eyelids; 8 FAPs represent the eyebrows; 4 FAPs represent the cheeks; 5 represent the tongue [MPEG-4] and so on. Each parametric animation technique uses the FACS and FAPs in a different way. There are two ways in which lip synchronization can be achieved in facial animation [Leung, 2001]. The first approach uses “energy detection techniques” to convert the input speech into an angle for the mouth opening. The energy content in the speech is measured and the lips of the avatar are animated accordingly. For example, an “o” in the uttered word results in the lips of the agent forming a brief circle, whereas two “o”s result in a more pronounced lip movement. The higher the intensity, the more pronounced is the lip movement. The quality of the generated animation depends on the quality of the input. The speech has to be recorded with good quality and the energy information has to be captured accurately. This requires the presence of additional equipment. Hence, the technique is not a preferred method. The second approach generates phonemes by scanning the input text. The phonemes are then mapped to appropriate visemes, which are used to manipulate the lip movement of the avatar. The method of generating lip movements based on the viseme information depends on the animation technique being used. Most of the existing facial animation techniques use the second method to provide lip synchronization. Facial animation requires generating plausible facial models and mechanisms to move the surface of the produced face model to reflect the required expressions and emotions [Egges, 2003]. Lip synchronization, jaw rotation, and eye movement are some of the important considerations in facial animations. 22 The generated animation should be believable, i.e., the agent should blink appropriately, the lips, teeth, and tongue should be modeled and properly animated, and emotions must be readable from the face [Byun, 2002]. The parameters used to control the animation should be easy to use. The control parameters should be consistent and easily adaptable across different face models. In other words, customizing the animation data for a different model should require as little human involvement as possible. The generated facial animation must be able to work with body animation and gesture animation. We describe and analyze four of the tools available for facial animation: the BEAT [Cassell, 2001], the FacMOTE [Byun, 2002], the MIRALab’s tool for facial animation [Egges, 2003], and the BALDI system [Baldi]. Each of these tools will be evaluated based on the applicable criteria from Chapter 1 of the report and the additional requirements identified so far in this section. Table 3 summarizes each of the techniques. The entries in the table labeled “cannot be determined” indicate that enough information is not provided to determine if the system satisfies that particular criteria. The entries labeled “possible” indicate that although the description of the system does not explicitly say anything about satisfying that criteria, it can be inferred from the available description that the system possibly satisfies the criteria. Criteria How are facial expressions triggered? BEAT Through words in the input text MIRALab Through words in the input text. BALDI Through words in the input text. Possible FacMOTE Possibly through words in the input text. Cannot be determined Possible Does it include personality and mood? Is the agent made to blink often? Can combination of expressions be generated? Does the technique provide support for lip synchronization? Can the animation be controlled by the user? Is the system extensible? Can the intensity of the emotion be changed? Is it portable to other face models? Can it be integrated with the existing body Yes Cannot be determined Possible Cannot be determined. Yes Possible Possible Yes Cannot be determined Yes Yes Yes Yes Yes Yes Yes Yes Yes Possible Yes Yes Cannot be determined Yes Yes Yes Yes Only MPEG-4 models Possible Yes Cannot be determined. Possible Yes Possible 23 animation techniques? Can the system be used in real time? Evaluation Demonstration Which of the above mentioned methods of animation is used? Yes Yes Scientific and Scientific and limited decent No No Viseme Parametric generation animation method technique Table 3: Facial Animation tools Yes Yes User evaluated and very limited No Parametric animation technique User evaluated and scientific Yes Not specified. 4.1 BEAT The Behavior Expression Animation Toolkit (BEAT) developed by the Gesture and Language Narrative Group (GNL) at the MIT Media Lab can be used to produce facial expressions. The technique was described in the “Gesture Animation” section (Section 3.2) of the report. The tool takes typed text to be spoken by the animated human figure as input and produces the appropriate nonverbal behavior [Cassell, 2001]. The nonverbal behavior is generated on the basis of “linguistic” and “contextual” analysis of the input text. It uses the generic knowledge database to produce a set of “suggested behaviors”. This set is then coupled with the user-generated filters to produce a selected behavior [Cassell, 2001]. The behavior suggestion module contains a series of facial expression generators like an eyebrow flash generator and a gaze generator. The eyebrow flash generator signals the raising of eyebrows when some thing surprising happens. This can be customized as mentioned in the “Gesture Animation” section (Section 3.2) of the report. The gaze generator is algorithmic and suggests gazing away from the user at the beginning of a dialog and gazing towards the user at the end of the dialog. If the dialog process is long, it suggests gazing at periodic intervals. As mentioned in the description of the BEAT in the “Gesture Animation” section (Section 3.2), the system is extensible and can be used in real time. In addition, it can be inferred that a combination of facial expressions can be generated. From the description of the system, we have inferred that it uses the second method of animation (the viseme generation method) described earlier in the section. Input text is analyzed to produce visemes, and the animation is generated by the speech articulation model taking these visemes as input. 24 From the available data, we infer that blinking of the eyes at regular intervals can be achieved by including the corresponding call in the selected behavior. It is, however, not specified if the intensity of the facial expression can be changed. Lip synchronization is provided by recognizing the visemes in the input text. Lip movements for ten distinct visemes are available. The description does not specify if the animator can modify the existing lip movements or generate new movements. 4.2 FacMOTE FacMOTE is a facial animation technique designed at the Department of Computer and Information Science at the University of Pennsylvania. The technique produces the required facial animation using a parametric animation technique. The system can work with a facial model created by using motion capture or generated manually as long as it is expressed in the MPEG-4 form [Byun, 2002; Grachery, 2001]. MPEG–4 is a standard that is used to produce high quality visual communication [Goto, 2001]. It defines a set of points on the face as Face Definition Parameters (FDP). Some of these points are used to define the shape of the face. A particular position of the face which does not show any emotion is decided to be the neutral position. A set of parameters called the Facial Animation Parameters (FAP) specify displacements from the neutral face position. These FAPs are applied to the FDPs, and the required facial expression is generated. FAPs can be used to generate visemes and expressions. As mentioned earlier, Visemes are visual phonemes used to represent lip synchronization. For example, when the avatar has to utter a word like “hello”, the visemes make sure that at the end of the utterance, the lips of the avatar look like an “o”. Fourteen distinguishable visemes are included in the library provided by the MPEG-4 standard. Transitions from one viseme to the other can be produce by blending the two visemes together by using a weighting factor for each of them [Garchery, 2001]. Since the MPEG-4 deals with both the facial expressions and visemes, it can be inferred that any animation technique which follows the MPEG-4 standard provides support for lip synchronization. Similarly, six facial expressions (joy, sadness, anger, fear, disgust and surprise) are provided. Each facial expression is associated with a value which specifies the intensity of the expression. The intensity can be varied as needed. Also, it is possible to produce a combination of expressions by blending the provided expressions with a weighing factor. For example, 70% of fear and 30% of surprise can be blended together to show horror. Details about how the visemes and expressions can be blended are specific to the techniques that use the MPEG model. A 3D model of the person who has to be represented is obtained using a 3D laser scanner. Facial Animation Parameters (FAPs) are then generated from this 3D model using multiple complicated mathematical algorithms [Byun, 2002]. 25 Since the large number of Facial Animation Parameters (66) makes it difficult to use FAPs as a direct animation tool, the FacEMOTE uses a set of four higher level parameters that drive the underlying 66 FAPs. Mapping between the FAPs and the higher level parameters is described in the subsequent paragraphs. The use of such parameters allows easy control over the face. The FAPs are organized as sets of individual FAP units (FAPUs). For example, all the FAPs used to control the eyes could be grouped into an eye FAPU, and so on. The four higher level parameters are used to control each of the FAPUs. The set of higher level parameters are categorized in space, weight, time and flow. These parameters were adapted from the effort parameters of the EMOTE system [Chi, 2000]. Each of these parameters takes values ranging from -1 to 1. A “0” represents a neutral pose. [Byun, 2002] offers the following examples. Space parameters can vary between indirect, which is represented by a -1 and direct, which is represented by a 1. For example, space parameters controlling an eye can be described as a gaze when the value selected is -1 and a focused look when the value selected is a 1. Similarly, space parameters are linked with other FAPUs to produce various other expressions. Weight parameters can vary between light and strong. When associated with speech, a light action can be whispering and a strong action can be snarling. Time parameters can vary between sustained and quick. When associated with the FAPUs of the mouth, a sustained action could be yawning and a quick action could be clearing of the throat. Flow parameters can vary between free and bound. A free action could be laughing, while a bound action could be chuckling. Free and bound are similar to sustained actions and quick actions in time parameters. The only difference is that flow parameters can be associated with speech. For example, a person can be laughing and at the same time he/she is saying “This is very funny”. The set of FAPs is used to specify each expression. A neutral expression can be generated by setting the four parameters to zero, which in turn sets all the FAPUs to zero and hence all the FAPs to zero. For example, a smile can be generated by setting the weight parameter associated with the lips FAPU to a value between -1 and 0 and the flow parameter to a value between 0 and 1. The intensity of the smile can be varied by changing these values. By using this approach, the animator is provided better means to control the FAPs when compared to having to deal with 66 parameters. Keyframe based interpolation techniques can be used to generate the animation from the given set of FAPs. Hence, it is possible to integrate the generated animation with the existing body animations which are produced using interpolation techniques. 26 Evaluation is done by trying to generate all the possible facial expression mentioned in FACS by varying the values of the four parameters in each FAPU. [Byun, 2002] shows a snapshot of the generated animation for a set of values and it looks to be decent. Combinations of expressions are also represented. They claim that since the FAPs store the distance between various facial feature points, the same FAP data can be used on different face models and generate realistic animation. A lack of a demonstration makes it impossible to determine if their claim is true. The explanation of the system does not explicitly specify how facial expression can be triggered. They have generated a library of all the regularly used expressions (like smile, and surprise) by changing the values of each of the four parameters for each FAPU. Assuming that their claim that the same FAP data can be used over different face models is true, we have inferred that the required expression can be called from the library based on the analysis of the input text. This was the approach used in gesture animation. The approach can be used in real–time because the facial expressions are generated by keyframe interpolation techniques, which is quite fast [Cad]. From the description of the method, we have inferred that it is possible to make the agent blink at specific intervals of time by embedding the appropriate trigger into the input text. When the input text is being scanned, a call to the eye blinking expression can be included. It is possible that there is a more profession way of including eye blinking, but it is not trivial from the provided description. The information provided about the technique does not explicitly state the ability of the user to add new expressions or to modify the existing expressions. We have inferred that the user could change or intensify an expression by changing the values of the four parameters. New expressions can also be generated by using the FAPs of an existing expression and changing the values of the four parameters. Once the intended expression is produced, it can be stored in the library. The disadvantage with this approach is that the four parameters are changed by trial and error in order to produce the required facial expression. This can be quite time consuming and frustrating at times. 4.3 The MIRALab This section describes and evaluates a facial animation technique designed by the MIRALab research group at the University of Geneva. The technique operates on input from the user in the form of text or audio. If the input is in the form of audio, it is converted into text by using available speech-to-text software. The system produces the facial animation in real-time and couples speech to it. The generated 3D face hence shows facial expressions and speaks the specified text [Egges, 2003]. 27 The text input is analyzed and tags are produced. These tags are then used to determine the appropriate facial expression. The research group at MIRALab also uses the MPEG-4 technique to generate facial animation. They claim that the use of Face Animation Parameters (FAPs) alone does not provide sufficient quality of facial animation for all applications. As mentioned earlier, it is difficult to produce an animation by controlling 66 parameters. Hence, they use a Facial Animation Table (FAT) which is also provided by the MPEG-4 standard in their animation [MPEG-4]. The Facial Animation Table (FAT) defines the effect of changing the set of FAPs. The table is indexed by facial expressions, called “IndexedFaceSet”. The IndexedFaceSet shows a facial expression graphically and points to the set of FAPs for that expression. The FAT contains different fields like coordIndex, which contains the list of FAPS that are to be changed to represent the current facial expression, and coordinate, which specifies the intensity and direction of the displacement of the vertices in the coordIndex field. The person to be represented is photographed, and 3D animation algorithms use these photographs to produce the 3D model of the person in the form of triangles. 3D graphical tools like 3D Studio Max, and Maya could be used to change the model if needed [Grachery, 2001]. The set of FAPs are generated from the 3D models using specific algorithms. Details about how these algorithms works can be found in [Kshirsagar, 2001]. The research group has developed tools to automatically build the Facial Animation Table (FAT) from the 3D animated face produced using [Kshirsagar, 2001]. The tool takes the animated face model, generates the FAPs and modifies these FAPs slightly to produce a different facial expression. The animator is then asked if he needs this expression. Hence, the animator can choose to store the generated expression if it is deemed necessary. Controls are provided in the form of slider bars, and the animator can himself/herself change a particular Facial Animation Parameters Unit (FAPU) to reflect the needed expression. Though the generation of the FAT is a one-time task and is less strenuous when compared to the method used by FacMOTE, it is still time consuming. The research group claims that the FAT can be downloaded and the animator does not have to generate a new FAT each time he wishes to create a new face animation. The same FAT data can be used on different face models and it is still possible to generate realistic animation. Again as in the case of FacMOTE, the lack of a demonstration makes it impossible to determine if their claim is true. The system can be used in real time once the FAT is formed or downloaded, since frames of animation are generated at the rate of 3.4 frames/second. The animation is produced by means of a keyframe based interpolation technique, which was also developed at the 28 MIRALab. The generated animation can be integrated with body animation which is also generated using keyframe based methods. The tool represents both visemes and expressions. The Face Animation Table (FAT) provided stores the fourteen basic visemes and the six emotions identified by MPEG-4. The animator can generate additional expressions and visemes using the tool that was used to build the FAT from 3D face model. The intensity of the emotion can be controlled by changing the intensity of the corresponding FAP. Since analysis of text is used to trigger the facial expressions, we assume that blinking of eyes can be modeled by placing the appropriate tag at regular intervals in the text. It is claimed that the technique can be extended to any face model that follows the MPEG-4 standard. If a particular model does not follow that standard, a special algorithm which is developed at the MIRALab [Grachery, 2001] is used to extract Facial Animation Parameters (FAPs). The technique was used to build a virtual tutor application which was evaluated by many human users. Details of the criteria of evaluation and the responses of the evaluators are missing. Hence, we have assumed that the evaluation is not scientific and is limited. 4.4 BALDI The BALDI is a conversational agent project done for the Center for Spoken Language Understanding (CSLU) at the University of Colorado. It is funded by the National Science Foundation (NSF) grant. The aim of the project is to develop interactive learning tools for the language training of deaf children. The system takes a recorded utterance and a typed version of the same utterance as input. The input is scanned and expressions are triggered based on the words in the input. Lip synchronization is achieved by producing visemes from the input text. The head of the agent is made up a number of polygons joined and blended together to form a smooth surface. The source code for the generation of the head, which is written in C, is provided on request. When a particular user has to be represented, a picture of the user is taken, which is projected on the generic face generated by using the above function. Special functions, called texture mapping functions, are used to blend the projection with the generic face. Since the system is used to train deaf children in any particular language, more importance is given to lip synchronization. The lip synchronization is controlled by 33 parameters. However, basic facial expressions (like happiness, anger, fear, disgust, surprise and sadness) are represented [Toolkit]. The user is provided with an interactive interface by means of which he can add new expressions or change the intensity of the existing expressions. As part of the evaluation, users are made to look at the animation and are asked to interpret what the agent is 29 saying. Success in interpreting spoken text correctly is recorded. They state that, on average, there is a miss once in 17 times. The evaluation is scientific and requires user involvement. A demonstration of the technique is available at http://www.cse.ogi.edu/CSLU/toolkit active as of October 1, 2003. The toolkit can be installed on a PC and can be tested. The generated animation in the demonstration produces proper lip synchronization, but does not represent any facial expressions. The agent is realistic to look at and blinks its eyes periodically. The animation can be generated in real time. The technique uses interpolation techniques to produce the facial animation [Massaro, 1998]. Hence, we have inferred that it can be coupled with the body animation techniques which use interpolation techniques. From the provided information, it is not clear if the tool can be extended over other face models. Also, it is not clear whether the personality and mood can be integrated into the tool. 30 Chapter 5 Evaluation of the Existing Tools In this section, we evaluate a few existing animation tools, the Microsoft Agents, the NetICE, the Jack, the DI-Guy and Dr. Ken Perlin’s responsive face based on the applicable criteria from the introduction part of the report. Table 4 gives a list of evaluation criteria and the performance of each of the tools with respect to those criteria. Criteria MS Agent NetICE Jack DI-Guy Public/Private Body animation and gestures Facial animation Speech Public Decent Private Limited Public Decent Public Appears to be decent Responsive face Private Not applicable Limited Very Limited Limited Limited Decent Claimed to be provided Very limited Very limited Not available Basic + TNT2 graphics card and a 32MB VRAM. Not provided Provided Yes Yes Not provided Cannot be determined Provided Not provided Easy to test. Control Provided Extensible Yes Demonstration Yes Limited No Yes Very Limited Not provided Provided Limited Yes System requirements Basic Not specified Not specified Examples in code Difficulty level Support Provided Not provided Cannot be determined Provided Cost Free Not provided Cannot be determined Not applicable Not applicable Decent Easy to use Provided Not specified $9000 for basic features. Table 4: Existing animation tools Basic Not applicable Not applicable 31 Microsoft Agent The Microsoft Agent (MS Agent) is publicly available software that provides a few animated characters, which show gestures and some expressions. Two of the characters available (Merlin and Genie) look human-like. There are other characters built by third party developers that are MS agent compatible. Agentry [Agentry] is one of them. Some of these agents represent both the body and the face. The user can control the MS agent or the MS compatible agent programmatically. There are some examples in code that act as a demonstration. These examples are available in languages like VC++, J++, Visual Basic, and HTML and can be downloaded from the MS Agent web page [MS Agent]. The code is easy to understand and execute. We have tried modifying the existing example code in Visual Basic for our evaluation of the tool. The code works as expected and creates an agent which performs all the specified actions. We have tried to integrate some of the animations developed by the third party sources into the MS agent software. We tested them, but our major evaluation was done on Merlin, an animation provided by Microsoft. We have chosen “Merlin”, since the character looked more human-like when compared to the others. The input is given in the form of a text. The agent analyzes the text and speaks, showing appropriate changes in its tone. For example, “What a wonderful day!!” is said with the needed emphasis, and “What day is it today?” is said with a hint of questioning. Additionally the spoken text can also be displayed in a text bubble. A snapshot of the Microsoft agent is shown in figures 1, 2 and 3. We have observed that lip synchronization is available and is reasonably plausible. To improve the lip synchronization, special software called the Linguistic sound editing tool can be used which allows the animator to develop phoneme and word-break information. The agents provided by Microsoft show emotions like happiness, sorrow and surprise. Various useful gestures like explaining, acting helpless, pointing to something are also represented. The gestures and expressions that can be represented by each character were listed in one of the examples available on the website [MS Agent]. The third party agents, on the other hand, show only a limited number of emotions or gestures. We have tried generating an animation by calling the various gesture functions from a driver program and giving different input texts. The generated animation was seamless and impressive. The software can be downloaded to a PC. Few components (like the localization support, agent character files, text-to-speech engines) can be downloaded, and they are available in the Microsoft website. 32 Figure 1: Initial interface when a MS agent is run Figure 2: Agent demonstrating the confused gesture 33 Figure 3: Agent demonstrating the confused gesture and speaking text simultaneously The developers are provided with a feature called the Agent Character Editor that allows the creation of custom agent characters. Documentation for the agent character editor is available. We have tried using the editor, but it was not trivial. Support is available in the form of a troubleshooting section and frequently asked questions section. The support provided is helpful. The MS agent is available royalty-free when used for the developer’s own application. A distribution license is required to use the tool if the application is to be posted on to a server or distributed using via electronic mail. NetICE The Networked Intelligent Virtual Environment (NetICE) is a project done by the Advanced Multimedia Processing Lab at the Carnegie Mellon University. It aims at providing a virtual conference setting so that people from remote places can still feel that they are communicating in person. The NetICE is client-server architecture. The server distributes information to the client. Each client at a remote location is rendered a 3D audiovisual environment. The client can add his/her avatar to the environment; see the virtual environment, and see all the avatars of the other participants. He/she can change his/her position, look around the environment and operate his/her hands (raise and lower both hands). There is a white board available for the client to write on. Figures 3, 4 and 5 demonstrate the working of this tool. 34 Figure 4: The collaborative virtual environment and the agents in it. Figure 5: A closer look at the animation 35 Figure 6: Agent demonstrating the raising of a hand gesture The website for NetICE [NetICE] provides a downloadable client side executable file. The file can be downloaded, and a connection to the server can be established using the specified port and IP address. This provides the client with a virtual environment containing his/her avatar. The user can choose to use his/her head model, a face model for this avatar, or use the synthetic model provided. If the user wishes to use his/her face model, it should be created by some other means. The tool currently does not provide any support for the creation of face models. The avatar can move around the room. No other gestures are provided other that raising and lowering hands. The basic facial expressions - joy, anger, surprise, sadness, fear, and disgust - are provided. The user’s face is seen well by others in the environment since the user always sees the back of his avatar. A demonstration of a virtual environment is available in the web site. We have observed from this demonstration that the movements of the virtual human are robot-like. In other words, the animation is not seamless. The body of the virtual human is also robot like. Facial expressions are limited, and the user is offered no control over them. Also, the lip movement is not synchronized with the speech. Though speech support is provided, the utterances are always in the same tone, no matter what the text is. In other words, there is no emotion shown in the speech. No special software is needed to run the demonstration. It is not specified if the tool needs any special kind of software. From a video presentation of the product available at [NetICE], it is clear that a tracking system is needed to track the user’s eyes and transfer 36 them to the environment. The tracking system is used to make the avatar maintain eye contact with the other avatars in the environment. It is claimed that provision for the user to use his own voice is provided, but this is not sufficiently demonstrated. As it currently stands, the tool provides reasonably good support for virtual business conferences where it might not be very necessary to represent the emotions of the participants, but it is not very useful to represent emotional agents. JACK Jack is a product of the Electronic Data Systems Corporation (EDS), which provides IT services [EDS]. It is a software tool which helps developers to build virtual humans to operate in virtual environments. These virtual humans are designed with the intent of replacing real humans in testing and analyzing the performance of machines. A female embodiment called Jill is also provided. The virtual humans when assigned to various tasks in a virtual environment can tell engineers what they see, reach, when and why they are getting hurt. This helps developers design safer and more efficient products. A demonstration of the working of the virtual human is provided at [Jack]. Figures 7 and 8 show the snapshots from the demonstration. From this demonstration, we observed that the body animation of both the virtual humans is plausible and seamless. The tool provides a motion capture toolkit [Jack] which can be used to generate gestures. This can be done by using either motion sensors or by using controls provided in the form of slider bars. There is a library of movements available. If the animator needs to modify any existing movement slightly to generate a gesture, he can use the controls provided. If a more complicated gesture is needed, he can use the motion sensors. The sensor attachments link the virtual human and the real human. The action of the real human attached with a sensor is reflected in the virtual human. The required gesture can be generated and stored in the library of available gestures. The availability and the cost of the motion sensors is however not clear. Facial animation is provided, but the virtual human does not show any expressions or emotions. The tool provides a template of 77 body animations that can be used as a virtual human. It is said that this template can be modified to form a new virtual human, but there is no description or demonstration available that shows how this can be achieved. We therefore conclude that the extensibility of the tool is limited since it does not provide sufficient evidence to prove its claims. 37 Figure 7: Agent showing the ability to hold an object. Figure 8: Agent demonstrating the done gesture The cost of the product and the license agreement details are not explicitly stated. Support is provided in the form of customer service and a frequently asked questions section on the website [Jack]. 38 Details about the requirements of the computer to run this tool are missing. Also, there are no examples in code, demonstrating how the tool works. DI- Guy The DI-Guy [DI-Guy] is commercial software developed by the Boston Dynamics lab for adding human-like characters to simulations. Though product is used mainly to train military personnel, from the information provided, we have observed that the tool can be used to generate body animation and gestures. It claims to provide realistic human models, a library of 1000 behaviors and an API to control the behaviors. It also claims that the tool is compatible over platforms like Windows, Linux, and Solaris. To run the toolkit, the system needs to posses a TNT2 graphics card and a 32MB VRAM. The DI-Guy comes with a set of characters, and a set of facial expressions (like smile, trust, distrust, conniving, head nodding, head shaking and blinking) can be represented. The user can combine these expressions to generate new expressions. Support for lip synchronization is also claimed. The current version of the DI-Guy forces one to use the body models provided with the tool. He/she cannot add other body models. This limits the extensibility of the system. Also, it is not clear if the user can control the animation in any way. Two types of licenses (called development license and runtime license) are available. The development license allows a new DI-Guy to be built. The runtime license allows the user to run the application on an additional computer. The DI-Guy product costs $9000 with an additional $3500 for expressive faces. We feel that the major disadvantage of this tool as the lack of demonstration. We feel that a demonstration should be provided to help the user decide if he can purchase the product or not. Responsive face Responsive face is the work done by Dr. Ken Perlin, at the New York University Media Research Lab. It is part of the Improv project. A demonstration of the face is available at [Face]. Figures 9, 10 and 11 show few snapshots from the demonstration. The face exhibits some predefined emotions like fright, anger, and disappointment. A panel of controls is available, which the user can operate to produce additional expressions built from the provided expressions. Once the required expression is formed, a snapshot can be taken and added to a time line. The time line is represented in the form of a bar and contains a list of all the snapshots needed to generate the required animation. After producing a series of snapshots, the animation can be played to make the face show all the animations. 39 Figure 9: The face and the panel of controls Figure 10: The timeline which does not have any snapshots 40 Figure 11: Timeline with snapshots We observe that this face can represent multiple emotions. The quality of the animation is seamless and very impressive. The transition from one snapshot to the other in the time line is done without any discontinuity. Dr. Perlin mentions that the responsive face has been integrated with a body animation in the Improv project. Also, there is a control button in the panel provided which makes the face speak. From this information, we infer that the face can be integrated with a few existing body animation techniques and support for speech can also be provided. However, the way in which this can be achieved is not trivial. 41 Chapter 6 Recommendations We have analyzed various techniques for body animation, gesture animation, facial animation, and lip synchronization in the previous sections of the report. In this chapter, we summarize our analysis and, where possible, suggest improvements to the analyzed systems. We also analyze whether any of the existing techniques can be integrated together to produce a believable animation and what, if any, are the possible difficulties in such an integration. Body animation We have considered two body animation techniques, the 2.5D video avatar and the body animation technique, designed by the MIRA Lab. We have observed that the 2.5D avatar looks plausible as long as it points to an object in the virtual environment, but other movements (like holding an object) cannot be simulated because motion capture is used to produce the animation. As mentioned earlier (Section 2.1), modifying a captured motion is a strenuous task. Hence, we feel that this technique can only be used to produce animations for applications which need pointing gestures like a virtual human presenter. Additionally, we have identified that the system does not segment the image captured from the original environment effectively. Because of this, the captured motion often has a part of the environment in its back ground. A method to recognize where the subject’s image ends and where the environment starts would be very helpful. We observe that generated animation does not have proper foot contact in the target environment. It looks as if the avatar is floating in the air. We feel that any approach which can add some gravity to the image and make the avatar have solid foot contact in the target environment would help immensely to increase the plausibility of the generated animation. The animation technique designed by the MIRA Lab (Section 2.2) maintains a vector which stores some distances from the body geometry (like the vertical distance from the crown of the head to the ground and so on). The animation technique takes the vector as an input and compares it with the vectors of the existing body templates (stored in a library). The required animation is generated by blending the two vectors using special procedures as discussed in the corresponding section of the report. If the user has to modify the generated animation, he/she will have to manually change the values in the vector using trial and error. We think that trial and error can be quite frustrating and time consuming. We feel that this can be minimized by representing each of the dimensions in a panel as slider bars. The user can change the parameters by modifying the slider bars. From the description of the MIRA Lab’s animation technique and the evaluation done, we feel that it is possible to produce reasonably believable body animations using this technique. 42 Gesture animation We have considered three gesture animation techniques: the Collaborative Virtual Environment, the BEAT architecture, and the Virtual Human Presenter. The technique used for the design of Collaborative Virtual Environment (CVE) (section 3.1) takes an input text and triggers the appropriate gestures by using the words in the text. The list of gesture functions available is not made public, and it is not specified if the user is allowed to add any new gestures. We feel that it would be helpful to the animator if the list of available gestures is made public. Also, the presence of a user-friendly interface by means of which the animator can produce a new gesture by blending two or more available gestures would make the tool extensible. A means of selecting the weight of each of the gestures in the final gesture will be very helpful. For example, the animator should be able to choose a gesture denoting surprise and blend it with a gesture denoting fear, by assigning each of them a weight of 30% and 70%. The resulting gesture can be used to represent horror. The animator should be able to change the existing gestures to form new gestures. Each new gesture could be assigned a name and could be linked to a word which triggers the gesture. Also, we strongly believe that the user must be given a chance to choose his avatar. In other words, he should be allowed to use the existing synthetic body provided by the CVE or add new body models generated using any of the existing body animation techniques. This technique is used in a 3D chat environment called Outer worlds [Outer] and is very helpful. Another technique, called Virtual Human Presenter (Section 3.3), takes the input text, analyzes it, and embeds gesture function calls in the text. We feel that the technique works well to design a virtual presenter, but to represent an emotional agent, a few features have to be included. Firstly, the animator should be provided control over the animation. He/she should be able to add new gestures and modify the existing gestures. It would be convenient if a user-friendly interface can be provided for this. We have observed that the speech is not synchronized with gestures. This might be because the agent is made to speak the text and then the appropriate gesture function is called. For example, the text “I warn you” is modified as “I warn \gest_warn you”. We suggest that the gesture function should show the required gesture and simultaneously make the agent say the text. In the above example, the \gest_warn function can make the agent utter the word, warn. The Behavior Expression Animation Toolkit (BEAT) (Section 3.2), developed by the Gesture and Language Narrative Group, produces the required gesture commands by linguistic and contextual analysis of the input text. We feel that the generated gesture commands can be animated using the interpolation techniques. The advantage of using interpolation techniques is that the gestures generated 43 by the BEAT architecture can then be integrated to the existing body animation techniques which also use interpolation techniques. From the information provided, we infer that the system can be used to represent a combination of gestures. We observe that BEAT does not provide support for pointing gestures. However, they can be added into the system by modifying the user knowledge base. From the description of the technique and our analysis we feel that BEAT is a useful technique to produce gesture animation for emotion agents. Facial animation and lip synchronization We have three facial animation techniques, the BEAT, the FacMOTE , the MIRA Lab’s approach, and the BALDI. The BEAT for facial animation (Section 4.1) operates in the same way it operates for gesture animation. The FacMOTE (Section 4.2) uses the MPEG-4 standard and produces the animation by modifying a set of Facial Animation Parameters (FAP). We infer from the description of the system that it is possible to add new expressions or to modify the existing expressions. This can be achieved by modifying the values in the Facial Animation Parameter Units (FAPU). It would be helpful if the user is provided with a list of basic expressions and control bars to change these expressions. For example, a set of control bars each controlling the eyes, eye brows, and lips can be provided. The animator can choose any existing expression and change these control bars to generate the needed expression. Once the needed expression is generated, it can be saved into the library of expressions. A similar approach is used in Dr. Ken Perlin’s responsive face and is helpful. The MIRA Lab (Section 4.3) also uses the MPEG-4 standard to produce facial expressions and lip synchronization. It provides all the missing features in the FacMOTE technique. The Facial Animation Parameters are modified to produce some basic expressions and these expressions are stored in the Facial Animation Table (FAT). The user is allowed to add new expressions and modify existing expressions. The only drawback with this approach is that it does not specify how the personality, mood and emotion of the character can be considered. The system can be integrated into techniques which take the personality, mood and emotion of the agent into consideration and decide what the expression is and what its intensity is. This data can then be used by the MIRA Lab’s system to manipulate the needed FAPs and produce the required expression. The BALDI (Section 4.4), which is designed by the Center for Spoken Language Understanding (CSLU), is used to train deaf children. From the description of the system and the available demonstration, we feel that the tool can be used for applications in which lip synchronization is most needed . We observe that the tool represents only the basic emotions and does not show any complicated emotions or expressions. Hence, we feel that the system is not well suited for the animation of emotional agents. 44 Possible integration of techniques From our evaluation and analysis of the tools, it is clear that different tools were designed with different purposes in mind. We believe that some of the tools can be integrated together to form a meta tool. This meta tool can then be used to generate the animation of an emotional agent. Based on our research, we feel that MIRA Lab’s body animation system (Section 2.2), the BEAT architecture (Section 3.2), and MIRA Lab’s facial animation system (Section 4.3) can be tied together to produce plausible animations. The library of the existing body templates in the MIRA Lab’s body animation tool can be organized by using logic similar to a hash table. For example, the templates of all medium height, medium built, black haired males can be grouped together. A table containing all such groups can be maintained. When the animator needs to generate an animation, he/she can choose the required group and look for the required template. This template can be modified if needed and stored as a body model, for example, Bob’s body model. Input to the animation can be taken in the form of speech and the appropriate gesture command can be generated using the BEAT architecture. A key frame based interpolation technique can be used to reflect the generated gesture on the selected body model, for example Bob’s body model. Motion blending techniques can be used to make the resulting animation look seamless. These techniques, as mentioned in Section 3.3, use multiple editing algorithms and multitarget interpolation to produce a plausible animation. Details about the motion blending algorithms can be found in [Kovar, 2003]. The input speech can be used by the MIRA Lab’s facial animation system to produce the appropriate facial expressions and lip synchronization. At every stage in the animation, the user can be provided some kind of control to change a gesture or expression. Challenges in integration A major challenge in integrating various different techniques to produce a believable animation is to maintain synchronization between the generated gestures, facial expressions, lip synchronization and speech. Coordination between the verbal and non-verbal channels is necessary to produce a plausible animation. In other words, it is important that speech, gaze, head movements, expressions, lip synchronization, and gestures work together. Even if each of them work really well independently, the animation is plausible only when all the features blend together appropriately. For example, when the speaker wants to emphasize some thing, this is done with a strong voice, eyes turned towards the listener and appropriate hand 45 movements all working together. The animation is not plausible if even one of behaviors happens a couple of seconds later. Hence, it can be said that synchrony is essential to have a believable conversation. When it is destroyed, satisfaction and trust in the conversation diminishes since the agent might appear clumsy or awkward [Gratch, 2002]. A technique called motion graph is proposed to make for synthesizing synchronous and plausible animations [Kovar, 2002]. A database of animation clips is maintained. For example, a clip which denotes the agent smiling, a clip which denotes an agent waving and so on. The motion graph is implemented as a directed graph in which the edges represent clips of animation data. Nodes act as the points where the small bits of motion data join seamlessly. The motion graphs convert the problem of synthesizing an animation into the process of selecting sequences of nodes. The motion graph takes a database of clips as input. The edges correspond to the clips of motion and the nodes act as decision points where it is determined which clip is the successor to the current clip. Transitions between clips are generated such that they can seamlessly connect the two clips. This achieved by means of a special algorithm which can be found in [Kovar, 2002]. The problem with the technique is that it is quite time consuming. If the database contains F frames, it is estimated that finding the next frame requires O(F2). User involvement is needed when more than one frame is recognized as the possible successor of the existing frame. On an average, it is found that the time needed to produce a plausible animation is equal to the length of the animation and at least 5 minutes of the user time. Hence, the applicability of the technique in real time is questionable. We feel that in applications, where plausibility of the animation is more important that the time consumed to generate it, the theory of motion graphs is really helpful. 46 Conclusions We have identified the basic requirements for a tool or theory for the graphical representation of emotional agents. To generate a lifelike agent, it is important to have a plausible body animation technique, gesture animation technique, facial animation and lip synchronization technique. We have analyzed many current theories in each of these fields and the possible improvements were suggested. We found that the MIRALab’s body animation tool can be used to produce plausible body animations. Similarly, if the suggested improvements are made, the BEAT architecture can evolve to be a good tool for gesture animation. Believable facial animation and lip synchronization can be produced using the MIRALab’s technique. We propose that three of the existing tools, MIRA Lab’s body animation system, the BEAT architecture and MIRA Lab’s facial animation system be integrated to produce a plausible animation. Synchronization is identified as a possible difficulty in the integration. We describe a technique called motion graph which aims at producing synchronized animations. The major disadvantage with this is that it is very time consuming. We conclude by saying that in applications where plausibility of the animation is more important than the time taken to generate the animation, the theory of motion graphs is really helpful. 47 References [Agentry] http://www.agentry.net/ active as on October 5, 2003 [Badler, 1995] Badler, N. I., “Planning and Parallel Transition Networks: Animation's New Frontiers”, Pacific Graphics '95 [Baldi] http://www.distance-educator.com/dnews/Article3208.phtml active as of October 1, 2003. [Bone] http://www.ucc.ie/fcis/DHBNFbone.htm active as on October 6, 2003 [Byun, 2002] Byun, M. and Badler, N. I., “FacEMOTE: qualitative parametric modifiers for facial animations”, July 2002.Proceedings of the 2002 ACM SIGGRAPH/ Eurographics symposium on Computer animation [Cad] http://www.cadtutor.net/dd/bryce/anim/anim.html active as of September 29, 2003. [Cassell, 2001] Cassell, J., Vilhjamsson, H., and Bickmore, T., “BEAT: the Behaviour Expression Animation Toolkit”, Proceedings of SIGGRAPH 2001, pp. 477-486. [Chandra, 1997] Chandra, A., “A computational Architecture to Model Human Emotions”. Proceedings of the 1997 IASTED International Conference on Intelligent Information Systems. (IIS ’97) IEEE. [Chi, 2000] Chi, D., Costa, M., Zhao, L., and Badler, N. I., “The EMOTE model for Effort and Shape”, In Proceedings of ACM SIGGRAPH 2000, ACM Press / ACM SIGGRAPH, Computer Graphics Proceedings, Annual Conference Series, ACM, 173-182 [CVLab] http://cvlab.epfl.ch/index.html active as on October 5, 2003 [Descamps, 2001] Descamps, S. and Ishizuka, M., “Bringing Affective Behavior to Presentation Agents”. Proceedings of the 21st International Conference on Distributed Computing Systems Workshops (ICDCSW ’01). 2001. [Dictionary] www.dictionary.com (15th September, 2003) [DI-Guy] http://www.bdi.com/content/sec.php?section=diguy active as on October 5, 2003 [EDS] www.eds.com. Active as on October 5, 2003 [Egges, 2003] Egges, A., Zhang, X., Kshirsagar, S., and Thalmann, N. M., “Emotional Communication with Virtual Humans", Multimedia Modeling, Taiwan, 2003 [Face] http://www.mrl.nyu.edu/projects/improv/ , 2002; active as on October 5, 2003 48 [FACS] http://www-2.cs.cmu.edu/afs/cs/project/face/www/facs.htm 2002; active as on September 29, 2003. [Garchery, 2001] Garchery, S. and Magnenat N. T., "Designing MPEG-4 Facial Animation Tables for Web Applications", Multimedia Modeling 2001, Amsterdam, pp 39-59., May, 2001 [Gleicher, 2001] Gleicher, M., 2001. “Comparing constraint-based motion editing methods”. Graphical Models 63(2), pp. 107-134, 2001. [Goto, 2001] Goto T., Kshirsagar, S., and Magnenat-Thalmann, N., “Real Time Facial Feature Tracking and Speech Acquisition for Cloned Head”, IEEE Signal Processing Magazine, Special Issue on Immersive Interactive Technologies, 2001. [Gratch, 2002] Gratch, J., Rickel, J., Andre, E., Badler, N., Cassell, J., and Petajan, E., "Creating Interactive Virtual Humans: Some Assembly Required," in IEEE Intelligent Systems, July/August 2002, pp. 54-63. [H-Anim] www.h-anim.org ; 1999, active as on October 5, 2003 [Hirose, 1999] Hirose, M., Ogi, T., Ishiwata, S., and Yamada, T., “Development and evaluation of immersive multiscreen display ‘CABIN’ systems and computers in Japan,” Scripta Technica, vol. 30, no. 1, pp. 13-22, 1999. [Immer] http://www.ejeisa.com/nectar/fluids/bulletin/16.htm 1997; active as on October 6, 2003. [Iso] http://www.ks.uiuc.edu/Research/vmd/vmd-1.7.1/ug/node70.html 2001; active as on September 29, 2003. [ISO, 1997] ISO/IEC 14496-2, Coding of Audio-Visual Objects: Visual (MPEG-4 video), Committee Draft, October 1997. [Jack] www.plmsolutions-eds.com/products/efactory/jack 2001;active as on October 5, 2003. [Jean] http://jean-luc.ncsa.uiuc.edu/Glossary/I/Isosurface/ active as on September 29, 2003. [Kovar, 2002] Kovar, L., Gleicher M., and Pighin F., “Motion Graphics”. Proceedings of the ACM SIGGRAPH 2002. [Kovar, 2003] Kovar, L. and Gleicher, M., “Flexible automatic motion blending with registration curves”. Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Pages 214 – 224. 49 [Kshirsagar, 2001] Kshirsagar, S., Garchery, S. and Magnenat-Thalmann, N., “Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation”. Deformable Avatars, Kluwer Academic Press, 2001, pp 24-34. [Leung, 2001] Leung W. H. and Chen T., "Immersive Interactive Technologies towards a Multi-User 3D Virtual Environment", IEEE Signal Processing Magazine, May 2001. [Lewis, 2000] Lewis, J., Cordner, M. and Fong, N., “Pose space deformations: A unified approach to shape interpolation and skeleton-driven deformation”. ACM SIGGRAPH, July 2000, pp. 165-172. [Magnenat, 2003] Magnenat-Thalmann N., Seo H. and Cordier F.," Automatic Modeling of Virtual Humans and Body clothing", Proc. 3-D Digital Imaging and Modeling, IEEE Computer Society Press, October, 2003 [Maldonado, 1998] Maldonado H., Picard A., Doyle P., and Hayes-Roth B.. “Tigrito: A Multi-Mode Interactive Improvisational Agent”. In: Proceedings of the 1998 International Conference on Intelligent User Interfaces, San Francisco, CA, 1998, pp. 29--32. [Massaro, 1998a] Massaro, D.W. “Perceiving Talking Faces: From Speech Perception to a Behavioral Principle”. Cambridge, MA: MIT Press. 1998 [Massaro, 1998] Massaro, D.W. and Stork, D.G. (1998). “Speech recognition and sensory integration”. American Scientist, 86, 236-244. [Moffat, 1997] Moffat D., “Personality Parameters and Programs,” Creating Personalities for Synthetic Actors, Springer Verlag, New York, 1997, pp. 120–165. [MPEG-4] MPEG-4 SNHC. “Information technology-generic coding of audio-visual objects part 2: Visual”, ISO/IEC 14996-2, Final draft of international standard, ISO/IEC JTCI/SC29/WG11 N2501. 1998 [MS Agent] http://www.microsoft.com/msagent/ active as on October 5, 2003. [MVL] The MVL research center http://green.iml.u-tokyo.ac.jp/tetsu/PPT/ICIP99/sld008.htm active as on September 28, 2003. [NetICE] http://amp.ece.cmu.edu/projects/NetICE active as on October 5, 2003 [Noma, 2000] Noma T., Zhao L., and Badler N., “Design of a Virtual Human Presenter”, IEEE Journal of Computer Graphics and Applications, 20(4):79-85, July/August, 2000, pp. 79-85 50 [Outer] www.outerworlds.com active as on October 6, 2003 [Oz, 1997] http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/oz/web/papers/CMU-CS97- 156.html active as on September 15, 2003) [Pina, 2002] Pina, A., Serón F. J., and Gutiérrez D., “The ALVW system: an interface for smart behavior-based 3D Computer Animation”. ACM International Conference Proceeding Series Proceedings of the 2nd international symposium on Smart graphics Hawthorne, New York Pages: 17 - 20.Year of Publication: 2002 [Rousseau, 1998] Rousseau, D. and Hayes-Roth, B., “A social-psychological model for synthetic actors”, in Proceedings 2nd International Conference on Autonomous Agents (Agents’98), pp. 165-172. [Salem, 2000] Salem, B. and Earle, N., “Designing a Non-Verbal language for Expressive Avatars”. Proceedings of the third international conference on Collaborative virtual environments San Francisco, California, United States Pages: 93 - 101 Year of Publication: 2000 [Scheepers, 1997] Scheepers, F., Parent, R. E., Carlson, W. E. and May, S. F., “Anatomy-based modeling of the human musculature”, Proceedings SIGGRAPH ‘97, pp.163 - 172, 1997. [Seo, 2003a] Seo, H., Magnenat-Thalmann, N.,"An Automatic Modeling of Human Bodies from Sizing Parameters", ACM SIGGRAPH 2003 Symposium on Interactive 3D Graphics, pp19-26, pp234, 2003 [Seo, 2003b] Seo, H., Cordier, F., Magnenat-Thalmann, N.,"Synthesizing Animatable Body Models with Parameterized Shape Modifications", ACM SIGGRAPH/ Eurographics Symposium on Computer Animation, July, 2003. [Sloan, 2001] Sloan, P., Rose, C., and Cohen, M. 2001 “Shape by Example”. Symposium on Interactive 3D Graphics, March, 2001. [Tamagawa, 2001] Tamagawa, K., Yamada, T., Ogi, T., and Hirose, M., “Development of 2.5D Video Avatar for Immersive communication”, IEEE Signal Processing Magazine, Special Issue on Immersive Interactive Technologies, 2001. [Tolani, 2000] Tolani, D., Goswami, A., and Badler, N., “Real-time inverse kinematics techniques for anthropomorphic limbs”. Graphical Models 62 (5), pp. 353-388. [ToolKit] http://cslu.cse.ogi.edu/toolkit/docs/users.html active as of October 1, 2003. [Tosa, 1996] Tosa, N., and Nakatsu R., “Life-Like Communication Agent – Emotion Sensing Character “MIC” and Feeling Session Character “MUSE”. Proceedings of the 51 1996 International Conference on Multimedia Computing and Systems (ICMCS ’96) IEEE. [Tosa, 2000]Tosa, N. and Nakatsu, R., “Interactive Art for Zen: 'Unconscious Flow”, International Conference on Information Visualisation (IV2000). July 19 - 21, 2000. London, England, p. 535 [Tsukahara, 2001]Tsukahara, W. and Ward, N., “Responding to Subtle, Fleeting Changes in the User's Internal State (Make Corrections)”. Proceedings of the SIGCHI conference on Human factors in computing systems 2001, Seattle, Washington, United States 2001 [Web3D] www.web3d.org active as on October 5, 2003 [Wilhelms, 1997] Wilhelms, J. and Van-Gelder, A., “Anatomically Based Modeling”, Proceedings SIGGRAPH ‘97, pp. 173 - 180, 1997. [Wu, 2001] Wu, Y. and Huang, T. S., “Human Hand Modeling, Analysis and Animation in the Context of Human Computer Interaction”, IEEE Signal Processing Magazine, Special Issue on Immersive InteractiveTechnologies, 2001. [Yamada, 1999] Yamada, T., Hirose, M., Ogi, T., and Tamagawa, K., “Development of Stereo Video Avatar in Networked Immersive Projection Environment”, Proceedings of the 1999 International Conference on Image Processing (ICIP '99), Kobe, Japan, October 24-28, 1999. IEEE Computer Society, 1999, ISBN 0-7803-5467-2, Volume III. 52