IA3: Intelligent Affective Animated Agents T.S. Huang, I. Cohen, P. Hong, Y. Li Beckman Institute, University of Illinois at Urbana-Champaign, USA Email: huang@ifp.uiuc.edu Abstract Information systems should be human-centered. Human-computer interface needs to be improved to make computers not only user-friendly but also enjoyable to interact with. Computers should be proactive and take initiatives. A step toward this direction is the construction of Intelligent Affective Animated Agents (IA3). Three essential components of IA3 are: The agent needs to recognize human emotion. Based on its understanding of human speech and emotional state, the agent needs to reason and decide on how to respond. The response needs to be manifested in the form of a synthetic talking face which exhibits emotion. In this paper, we describe our preliminary research results in these three areas. We believe that although challenging research issues remain, for restricted domains effective IA3 could be constructed in the near future. 1. Introduction As we enter the age of Information Technology, it behooves us to remember that information systems should be human-centered [1]. Technology should serve people, not people technology. Thus, it is of the utmost importance to explore new and better ways of humancomputer interaction, to make computers not only more user-friendly, but more enjoyable to use. The computer should be proactive, taking initiatives in asking the right questions, offering encouragement, etc. In many applications, it is highly effective to have an embodied intelligent agent to represent the computer [2]. For example, the agent can be manifested as a synthetic talking face with synthesized speech. It can exhibit emotion in terms of facial expression and the tone of the voice. In addition to recognizing the speech input from the human, the computer needs to recognize also the emotional and cognitive state of the human (through visual and audio sensors), so that the agent can decide on the appropriate response. In this paper, we shall discuss some aspects of this type of intelligent Affective Animated Agents (IA3), and present some preliminary results of our research. In Section 2, we describe research in the automatic recognition of human emotion by analyzing facial expression and the tone of the voice. Section 3 presents our synthetic talking face model (the iFACE) and discuss issues related to the use of text to drive the model. In Section 4, we offer some preliminary thoughts about intelligent dialog systems. We conclude with a few remarks in Section 5. 2. Audio/Visual Emotion Recognition One of the main problems in trying to recognize emotions is the fact that there is no uniform agreement about the definition of emotions. In general, it is agreed that emotions are a short-term way of expressing inner feeling, whereas moods are long term, and temperaments or personalities are very long term [3]. Emotions can be expressed in various different ways, through voice, facial expressions, and other physiological means. Although there are arguments on how to interpret these physiological measurements, it is quite clear that there is a strong correlation between measurable physiological signals and the emotion of a person. In the past 20 years there has been much research on recognizing emotion through facial expressions. This research was pioneered by Ekman and Friesen [4] who started their work from the psychology perspective. In the early 1990s the engineering community started to use these results to construct automatic methods of recognizing emotions from facial expressions in images or video [5][6][7][8][9]. Studies of vocal emotions have been conducted for over 60 years. Most recent studies [10][11][13][14] used prosodic information such as the pitch, duration, and intensity of the phrase as the features to recognize emotions in voice. Work on recognition of emotions from visual and audio combined has been recently studied by Chen [9], Chen et al. [15], and DeSilva et al [16]. 2.1 Automatic Facial Expression Recognition The very basis of any recognition system is extracting the best features to describe the physical phenomena. As such, categorization of the visual information revealed by facial expression is a fundamental step before any recognition of facial expressions can be achieved. First a model of the facial muscle motion corresponding to different expressions has to be found. This model has to be generic enough for most people if it is to be useful in any way. The best known such model is given in the study by Ekman and Friesen [4], known as the Facial Action Coding System (FACS). Ekman has since argued that emotions are linked directly to the facial expressions, and that there are six basic “universal facial expressions” corresponding to happiness, surprise, sadness, fear, anger, and disgust. The FACS codes the facial expressions as a combination of facial movements known as action units (AUs). The AUs have some relation to facial muscular motion and were defined based on anatomical knowledge and by studying videotapes of how the face changes its appearance when displaying the expressions. Ekman defined 46 such action units each corresponds to an independent motion of the face. We implemented a face-tracking algorithm. The face tracking algorithm and system are based on the work of Tao and Huang [18] called the Piecewise Bézier Volume Deformation (PBVD) tracker. This system was modified to extract the features for the emotion expression recognition by Chen [9]. The estimated motions are represented in terms of magnitudes of some predefined AUs. These AUs are similar to what Ekman and Friesen [4] proposed, but Figure 1. AUs extracted by the face tracker only 12 AUs are used. Each AU corresponds to a simple deformation on the face, defined in terms of the Bézier volume control parameters. In addition to the 12 AUs, the global head motion is also determined from the motion estimation. Figure 1 shows the 12 AUs being measured for emotion expression recognition, where the arrow represents the motion direction of the AU moving away from the neutral position of the face. Using the measurements of these action units, two types of classifiers were constructed. The first is a frame-based classifier [9]. The second uses the entire facial expression time information. The frame based classifier makes a decision among the seven classes (happiness, sadness, surprise, anger, fear, disgust, and neutral) for each time frame using a Sparse Network of Winnows (SNoW) classifier [17]. The SNoW classifier transforms the original AUs to higher dimensional feature space, after which the connections between the transformed feature nodes to the output target nodes (the emotion classes in this case) will be sparse. The training uses a multiplicative update rule (Winnow), this in contrast to a neural network that uses an additive update rule. The advantages of using SNoW is that it does not require a large number of training data, and in the sparseness of the connections between the layers, which gives a lower probability of error and speed. For testing, the output target with the highest score is the winning class (“winner-takes-all”). The second classifier is a novel architecture of a multilevel hidden Markov model (HMM) classifier. The multilevel HMM both serves as a classifier of the emotion sequences and does automatic segmentation of the video to the different emotions. The architecture is constructed of a lower level of six emotionspecific HMMs, trained on labeled segmented facial expression sequences, with the observations being the AU measurements of the face tracker. The state sequence of each of the six HMMs is decoded using the Viterbi algorithm, and this state sequence vector (six dimensional) serves as the observation to the high level HMM. The high level HMM consists of seven states, one representing each emotion and a neutral state. The state in which the high level HMM is in at each time can be interpreted as the classification of the time sequence. The high level HMM does both the segmentation and the classification at the same time. Since the observation vector is the state sequence of the lower level HMMs, it also learns the discrimination function between the six HMMs. This is the main difference between this work and the work of Otsuka and Ohya [6], who used emotion-specific HMMs but did not attempt to use a higher-level architecture to learn the discrimination between the different models. These algorithms were tested on a database collected by Chen [9], and the first was also implemented in real time, for persondependent recognition. The subjects in the database were asked to express different emotions given different stimuli. The database is of 100 subjects of different genders and ethnic backgrounds. It includes sequences of facial expressions only, as well as sequences of emotional speech and video. Testing on this database yielded recognition accuracy of over 90% for both methods, using a person-dependent approach, and a much lower accuracy of around 60-70% for a person independent approach. It was noticed that happiness and surprise are classified very well for both person-dependent and person-independent cases, and the other emotions are greatly confused with each other, especially in the person-independent test. If the number of classes is reduced by combining the classes, disgust, anger, fear, and sadness to one “negative” class, the accuracy becomes much higher for both the person-dependent tests (about 97%) and the person-independent tests (about 90%). Figure 2 shows four examples of the real time implementation of the first method. The label shows the recognized emotion of the user. asked for, don’t you ever listen? The audio is processed on a phrase level to extract prosodic features. The features are statistics of the pitch contour, its derivative, statistics of the RMS energy envelope and its derivatives. A measure of the syllabic rate is also extracted (it expresses the ‘rate’ of speaking). The features are computed for a whole phrase, since it is unlikely that the emotion changes in the speech very fast. An optimal Naïve Bayes classifier is then used. The overall accuracy using this classifier was around 75% for a person-dependent test, and around 58% for a person-independent test, which shows that there is useful information in the audio for recognizing emotions (pure chance is 1/7=14.29%). 2.3 Emotion Recognition from Combined Audio and Video. There are some inherent differences between the features of the audio and video. Facial expressions can change at a much faster rate than the vocal emotions, which are expressed in longer sequences of a phrase or sentence. To account for these time differences, a classifier is designed for each of the channels, and not a combined classifier that has to wait for the audio to be processed. The combination of the two classifiers is handled using a system that can work in three modes; audio only, video only and combined audio and video. The mode is set using two detectors. An audio detector to recognize that the user is speaking, a video detector determines if the user is being tracked, and if the tracking of the mouth region is reliable enough since when the user talks, the mouth movement is very fast, and then the mouth region is not reliable for expression recognition. In the video and audio mode, only the top region of the face is used for the expression recognition because of the fast mouth movement. 3. Synthetic Talking Face (iFace) Figure 2: Frames from the real time facial expression recognizer 2.2 Emotion Recognition from Audio Emotions are expressed using the voice as well as through the facial expressions. In the database that was collected, the subjects were asked to read sentences while displaying and voicing the emotion. For example, a sentence displaying anger is: Computer, this is not what I 3.1 Face Modeling We have developed a system called iFACE, which provides functionalities for face modeling, editing, and animation. A realistic 3D head model is one of the key factors of natural human computer interaction. In recent years, researchers have been trying to combine computer vision and computer graphics techniques to build realistic head models [12][20]. In our system, we try to make the head modeling processing more systematic. The whole process is nearly automatic with only a few manual adjustments Wired frame Shaded (a) Generic face model of iFACE Texture data Range data (b) Cyberscanner data Frontal view. Side view. (d) The customized head model with texture mapping (c) 35 Feature points are selected for model fitting Figure 3. Face Modeling necessary. The generic face model we use is a geometrical model (figure 3(a)) that consists of all the facial accessories such as eyes, teeth, tongue, etc. To customize the face model for a particular person, we first obtain both the texture data and range data of that person by scanning his/her head using Cyberware cyberscanner. An example of the cyberscanner data is shown in figure 3(b). 35 feature points (figure 3(c)) are manually selected on the scanned data. Those feature points have their correspondences on the generic geometry face model. We then fit the generic model to the person by deforming it based on the range data and the selected feature points. Manual adjustments are required where the scanned data are missing. Figure 3(d) show an example of a customized head model. 3.2 Text Driven Talking Face When text is used in communication, e.g. in the context of text-based electronic chatting over Internet, visual speech synthesized from text will greatly help deliver information. iFACE is able to synthesize visual speech from a text stream. The structure of the system is illustrated in Figure 4. The system uses Microsoft Text To Speech (TTS) engine for text analysis and speech synthesis. First, the text is parsed into phoneme sequence. A phoneme is a member of the set of the smallest units of speech that serve to distinguish one utterance from another in a language or dialect. Each phoneme is mapped to a viseme. A viseme is a generic facial shape that serves to describe a particular sound. A phoneme-to-viseme mapping is built for finding the corresponding face shape for each phoneme. From information on phoneme durations, we can locate the frames at which the phonemes start. For these frames, viseme images are generated which are used as key frames for animation. To synthesize animation sequence, we adopt a key frame based scheme similar to [19]. The face shapes between key frames are decided by an interpolation scheme. 3.3 Synthesizing Expressive Talking Head A set of basic facial shapes is built by adjusting the control points. Those basic facial shapes are in spirit similar to the Action Units of [4]. They are built so that all kinds of facial expressions can be approximated by the linear combinations of them. Given a script of expression sequence, we use key frame technique to synthesize expression animation sequence, such as nodding head, eye blinking, raising eyebrow, etc. By combining the script of expression and text, we can generate expressive talking head. 4. Intelligent Dialog Systems 4.1 Brief Introduction to Intelligent agent and dialog systems Different types of information are exchanged between computers and users in multimodal Text to Speech Engine (TTS) Text … Text Analysis Phoneme Sequence Speech Synthesis Speech Signals Play Speech Signals synchronize Map Phoneme to Viseme Viseme Sequence Generate Key Frames Key Frames Animate Face Model by Interpolation Figure 4. The structure of text driven talking face. interaction. Results of video tracking and speech recognition must be integrated by the system to produce proper responses. In order to build such a system in an orderly way, an intelligent agent should be used. Special techniques should also be developed for intelligent agent to handle dialogs. An intelligent agent models after basic functions of human minds. It includes a set of beliefs and goals, a world model which has a set of parameter variables, a set of primitive actions it can take, a planning module and a reasoning module. For the agent to think and act, there must be a world where the agent exists. This agent world is a simplified model of the real world. This agent world is actually a list of parameters where each parameter reflects some property of the real world. At a specific time, the parameters take specific values and the agent is said to be in a specific world state. Like human, the agent has its own beliefs: which world states are good, which world states are bad, and what kind of ideal world states the agent should pursue. Once the agent decides the current goals, it will generate a plan to achieve these goals. A plan is simply a list of primitive action sequences. Each primitive action will have some prerequisites and cause some parameters in the world state to change. An agent also has reasoning ability. The agent can discover new conclusions from current knowledge. A spoken dialog system should understand what the user says and generate correct speech responses. A spoken dialog system is composed with a speech recognition engine, a syntax analysis parser, an information extraction module, a dialogue management module, and a text-to-speech (TTS) engine. In our case, the speech recognition engine is IBM ViaVoice. The TTS engine is IBM ViaVoice Outloud. The speech recognition engine transcribes sound waves into text. Then the syntax parser parses the text and generates a parsing tree. This parsing tree, containing a rich information extraction module, can do semantic and pragmatic analysis and get the important information using the semantic clues provided by the parsing tree. However, the ability of this type of information extraction is limited. In order to increase the flexibility of the system, people use semantic networks and ideas such as finite state machine to cluster related words and statistically model the input sentences. But the improvements are very limited. After the messages are understood, correctly or not, the dialogue management module uses various pattern classification techniques to identify the user’s intentions. A plan model is built for each of such intentions. Using these plan models, the system can decide whether to answer a question, to accept a proposal, to ask a question, to give a proposal, or to reject a proposal and give a suggestion. A response is then generated. Current systems put focus on generating the correct responses. This is absolutely important. However, it is also misleading because the understanding issue is put aside. Though a quick solution can be established without understanding, the ability of such a dialogue system is always inherently limited. 4.2 Intelligent Agent With Speech Capability In a multimodal interface, different types of information and the complexity of tasks require the presence of an intelligent agent. The intelligent agent handles not only results from video tracking, emotion analysis, but also text strings of natural language from the speech recognizer. The intelligent agent produces not only operational responses but also speech responses. Therefore, it is necessary to study ways to give an intelligent agent speech capability. Actually, pure dialog systems also benefit from an intelligent agent approach since intelligent agents are powerful in handling complex tasks and have the ability to learn. Most importantly, there have been many research results regarding various aspects of intelligent agents. However, achieving speech capability for intelligent agents is not a trivial task. Until now, there is no research focusing on giving intelligent agents speech capability. The difficulty mainly comes from natural language understanding. The issue of natural language understanding is not solved yet. For an intelligent agent to act on language input, the agent must understand the meanings of the sentences. In order to understand a single sentence, a knowledge base related to the semantic information of each word in the sentence must be accessible to the agent. There have been many efforts that try to build universal methods for knowledge representation that can be used by computers or by an intelligent agent. However, these methods only represent the knowledge by using a highly abstracted symbol language. They can be easily implemented on computers, but are not suitable for storing and organizing information related to a specific word. There is no method designed to represent knowledge embedded in human speech. In order to build a structure that can store the associated concept of each word, we propose a word concept model to represent knowledge in a “speech-friendly” way. We hope to establish a knowledge representation method that not only can be used to store word concept, but also can universally be used by an intelligent agent to do reasoning and planning. 4.3 Word Concept Model For human beings, there is a underlying concept in one’s mind associated with each word he knows. If we hear the word “book”, the associated concept will appear in our mind, consciously or unconsciously. We know a book is something composed of pages and made of paper. We know there are words printed in a book. We also know we can read a book for learning or relaxing. Furthermore, we know the usual ranges of the size of a book. There may be other information we know. The central idea is that the concept of “book” should include all these information. Human understands the word “book” because he knows this underlying related information. Clearly, if an intelligent agent wants to understand human language, it must have access to this underlying information and be able to use it to extract information from the sentences spoken by the users. In the human mind, complex concepts are explained by simple concepts. We need to understand the concept of “door” and “window” before we understand the concept of “room”. The concept of “window” is further explained by “frame”, “glass” and so on. Ultimately, all the concepts are explained based on the input signal patterns perceived by the human sensory motors. This layered abstraction is the key to fast processing of information. When reasoning is being done, only related information is pulled out. Lower layered information may not appear in a reasoning process. In our word concept model, such layered concept abstraction is also used. We want to build complex concepts by using simple concepts. Before we can do that, we need a basic concept space on which we can define primitive concepts. For human, the basic concept space is the primitive input signal patterns from sensors. For a computer, it may be difficult to do so. However, for a specific domain, it is possible to define the basic concept space by carefully studying the application scenarios. Here, we borrowed ideas from previous research on concept space. A concept space is defined by a set of concept attributes. Each attribute can take discrete or continuous values. The attribute “color” can be discrete such as “red”, “yellow”, etc. Or it can be continuous in terms of light wavelength. Whether to use discrete values or continuous values should be determined by the application. Once we have established this concept space, some primitive concept in the system can be defined on this concept space. Then complex concept can be built using simple concept. Complex concept can also have a definition on the concept space if possible. As we started to map word into this concept structure, we feel it necessary to treat different types of words differently. Since the most basic type of concept in the word are physical matters, we first map nouns and pronouns into the concept space. We call these types of words solid word. For solid word, we can find the mapping on the concept space quite easily since they refer to specific types of things in the physical world. There may be multiple definitions for one word because one word can have many meaning. The agent can decided which definition to use by combining context information. After we have defined the solid words, other types of words are then defined around solid words instead of in the concept space. But their definitions are closely coupled with the concept space. For verbs, we first define the positions where solid word can appear around this verb. For example, solid word can appear both before and after the verb “eat”. For some verb, more than one solid word can appear after the verb. In all the cases, the positions where solid word can appear are numbered. Then, definition of this verb is described by how the attribute of each solid word should change. For example, if “Tom” is the solid word before “go” and “home” is the solid word after “go”, then the “position” attribute of “Tom” should be set to “home” or the location of “home”. Of course, there are complex situations where this simple scheme doesn’t fit. However, powerful techniques can be developed using the same idea. For adjectives, definitions are described as values of the attributes. “Big” may be defined as “more than 2000 square feet” for the “size” attribute if it is used for houses. Obviously, multiple definitions are possible and should be resolved by context. Adverbs can be defined similarly. For prepositions, we first define a basic set of relationships. These include “in”, “on”, “in front of”, “at left of”, “include”, “belong to”, etc. Then prepositions can be defined by using these basic relationships. There are other types of words, but most of the rest are grammar words and can be dealt with by a language parser. At this stage, our word concept model is not complete yet. Though have we already established a basic concept space and defined a vocabulary of around two hundred words for a travel-related content-based image retrieve database, we expect to see many practical issues when we start to build the dialogue system. We should meet the challenge of building a robust information extraction module and an intelligent agent that can reason on the word concept model we described. We also expect to build a core system that can learn new words from the user. We hope to have a small system that can grow into a powerful one through the interaction with human. 5. Concluding Remarks In this paper we have described our preliminary research results on some essential aspects of Intelligent Affective Animated Agents, namely: (1) Automatic recognition of human emotion from a combination of visual analysis of facial expression and audio analysis of speech prosody. (2) Construction of synthetic talking face and how it could be driven by text and emotion script. (3) Intelligent dialog systems. Challenging research issues remain in all these three areas. However, we believe that in constrained domains, effective Intelligent Affective Animated Agents could be built in the near future. 6. References [1] NSF Workshops on “Human-Centered Systems: Information, Interaction, and Intelligence,” organized by J.L. Flanagan and T. S. Huang, 1997. Final Reports available at http://www.ifp.uiuc.edu/nsfhcs/ [2] J. Cassell, J. Sullivan, S. Prevost, and E. Churchill (eds.), "Embodied Conversational Agents", MIT Press, 2000. [3] J.M. Jenkins, K. Oatley, and N.L. Stein, eds., Human Emotions: A Reader, Malden, MA: Blackwell Publishers, 1998. [4] P. Ekman and W.V. Friesen, Facial Action Coding System: Investigator’s Guide, Palo Alto, CA: Consulting Psychologist Press, 1978. [5] K. Mase, “Recognition of facial expression from optical flow,” IEICE Transactions, vol. E74, pp. 3474-3483, October 1991. [6] T. Otsuka and J. Ohya, “Recognizing multiple persons facial expressions using HMM based on automatic extraction of significant frames from image sequences,'' in Proc. Int. Conf. on Image Processing (ICIP-97), (Santa Barbara, CA, USA), pp. 546-549, Oct. 26-29, 1997. [7] Y. Yacoob and L. Davis, “Recognizing human facial expressions from long image sequences using optical flow,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, pp. 636-642, June 1996. [8] M. Rosenblum, Y. Yacoob, and L. Davis, “Human expression recognition from motion using a radial basis function network architecture,” IEEE Transactions on Neural Network, vol.7, pp.1121-1138, September 1996. [9] L.S. Chen, “Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction,” PhD dissertation, University of Illinois at Urbana-Champaign, Dept. of Electrical Engineering, 2000. [10] R. Cowie and E. Douglas-Cowie, “Automatic statistical analysis of the signal and prosodic signs of emotion in speech,” in Proc. International Conf. on Spoken Language Processing 1996, Philadelphia, PA, USA, October 3-6, 1996, pp.1989-1992. [11] F. Dellaert, T.Polzin, and A. Waibel, “Recognizing emotion in speech,” in Proc. International Conf. on Spoken Language Processing 1996, Philadelphia, PA, USA, October 3-6, 1996, pp.1970-1973. [12] B. Guenter, C. Grimm, D. Wood, et al. Making Faces, in Proc. SIGGRAPH'98, 1998. [13] T. Johnstone, “Emotional speech elicited using computer games,” in Proc. International Conf. on Spoken Language Processing 1996, Philadelphia, PA, USA, October 3-6, 1996, pp.1985-1988. [14] J. Sato and S. Morishima, “Emotion modeling in speech production using emotion space,” in Proc. IEEE Int. Workshop on Robot and Human Communication, Tsukuba, Japan, Nov. 1996, pp.472-477. [15] L.S. Chen, H. Tao, T.S. Huang, T. Miyasato, and R. Nakatsu, “Emotion recognition from audiovisual information,” in Proc. IEEE Workshop on Multimedia Signal Processing, (Los Angeles, CA, USA), pp.83-88, Dec. 79, 1998. [16] L.C. De Silva, T. Miyasato, and R. Nakatsu, “Facial emotion recognition using multimodal information,'' in Proc. IEEE Int. Conf. on Information, Communications and Signal Processing (ICICS'97), Singapore, pp.397-401, Sept. 1997. [17] D. Roth, “Learning to resolve natural language ambiguities: A unified approach,” in National Conference on Artifical Intelligence, Madison, WI, USA, pp.806813, 1998. [18] H. Tao and T.S. Huang, “Bézier Volume Deformation Model for Facial Animation and Video Tracking,” in “Modeling and Motion Capture Techniques for Virtual Environments”, eds. N. Magnenat-Thalman and D. Thalman, Springer, USA, 1998, pp. 242-253. [19] K. Waters and T. M. Levergood, DECface, “An Automatic Lip-Synchronization Algorithm for Synthetic Faces,” Digital Equipment Corporation, Cambridge Research Lab, Technical Report CRL 93-4, 1994. [20] V. Blanz and T. Vetter, A Morphable Model for the Synthesis of 3D Faces, in Proc. SIGGRAPH’99, 1999.