From: AAAI Technical Report WS-98-09. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. Task-Oriented Dialogs with Animated Agents in Virtual Reality Jeff Rickel and W. Lewis Johnson Information Sciences Institute & Computer Science Department University of Southern California 4676 Admiralty Way, Marina del Bey, CA 90292-6695 rickel@isi.edu, johnson@isi.edu http://www.isi.edu/isd/VET/vet.html Abstract We are working towards animated agents that can carry on tutorial, task-oriented dialogs with human students. The agent’s objective is to help students learn to performphysical, procedural tasks, such as operating and maintaining equipment. Although most research on such dialogs has focused on verbal communication, nonverbal communicationcan play many importantroles as well. Toallow a widevariety of interactions, the student and our agent cohabit a threedimensional, interactive, simulated mock-upof the student’s work environment. The agent, Steve, can generate and recognize speech, demonstrate actions, use gaze and gestures, answer questions, adapt domain procedures to unexpected events, and remember past actions. This paper gives a brief overviewof Steve’s methodsfor generating multi-modal behavior, contrasting our workwith prior workin task-oriented dialogs and multi-modal explanation generation. Introduction We are working towards animated agents that can carry on tutorial, task-oriented dialogs with human students. The agent’s objective is to help students learn to perform physical, procedural tasks, such as operating and maintaining equipment. Thus, like most earlier research on task-oriented dialogs, the agent (computer) serves as an expert that can provide guidance to a human novice. Research on such dialogs dates back more than twenty years (Deutsch 1974), and this subject remains an active research area (Allen et al. 1996). However, the vast bulk of this research has focused solely on verbal dialogs, even though the earliest studies clearly showed the ubiquity of nonverbal communication in human task-oriented dialogs (Deutsch 1974). To allow a wider variety of interactions among agents and human students, we use virtual reality; agents and students cohabit a threedimensional, interactive, simulated mock-upof the student’s work environment. Virtual reality offers a rich environment for multimodal interaction among agents and humans. Like standard desktop dialog systems, agents can communicate with humansvia speech, using text-to-speech and speech recognition software. (We currently use commercial products from Entropic for these functions.) Like previous simulation-based training systems, the behavior of the virtual world is controlled by a simulator, and agents can perceive the state of the virtual world via messages from the simulator, and they can take action in the world by sending messages to the simulator. However, an animated agent that cohabits a virtual world with students has a distinct advantage over previous disembodied tutors: the agent can additionally communicatenonverbally using gestures, gaze, facial expressions, and locomotion. Students also have more freedom; they can movearound the virtual world, gaze around (via a head-mounted display), and interact with objects (e.g., via a data glove). Moreover, agents can perceive these humanactions; virtual reality software can inform agents of the location (in x-y-z coordinates), field of view (i.e., visible objects), actions of humans. Thus, virtual reality is an important application area for multi-modal dialog research because it allows more human-like interactions among synthetic agents and humans than desktop interfaces can. Although practically ignored until recently, nonverbal communication can play many important roles in task-oriented tutorial dialogs. The agent can demonstrate how to perform actions (Rickel & Johnson 1997a). It can use locomotion, gaze, and deictic gestures to focus the student’s attention (Lester et al. 1998; Noma8¢ Badler 1997; l=tickel & Johnson 1997a). It can use gaze to regulate turn-taking in a mixedinitiative dialog (Cassell et al. 1994). Headnods and facial expressions can provide unobtrusive feedback on the student’s utterances and actions without unnecessarily disrupting the student’s train of thought. All of these nonverbal devices are a natural component of human dialogs. Moreover, the mere presence of a life-like agent may increase the student’s arousal and motivation to perform the task well (Lester et al. 1997; Walker, Sproull, & Subramani 1994). To explore the use of animated agents for tutorial, task-oriented dialogs, we have designed such an agent: Steve (Soar Training Expert for Virtual Environments). Steve is fully implemented and integrated with the other software components on which it relies (i.e., virtual reality software, a simulator, and commercial speech recognition and text-to-speech products). Wehave tested Steve on a variety of naval operating procedures; it can teach students how to operate several consoles that control the engines aboard naval ships, as well as how to perform an inspection of the air compressors on these engines. Moreover, Steve is not limited to this domain; it can provide instruction in a new domain given only the appropriate declarative domain knowledge. Steve’s Capabilities To illustrate Steve’s capabilities, suppose Steve is demonstrating how to inspect a high-pressure air compressor aboard a ship. The student’s head-mounted display gives her a three-dimensional view of her shipboard surroundings, which include the compressor in front of her and Steve at her side. As she moves or turns her head, her view changes accordingly. Her head-mounted display is equipped with a microphone to allow her to speak to Steve. After introducing the task, Steve begins the demonstration. "I will nowcheck the oil level," Steve says, and he moves over to the dipstick. Steve looks downat the dipstick, points at it, looks back at the student, and says "First, pull out the dipstick." Steve pulls it out (see Figure 1). Pointing at the level indicator, Steve says "Nowwe can check the oil level on the dipstick. As you can see, the oil level is normal." To finish the ¯ subtask, Steve says "Next, insert the dipstick" and he pushes it back in. Continuing the demonstration, Steve says "Make sure all the cut-out valves are open." Looking at the cut-out valves, Steve sees that all of them are already open except one. Pointing to it, he says "Opencut-out valve three," and he opens it. Next, Steve says "I will now perform a functional test of the drain alarm light. First, check that the drain monitor is on. As you can see, the power light is illuminated, so the monitor is on" (see Figure 2). The student, realizing that she has seen this procedure before, says "Let me finish." Steve acknowledges that she can finish the task, and he shifts to monitoring her performance. The student steps forward to the relevant part of the compressor, but is unsure of what to do first. "What Figure 1: Steve pulling out a dipstick Figure 2: Steve describing a power light 65 This example illustrates a number of Steve’s capabilities. It can generate and recognize speech, demonstrate actions, use gaze and gestures, answer questions, adapt domain procedures to unexpected events, and remember past actions. The remainder of the paper provides a brief overview of Steve’s methods for generating multi-modal communicative acts. For more technical details on this and other aspects of Steve’s capabilities, as well as a longer discussion on related work, see (Johnson et al. 1998) and (Rickel & Johnson 1998). Generating Multi-Modal Behavior Like many other autonomous agents that deal with a real or simulated world, Steve consists of two components: the first, implemented in Soar (Laird, Newell, & Rosenbloom1987), handles high-level cognitive processing, and the second handles sensorimotor processing. The cognitive component interprets the state of the virtual world, constructs and carries out plans to achieve goals, and decides howto interact with the student. The sensorimotor component serves as Steve’s interface to the virtual world, allowing the cognitive componentto perceive the state of the world and cause changes in it. It monitors messages from the simulator describing changes in the state of the world, from the virtual reality software describing actions taken by the student and the student’s position and field of view, and from speech recognition software describing the student’s requests and questions posed to Steve. 2 The sensorimotor module sends messages to the simulator to take action in the world, to text-to-speech software to generate speech,3 and to the virtual reality software to control Steve’s animated body. Steve’s high-level behavior is guided by three primary types of knowledge: a model of the current task, Steve’s current plan for completing the task, and a representation of whohas the task initiative. Steve’s model of a task is encoded in a hierarchical partialorder plan representation, which it generates automatically using task decomposition planning (Sacerdoti 1977) from its declarative domain knowledge. As the task proceeds, Steve uses the task model to maintain a plan for how to complete the task, using a variant of partial-order planning (Weld 1994) techniques. Finally, it maintains a record of whether Steve or the student is currently responsible for completing the task; this task initiative can change during the course of the Figure 3: Steve pressing a button should I do next?" she asks. Steve replies "I suggest that you press the function test button." The student asks "Why?"Steve replies "That action is relevant because we want the drain monitor in test mode." The student, wondering why the drain monitor should be in test mode, asks "Why?"again. Steve replies "That goal is relevant because it will allow us to check the alarm light." Finally, the student understands, but she is unsure which button is the function test button. "Showme how to do it" she requests. Steve moves to the function test button and pushes it (see Figure 3). The alarm light comes on, indicating to Steve and the student that it is functioning properly. Nowthe student recalls that she must extinguish the alarm light, but she pushes the wrong button, causing a different alarm light to illuminate. Flustered, she asks Steve "What should I do next?" Steve responds "I suggest that you press the reset button on the temperature monitor." She presses the reset button to extinguish the second alarm light, then presses the correct button to extinguish the first alarm light. Steve looks at her and says "That completes the task. Any questions?" The student only has one question. She asks Steve why he opened the cut-out valve. 1 "That action was relevant because I wanted to dampenoscillation of the stage three gauge" he replies. 1Unlike all other communicationbetween the student and Steve, such after-action review questions are posed via a desktop menu, not speech. Steve generates menuitems for all the actions he performed, and the student simply selects one. A speechinterface for after-action reviewwould require more sophisticated speech understanding. 66 2Steve does not currently incorporate any natural language understanding; it simply mapspredefined phrases to speechacts. 3Steve’s natural language generation is currently done using text templates. task at the request of the student. WhenSteve has the task initiative, its role is to demonstrate how to perform the task. In this role, it follows its plan for completing the task, demonstrating each step. Because its plan only provides a partial order over task steps, Steve uses a discourse focus stack (Grosz ~ Sidner 1986) to ensure a global coherence the demonstration. The focus stack also allows Steve to recognize digressions and resume the prior demonstration when unexpected events require a temporary deviation from the usual order of task steps. Most of Steve’s multi-modal communicative behavior arises when demonstrating a primitive task step (i.e., an action in the simulated world). For example, to demonstrate an object manipulation action, Steve would typically proceed as follows: 1. First, Steve movesto the location of the object it needs to manipulate by sending a locomotion motor command,along with the object to which it wants to move. Then, it waits for perceptual information to indicate that the body has arrived. 2. OnceSteve arrives at the desired object, it explains what it is going to do. This involves describing the step while pointing to the object to be manipulated. To describe the step, Steve outputs a speech specification with three pieces of information: ¯ the nameof the step - this will be used to retrieve the associated text fragment ¯ whether Steve has already demonstrated this step - this allows Steve to acknowledgethe repetition, as well as choose betweena concise or verbose verbal description ¯ a rhetorical relation indicating the relation in the task model between this step and the last one Steve demonstrated - this is used to generate an appropriate cue phrase (Grosz & Sidner 1986; Moore 1993) This sequence of events in demonstrating an action is not hardwired into Steve. Rather, Steve has a class hierarchy of action types (e.g., manipulate an object, move an object, check a condition), and each type of action is associated with an appropriate suite of communicative acts. Each suite is essentially an augmented transition network represented as Soar productions. By representing a suite as an ATNrather than a fixed plan, Steve’s demonstration of an action can be more reactive and adaptive. Steve does not yet makeextensive use of multi-modal input. However, Steve’s demonstrations are sensitive to the student’s field of view. WhenSteve references an object and points to it, it checks whether the object is in the student’s field of view. If not, Steve says "Look over here!" and waits until the student is looking before proceeding with the demonstration. Whenthe student has the task initiative, Steve’s primary role is to answer questions and evaluate the student’s actions. Steve’s answers are currently just verbal. Whenevaluating the student’s actions, Steve accompanies negative feedback with a shake of its head, and provides positive feedback on correct actions only nonverbally, by nodding its head. Our rationale is that such positive feedback should be as unobtrusive as possible, to avoid disrupting the student, and we expect verbal commentsto be more disruptive. Discussion Our work on multi-modal communicative behavior is still in its early stages. Nonetheless, it is informative to compare Steve’s methods to previous work in multimodal explanation generation. Most notably, although Steve employs planning to decide how to complete a task, it does not employ any communicative planning. Instead, its communicative behavior is governed by augmentedtransition networks that are specific to different types of task steps, similar to the schemata approach to explanation generation pioneered by McKeown (McKeown1985). In contrast, Andre et al. (Andre, Rist, ~ Mueller 1998) employ a standard top-down discourse planning approach to generating the communicative behavior of their animated agent, and they compile the resulting plans into finite state machines for efficient execution. The tradeoffs betweenthese two approaches to discourse generation are well known. In contrast to prior work in multi-modal explanation generation (Maybury 1993), which focused mainly combiningtext and graphics, the issue of media allocation seems less an issue for animated agents. The decision between conveying information in text or graphics is particularly difficult because graphics can be used in many ways. In contrast, the nonverbal behavior Once Steve sends the motor commandto generate the speech, it waits for an event from the sensorimotor component indicating that the speech is complete. 3. Whenthe speech is complete, Steve performs the task step. This is done by sending an appropriate motor commandand waiting for evidence in its perception that the commandwas executed. For example, if it sends a motor commandto press buttonl, it waits for a message from the simulator indicating the resulting state: buttonl_state depressed. 4. If appropriate, Steve explains the results of the action, using appropriate text fragments and pointing gestures. 67 of an animated agent, though important, is a far less expressive medium. Therefore, nonverbal body language serves more to complement and enhance verbal utterances, but has less ability to replace them than graphics does. (Although see (Cassell forthcoming) for a deeper discussion of this issue.) The two areas where nonverbal actions can significantly replace verbal utterances are demonstrations and facial expressions. Demonstrating an action may be far more effective than trying to describe howto perform the action, and is perhaps the biggest advantage of an animated agent. Our work in controlling Steve’s facial expressions has only recently begun, but we hope to use them to give a variety of different types of feedback to students when a verbal comment would be unnecessarily obtrusive. To handle multi-modal input in virtual reality, the techniques of Billinghurst and Savage (Billinghurst Savage 1996) would nicely complement Steve’s current capabilities. Their agent, which is designed to train medical students how to perform sinus surgery, combines natural language understanding and gesture recognition. They parse both types of input into a single representation, and their early results confirm the intuitive advantages ofmulti-modalinput: (1) different types of communicationare simpler in one or the other mode, and (2) in cases where either mode alone would be ambiguous, the combination can help disambiguate. Acknowledgments This work is funded by the Office of Naval Research, grant N00014-95-C-0179. Weare grateful for the contributions of our manycollaborators: RandyStiles and his colleagues at Lockheed Martin; Allen Munro and his colleagues at Behavioral Technologies Laboratory; and Richard Angros, Ben Moore, BehnamSalemi, Erin Shaw, and Marcus Thiebaux at ISI. Cassell, J.; Pelachaud, C.; Badler, N.; Steedman, M.; Achorn, B.; Becket, T.; Douville, B.; Prevost, S.; and Stone, M. 1994. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Proceedings of ACMSIGGRAPH’9,~. Cassell, J. forthcoming. Embodiedconversation: Integrating face and gesture into automatic spoken dialogue systems. In Luperfoy, S., ed., Automatic Spoken Dialogue Systems. MIT Press. Deutsch, B. G. 1974. The structure of task oriented dialogs. In Proceedings of the IEEE Speech Symposium. Pittsburgh, PA: Carnegie-Mellon University. Also available as Stanford Research Institute Technical Note 90. Grosz, B. J., and Sidner, C. L. 1986. Attention, intentions, and the structure of discourse. Computational Liguisties 12(3):175-204. Johnson, W. L.; Rickel, J.; Stiles, R.; and Munro, A. 1998. Integrating pedagogical agents into virtual environments. Presence. Forthcoming. Laird, J. E.; Newell, A.; and Rosenbloom,P. S. 1987. Soar: An architecture for general intelligence. Artificial Intelligence 33(1):1-64. Lester, J. C.; Converse, S. A.; Kahler, S. E.; Barlow, S. T.; Stone, B. A.; and Bhogal, R. S. 1997. The persona effect: Affective impact of animated pedagogical agents. In Proceedings of CHI ’97, 359-366. Lester, J. C.; Voerman,J. L.; Towns,S. G.; and Callaway, C. B. 1998. Deictic believability: Coordinating gesture, locomotion, and speech in lifelike pedagogical agents. Applied Artificial Intelligence. Forthcoming. Maybury, M. T., ed. 1993. Intelligent Multimedia Interfaces. Menlo Park, CA: AAAIPress. McKeown,K. R. 1985. Text Generation. University Press. References Cambridge Moore, J. D. 1993. What makes human explanations effective? In Proceedings of the 15th Annual Conference of the Cognitive Science Society, 131-136. Allen, J. F.; Miller, B. W.; Ringger, E. KI; and Sikorski, T. 1996. Robust understanding in a dialogue system. In Proceedings of the 3~th Annual Meeting of the Association for Computational Linguistics, 62-70. Noma, T., and Badler, N. I. 1997. A virtual human presenter. In Proceedings of the IJCAI Workshop on Animated Interface Agents: Making Them Intelligent, 45-51. Andre, E.; Rist, T.; and Mueller, J. 1998. Integrating reactive and scripted behaviors in a life-like presentation agent. In Proceedings of the Second International Conference on Autonomous Agents. ACMPress. Billinghurst, M., and Savage, J. 1996. Adding intelligence to the interface. In Proceedings of the IEEE Virtual Reality Annual International Symposium (VRAIS ’96), 168-175. Los Alamitos, CA: IEEE Computer Society Press. 58 Rickel, J., and Johnson, W. L. 1997a. Integrating pedagogical capabilities in a virtual environment agent. In Proceedings of the First International Conference on Autonomous Agents. ACMPress. Rickel, J., and Johnson, W. L. 1997b. Intelligent tutoring in virtual reality: A preliminary report. In Proceedings of the Eighth World Conference on Artificial Intelligence in Education, 294-301. IOS Press. Rickel, J., and Johnson, W. L. 1998. Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Applied Artificial Intelligence. Forthcoming. Sacerdoti, E. 1977. A Structure for Plans and Behavior. NewYork: Elsevier North-Holland. Walker, J. H.; Sproull, L.; and Subramani, R. 1994. Using a humanface in an interface. In Proceedings of CHI-94, 85-91. Weld, D. S. 1994. An introduction to least commitment planning. AI Magazine 15(4):27-61. 59