From: AAAI Technical Report WS-98-09. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved.
Task-Oriented Dialogs with Animated Agents in Virtual
Reality
Jeff
Rickel
and W. Lewis Johnson
Information Sciences Institute & Computer Science Department
University of Southern California
4676 Admiralty Way, Marina del Bey, CA 90292-6695
rickel@isi.edu, johnson@isi.edu
http://www.isi.edu/isd/VET/vet.html
Abstract
We are working towards animated agents that can
carry on tutorial, task-oriented dialogs
with human
students. The agent’s objective is to help students
learn to performphysical, procedural tasks, such as
operating and maintaining equipment. Although most
research on such dialogs has focused on verbal communication, nonverbal communicationcan play many
importantroles as well. Toallow a widevariety of interactions, the student and our agent cohabit a threedimensional, interactive, simulated mock-upof the
student’s work environment. The agent, Steve, can
generate and recognize speech, demonstrate actions,
use gaze and gestures, answer questions, adapt domain procedures to unexpected events, and remember past actions. This paper gives a brief overviewof
Steve’s methodsfor generating multi-modal behavior,
contrasting our workwith prior workin task-oriented
dialogs and multi-modal explanation generation.
Introduction
We are working towards animated agents that can
carry on tutorial,
task-oriented dialogs with human
students. The agent’s objective is to help students
learn to perform physical, procedural tasks, such as
operating and maintaining equipment. Thus, like most
earlier research on task-oriented dialogs, the agent
(computer) serves as an expert that can provide guidance to a human novice. Research on such dialogs
dates back more than twenty years (Deutsch 1974),
and this subject remains an active research area (Allen
et al. 1996). However, the vast bulk of this research
has focused solely on verbal dialogs, even though the
earliest studies clearly showed the ubiquity of nonverbal communication in human task-oriented dialogs
(Deutsch 1974). To allow a wider variety of interactions among agents and human students,
we use
virtual reality; agents and students cohabit a threedimensional, interactive, simulated mock-upof the student’s work environment.
Virtual reality offers a rich environment for multimodal interaction
among agents and humans. Like
standard desktop dialog systems, agents can communicate with humansvia speech, using text-to-speech and
speech recognition software. (We currently use commercial products from Entropic for these functions.)
Like previous simulation-based training systems, the
behavior of the virtual world is controlled by a simulator, and agents can perceive the state of the virtual
world via messages from the simulator, and they can
take action in the world by sending messages to the
simulator. However, an animated agent that cohabits
a virtual world with students has a distinct advantage
over previous disembodied tutors: the agent can additionally communicatenonverbally using gestures, gaze,
facial expressions, and locomotion. Students also have
more freedom; they can movearound the virtual world,
gaze around (via a head-mounted display), and interact with objects (e.g., via a data glove). Moreover,
agents can perceive these humanactions; virtual reality software can inform agents of the location (in x-y-z
coordinates), field of view (i.e., visible objects),
actions of humans. Thus, virtual reality is an important application area for multi-modal dialog research
because it allows more human-like interactions among
synthetic agents and humans than desktop interfaces
can.
Although practically ignored until recently, nonverbal communication can play many important roles in
task-oriented tutorial dialogs. The agent can demonstrate how to perform actions (Rickel & Johnson
1997a). It can use locomotion, gaze, and deictic gestures to focus the student’s attention (Lester et al.
1998; Noma8¢ Badler 1997; l=tickel & Johnson 1997a).
It can use gaze to regulate turn-taking in a mixedinitiative dialog (Cassell et al. 1994). Headnods and
facial expressions can provide unobtrusive feedback on
the student’s utterances and actions without unnecessarily disrupting the student’s train of thought. All
of these nonverbal devices are a natural component
of human dialogs. Moreover, the mere presence of a
life-like agent may increase the student’s arousal and
motivation to perform the task well (Lester et al. 1997;
Walker, Sproull, & Subramani 1994).
To explore the use of animated agents for tutorial, task-oriented dialogs, we have designed such an
agent: Steve (Soar Training Expert for Virtual Environments). Steve is fully implemented and integrated
with the other software components on which it relies
(i.e., virtual reality software, a simulator, and commercial speech recognition and text-to-speech products).
Wehave tested Steve on a variety of naval operating procedures; it can teach students how to operate
several consoles that control the engines aboard naval
ships, as well as how to perform an inspection of the
air compressors on these engines. Moreover, Steve is
not limited to this domain; it can provide instruction
in a new domain given only the appropriate declarative
domain knowledge.
Steve’s
Capabilities
To illustrate
Steve’s capabilities,
suppose Steve is
demonstrating how to inspect a high-pressure air compressor aboard a ship. The student’s head-mounted
display gives her a three-dimensional view of her shipboard surroundings, which include the compressor in
front of her and Steve at her side. As she moves or
turns her head, her view changes accordingly. Her
head-mounted display is equipped with a microphone
to allow her to speak to Steve.
After introducing the task, Steve begins the demonstration. "I will nowcheck the oil level," Steve says,
and he moves over to the dipstick. Steve looks downat
the dipstick, points at it, looks back at the student, and
says "First, pull out the dipstick." Steve pulls it out
(see Figure 1). Pointing at the level indicator, Steve
says "Nowwe can check the oil level on the dipstick.
As you can see, the oil level is normal." To finish the
¯ subtask, Steve says "Next, insert the dipstick" and he
pushes it back in.
Continuing the demonstration, Steve says "Make
sure all the cut-out valves are open." Looking at the
cut-out valves, Steve sees that all of them are already
open except one. Pointing to it, he says "Opencut-out
valve three," and he opens it.
Next, Steve says "I will now perform a functional
test of the drain alarm light. First, check that the
drain monitor is on. As you can see, the power light
is illuminated, so the monitor is on" (see Figure 2).
The student, realizing that she has seen this procedure
before, says "Let me finish." Steve acknowledges that
she can finish the task, and he shifts to monitoring her
performance.
The student steps forward to the relevant part of the
compressor, but is unsure of what to do first. "What
Figure 1: Steve pulling out a dipstick
Figure 2: Steve describing a power light
65
This example illustrates
a number of Steve’s capabilities. It can generate and recognize speech, demonstrate actions, use gaze and gestures, answer questions,
adapt domain procedures to unexpected events, and
remember past actions.
The remainder of the paper provides a brief overview
of Steve’s methods for generating multi-modal communicative acts. For more technical details on this and
other aspects of Steve’s capabilities, as well as a longer
discussion on related work, see (Johnson et al. 1998)
and (Rickel & Johnson 1998).
Generating
Multi-Modal
Behavior
Like many other autonomous agents that deal with a
real or simulated world, Steve consists of two components: the first, implemented in Soar (Laird, Newell,
& Rosenbloom1987), handles high-level cognitive processing, and the second handles sensorimotor processing. The cognitive component interprets the state of
the virtual world, constructs and carries out plans to
achieve goals, and decides howto interact with the student. The sensorimotor component serves as Steve’s
interface to the virtual world, allowing the cognitive
componentto perceive the state of the world and cause
changes in it. It monitors messages from the simulator
describing changes in the state of the world, from the
virtual reality software describing actions taken by the
student and the student’s position and field of view,
and from speech recognition software describing the
student’s requests and questions posed to Steve. 2 The
sensorimotor module sends messages to the simulator
to take action in the world, to text-to-speech software
to generate speech,3 and to the virtual reality software
to control Steve’s animated body.
Steve’s high-level behavior is guided by three primary types of knowledge: a model of the current task,
Steve’s current plan for completing the task, and a
representation of whohas the task initiative. Steve’s
model of a task is encoded in a hierarchical partialorder plan representation, which it generates automatically using task decomposition planning (Sacerdoti
1977) from its declarative domain knowledge. As the
task proceeds, Steve uses the task model to maintain
a plan for how to complete the task, using a variant
of partial-order planning (Weld 1994) techniques. Finally, it maintains a record of whether Steve or the student is currently responsible for completing the task;
this task initiative can change during the course of the
Figure 3: Steve pressing a button
should I do next?" she asks. Steve replies "I suggest
that you press the function test button." The student
asks "Why?"Steve replies "That action is relevant because we want the drain monitor in test mode." The
student, wondering why the drain monitor should be
in test mode, asks "Why?"again. Steve replies "That
goal is relevant because it will allow us to check the
alarm light." Finally, the student understands, but
she is unsure which button is the function test button.
"Showme how to do it" she requests. Steve moves to
the function test button and pushes it (see Figure 3).
The alarm light comes on, indicating to Steve and the
student that it is functioning properly. Nowthe student recalls that she must extinguish the alarm light,
but she pushes the wrong button, causing a different
alarm light to illuminate. Flustered, she asks Steve
"What should I do next?" Steve responds "I suggest
that you press the reset button on the temperature
monitor." She presses the reset button to extinguish
the second alarm light, then presses the correct button
to extinguish the first alarm light. Steve looks at her
and says "That completes the task. Any questions?"
The student only has one question. She asks Steve
why he opened the cut-out valve. 1 "That action was
relevant because I wanted to dampenoscillation of the
stage three gauge" he replies.
1Unlike all other communicationbetween the student
and Steve, such after-action review questions are posed via
a desktop menu, not speech. Steve generates menuitems
for all the actions he performed, and the student simply
selects one. A speechinterface for after-action reviewwould
require more sophisticated speech understanding.
66
2Steve does not currently incorporate any natural language understanding; it simply mapspredefined phrases to
speechacts.
3Steve’s natural language generation is currently done
using text templates.
task at the request of the student.
WhenSteve has the task initiative,
its role is to
demonstrate how to perform the task. In this role,
it follows its plan for completing the task, demonstrating each step. Because its plan only provides a partial
order over task steps, Steve uses a discourse focus stack
(Grosz ~ Sidner 1986) to ensure a global coherence
the demonstration. The focus stack also allows Steve
to recognize digressions and resume the prior demonstration when unexpected events require a temporary
deviation from the usual order of task steps.
Most of Steve’s multi-modal communicative behavior arises when demonstrating a primitive task step
(i.e., an action in the simulated world). For example,
to demonstrate an object manipulation action, Steve
would typically proceed as follows:
1. First, Steve movesto the location of the object it
needs to manipulate by sending a locomotion motor
command,along with the object to which it wants
to move. Then, it waits for perceptual information
to indicate that the body has arrived.
2. OnceSteve arrives at the desired object, it explains
what it is going to do. This involves describing the
step while pointing to the object to be manipulated.
To describe the step, Steve outputs a speech specification with three pieces of information:
¯ the nameof the step - this will be used to retrieve
the associated text fragment
¯ whether Steve has already demonstrated this step
- this allows Steve to acknowledgethe repetition,
as well as choose betweena concise or verbose verbal description
¯ a rhetorical relation indicating the relation in the
task model between this step and the last one
Steve demonstrated - this is used to generate
an appropriate cue phrase (Grosz & Sidner 1986;
Moore 1993)
This sequence of events in demonstrating an action
is not hardwired into Steve. Rather, Steve has a class
hierarchy of action types (e.g., manipulate an object,
move an object, check a condition), and each type
of action is associated with an appropriate suite of
communicative acts. Each suite is essentially an augmented transition network represented as Soar productions. By representing a suite as an ATNrather than
a fixed plan, Steve’s demonstration of an action can be
more reactive and adaptive.
Steve does not yet makeextensive use of multi-modal
input. However, Steve’s demonstrations are sensitive
to the student’s field of view. WhenSteve references
an object and points to it, it checks whether the object is in the student’s field of view. If not, Steve says
"Look over here!" and waits until the student is looking before proceeding with the demonstration.
Whenthe student has the task initiative, Steve’s primary role is to answer questions and evaluate the student’s actions. Steve’s answers are currently just verbal. Whenevaluating the student’s actions, Steve accompanies negative feedback with a shake of its head,
and provides positive feedback on correct actions only
nonverbally, by nodding its head. Our rationale is that
such positive feedback should be as unobtrusive as possible, to avoid disrupting the student, and we expect
verbal commentsto be more disruptive.
Discussion
Our work on multi-modal communicative behavior is
still in its early stages. Nonetheless, it is informative
to compare Steve’s methods to previous work in multimodal explanation generation. Most notably, although
Steve employs planning to decide how to complete
a task, it does not employ any communicative planning. Instead, its communicative behavior is governed
by augmentedtransition networks that are specific to
different types of task steps, similar to the schemata
approach to explanation generation pioneered by McKeown (McKeown1985). In contrast, Andre et al. (Andre, Rist, ~ Mueller 1998) employ a standard top-down
discourse planning approach to generating the communicative behavior of their animated agent, and they
compile the resulting plans into finite state machines
for efficient execution. The tradeoffs betweenthese two
approaches to discourse generation are well known.
In contrast to prior work in multi-modal explanation
generation (Maybury 1993), which focused mainly
combiningtext and graphics, the issue of media allocation seems less an issue for animated agents. The decision between conveying information in text or graphics
is particularly difficult because graphics can be used
in many ways. In contrast, the nonverbal behavior
Once Steve sends the motor commandto generate
the speech, it waits for an event from the sensorimotor component indicating that the speech is complete.
3. Whenthe speech is complete, Steve performs the
task step. This is done by sending an appropriate
motor commandand waiting for evidence in its perception that the commandwas executed. For example, if it sends a motor commandto press buttonl,
it waits for a message from the simulator indicating
the resulting state: buttonl_state depressed.
4. If appropriate, Steve explains the results of the action, using appropriate text fragments and pointing
gestures.
67
of an animated agent, though important, is a far less
expressive medium. Therefore, nonverbal body language serves more to complement and enhance verbal
utterances, but has less ability to replace them than
graphics does. (Although see (Cassell forthcoming)
for a deeper discussion of this issue.) The two areas
where nonverbal actions can significantly replace verbal utterances are demonstrations and facial expressions. Demonstrating an action may be far more effective than trying to describe howto perform the action,
and is perhaps the biggest advantage of an animated
agent. Our work in controlling Steve’s facial expressions has only recently begun, but we hope to use them
to give a variety of different types of feedback to students when a verbal comment would be unnecessarily
obtrusive.
To handle multi-modal input in virtual reality, the
techniques of Billinghurst and Savage (Billinghurst
Savage 1996) would nicely complement Steve’s current capabilities. Their agent, which is designed to
train medical students how to perform sinus surgery,
combines natural language understanding and gesture
recognition. They parse both types of input into a single representation, and their early results confirm the
intuitive advantages ofmulti-modalinput: (1) different
types of communicationare simpler in one or the other
mode, and (2) in cases where either mode alone would
be ambiguous, the combination can help disambiguate.
Acknowledgments
This work is funded by the Office of Naval Research,
grant N00014-95-C-0179. Weare grateful for the contributions of our manycollaborators: RandyStiles and
his colleagues at Lockheed Martin; Allen Munro and
his colleagues at Behavioral Technologies Laboratory;
and Richard Angros, Ben Moore, BehnamSalemi, Erin
Shaw, and Marcus Thiebaux at ISI.
Cassell, J.; Pelachaud, C.; Badler, N.; Steedman, M.;
Achorn, B.; Becket, T.; Douville, B.; Prevost, S.; and
Stone, M. 1994. Animated conversation: Rule-based
generation of facial expression, gesture and spoken
intonation for multiple conversational agents. In Proceedings of ACMSIGGRAPH’9,~.
Cassell, J. forthcoming. Embodiedconversation: Integrating face and gesture into automatic spoken dialogue systems. In Luperfoy, S., ed., Automatic Spoken
Dialogue Systems. MIT Press.
Deutsch, B. G. 1974. The structure of task oriented
dialogs. In Proceedings of the IEEE Speech Symposium. Pittsburgh, PA: Carnegie-Mellon University.
Also available as Stanford Research Institute Technical Note 90.
Grosz, B. J., and Sidner, C. L. 1986. Attention, intentions, and the structure of discourse. Computational
Liguisties 12(3):175-204.
Johnson, W. L.; Rickel, J.; Stiles, R.; and Munro,
A. 1998. Integrating pedagogical agents into virtual
environments. Presence. Forthcoming.
Laird, J. E.; Newell, A.; and Rosenbloom,P. S. 1987.
Soar: An architecture for general intelligence. Artificial Intelligence 33(1):1-64.
Lester, J. C.; Converse, S. A.; Kahler, S. E.; Barlow,
S. T.; Stone, B. A.; and Bhogal, R. S. 1997. The persona effect: Affective impact of animated pedagogical
agents. In Proceedings of CHI ’97, 359-366.
Lester, J. C.; Voerman,J. L.; Towns,S. G.; and Callaway, C. B. 1998. Deictic believability: Coordinating
gesture, locomotion, and speech in lifelike pedagogical
agents. Applied Artificial Intelligence. Forthcoming.
Maybury, M. T., ed. 1993. Intelligent
Multimedia
Interfaces. Menlo Park, CA: AAAIPress.
McKeown,K. R. 1985. Text Generation.
University Press.
References
Cambridge
Moore, J. D. 1993. What makes human explanations
effective? In Proceedings of the 15th Annual Conference of the Cognitive Science Society, 131-136.
Allen, J. F.; Miller, B. W.; Ringger, E. KI; and Sikorski, T. 1996. Robust understanding in a dialogue
system. In Proceedings of the 3~th Annual Meeting of
the Association for Computational Linguistics, 62-70.
Noma, T., and Badler, N. I. 1997. A virtual human
presenter. In Proceedings of the IJCAI Workshop on
Animated Interface Agents: Making Them Intelligent,
45-51.
Andre, E.; Rist, T.; and Mueller, J. 1998. Integrating
reactive and scripted behaviors in a life-like presentation agent. In Proceedings of the Second International
Conference on Autonomous Agents. ACMPress.
Billinghurst, M., and Savage, J. 1996. Adding intelligence to the interface. In Proceedings of the
IEEE Virtual Reality Annual International Symposium (VRAIS ’96), 168-175. Los Alamitos, CA: IEEE
Computer Society Press.
58
Rickel, J., and Johnson, W. L. 1997a. Integrating pedagogical capabilities in a virtual environment
agent. In Proceedings of the First International Conference on Autonomous Agents. ACMPress.
Rickel, J., and Johnson, W. L. 1997b. Intelligent
tutoring in virtual reality: A preliminary report. In
Proceedings of the Eighth World Conference on Artificial Intelligence in Education, 294-301. IOS Press.
Rickel, J., and Johnson, W. L. 1998. Animated agents
for procedural training in virtual reality: Perception,
cognition, and motor control. Applied Artificial Intelligence. Forthcoming.
Sacerdoti, E. 1977. A Structure for Plans and Behavior. NewYork: Elsevier North-Holland.
Walker, J. H.; Sproull, L.; and Subramani, R. 1994.
Using a humanface in an interface. In Proceedings of
CHI-94, 85-91.
Weld, D. S. 1994. An introduction to least commitment planning. AI Magazine 15(4):27-61.
59