Action Scheduling in Humanoid Conversational Agents
by
Joey Chang
Submitted to the Department of Electrical Engineering and Computer Science
In Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Computer Science and Engineering
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
May 22, 1998
Copyright
19M.T.
All rights reserved.
Q
@ Copyright 1998 M.f.T. All rights reserved.
Author
D,
'
Department of Electrical Engineering and Computer Science
May 22, 1998
Certified
by
(
I I.
N
, N-1- I Tyent
T&T CareeDeVement
'-
Accepted
by
.
/
Justine
Cassell
Profess fIedia
-£--J -
/
_
'/
--
Arts and Sciences
-'f'
4,'w
uprior
_.--
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
MASSACHUSETTS
INSTITUTE
OFTECHNOLOGY
NOV 1 6 1998
LIBRARIES
fIE
2
Action Scheduling in Humanoid Agents
by
Joey Chang
Submitted to the Department of Electrical Engineering and Computer Science
on May 22, 1998 in Partial Fulfillment of the
Requirements for the Degrees of
Bachelor of Science in Computer Science and Engineering
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
Abstract
This paper presents an approach to action scheduling in lifelike humanoid agents. The
action scheduler constitutes a module in the agent which links the mental processes to the
physical actions of the agent. It receives requests from the agent's behavior generating
modules and schedules them to be executed at the appropriate time, handling such tasks
as speech-to-action synchronization, concurrent and overlapping behavior blending, and
behavior conflict resolution. The optimal approach to action scheduling will vary from
system to system, depending upon the characteristics of the system, since the ultimate
action scheduler would accommodate a fully functional human representation-a goal
which is out of the scope of research today. This paper presents an action scheduler for a
real-time three-dimensional multi-modal humanoid agent.
Thesis Supervisor: Justine Cassell
Title: AT&T Career Development Professor of Media Arts and Sciences
3
Acknowledgements
Thank you to Justine Cassell for her guidance and insight.
Thank you to Hannes Vilhjdlmsson for his invaluable technical expertise and inspiration.
Thank you to the Rea Team for their hard work and devotion.
Thank you to my family for always being ready to help, whichever path I choose.
Thank you to Kae-Chy for being there for me.
And while the acknowledgements are short in description, each of you know the great
lengths to which they extend in my heart.
4
Contents
7
1. Introduction
1.1 Goals of this work
7
1.2 Thesis Outline
7
1.3 Multi-Modal Communication
_
___
8
___
1.4Humanoid
Agents
_ __…
-_-_-
8
10
2. Background
2.1 The Role of the Action Scheduler
10
2.2 Past work __
13
2.2.1
Ymir -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
……________
2.2.1.1 The Processing Layers
2.2.1.2 Behavior Selection
-
_
13
13
14
___
---
----
2.2.1.3 Proper Abstraction__--
15
2.2.2
Animated Conversations
16
2.2.3
Multi-level Direction of Creatures
17
17
2.2.3.1 A Three-Tier Architecture
2.2.3.2 Flexibility through Modularity
-
18
__
-_-18
2.2.3.3 Using DOFs -…
andShortcomings
----2.2.3.4
Similarities
2.2.4
Improv…--
-
-
-
-
-
-
_
-
2.2.4.1 Handling Competing Actions -----
-
-
-
-
-_ -
20
20
-__
__
19
2.2.4.2 Action Buffering - -_-21…
-----
2.2.4.3
Shortcomings……-----
___
21
3. Statement of Problem
22
4. Architecture
24
25
4.1 Architectural Overview
4.2 Enabling Speech Planning
4.3 Enabling
Coarticulation-
……__________
-
-
-
-
-
-
-
-
25
_
-
-
-
-
-
26
26
4.4 Input Overview
5
4.5 Similarities and Differences to Ymir
_
__________
27
5. Implementation
29
5.1 REA: a Multi-Modal Humanoid Conversational Agent -…
5.2AFlexible
Animation
Engine - - 5.3 The Action Scheduler
5.4 Speech Generation
--...
___
…
29
_
30
_______
31
32
____
6. Evaluation
34
6.1 What Worked
34
6.2 What Didn't Work---------
35
6.3
AComparison
-- ___-
---
36
7. Conclusions and Further Research
38
7.1 The Validity of the Action Scheduler
7.2 Expansions…
_
38
39_
6
Chapter 1
Introduction
1.1 Goals of This Work
The primary goal of this work is to examine the role of the Action Scheduler in the
development of a multi-modal humanoid agent. While there is no standard design to the
development of a multi-modal humanoid agent, most attempts at developing some
"smart" creature or agent have similarities in their decision-making process for evoking
behaviors. This work will take one such architecture for a multi-modal humanoid agent
and examine the issues that effect the Action Scheduler.
It will also examine the action-
evoking designs of other applications and attempt to determine the optimal method of
implementing a general multi-modal humanoid agent.
1.2 Thesis Outline
This thesis will begin with a review of the role of the Action Scheduler in a multi-modal
humanoid agent accompanied by a review of past work relevant to the Action Scheduler.
Following this will be a description of the architecture which will use this particular
Action Scheduler. Most importantly, this will include the modules which affect and are
directly affected by the Action Scheduler. The thesis will conclude with a description of
7
the implementation of the humanoid agent and a discussion of the observations made
during and after its development
1.3 Multi-modal Communication
Humans typically engage in face-to-face communication through the use of multiple
input modalities. On the lowest level-the
perceptual level-these
include sight and
hearing. Beyond the raw perceptual layer, however, humans use social customs and
behaviors evolved through generations to regulate the collaborative effort of the face-toface exchange of information-the
conversation.
pointing (deictics) to a requested object-are
Some of these behaviors-such
as
consciously performed and produce an
easily recognizable purpose. However, behaviors such as backchannel listener feedback,
where a listener provides feedback in the form of head movements and short utterances as
a way of regulating the conversation, are not so consciously noticed as elements of the
conversation. Other communicative behaviors involve the use of facial expression, gaze
direction, hand gesture, head movement, and posture to maintain a smooth conversational
flow.
1.4 Humanoid Agents
The continuing fast-paced advance of technological capabilities has driven a small
number of businesses and research facilities to pursue the futuristic dream of a virtual
servant. These computer-driven creatures are produced with the aim that they may fill
the duties currently filled by humans, but enjoy the advantage of sparing human
resources.
Such entities, termed agents, have encountered a rocky reception at best due
to the unfamiliarity of dealing with a software-driven agent that attempts to mimic a real-
8
life being. Help agents such as Microsoft Word's dictating paperclip have met with wide
skepticism regarding their usefulness compared to a conventional means of help lookup.
While such applications as the Star Trek voice-interface computer are testimony
to the wide potential of agents for the facilitation of information retrieval and task
completion, the current state of research in the development of such agents is still far too
primitive to produce highly useful results. One approach to developing an effective agent
involves the development of a humanoid agent that embodies the multiple modalities of
communication inherent in human face-to-face communication. This approach strives to
create a virtual humanoid that demonstrates both the high level conscious and lower level
subconscious forms of face-to-face communicative behavior in an attempt to ease the task
of information exchange-the
conversation-between
the user and the agent. Such an
endeavor demands an in-depth examination of discourse research to properly design and
embed proper behaviors within a humanoid agent.
9
Chapter 2
Background
2.1 The Role of the Action Scheduler
The Action Scheduler's primary role involves planning what the agent will do on the
highest level.
[Th6risson 96] describes the Action Scheduler as a cerebellum that
receives behavior requests from various drives within the mind of the agent.
drives represent different levels of behavior.
These
The Action Scheduler examines the
behavior requests submitted by these other modules and performs a resource look-up to
verify the agent's ability to perform the particular action.
Such a role carries with it the responsibility of selecting the proper method to
reflect a certain behavior in light of the possibility of conflicts. If the agent needs to
scratch its head and desires to acknowledge the user by waving with the same hand, such
a conflict could be resolved by sufficing the acknowledgement
with a nod of the head
instead.
The graceful execution of actions also falls under the responsibility of the Action
Scheduler. Since the visual output of actions that are merely scheduled as is would
appear awkward in several cases, the Action Scheduler must catch and edit these cases so
10
they present a smooth output. Take the case where the agent points to the user and then
scratches its head thoughtfully. If these two actions occur reasonably close, it would
seem awkward for the agent to execute them as completely separate actions, complete
with preparation and retraction phases-where
the user prepares for the main part of the
gesture and retracts to a rest position. More graceful would be the omission of the
retraction phase of the former gesture and the preparation phase of the latter gesture
resulting in a direct transition from the outstretched arm (pointing at the user) to a scratch
of the head (in thoughtfulness). Such a blending of closely occurring gestures is known
as coarticulation.
Similarly, the reverse case, where an intermediary position is desired, also occurs
and is not immediately obvious to the execution phase unless the action scheduler can
catch it. Since three-dimensional animation engines operate by the linear translation
from one key frame to the next, a method of creating an intermediary position to alleviate
physical impossibilities-known
as action buffering-becomes
a necessity.
For
example, if an agent needs to transition from a state where its hands are in its pockets to a
folded arm position, a direct linear translation rendering of the animation would result in
the hands seeming to pass right through the pockets and into the folded arm position.
Much more realistic would be the intermediary phase where the agent pulls its hands out
of its pockets and then moves to the folded arm position. Needless to say, such action
buffering would be crucial for most cases transitioning from a hand-in-pockets state.
The responsibility of action buffering can lie in either the Action Scheduler, or the
Animation Engine, depending upon how much freedom the Animation Engine is given to
interpret action directives from the Action Scheduler. If the Animation Engine follows
11
strict methods of executing actions, then the Action Scheduler should resolve the proper
set of actions and instruct the Animation Engine to execute them.
If, however, the
developer chooses to allow some interpretive freedom in the Animation Engine, action
buffering could occur within the Animation Engine, with it simply reporting to the Action
Scheduler the information that the Scheduler needs, such as what joints are used and
when those joints are busy.
Another less noticeable, but drastically crucial task, is the synchronization of
speech and action. Given that the humanoid agent has the ability to speak, the Action
Scheduler must be able to negotiate the speech system and coordinate actions so that they
fall at reasonably precise moments for communicative effectiveness. This functionality
will depend largely upon the speech generation tools available. While some tools provide
specific timing information, others do not, and would demand other methods of timing
synchronization such as estimation. While the speech tools can be run by the Animation
Engine since it is a rote action, the various timing and synchronization
tasks should be
performed by the Action Scheduler so that it can manage the integration of gesture and
facial expression with the speech.
Finally, one other basic, and probably the least explored endeavor, is the
resolution of behavior conflicts. How does the conversational agent decide what to do
when the driving forces within it want more than one behavior to occur? An agent may
typically greet a user with a smile and a spoken greeting, but how does the agent handle
the acknowledgement of a second user when it is in the process of speaking with the first
user?
Such resolutions require the knowledge of which modalities are and are not
available as well as the alternatives available to the agent when a particular action request
12
is not executable. Sometimes, the agent won't be able to find acceptable alternatives to
express particular conflicting behaviors. In these cases, the Action Scheduler must have
to ability to decide which behaviors are of greater priority. It must then gracefully
interrupt or carry out the appropriate behaviors. The Action Scheduler represents the
final stage at which the agent can refine its desired behaviors, so it becomes crucial that
the Action Scheduler contain the intelligence to do so properly and effectively.
2.2 Past Work
A number of past works provide insight regarding the approach of an effective Action
Scheduler for a three-dimensional humanoid agent.
While some works do not
specifically target a humanoid conversational agent, all have contributions to the
development of a capable and interactive agent.
2.2.1 Ymir
The predominant humanoid conversational agent in research is Th6rrison's Ymir
architecture, a design for a multi-modal humanoid agent.
mediates between three layers-the
In it, the Action Scheduler
reactive layer, the process control layer, and the
content layer-which each submit behavior requests.
2.2.1.1 The Processing Layers
The reactive layer consists of those behaviors that occur with the utmost immediacy, such
as turning the agent toward the user when the agent is addressed. This layer serves to
produce the behaviors that require little or no state knowledge and therefore the least
amount of computation. The quick turnover of a response enables the system to retain a
level of reactivity that contributes to the believability of the agent's intelligence. While
13
the actual decisions issued by the Reactive Layer could contain large amounts of error,
the fact that a response occurs takes precedence over whether the response is a correct
one.
The Process Control Layer manages global control of the interaction.
This
includes elements of face-to-face interaction that are common through most
conversations, such as knowing when to take the turn, when to listen because the user is
speaking, what to do when the agent does not understand the user, and how to actdepending on the agent's understanding of the user-when
the user is speaking to the
agent. The Process Control Layer has a handle into the activity status, inactivity status,
and thresholds of state variables that affect the Reactive Layer-an
important
functionality since the Reactive Layer does not have access to enough global information
to operate appropriately on an independent basis.
The third and highest level layer-the Content Layer-handles the processes that
understand the content of the user's
input and generate appropriate responses
accordingly. If the user requested to look at an object, the Content Layer would process
the content of the request and possibly spawn a behavior such as "show object."
2.2.1.2 Behavior Selection
Behavior requests are serviced in an order based upon the layer from which the request
originates. Received behavior requests are placed in a buffer that is periodically checked
for contents. If the buffer contains requests, the Action Scheduler services all Reactive
Layer requests first.
Once there are no more Reactive Layer requests, the Action
Scheduler services one Process Control Layer request.
If there are no Reactive or
Process Control Layer requests, it services a Content Layer request.
14
Once a request is fulfilled, it is ballistic in that behavior conflicts always result in
the latter-the
as of yet unexecuted behavior-requiring
an alternative or becoming
rejected. Speech execution is performed in ballistic units roughly the size of noun or verb
phrases to accommodate a workable means of interrupting the agent mid-statement.
2.2.1.3 Proper Abstraction
The Action Scheduler in this design receives the request at the conversational
phenomenon level and translates that phenomenn into body movements-or
actions.
This means that the Action Scheduler decides how to execute the particular behavior. A
behavior such as PROVIDE_FEEDBACK, which would involve the agent giving
backchannel feedback to the speaking user, would be resolved by the Action Scheduler
into several possibilities such as SAY_OK or NOD_HEAD.
The executable action
would then be chosen from the generated possibilities.
Giving such duties to the Action Scheduler make it necessary to make it
discourse-smart, a feature that should be abstracted away from that particular module.
The complexities of humanoid conversational agent design involve a large enough set of
tasks such that the Action Scheduler should be concerned solely with receiving explicit
action requests and scheduling them, not resolving what a discourse-related behavior
should entail. Furthermore, separating the conversational phenomena-deciding module
from the action scheduling module does not incur any scope difficulties because the two
tasks-behavior
generation and action scheduling-are independently executable.
Ymir's Action Scheduler does not provide feedback to the rest of the architecture,
following the belief that any necessary feedback needed by the other modules is
embodied in the user's response. This approach seems problematic when considering the
15
value of the agent knowing whether or not a behavior, especially one containing
information from the Content Layer, is successfully executed. If the agent plans to share
information regarding a specific object, or a goal of the user, and such an attempt fails to
execute, the knowledge of whether such a failure is due to internal-essentially
invisible-or interactional shortcomings should inform the agent of whether the attempt
should simple be made again or whether an alternate approach should be generated.
2.2.2 Animated Conversations
[Cassell et. al. 1994] introduces a system that automatically generates and animates
conversations between multiple humanoid agents with synchronized speech, intonation,
facial expressions, and hand gestures. The system generates the phenomenon of gesture
coarticulation, in which two gestures within an utterance occur closely in time without
intermediary relaxation of the hands. Namely, the retraction phase of the former gesture
and the preparation phase of the latter gesture are replaced with a direct transfer from the
former to the latter gesture.
For instance, if the agent plans to POINT_AT(X) and utter the phrase "I would
buy this item" where X is the item it plans to point at precisely when it utters the would
'this', then the agent would execute a preparation phase during some time before the
word 'this' is uttered and a retraction phase for some time after the word is uttered. If,
however, the agent wishes to say "I would buy this item and sell that item," calling
POINTAT(X)
and POINTAT(Y)
at the words 'this' and 'that', respectively, then it
would appear awkward for the agent to execute a preparation phase and a retraction phase
for both POINTAT() calls. The execution normally performed by humans involves
discarding the retraction phase of POINTAT(X)
16
and the preparation phase of
POINTAT(Y),
'this'.
creating the effect of point to 'that' directly from having pointed at
Effectively this means that instead of the agent pointing to one object,
withdrawing its hand, and pointing to the other object, the agent points to the first object,
then shift over to pointing to the other object directly. This becomes critical when
implementing closely occurring gestures.
Animated Conversations accomplishes the effect of coarticulation by keeping a
record of all pending gestures and tagging only the final gesture with a retraction phase.
This can become problematic when applying coarticulation to a humanoid conversational
agent, however, because it demands real-time generation of behaviors. Such an agent
would require a way to quickly generate or alter gesture preparation and retraction phases
to say the least.
2.2.3 Multi-level Direction of Creatures
[Blumberg, Galyean 95] present a layered architecture approach to multi-level direction
of autonomous creatures for real-time virtual environments.
2.2.3.1 A Three-Tier Architecture
Their architecture consists of a behavior (Behavior System), action (Motor Skill), and
animation (Geometry) level, separated in between by a Controller (between Behavior and
Motor) and a Degrees of Freedom (between Motor and Geometry) abstraction barrier.
The Behavior System produces general behavior directives, while the Motor Skill details
the actual actions possible and the Geometry specifies the lowest level of animation. The
Controller maps the behaviors onto appropriate actions in the Motor Skill module, and
the Degrees of Freedom constitute a resource manager, keeping track of what parts of the
Geometry are busy animating.
17
For example, if the Behavior System produced a canine agent behavior of
GREET, the Controller would resolve this behavior to tailor to the agent. In this case, it
might resolve the behavior into actions such as WAG-TAIL and PANT, to be sent to the
Motor Skill module. The Motor Skill module then sends the WAG-TAIL and PANT
requests through the Degrees of Freedom module, which once again resolves the action
request to tailor to the particular geometry. Such animation level directives as SWINGTAIL-CONE or OSCILLATE-TONGUE-OBJECT might be present in the Geometry to
run the animation.
2.2.3.2 Flexibility through Modularity
This architecture provides a good level of modularity in that the Controller and Motor
Skill modules can be replaced to accommodate different sets of creatures.
While a
behavior request of MOVE-TO might evoke a WALK-TO from a dog, MOVE-TO might
evoke a DRIVE-TO from a car or a TELEPORT-TO from a wizard creature. Similarly,
the Degrees of Freedom and Geometry modules dictate the animation method, and
conceivably could be replaced by any sort of animation method, given the degrees of
freedom were mapped correctly.
2.2.3.3 Using DOFs
Motor Skills utilize DOFs to produce coordinated movement by declaring which DOFs
are needed to perform the particular task. If the resources are available, it locks the
needed DOFs and performs the particular task. Most Motor Skills are "spring-loaded" in
that they are given an "on" directive to activate and, once the "on" directives are no
longer requested, will begin to move back to some neutral position and turn off within a
few time steps. This supports a sort of memory-less scheme in which only the length of
18
time to retain the "on" requests need be stored, as opposed to details of an "off'
command. For example, if a canine agent were to initiate a WAG-TAIL action, the DOF
module would seize the proper resources to wag the tail (given they are available), and
initiate the wagging of the tail. Once the agent no longer wishes to wagif a canine agent
were to initiate a WAG-TAIL action, the DOF module would seize the proper resources
to wag the tail (given they are available), and initiate the wagging of the tail. Once the
agent no longer wishes to wag its tail, the tail will gradually decrease its amplitude of
movement, eventually turning off completely within a few time steps.
2.2.3.4 Similarities and Shortcomings
The DOF usage is similar to Th6risson's Ymir architecture in that resources are checked
and locked to initiate actions. Furthermore, actions that conflict simply cannot interrupt
ongoing actions in both systems, making them first come first serve types of systems.
Both seem successful at least at the primitive level of conflict resolution. Unfortunately,
a humanoid conversational agent requires a more complex scheme to accommodate the
insertion of proper accompanying actions to the speech stream, such as time-critical
gestures and facial expressions. The system described by [Blumberg, Galyean, 95] is not
designed to undertake lengthy speech and gesture combinations which would involve the
transmission of key content. The majority of the conversational agents actions will not
be actions that decay and eventually turn off.
They will require strict timing
specifications, such as the synchronization of a beat gesture with an emphasized word.
The agent's behaviors and actions are not driven directly by basic competing needs so
they cannot be directly linked as such. While a canine agent in the [Blumberg, Galyean,
95] system may be able to perform decision-making through primitive desires such as
19
HUNGER, CURIOSITY, FATIGUE, or FEAR, the conversational agent must react to
more complex driving factors which will depend upon the actions of each other. If the
conversational agent desires to sell a house and proceeds to relate information about the
house to the user, competing behaviors that would cause the agent to switch to another
task would make the behavior of the agent appear flighty or unrealistic, unless a more
complex behavior resolution and execution architecture is implemented.
2.2.4 Improv
[Perlin, Goldberg 96] present a system called Improv which enables a developer to create
real-time behavior-based animated actors. Improv utilizes an architecture based upon the
Blumberg/Gaylean architecture.
2.2.4.1 Handling Competing Actions
With Improv, authors create lists of groups of designated competing actions. These
groups contain sets of actions that are physically impossible to perform simultaneously,
such as waving one's arm to greet a friend while scratching one's head with that same
arm. When two action requests fall within the same group, conflict resolution occurs by
way of a smoothly transitioning weight system. For example, if an agent is scratching its
head and then wishes to wave to friend, it might stop scratching, wave for a bit, and then
resume scratching, all due to the smoothly transitioning weight system. At the time that
the agent waves to the friend, its WAVE weight exceeds its SCRATCH-HEAD
weight.
However, as the waving continues, the value of the weight drops below the value of the
suspended SCRATCH-HEAD action. This enables actions considered more "important"
by the author to take over conflicting ones, while allowing the conflicting
after an "important" one has completed.
20
ones to resume
2.2.4.2 Action Buffering
Improv also performs action buffering, which handles unrealistic transitions between
actions. Since graphics typically employ a linear approach to rendering movement, some
translations of position will be physically impossible for real humans due to obstructions.
For example, moving from a "hands behind back" position to a "hands folded" position
would entail an intermediary position of "hands at side relaxed" in a linear movement
system. Improv allows authors to declare certain actions in a group to be a buffering
action for another.
2.2.4.3 Shortcomings
While Improv handles the visual display of character actions well, it falls short of serving
a humanoid conversational agent because of the simplicity of the behavior engine. The
conversational agent requires a high level of depth to encompass the understanding of the
conversation flow. Elements such as transmitted content and utterance planning cannot
be handled by systems such as Improv, a scripting tool which doesn't have the processing
power to handle heavy content-based decision-making.
21
Chapter 3
Statement of Problem
The task at hand involves scheduling actions that enable a three dimensional multi-modal
humanoid conversational agent to converse with a user. The specific application involves
the agent adopting the role of a real estate agent and showing the user through a virtual
house in an attempt
to sell the house.
However,
the main focus of the agent's
functionality will be its ability to hold, as realistically as possible, a conversation with the
user. Therefore, conversational behaviors and the actions that follow from them will be
the primary focus. Such actions include movements of the head, eyes, eyebrows, and
mouth for generating facial expressions and gaze behaviors; and movements of the arms
and hands for generating gesture.
The aforementioned and examined previous work contributes nicely to aspects
that would be useful in the Action Scheduler for a humanoid conversational agent.
However, individually they each lack a portion of what needs to be integrated as a whole
for the agent, obviously because the studied works were ihotattempting the development
of a humanoid conversational agent. The closest comparison, Ymir, focused upon multimodality, a critical feature in generating a conversational agent because of the necessity
22
of a rich level of input resources. However, Vmir's Action Scheduler contains a number
of elements that could be improved upon, such as the division of its tasks into two clearer
modules, separating the generation of behaviors and the execution method of those
behaviors. Another suspect assumption that Ymir makes is that the feedback loop need
only lie in the data retrieved from sensing the user. While humans do indeed receive
much feedback from others in the conversation, there are a number of realities in the
implementation of a software agent that call for the need for sender notification when an
action completes or executes. We are at least subconsciously aware of signals being sent
to observers, such as signaling the desire to speak, and their meaning and significance or
we would not engage in them. For this reason, the humanoid conversational agent should
have the capability to be aware of the execution of actions it performs itself.
My goal is to design an architecture for the Action Scheduler of a humanoid
conversational agent which will encompass all or as many of those features that are
relevant in the reviewed works. It will improve upon the Ymir architecture's design of
the Action Scheduler and attempt to incorporate features such as gesture coarticulation,
action buffering, and interruptions of actions and/or speech.
23
Chapter 4
Architecture
The Action Scheduler must be considered as a closely-knit module to the Animation
Engine-the
module lying at its output--of the humanoid conversational agent. While
the two modules should be abstracted away from one another, it is important for the
Action Scheduler to understand the limitations of the Animation Engine and the
functionality provided by it. For this reason, the architecture of the Action Scheduler
includes both the Action Scheduler and the Animation Engine.
On the input end of the Action Scheduler lies a Generation Module which will
resolve appropriate discourse phenomena into explicit actions and send the action
requests to the Action Scheduler, thus relieving the Action Scheduler of the task of
determining the execution of various discourse phenomena.
Dialogue will be planned as a behavior by the Generation Module, scheduled and
time-synchronized by the Action Scheduler, and initiated by the Animation Engine.
24
4.1 Architectural Overview
The model for the architecture of the Action Scheduler comprises a slightly redefined and
extended version of [Blumberg 95]. An abstraction module, Degrees-of-Freedom (DOF),
separates the Action Scheduler from the Animation Engine, enabling notifications to the
Action Scheduler regarding the status of the geometry, controlled explicitly by the
Animation Engine. As requests enter the Action Scheduler, it attempts to schedule them
on an immediate basis, consulting the DOF module for current availability of motors.
Each request may carry with it a set of independent behavior requests that have been
broken down, by the generation module, into single or concurrent sets of actions.
Each
particular broken-down behavior request can have a number of alternative actions or
action sets that give the Action Scheduler some level of flexibility in the event that
conflicts occur. Should no alternatives pass the check without a conflict, the request for
that particular behavior will be discarded or queued until resources are freed.
4.2 Enabling Speech Planning
To this point, the architecture appears to be very similar to that of [Blumberg 95].
However, the agent must also have the ability to plan long sets of utterances to converse
with the user. To this end, requests contain a QUEUE status bit that determines whether
the request is queued upon conflict, or whether it is discarded upon the depletion of
alternatives. This enables sets of utterarces to be sent one after the other and executed in
line regardless of motor conflicts. Supplementing this feature is the ability of the request
sender to ask for an acknowledgement of execution or of completion as a part of the
action request. This would enable other modules to receive a notification of executing or
completed behaviors, useful in the event that a message is lost or never completes. Such
25
information would be useful on a content level where the agent should have knowledge
of whether or not it shared specific bits of data with the user.
4.3 Enabling Coarticulation
The Animation Engine produces the graphics by moving motors to a target position
regardless of the current position of the motors. This enables performed actions to be
flexible to a degree that they can start where they left off and not suffer from the
restriction of always having to be a particular strict start-to-end specification. In the
execution of utterances embedded with gestures, retraction phases can be tagged as such,
and discarded to make way for the preparation phase of a following gesture if the two
occur closely.
4.4 Input Overview
The input to the Action Scheduler takes the form of a frame that is used, as well, in all
message passing for the other modules. A message frame contains several details such as
time sent, sender, receiver, and input received from agent sensors. As modules receive
incoming frames, they parse the relevant information and generate a frame to pass on to
the next module, placing within it information pertinent to the receiving module, whether
the information is passed from the first frame or generated within the sending module.
For instance, the Generation Module, which sends the Action Scheduler's input, might
receive information regarding the agent's surroundings. The module would then generate
a feasible set of reactions for the agent, package that data with the relevant information it
received in the first frame, and send it off to the Action Scheduler. The frame format
would consist of sets of types, keys, and values in an embedded format such as the
following:
26
(type :keyl value 1 :key2 value... )
where value could hold another mini-frame of the same format. This holds the possibility
for an endless set of embedded frames. Once a module receives a frame, it can parse the
frame and query the names of the types as well as check the values associated with a
particular key. Examples of frames sent to the Action Scheduler are the following:
(behavior :queued false :content (choice :options [[ (eyes :state looktowards)
(head :trans nod)]
[(face :state smile)] ]))
This frame would be enclosed within a frame that would contain other information such
as the time the frame is sent, the sender, the intended recipient, and the duration for
which the frame is valid.
4.5 Similarities and Differences to Ymir
The proposed architecture for an Action Scheduler/Animation Engine is similar to Ymir
in a number of ways. Both utilize the upkeep of joint status to notify of conflicts. Both
also serve requests on a similar basis, using a first-come-first-serve policy. Ymir does
not, however, use any conflict resolution policies other than denying the latter request.
The proposed architecture enables actions to be queued for later execution, should the
modality currently be busy.
Ymir places a great deal of responsibility upon the Action Scheduler in terms of resolving
the behavior request from a level of intended effect to that of physical action. While
Ymir gives the Action Scheduler the responsibility of deciding the proper actions to elicit
a behavior such as ACKNOWLEDGEMENT, the proposed architecture places that
discourse-related material outside of the Action Scheduler, reasoning that action
scheduling should be the main focus, especially when the complexity is expected to grow
27
to large proportions. Another feature present in the proposed architecture and absent in
'imir is the functionality of combining the speech stream and the action requests into one,
effectively enabling the timing of actions to be paired with the timing of words within the
utterance.
Overall, the proposed architecture attempts to follow the model that was
created by Ymir, but improve it in ways that can accommodate higher demands from
humanoid conversational agents in the future.
28
Chapter 5
Implementation
The architecture for the Action Scheduler is being implemented in a project on multimodal conversation in a humanoid agent at the MIT Media Lab. The Action Scheduler
and Animation Engine represent the end modules of the agent design, receiving the action
requests and carrying the responsibility of formatting them into a realistically behaving
agent.
5.1 REA: a Multi-Modal Humanoid Conversational Agent
REA the Real Estate Agent has the goal of selling the user a house.
To this end, she
performs, with the user, a virtual walkthrough of a 3D representation of the targeted
house. Her interaction with the user consists of a mixed-initiative conversation where
Rea will service the user's requests, answer questions, make suggestions, and engage in
small talk. Rea's design involves five main modules apart from the Action Scheduler and
Animation Engine. The Input Manager lies at the back end of the agent design. It
collects data from various sources of input (vision cameras, speech input), and formats it
so that it can be understood at the next step. The Understanding Module examines the
29
data collected by the Input Manager and attempts to glean a higher level understanding of
the implications of the perceived data and the intentions of the user. The Reactive
Module generates appropriate reactive actions for the humanoid agent and passes the
request along to the Generation Module, which generates appropriate propositional
behavior for Rea, such as utterance production. The Generation Module then performs
all action requests to the Action Scheduler, which then interfaces with the Animation
Engine to produce realistic behavior in the 3D humanoid agent.
Rea occupies four computers. Two SGI 02s are utilized wholly to run the STIVE
vision system. A Hewlett-Packard is used to run speech recognition software. One SGI
Octane runs the TrueTalk speech generation software and the Action Scheduler and
Animation Engine which performs the graphical rendering.
5.2 A Flexible Animation Engine
The Animation Engine is developed with the TGS Open Inventor 3D Toolkit. Inventor's
scenegraph representation of a graphical scene enables the Animation Engine to specify a
rough skeleton of nodes which, if followed by a character developer using a higher level
3D package such as 3D Studio Max, allows VRML characters to be read in and
assembled by the Animation Engine. This allows artistic specialists to avoid the deep
coding complexities when they want to vary the physical characteristics of the characters.
Each node within the specified skeleton is then attached to up to a number of
Inventor engines which drive the rotation and translation along appropriate independent
axes by updating appropriate values in the graphical fields. A Ramp object is then
attached to each engine and utilized to provide the change of values over time that will
30
initiate the update of the fields. This variance of output values over time creates the
movement in the humanoid agent.
The Ramp object also performs a mapping of a range of input values from 0 to 1
to the appropriate possible range of movement of the agent's joints. This means that a
developer wishing to modify the physical flexibility of the character's joints need only
alter the mapping within the ramp class. Once this is done, functions to drive the ramp
through the appropriate angles or distances can be inputted with a range from 0 to 1. This
method allows the complexities of angle joints in particular to be abstracted away from
an action function designer. Instead of concerning him or herself with remembering that
the shoulder joint ranges from -3.14/2 to 3.14/4 and the elbow from 0 to 3.14/2, he or she
need only note the 0 and
point of the angle extremities and interpolate the desired
values.
Currently, the Animation Engine is capable of executing actions which assist in
floor management between the agent and the user.
These actions include eyebrow
movement (raising and lowering), head movement (turning with six degrees of freedom
to achieve such actions as looking toward, looking away, nodding and shaking), eye
movement (looking up, down, left, and right), eyelid movement (blinkirig), and mouth
movement (speaking and not speaking).
5.3 The Action Scheduler
The Action Scheduler for Rea receives action requests by way of message frames sent
over a Local Area Network and placed inside a buffer. Each message frame contains a
behavior request or set of behavior requests, each of which can be served independently
but are packaged together for simplicity. Each behavior request consists of a behavior or
31
a set of possible behaviors (a behavior and its alternatives), each of which has been
decoded from the conversational phenomenon level to the action level bforthe Action
Scheduler. Each behavior request can decode from one behavior to the execution of
many actions. For instance, the Generation Module may wish to execute an GREET
behavior. It would decode this to a few possibilities, each of which might consist of more
than one action simultaneously. One example of a decoding might be DO(SAY("hi"),
NOD, WAVE) If that conflicts, an alternative might be DO(SMILE, NOD). The Action
Scheduler examines each behavior request and checks for joint conflicts (conflict within
the geometry of the agent), checking for alternative actions upon the instance of a
conflict. If the Action Scheduler concludes that there are no non-conflicting action sets
then it discards the request. If the request's QUEUE flag has been set, then it saves the
request for later execution once the joint conflicts are freed.
5.4 Speech Generation
The Generation Module, which sends all of the Action Scheduler's input, generates the
speech for the agent. This includes pitch specifications and embedded gestures as well as
the words in the speech. The Action Scheduler properly formats the speech request and
passes it to the Animation Engine, which performs the final call to the speech tool,
TrueTalk. If the Generation Module then requests an interruption to the speech, the
Action Scheduler can flag the Animation Engine in much the same way as it sends a
speech directive, causing the agent to end the utterance abruptly, but gracefully.
TrueTalk handles the graceful interruption of the conversational agent.
TrueTalk allows the insertion of callbacks within the speech instruction. The
agent utilized these callbacks to notify other modules of the completion or initiation of
32
certain actions, for example propositional actions which concern what knowledge the
agent has of what she has said or not said. This notification becomes useful to modules
concerned with the content level of the agent's output. If the agent has failed to execute a
content related action, such as mentioning the cost of the house, the content generating
module should be aware of the failure to pass that information to the user and avoid such
fallacies as asking whether the cost is too high, for instance. The callbacks can also be
used to initiate gesture executions at key point within the speech, such as when the agent
wishes to point at an object precisely upon uttering a particular word.
33
Chapter 6
Evaluation
6.1 What Worked
Rea is work in progress, so her design has not had the opportunity to be extensively
tested. However, much of the back end development of the Action Scheduler and the
Animation Engine an, a modicum of testing on the integration of all of her modules has
been completed to a sufficient degree to make a number of observations.
The development of the Inventor engines to manipulate the joint values along
with the creation of Ramp classes to move the joint values through a given range over a
given time greatly facilitated the generation of action procedures. Using a convention of
ranging the possible joint values from 0 to 1 allowed quick intuitive estimations of the
proper angles needed to produce the desired effects. This allowed a careful developer to
measure the proper conversions necessary to embed in each particular engine to create a
physically realistic agent in terms of flexibility. The generation of action procedures to
add to the knowledge base of the character quickly followed with ease.
The ramp/engine combination also enables the system to constantly have an
awareness of the values of Rea's joint fields. This produces two benefits. First, the
34
Action Scheduler can query the joints and know to a reasonable degree where they are.
Second, movements are performed from current positions as opposed to pre-calculated
start positions. This scheme of designating only final positions instead of both final and
initial positions allows a much greater level of flexibility in the movement specifications.
Rea responds with a great level of immediacy. When there is no user present in
the user space, the agent will look to the side or glance away, appearing to be in a nonattentive state.
Once a user enters the user space and faces Rea, the agent receives a
signal carried from the input modules through to the Action Scheduler and immediately
gazes upon the user expectantly. While Rea's development has not reached a state where
she can understand language, she is able to understand when the user is or is not
speaking. This ability combined with the ability to know whether the user is attending to
her or not enables Rea to engage in turn-taking behavior. When the user completes an
utterance (which involves the stopping of gesture movement as well), Rea initiates an
utterance, only continuing as long as the user appears to be attending.
If the user
indicates that he or she wishes to speak, Rea will gracefully cease speaking and politely
wait for the user to complete the turn. Rea's ability to exhibit a sensitivity to the flow of
the conversation and react with reasonable immediacy is due largely to the Action
Scheduler's ability to interrupt executing actions at will.
6.2 What Didn't Work
We found that high-end 3D packages such as 3D Studio Max and Softimage do not have
VRML porting tools that make a fully complete conversion from a native scene to
VRML. While our intention was that developers could create highly artistic characters
with these high-end packages and then convert them to VRML characters which would
35
be used by the Animation Engine, the conversion process was fraught with errors and
annoyances. Problems such as mysteriously inverted normals, displaced objects, and
untranslated groupings in the hierarchy made the process extremely cumbersome. A
complete conversion tool would greatly facilitate the combination of high-level graphics
and real-time graphical animation in the future.
Initial attempts to generate callback functionality with the speech tool, TrueTalk,
failed. Given that speech generation technology still doesn't provide much in the way of
precise timing and fast generation, further problems are expected as the development of
Rea continues.
Speech recognition, as well, remains an area that takes a noticeable
amount of processing, so the Action Scheduler's ability to serve the proper reactive
behaviors given the lag between the user's action and understanding of that action is
expected to suffer.
6.3 A Comparison
The Action Scheduler/Animation Engine draws from the contributions of past works to
various elements in conversation. The gesture coarticulation of Animated Conversations,
action buffering of Improv, and the three-tier architecture of the Blumberg system are
useful features to make the agent more realistic. Th6risson's Ymir architecture is the
closest predecessor to the proposed design for a humanoid conversational agent. The
proposed system has some structural similarities to Ymir such as the use of joints to
notify the existence of conflicts. However, several differences exist as well. The manner
in which the two Action Schedulers choose incoming requests differs. While Ymir
selects particular types of behaviors to service first, and then resolves what those
behaviors mean on the action level, Rea's scheduler is not designed to need to understand
36
the distinctions between those types of behaviors, nor does it need to understand the
mapping from behavior to action. This makes the Action Scheduler cleaner because it
does not need any discourse knowledge.
Another distinction is the handling of
speech/action integration. Rea's Action Scheduler can combine the speech annotation
with word oriented callbacks to fire actions at the occurrence of specific words. YImir,
however, does not connect the two calls to speech and action. While Rea's and Y•mir's
action scheduler architectures have similarities between them, Rea enjoys the ability to
move to the level of a more complex agent, since the multitude of discourse behavior
resolution tasks are abstracted away to a generation module.
37
Chapter 7
Conclusions and Future Research
7.1 The Validity of the Action Scheduler
While the use of an Action Scheduler module seems to be successful in progressing the
performance of humanoid agents, it still remains in question whether the current
paradigm is a model that will fit further development in the future. It must be noted that
while the Action Scheduler does behave as a sort of cerebellum as observed by
[Th6risson 96], it should not necessarily be selecting and excluding actions simply on an
availability basis. Humans do not discard actions simply and solely because the modality
happens to be busy at the moment. Currently it seems sufficient to use a scheme in which
the Action Scheduler discards requests that exhaust all possible alternatives, while
allowing the possibility to perform an absolute interruption. However, a more accurate
model of human decision making would require more subtle distinctions between actions.
Perhaps the use of prioritization levels would be more appropriate, but then the task
would require a global method of prioritizing behaviors.
38
7.2 Expansions
The fact that our priorities shift depending upon several factors implies that a more
complex scheme for behavior selection needs to be implemented in the Action Scheduler
or that the selection process needs to be abstracted completely away from that module
and placed in its input module. While the Action Scheduler may hold promise for
developing more effective humanoid agents, key tasks such as conflict resolution need to
be handled more cleverly.
The current scheme of first-come-first-serve cannot
accommodate extensively growing demands on conversational agents. Such a scheme
greatly limits the potential intelligence that the agent can exhibit if development
undertakes an effort to correctly represent human cognition. The alternative, the creation
of the perception of intelligence, achieved in systems that aim to create actors (such as
Improv), undoubtedly will experience a ceiling effect on the potential abilities of the
agent. Conflict resolution should drive from the intentions and needs of the agent.
39
Bibliography
[Blumberg, Galyean, 1995] B. Blumberg, T. Galyean. "Multi-level Direction of
Autonomous Creatures for Real-Time Virtual Environments, Computer Graphics
(SIGGRAPH '95 Proceedings), 30(3):47-54, 1995.
[Cassell et. al. 1994] J. Cassell, C. Pelachaud, N.I. Badler, M. Steedman, B. Achorn, T.
Becket, B. Douville, S. Prevost, M. Stone. "ANIMATED CONVERSATION: Rule-based
Generation of Facial Expression, Gesture and Spoken Intonation for Multiple
Conversational Agents". Siggraph'94, Orlando, USA, 1994.
[Perlin, Goldberg, 1996] Perlin, K. And Goldberg. A. "Improv: A System for Scripting
Interactive Actors in Virtual Worlds". Media Research Laboratory, New York
University.
[Th6risson 1996] Th6risson, K.R. Communicative Humanoids: A Computational Model
of Psychosocial Dialogue Skills. Ph.D. Thesis, Media Arts and Sciences, MIT Media
Laboratory.
40
THESIS PROCESSINGSLIP
FIXED FIELD: ill.
name
index
* COPIES:
biblio
h
Lindgren
VARIES:
lTLE
Aero
Dewey
(
Hum
Music
Rotch
Science
isTurg
NAME VARIES: P[_
IMPRINT:
(COPYRIGHT)
4o
P COLLATION:
--
&ADD: DEGREE:
.S,
E
P-DEPT.:
E"
SUPERVISORS:
_
.
j&
i
,
,
.
,
.
,
.
,
,,~
-
.
,
NOTES:
catr:
date:
page:
o DEPT .
. YEAR:
EB-
Ing
3iq 6113S
G
y
P9DEGREE:
-
M. EnE-
, NAME:
1ACHr%
TWe