From: AAAI Technical Report FS-93-03. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Instructing
Real World Va, cuumers
Bonnie Webber
Norman Badler
Department of Computer J% Information
Science
xUniversity of Pennsylva.nia.
.June 4, 1993
1
Introduction
Early views of Natural Language il~slr~tclio~s corresponded to the early views of plates: instructions were interpreted as speci~’ing nodes of a hierarchical plan that, when completely
expanded (instructions having been recognized as only being partial specifications of activity),
acts as a control structure guiding the behavior of an agent.. This view, for example, enabled
SIIRDLU’ssuccessful response to instructions such as "Pick up the green pyramid and put it in
the box" [29], a response that. attracted early public at.tention to the emergingfield of Artificial
httelligence.
That plal~s should not be viewed as control structures has already been well argued by
others in the field, including Agre and Chapman[1], Pollack [2:3], and Suchman[25]. Agre and
Chaplnan, for example, point to the fact that when people form plans in response to instructions
given theln before the start, of a task, they appear to use those plans as resources: the actual
task situation maylead them to interpolate additional actions not mentioned in t.he instructions,
to replace actions specified in the instructions with other ones that seem better suited to the
situation, and to grotmd referring terms as a consequence of their actions rather than as a
precondition for them, all of which are very differmtt from how COlnputers use programs to
control their operation.
Agre and Chapmando not actually distinguish between instructions and plans in [1]. IL
is in subsequent work that distinctions emerge. For example, looking at instructious given as
advice to agents already engaged in an activity, Chapman[7] treats iustructions as additional
evidence for action alternatives already identified by the agent as being relevant to its current
situation, but that might not he followed ilnmediately (or ever) if other alternat, ives have more
evidence in their favor. Chapmanargues that this is how arcade game players follow advice given
*The authors would like to thank Brett Achorn, Breck Baldwin, Welton Becket, Barbara Di Eugenio, Christopher Geib, ~’Iooll Jung, Libby Levison. Michael Moore, Michael \Vhite, and Xinmin Zhao, all of whomhave
contributed greatly to the current version of AniinNL. Wewould also like to thank Mark Steedman, for comments on earlier drafts of/.tie
paper. This research is partially
supported by AROGrant DAAL03-89-C-0031
including participation
by the U.S. Army Research Laboratory (Aberdeen), Natick Laboratory, and the Institute
for Simulation and Training; U.S. Air Force DEPTHcontract, through Hughes Missile Systems F33615-91-C-0001;
DMSO
through the University of Iowa; National Defense Science and Engineering Graduate Fellowship in Computer Science DAAL03-92-G-0342; NSF Grant IRIgl-ITll0,
CISE Grant CDA88-22719. and Instrumentation
and Laboratory hnprovement Program Grant USE-9152503. and DARPAgrant N00014-90-J-186.
109
them by kibbitzers watching them play. (Chapmanalso notes that negative instructions can be
tmderstood as evidence against actions, but does not really explore this idea or its consequences.)
Turning to instrnctions that iMem’ttl~t activity, Altermann and his colleagues [2] show how
such instructions can be treated as assistance, helping agents to accommodatetheir existing
routines to the device currently at. hand. In this approach, routines evolved over manydifferent.
instances of engagementhelp focus an agent on the details of the situation that require attention
and on tile decisions that nmst be made. Instructions may interrupt activity to call attention
to other relevant details or decisions or to correct decisions already made. Neither plans nor
instructions function as control structures that determine the agent’s behavior.
Our own work has focussed on instructions given to agent.s before they undertake a task,
such as procedural instructions and warnings. Such intructions are widespread, found packaged
in with devices as small as filses (i.e.. instructions for removingthe blownfuse and installing the
newfuse contained in the package) and as large as V-16 aircraft (e.g., instructions for performing
routine maintelmnce on the air conditioning water separator, on the flight control syst.em power
TM is an Owner’s Manual containing such
supplies, etc.). Packaged in with tile RO~lal CANVAC
instructions as:
¯ Do not pull on tile cord to unplug the vacuumcleaner. Grasp the plug instead. (p.2)
¯ Do not run the vacuumcleaner over the cord. (73.2)
¯ Depress door release button to open door and expose paper bag. (p.5)
The first two instructions are fl’om the section labelled "Warning: To reduce the risk of fire,
electric shock or injury ’’~, while tile third is fi’om the operating instructions for "Paper bag
removal and installat.ion".
Our work on instructions
is being done as part of the Animation and Natural Language
(AninrNL) project at the l,’niversity of Pennsylvania. Besides providing a rich framework
which to analyse the selnantics and pragmatics of instructions, t.he project has the goal of
enabling people to use instructions to speci~ what animated hunran figures should do. Among
the potential applications of such a feature is task a~alysis in connection with Computer-Aided
Design tools. Being able to specie" the task through a sequence of Natural Languageinstructions,
much as one would find in an instruction manual, means that a designer need only modi~" agent
and environment, in order t.o observe their effect, on the agent’s ability to perform the task.
Trying to do this through direct manipulation would require a designer to xnanipulate each
different agent through each difl’erent environlnent, in order to observe the same effects.
AnimNLbuilds upon tile Jack TM animation system developed at the University of Pennsylvania’s Computer Graphics Research Laboratory. Animation follows from model-based simulatiolr.
Jack provides biomechanically reasonable and anthropometrically-scaled
human models and a
growing repertoire of behaviors such as walking, stepping, looking, reaching, turning, grasping,
strength-based lifting, and collision-avoidance post.ure planning [3]. Each of these behaviors is
environment.ally reactive in the sense that incremental computations during simulation are able
to adjust an agent’s perfornaance to the situation wi~ho~tt f~rther i~vol~,eme-~! of lhe higher level
processes [,1] unless an exceptional failure condition is signaled. Different spatial environments
can easily be constructed and modified, to enable designers to vary the situations in which the
lfigures are acting,
l In discussingagents, wewill use the pronoun"’he". since wewill be using a malefigure in our illustrated
exalnple- i.e., a figure with malebodyproportions. TheJack animat.ionsystemprovidesanthropomet.rically
sizablefemalefiguresas well.
Ii0
Trying to make a humanfigure movein ways that people expect a humanto movein carrying
out a task, is a forlnidable problem: truman models in Jack are highly-articulated,
with over
100 degrees of fi’eedom [3]. While the environment through which a Jack agent moves should,
and does, influence its low-level resl)onses [4], we have found that a great many behavioral
constraints can be derived through instruction understanding and planning, the latter broadly
construed as adopting intentions [6, 8, 16].
Because this is to be a short position paper, we will confine our commentsto two issues and
how the AnimNLarchitecture deals with them to produce, in the end, human figure animations.
¯ expectations derivable from instructions
and the behaviors they can lead to;
¯ the broader class of behavioral decisions informed by knowledgeof the inten.tion towards
which one’s behavior is directed.
In other pal)ers, we discuss other issues we are having to face such as taking intentions seriously in
means-end planning [14, 15] and mediating between symbolic action specifications such as "pick
up the vacutun cleaner" and "open the door" and the specific reaching, grasping, re-orienting,
etc. behaviors needed to effect, t henl in the current situation [28].
In this discussion, we will try to restrict our examples to vacuuminghousehold floors, although we ask the reader’s indulgence if we find the need to branch out to removing stains from
carpets, to provide more illustrative examples.
2
Expectations
from Instructions
Usually one thinks of instructions in terms of what they explicitly request, or advise people to
do or to avoid doing. But another role they play is to provide agent.s with specific expectations
about actions and their consequences, expectations that can influence an agent’s behavioral
decisions. Several of these are discussed in lnore detail in [28]. Here we focus on expectations
concerning event and object locations.
Because actions can effect changes in the world or changes in what the agent can perceive,
instructions mayevoke more than one situat.ional context [27]. This means that part of an agent’s
cognitive task in understanding instructions is to determine for each referring expression, the
situation in which the agent is meant to find (or ground) its referent. Somereferring expressions
in an instruction maybe intended to refer to objects in the currently perceivable situation, while
others maybe intended to refer to objects that only appear (i.e., coine into existence or become
perceivable) as a consequence of carrying out an action specified in the instruction.
The difference can be seen by comparing the following two instructions
la. Go downstairs and get= me the vacuum cleaner.
b. Go downstairs and empty the vacuum cleaner.
In (la), "the vacuumcleaner" will generally be taken to denote one that will be found downstairs,
one that the agent may not yet be aware of. Given this instruction, agents appear to develop
the expectation that it is only after they perform the initial action, that they will be in a context
in which it. makes sense to try to ground the expression and determine its referent. What is
especially interesting is the slrel~glh of this expectation: a cooperative agent will look aroundif a
vacuumcleaner isn’t imlnediately visible whenthey get. downstairs, including checking different
roolns, opening closets, etc. until they find one. This has led one of our students, Michael
iii
Moore, to develop procedures he calls search plans [22] following [20], that guide this type of
behavior.
lnstructiou (lb) ("Go downstairs and empty the vactmmcleaner") leads to different locational expectations than (la): an agent will first try to ground "the vacuum cleaner" in the
current situation. If successfnl (and cooperative), the agent will then take that vacuumdownstairs and empty it. If unsuccessflll however, an agent will not just take the instruction to be
infelicitous (as they would in the case of an instruction like "Turn on the vacuumcleaner", if
there were no vacuumin the current, situation). Rat.her an agent, will adopt the same locat,ional
expectation as in the first example, that the vacuumcleaner is downstairs (i.e., where they will
be after performing the initial action).
In AnimNL,Barbara Di Eugenio has attempted to derived some of these locational expectations th,’ough ,~lan inference techni0ues described in moredetail in [10, 11, 12]. In this case, the
inferences are of Ihe torm: if one goes to place p for the purpose of doing action o. then expect
t t. to be the site of o. Action representations have a notion called, alternatively, applicability
condilions [24], conslrainls [18]. or usewh.e,, condilions [26]: these are conditions that must hold
for all action to be relevant, and are not meant to be achieved. If a- has amongit applicability
conditions that an argument be at. i t for o to even be relevant, then a locational expectation
develops as in (la). If not, a weakerexpectation arises, as in (lb). (Notice that this occurs
within a single clause: "Get: the vacuumcleaner downstairs" leads to t.he same expectation as
(la), while "Empty the vacuumcleaner downstairs" leads to the same expectation as (lb).)
3
Informational
Import of Purpose Clauses
Commentsare often made about the ¢~ficiency of Natural Language - how lnany meanings can
piggy-back on one form. One examples of this is purpose clauses -- infinitival
clauses that
explicity encode the goal that an action achieves. Not only do purl)ose clauses provide explicit.
justification
for actions (evidence for why they should be done), they also convey implicitly
important aspects of actions such as perceptual tests needed to monitor and/or terminate them,
al)propriate direction of movement,other needed actions, etc.
This efficiency, in part, makesup for the fact that instruct.ions are partial not only in what
actions are madeexplicit (leaving others to be inferred or motivated by the situation) but. also
in what aspects of actions are made explicit. Given a partial description of an action and its
purpose, an agent can use t.he latter as a basis for inferring an effective elaboration of the former,
using techniques such as those described in [9, 10. 11, 12].
To see this, consider the following instruction for vacuuming:
To finish off, vacuumthe rug or carpet against the direction of the pile to leave it.
raised.
(This is actually does improve its appearance.) To follow this instruction, an agent must first
determine the direction of the pile. 2 A good way t.o do this is to vacuum a bit. in various
directions, observing the resulting state of the pile. Whenone sees the direction of sweep that.
leaves it. raised -- i.e. that satisfies the purpose clause -- one can carry out. the instruction and
finish off the job.
Examplesabound of information conveyed iml)licitly by purpose clauses: to clean wine stains
from a rug or carpet, one follows instructions such as:
2Throughlhe presuppositionof the nounphrase, the agent mayalso be informedthat a wovenpile has a
direction!
112
Blot with clean tissues to removeany liquid still standing. Sprinkle liberally with
salt to extract liquid that has soakedinto tile fabric. Vacuulntip t.he salt.
In the first sentence, blotting with clean tissues specifies a type of act.ivity but not howlong
that activity nmst be pursued. (In the terminology used by Moens and Steedman [21], it is
simply a process like running, not a culminated process like running a mile.) Howlong comes
fi’om the purpose clause "to remove auy liquid still standing". This implies a perceptual test
couditioning termination - i.e., blot until no standing liquid is visible. In a somewhatdifferent
way, the purpose clause in the second sentence signals whenthe agent, can start, the final step,
vacuumingup the salt.. It is not the termination point of the sprinkling (which is terminated
when the agent decides there is now a. "liberal" amount of salt on the stain [17]), but of the
subsequent waiting while the salt does its thing. The agent’s perceptual test comes from the
purpose clause --- whensalt extracts liquid, its texture and color (in the case of red wine) change:
when the agent perceives that its surface is damp, the salt has extracted as muchas it can, and
3the agent call commeucevacuulniug.
Di Eugenio has designed and implenaented tile machinery used in AnimNLfor computing
manyof these inferences that follow from understanding that the action described in the main
clause is being doue for the purpose described ill the purpose clause. The relationship between
the two actions may be one of generation or enablemeTil, tile difference between t.hem roughly
being that additional actions are needed in the latter case to accolnplish t.he specified purpose.
tier approach makes use of both linguistic knowledge and planning knowledge. A knowledge
base of plan schemata (or recipes) complements a taxonomic structure of action descriptions.
The latter is represented in Classic [5] and exploit.s classificaliol~ t.o be in a position to find
related action descriptions, each corresponding to a different node. These nodes index into the
knowledge base of recipes which indicates generation, enablement and sub-structure relationships between actions. The inference algorithms oil these linked structures are described in
detail in [9, 11].
4
Summary
The problems of understanding what information instructions provide an agent and how that
information can be used are far fronl solved. Instructions meant to serve as general policy
(e.g. "Do not l’Ull the vacuumcleaner over tile cord.") will have to be understood and used
different ways than instructions meant to serve as immediate advice or corrections whenroutines "
fail or procedural instructions such as those discussed above. Someof the problems involved in
using instructions will be simplified in tile case of non-humanoidrobots, given that people will
have fewer expectations concerning their response: criteria for success will be more like ’:did
they accomplish the task without breaking anything?" t.han "’did they do it the way i would?"
Watchingall earl’,’ version of our animated agent opening all under the-counter kitchen cabinet,
with its knees stretched uncomfortably wide and oscillating back and fort.h, makes one realize
that there is Mot more guiding low-level human performance than joint constraints and final
state. These attempts though to flirt.her our understanding of instructions and how intentions
map to activity through tile use of humanfigure animation to visualize theories seem to us a
valid and profitable exercise.
3Vacuulnhlg may reveal that not all the liquid
be called for.
has been extracted,
i13
and that.
repeating
the procedm’e might
References
[1] Agre, P. and Chapman, D. What are Plans for? In P. Maes (ed.), Designing Autonomous
Agents. Cambridge MA:MITPress, pp. 17-34. (First. published as Technical Report., MIT
AI Laboratory, 1989.)
[2] Alterman, R., Zito-Vfolf, R. and Carpenter, T. Interaction,
Usage. J. Learning Sciences 1(4), 1991.
Comprehensionand Instruction
[3] Badler, N., Phillips, C. and Webber, B. Simulaling Humans: Computer Graphics, Animalion and Control. NewYork: Oxford University Press, 1993.
[4] Becket, W. and Badler, N. I. Integrated Behavioral Agent Architecture. Proc. Workshopon
Computer Generaled [brces and Behavior Representation, Orlando FL. March 1993.
[5] Braclnnan, R., McGuinness. D., Patel-Schneider, P.. Resnick, L, and Borgida, A. Living
with Classic: \Vhen and Howto use a KL-ONE
like language. In J. Sowa(ed.). Principles
of Semantic Networks. San Mateo C.A: .Morgan 1,2aufmann Publ., 1991, pp.401-457.
[6] Bratman, M., Israel, D. and Pollack, M. Plans and Resource-bounded Practical
Computational Inleliigence 4(4), November1988, pp. 349 355.
[7] Chaplnan, D. Vision, Instruction
Reasoning.
and Action. Cambridge MA:MIT Press, 1991.
[8] Cohen, P.R. and Levesque, H. Persistence. intention, and comnritment. In M. Georgeff and
A. Lansky (eds.), Reasoning about Aclioaa and Plans, Proceedings of lhe 1986 Workshop.
Los Altos CA: Morgan Kaufmann. 1986, pp. 297-340.
[9] Di Eugenio, B. Understanding Natural Language Instructions: the Case of Purpose Clauses.
Proc. 30lh Annual Conference of the As.~-oc. for Compulalional Linguistics, Newark DL,
June 1992.
[10] Di Eugenio, B. Computing Expectations in Understanding Natural Language Instructions.
Submitted to AI*IA93, Third Conference of the Italian Association for Artificial Intelligence, Torino, Italy, August 1993.
[11] Di Eugenio, B. Underslanding Nalural Language Instructions: a Computational Approach
to Purpose Clauses. PhDThesis, University of Pennsylvania, August 1993.
[12] Di Eugenio, B. and Webber, B. Plan Recognition in Understanding Instructions. Proc. 1st.
lnl’l Conference on Arl~ficial Intelligence Planning Syslems, College Park MD,June 1992.
[13] Di Eugenio. B. and V~hite. M. On the Interpretation of Natural Language Instructions.
1992 Int. Conf. on (’omp,lational Ling,i.~tica (COLING-92),Names, France, July 1992.
[14] Geib, C. Intentions in Means/End Planning. Technical Report MS-CIS-92-73, Dept. Computer & hfformation Science, University of Pennsylvania, 1992.
Intentions in Means-End Planning. In AAAI
[15] Geib, C. A Consequence on Incorl)orating
Spring Symposium Series: Foul~dations of Automatic Planning: The Classical Approach.
and Beyond, Working Notes. Stanford CA, March 199:3.
[16] Grosz, B. and Sidner. C. Plans for Discourse. In P. Cohen, J. Morgan & M. Pollack,
Intentions in Communication. Cambridge MA: MIT Press, 1990.
i14
[17] Karlin, R. Defining tile Semantics of Verbal Modifiers ill the Domainof Cooking Tasks.
Proc. 261h Am~ual .,llceling.
Associalio~ for Computalio~al Linguistics, SUNYBuffalo,
June 1988, pp. 61-67.
[18] Lit.man, D. and Allen, .1. Discourse Processing and CommonsensePlans. In P. Cohen, .J.
Morgan & M. Pollack, l~lel~tioas i1~ Commul~icalion. Cambridge MA:MIT Press, 1990,
pp. 365-388.
[19] Levison, L. Action Composition for the Animation of Natural Language Instructions. Technical Report, MS-CIS-91-28, Dept. of Computer & Information Science, University Of Pennsylvania, August, 1991.
[20] Miller, G., Gelernter and Pribram. K. The ,(;ll"uclure of Pla~ls a;~d Behavior.
[21] Moens, M. and Steedman, M. Temporal Ontology and Temporal Reference. Computational
Li,g~istics, 14(2):15-28. 1988.
[22] Moore, M.B. Search Pla,s. PhDDissert.ation proposal, Departlnent of Computer and Information Science, Universit.y of Pennsylvania, May199:3. (Technical report. MS-CIS-93-55).
[23] Pollack, M. I~ferring domai~ pla~s in question-a~swering. PhDthesis, Department of Con>
puter & Information Science, "l~chnical Report MS-CIS-86-40.University of Pennsylvania.,
1986.
[24] Schoppers, M. Universal Plans of React.ive Robots in Unpredictable Environments. Proc.
Intl. J. Col~f. on Artificial IMelligel~ce, Milan, Italy. 1987.
[25] Suchman,L. Plal~s al~d Situated Action, s. NewYork: CambridgeUniversity Press, 1987.
[26] Tate, A. Generating Project. Networks.Proc. h~ll. J. Co~f. on Artificial IntelligeT~ce, Milan,
Italy, 1987.
[27] Webber, B. and Baldwin. B. Accmnmodating Context Change. Proc. 30th Al~nual Conference of the Assoc. for Comp~datiol~alLil~gt6stics, NewarkDL, June 1992.
[28] \¥ebber, B., Badler, N., Baldwin, F., Becket, W.. Di Eugenio, B., Geib, C., .Jung, M., Levison, L., Moore, M. and \\’hit.e, M. "’Doing What. You’re Told: Following task instructions
in changing but hospitable environments". In Y. Wilks and N. Okada (eds.), Language and
Vision across the Pacific, to appear 1993. (Also appears as Technical Report. MS-CIS-92-74,
Department of Computer &" hlformation Science, University of Pennsylvania.)
[29] Winograd, T. Ul~dcrslandil~g A’alural Lal~guage. NewYork: Academic Press, 1972.
115