From: AAAI Technical Report FS-93-03. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Instructing Real World Va, cuumers Bonnie Webber Norman Badler Department of Computer J% Information Science xUniversity of Pennsylva.nia. .June 4, 1993 1 Introduction Early views of Natural Language il~slr~tclio~s corresponded to the early views of plates: instructions were interpreted as speci~’ing nodes of a hierarchical plan that, when completely expanded (instructions having been recognized as only being partial specifications of activity), acts as a control structure guiding the behavior of an agent.. This view, for example, enabled SIIRDLU’ssuccessful response to instructions such as "Pick up the green pyramid and put it in the box" [29], a response that. attracted early public at.tention to the emergingfield of Artificial httelligence. That plal~s should not be viewed as control structures has already been well argued by others in the field, including Agre and Chapman[1], Pollack [2:3], and Suchman[25]. Agre and Chaplnan, for example, point to the fact that when people form plans in response to instructions given theln before the start, of a task, they appear to use those plans as resources: the actual task situation maylead them to interpolate additional actions not mentioned in t.he instructions, to replace actions specified in the instructions with other ones that seem better suited to the situation, and to grotmd referring terms as a consequence of their actions rather than as a precondition for them, all of which are very differmtt from how COlnputers use programs to control their operation. Agre and Chapmando not actually distinguish between instructions and plans in [1]. IL is in subsequent work that distinctions emerge. For example, looking at instructious given as advice to agents already engaged in an activity, Chapman[7] treats iustructions as additional evidence for action alternatives already identified by the agent as being relevant to its current situation, but that might not he followed ilnmediately (or ever) if other alternat, ives have more evidence in their favor. Chapmanargues that this is how arcade game players follow advice given *The authors would like to thank Brett Achorn, Breck Baldwin, Welton Becket, Barbara Di Eugenio, Christopher Geib, ~’Iooll Jung, Libby Levison. Michael Moore, Michael \Vhite, and Xinmin Zhao, all of whomhave contributed greatly to the current version of AniinNL. Wewould also like to thank Mark Steedman, for comments on earlier drafts of/.tie paper. This research is partially supported by AROGrant DAAL03-89-C-0031 including participation by the U.S. Army Research Laboratory (Aberdeen), Natick Laboratory, and the Institute for Simulation and Training; U.S. Air Force DEPTHcontract, through Hughes Missile Systems F33615-91-C-0001; DMSO through the University of Iowa; National Defense Science and Engineering Graduate Fellowship in Computer Science DAAL03-92-G-0342; NSF Grant IRIgl-ITll0, CISE Grant CDA88-22719. and Instrumentation and Laboratory hnprovement Program Grant USE-9152503. and DARPAgrant N00014-90-J-186. 109 them by kibbitzers watching them play. (Chapmanalso notes that negative instructions can be tmderstood as evidence against actions, but does not really explore this idea or its consequences.) Turning to instrnctions that iMem’ttl~t activity, Altermann and his colleagues [2] show how such instructions can be treated as assistance, helping agents to accommodatetheir existing routines to the device currently at. hand. In this approach, routines evolved over manydifferent. instances of engagementhelp focus an agent on the details of the situation that require attention and on tile decisions that nmst be made. Instructions may interrupt activity to call attention to other relevant details or decisions or to correct decisions already made. Neither plans nor instructions function as control structures that determine the agent’s behavior. Our own work has focussed on instructions given to agent.s before they undertake a task, such as procedural instructions and warnings. Such intructions are widespread, found packaged in with devices as small as filses (i.e.. instructions for removingthe blownfuse and installing the newfuse contained in the package) and as large as V-16 aircraft (e.g., instructions for performing routine maintelmnce on the air conditioning water separator, on the flight control syst.em power TM is an Owner’s Manual containing such supplies, etc.). Packaged in with tile RO~lal CANVAC instructions as: ¯ Do not pull on tile cord to unplug the vacuumcleaner. Grasp the plug instead. (p.2) ¯ Do not run the vacuumcleaner over the cord. (73.2) ¯ Depress door release button to open door and expose paper bag. (p.5) The first two instructions are fl’om the section labelled "Warning: To reduce the risk of fire, electric shock or injury ’’~, while tile third is fi’om the operating instructions for "Paper bag removal and installat.ion". Our work on instructions is being done as part of the Animation and Natural Language (AninrNL) project at the l,’niversity of Pennsylvania. Besides providing a rich framework which to analyse the selnantics and pragmatics of instructions, t.he project has the goal of enabling people to use instructions to speci~ what animated hunran figures should do. Among the potential applications of such a feature is task a~alysis in connection with Computer-Aided Design tools. Being able to specie" the task through a sequence of Natural Languageinstructions, much as one would find in an instruction manual, means that a designer need only modi~" agent and environment, in order t.o observe their effect, on the agent’s ability to perform the task. Trying to do this through direct manipulation would require a designer to xnanipulate each different agent through each difl’erent environlnent, in order to observe the same effects. AnimNLbuilds upon tile Jack TM animation system developed at the University of Pennsylvania’s Computer Graphics Research Laboratory. Animation follows from model-based simulatiolr. Jack provides biomechanically reasonable and anthropometrically-scaled human models and a growing repertoire of behaviors such as walking, stepping, looking, reaching, turning, grasping, strength-based lifting, and collision-avoidance post.ure planning [3]. Each of these behaviors is environment.ally reactive in the sense that incremental computations during simulation are able to adjust an agent’s perfornaance to the situation wi~ho~tt f~rther i~vol~,eme-~! of lhe higher level processes [,1] unless an exceptional failure condition is signaled. Different spatial environments can easily be constructed and modified, to enable designers to vary the situations in which the lfigures are acting, l In discussingagents, wewill use the pronoun"’he". since wewill be using a malefigure in our illustrated exalnple- i.e., a figure with malebodyproportions. TheJack animat.ionsystemprovidesanthropomet.rically sizablefemalefiguresas well. Ii0 Trying to make a humanfigure movein ways that people expect a humanto movein carrying out a task, is a forlnidable problem: truman models in Jack are highly-articulated, with over 100 degrees of fi’eedom [3]. While the environment through which a Jack agent moves should, and does, influence its low-level resl)onses [4], we have found that a great many behavioral constraints can be derived through instruction understanding and planning, the latter broadly construed as adopting intentions [6, 8, 16]. Because this is to be a short position paper, we will confine our commentsto two issues and how the AnimNLarchitecture deals with them to produce, in the end, human figure animations. ¯ expectations derivable from instructions and the behaviors they can lead to; ¯ the broader class of behavioral decisions informed by knowledgeof the inten.tion towards which one’s behavior is directed. In other pal)ers, we discuss other issues we are having to face such as taking intentions seriously in means-end planning [14, 15] and mediating between symbolic action specifications such as "pick up the vacutun cleaner" and "open the door" and the specific reaching, grasping, re-orienting, etc. behaviors needed to effect, t henl in the current situation [28]. In this discussion, we will try to restrict our examples to vacuuminghousehold floors, although we ask the reader’s indulgence if we find the need to branch out to removing stains from carpets, to provide more illustrative examples. 2 Expectations from Instructions Usually one thinks of instructions in terms of what they explicitly request, or advise people to do or to avoid doing. But another role they play is to provide agent.s with specific expectations about actions and their consequences, expectations that can influence an agent’s behavioral decisions. Several of these are discussed in lnore detail in [28]. Here we focus on expectations concerning event and object locations. Because actions can effect changes in the world or changes in what the agent can perceive, instructions mayevoke more than one situat.ional context [27]. This means that part of an agent’s cognitive task in understanding instructions is to determine for each referring expression, the situation in which the agent is meant to find (or ground) its referent. Somereferring expressions in an instruction maybe intended to refer to objects in the currently perceivable situation, while others maybe intended to refer to objects that only appear (i.e., coine into existence or become perceivable) as a consequence of carrying out an action specified in the instruction. The difference can be seen by comparing the following two instructions la. Go downstairs and get= me the vacuum cleaner. b. Go downstairs and empty the vacuum cleaner. In (la), "the vacuumcleaner" will generally be taken to denote one that will be found downstairs, one that the agent may not yet be aware of. Given this instruction, agents appear to develop the expectation that it is only after they perform the initial action, that they will be in a context in which it. makes sense to try to ground the expression and determine its referent. What is especially interesting is the slrel~glh of this expectation: a cooperative agent will look aroundif a vacuumcleaner isn’t imlnediately visible whenthey get. downstairs, including checking different roolns, opening closets, etc. until they find one. This has led one of our students, Michael iii Moore, to develop procedures he calls search plans [22] following [20], that guide this type of behavior. lnstructiou (lb) ("Go downstairs and empty the vactmmcleaner") leads to different locational expectations than (la): an agent will first try to ground "the vacuum cleaner" in the current situation. If successfnl (and cooperative), the agent will then take that vacuumdownstairs and empty it. If unsuccessflll however, an agent will not just take the instruction to be infelicitous (as they would in the case of an instruction like "Turn on the vacuumcleaner", if there were no vacuumin the current, situation). Rat.her an agent, will adopt the same locat,ional expectation as in the first example, that the vacuumcleaner is downstairs (i.e., where they will be after performing the initial action). In AnimNL,Barbara Di Eugenio has attempted to derived some of these locational expectations th,’ough ,~lan inference techni0ues described in moredetail in [10, 11, 12]. In this case, the inferences are of Ihe torm: if one goes to place p for the purpose of doing action o. then expect t t. to be the site of o. Action representations have a notion called, alternatively, applicability condilions [24], conslrainls [18]. or usewh.e,, condilions [26]: these are conditions that must hold for all action to be relevant, and are not meant to be achieved. If a- has amongit applicability conditions that an argument be at. i t for o to even be relevant, then a locational expectation develops as in (la). If not, a weakerexpectation arises, as in (lb). (Notice that this occurs within a single clause: "Get: the vacuumcleaner downstairs" leads to t.he same expectation as (la), while "Empty the vacuumcleaner downstairs" leads to the same expectation as (lb).) 3 Informational Import of Purpose Clauses Commentsare often made about the ¢~ficiency of Natural Language - how lnany meanings can piggy-back on one form. One examples of this is purpose clauses -- infinitival clauses that explicity encode the goal that an action achieves. Not only do purl)ose clauses provide explicit. justification for actions (evidence for why they should be done), they also convey implicitly important aspects of actions such as perceptual tests needed to monitor and/or terminate them, al)propriate direction of movement,other needed actions, etc. This efficiency, in part, makesup for the fact that instruct.ions are partial not only in what actions are madeexplicit (leaving others to be inferred or motivated by the situation) but. also in what aspects of actions are made explicit. Given a partial description of an action and its purpose, an agent can use t.he latter as a basis for inferring an effective elaboration of the former, using techniques such as those described in [9, 10. 11, 12]. To see this, consider the following instruction for vacuuming: To finish off, vacuumthe rug or carpet against the direction of the pile to leave it. raised. (This is actually does improve its appearance.) To follow this instruction, an agent must first determine the direction of the pile. 2 A good way t.o do this is to vacuum a bit. in various directions, observing the resulting state of the pile. Whenone sees the direction of sweep that. leaves it. raised -- i.e. that satisfies the purpose clause -- one can carry out. the instruction and finish off the job. Examplesabound of information conveyed iml)licitly by purpose clauses: to clean wine stains from a rug or carpet, one follows instructions such as: 2Throughlhe presuppositionof the nounphrase, the agent mayalso be informedthat a wovenpile has a direction! 112 Blot with clean tissues to removeany liquid still standing. Sprinkle liberally with salt to extract liquid that has soakedinto tile fabric. Vacuulntip t.he salt. In the first sentence, blotting with clean tissues specifies a type of act.ivity but not howlong that activity nmst be pursued. (In the terminology used by Moens and Steedman [21], it is simply a process like running, not a culminated process like running a mile.) Howlong comes fi’om the purpose clause "to remove auy liquid still standing". This implies a perceptual test couditioning termination - i.e., blot until no standing liquid is visible. In a somewhatdifferent way, the purpose clause in the second sentence signals whenthe agent, can start, the final step, vacuumingup the salt.. It is not the termination point of the sprinkling (which is terminated when the agent decides there is now a. "liberal" amount of salt on the stain [17]), but of the subsequent waiting while the salt does its thing. The agent’s perceptual test comes from the purpose clause --- whensalt extracts liquid, its texture and color (in the case of red wine) change: when the agent perceives that its surface is damp, the salt has extracted as muchas it can, and 3the agent call commeucevacuulniug. Di Eugenio has designed and implenaented tile machinery used in AnimNLfor computing manyof these inferences that follow from understanding that the action described in the main clause is being doue for the purpose described ill the purpose clause. The relationship between the two actions may be one of generation or enablemeTil, tile difference between t.hem roughly being that additional actions are needed in the latter case to accolnplish t.he specified purpose. tier approach makes use of both linguistic knowledge and planning knowledge. A knowledge base of plan schemata (or recipes) complements a taxonomic structure of action descriptions. The latter is represented in Classic [5] and exploit.s classificaliol~ t.o be in a position to find related action descriptions, each corresponding to a different node. These nodes index into the knowledge base of recipes which indicates generation, enablement and sub-structure relationships between actions. The inference algorithms oil these linked structures are described in detail in [9, 11]. 4 Summary The problems of understanding what information instructions provide an agent and how that information can be used are far fronl solved. Instructions meant to serve as general policy (e.g. "Do not l’Ull the vacuumcleaner over tile cord.") will have to be understood and used different ways than instructions meant to serve as immediate advice or corrections whenroutines " fail or procedural instructions such as those discussed above. Someof the problems involved in using instructions will be simplified in tile case of non-humanoidrobots, given that people will have fewer expectations concerning their response: criteria for success will be more like ’:did they accomplish the task without breaking anything?" t.han "’did they do it the way i would?" Watchingall earl’,’ version of our animated agent opening all under the-counter kitchen cabinet, with its knees stretched uncomfortably wide and oscillating back and fort.h, makes one realize that there is Mot more guiding low-level human performance than joint constraints and final state. These attempts though to flirt.her our understanding of instructions and how intentions map to activity through tile use of humanfigure animation to visualize theories seem to us a valid and profitable exercise. 3Vacuulnhlg may reveal that not all the liquid be called for. has been extracted, i13 and that. repeating the procedm’e might References [1] Agre, P. and Chapman, D. What are Plans for? In P. Maes (ed.), Designing Autonomous Agents. Cambridge MA:MITPress, pp. 17-34. (First. published as Technical Report., MIT AI Laboratory, 1989.) [2] Alterman, R., Zito-Vfolf, R. and Carpenter, T. Interaction, Usage. J. Learning Sciences 1(4), 1991. Comprehensionand Instruction [3] Badler, N., Phillips, C. and Webber, B. Simulaling Humans: Computer Graphics, Animalion and Control. NewYork: Oxford University Press, 1993. [4] Becket, W. and Badler, N. I. Integrated Behavioral Agent Architecture. Proc. Workshopon Computer Generaled [brces and Behavior Representation, Orlando FL. March 1993. [5] Braclnnan, R., McGuinness. D., Patel-Schneider, P.. Resnick, L, and Borgida, A. Living with Classic: \Vhen and Howto use a KL-ONE like language. In J. Sowa(ed.). Principles of Semantic Networks. San Mateo C.A: .Morgan 1,2aufmann Publ., 1991, pp.401-457. [6] Bratman, M., Israel, D. and Pollack, M. Plans and Resource-bounded Practical Computational Inleliigence 4(4), November1988, pp. 349 355. [7] Chaplnan, D. Vision, Instruction Reasoning. and Action. Cambridge MA:MIT Press, 1991. [8] Cohen, P.R. and Levesque, H. Persistence. intention, and comnritment. In M. Georgeff and A. Lansky (eds.), Reasoning about Aclioaa and Plans, Proceedings of lhe 1986 Workshop. Los Altos CA: Morgan Kaufmann. 1986, pp. 297-340. [9] Di Eugenio, B. Understanding Natural Language Instructions: the Case of Purpose Clauses. Proc. 30lh Annual Conference of the As.~-oc. for Compulalional Linguistics, Newark DL, June 1992. [10] Di Eugenio, B. Computing Expectations in Understanding Natural Language Instructions. Submitted to AI*IA93, Third Conference of the Italian Association for Artificial Intelligence, Torino, Italy, August 1993. [11] Di Eugenio, B. Underslanding Nalural Language Instructions: a Computational Approach to Purpose Clauses. PhDThesis, University of Pennsylvania, August 1993. [12] Di Eugenio, B. and Webber, B. Plan Recognition in Understanding Instructions. Proc. 1st. lnl’l Conference on Arl~ficial Intelligence Planning Syslems, College Park MD,June 1992. [13] Di Eugenio. B. and V~hite. M. On the Interpretation of Natural Language Instructions. 1992 Int. Conf. on (’omp,lational Ling,i.~tica (COLING-92),Names, France, July 1992. [14] Geib, C. Intentions in Means/End Planning. Technical Report MS-CIS-92-73, Dept. Computer & hfformation Science, University of Pennsylvania, 1992. Intentions in Means-End Planning. In AAAI [15] Geib, C. A Consequence on Incorl)orating Spring Symposium Series: Foul~dations of Automatic Planning: The Classical Approach. and Beyond, Working Notes. Stanford CA, March 199:3. [16] Grosz, B. and Sidner. C. Plans for Discourse. In P. Cohen, J. Morgan & M. Pollack, Intentions in Communication. Cambridge MA: MIT Press, 1990. i14 [17] Karlin, R. Defining tile Semantics of Verbal Modifiers ill the Domainof Cooking Tasks. Proc. 261h Am~ual .,llceling. Associalio~ for Computalio~al Linguistics, SUNYBuffalo, June 1988, pp. 61-67. [18] Lit.man, D. and Allen, .1. Discourse Processing and CommonsensePlans. In P. Cohen, .J. Morgan & M. Pollack, l~lel~tioas i1~ Commul~icalion. Cambridge MA:MIT Press, 1990, pp. 365-388. [19] Levison, L. Action Composition for the Animation of Natural Language Instructions. Technical Report, MS-CIS-91-28, Dept. of Computer & Information Science, University Of Pennsylvania, August, 1991. [20] Miller, G., Gelernter and Pribram. K. The ,(;ll"uclure of Pla~ls a;~d Behavior. [21] Moens, M. and Steedman, M. Temporal Ontology and Temporal Reference. Computational Li,g~istics, 14(2):15-28. 1988. [22] Moore, M.B. Search Pla,s. PhDDissert.ation proposal, Departlnent of Computer and Information Science, Universit.y of Pennsylvania, May199:3. (Technical report. MS-CIS-93-55). [23] Pollack, M. I~ferring domai~ pla~s in question-a~swering. PhDthesis, Department of Con> puter & Information Science, "l~chnical Report MS-CIS-86-40.University of Pennsylvania., 1986. [24] Schoppers, M. Universal Plans of React.ive Robots in Unpredictable Environments. Proc. Intl. J. Col~f. on Artificial IMelligel~ce, Milan, Italy. 1987. [25] Suchman,L. Plal~s al~d Situated Action, s. NewYork: CambridgeUniversity Press, 1987. [26] Tate, A. Generating Project. Networks.Proc. h~ll. J. Co~f. on Artificial IntelligeT~ce, Milan, Italy, 1987. [27] Webber, B. and Baldwin. B. Accmnmodating Context Change. Proc. 30th Al~nual Conference of the Assoc. for Comp~datiol~alLil~gt6stics, NewarkDL, June 1992. [28] \¥ebber, B., Badler, N., Baldwin, F., Becket, W.. Di Eugenio, B., Geib, C., .Jung, M., Levison, L., Moore, M. and \\’hit.e, M. "’Doing What. You’re Told: Following task instructions in changing but hospitable environments". In Y. Wilks and N. Okada (eds.), Language and Vision across the Pacific, to appear 1993. (Also appears as Technical Report. MS-CIS-92-74, Department of Computer &" hlformation Science, University of Pennsylvania.) [29] Winograd, T. Ul~dcrslandil~g A’alural Lal~guage. NewYork: Academic Press, 1972. 115