From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Learning Action Models By Observing Other Agents * Xuemei Wang School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract applied. Theeffects describe howthe application of the operator changesthe world.A state is representedas a conjunction of literals, wherea literal is a formulaor its negation.Given an initial state and a goal expression, PRODIGY searches for a sequenceof operators that will transformthe initial state into a state that matchesthe goal expression. Figure 1 shows the basic search algorithm in PRODIGY. Notice that the operators are applied in PRODIGY’s mental model, not in the environment.PRODIGY backtracks if the current path fails to find a solution. Autonomous agents should havethe ability to learn from the environment.The actions of other agents in the environment provide a useful sourcefor such learning[Dent,1990]. This paper presents a frameworkfor learning action modelsby observingother agents in the context of PRODIGY [Carbonell et al. , 1990], a general-purpose planner. A domain in PRODIGY is specified by a set of operators that represent actions. The systemlearns the operators incrementally through a close-loop integration of observingother agents, learning, planning, and executing plans in the environment, l. Check if the goalstatement is true in the currentstate. If it is, solutionhas beenfound,returnthe plan. 2. Choosea goalfromthe set of pendinggoals. 3. Findan operatorto achievethe goal. 4. Applythe operatorif applicable.Otherwise, addthe preconditions of the operatorto the set of pending goals.GotoI. 1 Introduction Webelieve that an autonomous intelligent agent shouldbe able to interact with the environment,to observeactions of other agents, to learn fromthem, and to use the learned knowledge Figure 1: Basic search algorithm in PRODIGY to solve new problems. A general-purpose planner provides a frameworkin which the agent can learn an action model and test the modelwhile solving encounteredproblems. In this paper, wepresent a framework for learning action models for planning domainsby observing other agents. Our system 3 Learning by Observing Other Agents in PRODIGY observes sequencesof changes in the environmentcaused by the actions of other agents, and hypothesizeoperators of the for learning by domainbasedon the observations. Whilethe systemis given This section describes in detail the mechanism observation in the context of PRODIGY. newproblemsto solve, it uses the hypothesizedoperators, modifiesand confirms themduring its ownplan execution. 3.1 Learning by Observation: The Problem and the Hypothesis 2 Background: The PRODIGY Problem Ourlearning systemhas twosources of input for learning: (a) the observedsequencesof state changescausedby the actions PRODIGY [Carbonell et al., 1990] is a domainindependent of other agents, and (b) the observed state changescaused roblemsolver that uses means-endsanalysis. As in STRIPS by its ownactions during plan execution whensolving new ikes and Niisson, 1971], PRODIGY’s operators have preproblems.Less initial knowledgebuilt in the systemmeans conditions and effects. The precondition expressionsspecify a moregeneral learning mechanism,we therefore prefer syswhatmustbe satisfied in the state before the operator can be temswith as little initial knowledge as possible. The initial knowledgein our systemconsists of the types of the objects *’Vnisresearchwassponsoredby the AvionicsLaboratory,Wright in the domainand the predicates available for representing Research andDevelopment Center,Aeronautical SystemsDivision (AFSC),U. S. Air Force, Wright-PattersonAFB,OH45433-6543 states. Nooperator modelis given to the system. Weassumethat everything in the state is observable. We underContract F33615-90-C-1465, ArpaOrder No. 7597. The also assumethat the operators and the environmentare deterviewsandconclusions containedin this document are those of the ministic. Applyingthe sameoperator in the exact samestate authorandshouldnot be interpretedas representing the official policies, eitherexpressed or implied,of the U.S.Government. has the exact san~e outcome.Undersuch ~,~sumptions, the Solver ~F 10 learning systemlearns domainswhoseoperators have conjunctive, non-negatedpreconditions,andno conditionaleffects. 3.2 Learning by Observation: The Methods Eachobservationof a state changefromthe abovetwo input sources gives the systeman opportunity to hypothesizea new operator,or to generalizean existing operator,or to recover froman incorrectgeneralization. Thehypothesization,generalization, andrecoverymethods are describedin detail in this section. (Operator OP-I (preoonds ((<x, Location) (<y, Location) ((z> Location)) (and (thirsty monkey) (at banana <x>) (at glass <z,) (hungry monkey) (at box <x~) (at waterfountain <y>) (hasknlfe monkey) (onbox <x>))) (effects ((add (hasbanana monkey))))) 3.2.1 Hypothesizingoperators from observations Thefirst sourceof inputis the observed sequencesof state Figure3: HypothesizedoperatorOP-1fromobservation-1. changescausedby the actions of otheragents. Anobservation consists of a pre-stateanda post-state. Thepre-stateis the state beforethe action is executedby the observedagent, and (thirstymonkey)and (at glass the post-state is the state after the action is executed.The the preconditions <z, ) of 0per~0r OP1 because they~enotsatisfied under delta-state of an observationis computed by finding the difobservation-2. The modified operator is shown in Figure 5. ference betweenthe post-state andthe pre-state. Givenan The modified operator will be further generalized if the sysobservation,the systemmatchesthe delta-state of the obserfor by it. The vationagainsteachoperatorit has learnedso far (there are no temsees otherobservationsthat are accounted preconditions of this operator will be the intersection of the suchoperatorsinitially). Anoperatoraccountsfor the obserpre-statesof all the observations that will be accounted for by vationif the delta-stateof the observationandthe effects of this operator. the operatormatch. If no operatoraccountsfor the observation,we hypothesize a newoperatorwhoseprecondition is the pre-stateof the Pre-state: observationandwhoseeffect is the delta-state of the obser(at banana p3) (hungry monkey) vation. Wegeneralizethe objects in the state to variablesof (at box p3) (at waterfountain pS) the mostspecific typesx. Sincethe systeminitially starts with (hasknife monkey) (onbox p3) no operatormodels,all operatorsare addedto the systemthis Post-state: way. For example,supposethe system is nowlearning in (st banana p3) (hungry monkey) the Monkey-and-Bananas domain.After the initial observa(st box p3) (at waterfountain pS) tion shownin Figure2, it formsthe operatorOp-1 shownin (hasknife monkey) (onbox p3) Figure3. (hasbanana monkey) Pre-state : (thirsty monkey) (at glass p6) (at box p3) (hasknife monkey) Post-state: (thirsty monkey) (at glass p6) (at box p3) (hasknife monkey) (hasbanana monkey) Delta-state: adds: (hasbanana (at banana p3) (hungry monkey) (at waterfountaln (onbox p3) (at banana p3) (hungry monkey) (at wsterfountain (onbox p3) Delta-state: adds: (haabanana p5) monkey) Figure4: Observation-2. (Operator OP-i (preconds ((<x, Location) (<y> Location)) (and (at banana <x>) (hungry monkey) (at box <x,) (at waterfountain <y>) (hasknife monkey) (onbox <x,))) (effects ((add (hasbanana monkey))))) pS) monkey) Figure2: Observation1. If exactly one operatorthat accountsfor the observation is found, the system modifythe operatorby droppingthe preconditions that do not holdin the pre-state of the observation. For example,given observationshownin Figure4. we find that OP-1 accountsfor this observation. Wedrop tObjectsthatdo notchange fromproblem to problem aretreated as constants andarenotgeneralized. 11 Figure5: OP-Iafter observation-Iandobservation-2. If morethanone operatoraccountsfor the observation,we cannottell whichoperatorthe observedagent is using. We leave the operatorsunmodified anddo not learn anythingfrom this observation. 3.2.2 Manningwith incomplete or incorrect operators PRODIGY’s standard planner assumes a correct action model. But the hypothesizedoperators based on the observations maybe incorrect in the followingways: I. The preconditions can be too specific. This is because weassumethat the preconditionsof the operators are the completestate at whichthe operators were applied by the observedagent. But parts of the state are irrelevant to the operator application. 2. Different operators can have exactly the sameeffects. Our systemwouldincorrectly consider themas the same operator and generalize themto a single wrongoperator. 3. Theeffects of an operatorcan be maskedby the pre-state. For example,if a literal should be addedby an operator but is true in the pre-state of the observation,wecannot learn that this literal shouldbe addedbythe operator. Planningwith the incorrect operators presents a numberof problems: if thegoalstatement is true in thecurrentstate. If it is. 1. Check solutionfound,returnthe plan; 2. Choose a goalfromthe set of pending goals; 3. Findan operatorto achievethe goal; 4. Applythe operatorin the environment.Modi.~the operator basedon the expectation andresultof applyin 8 the operatorin environment. If the operatoris not appficable,addthe preconditions of the operatorto the set of pending goals. Goto1. Figure 6: Modifiedbasic search algorithm in PRODIGY. 3.2.3 Confirmingoperators during execution Whenthe systeminteracts with the environmentby testing in it the hypothesizedoperators, one of the four scenarioes shownin figure 7 is possible. Theoperator is modifiedaccordingly for each scenario. Figure 7 gives a brief overview of the modifications.Details follow. 1. WhenPRODIGY is planning in its mental model, an opHowto n~odifythe operator Operator Operator erator is applicableif and onlyif all its preconditionsare applicable applicable satisfied. Therefore, whenan operator has unnecessary in mental in the preconditions, PRODIGY has to achieve its unnecessary model environment no no Donot changethe operaior. preconditions in order to apply the operator. Not only Some preconditionsin the operator no yes does this force PRODIGY to do unnecessarysearch, but are unnecessary, dropthem. also it can makemanysolvable problemsunsolvable. yes no Split the operatorbecausewehave 2. Whenwe havean over generalized operator, the operator generalizedthe preconditions of may be wrongly considered applicable in PRODIGY’s twodistinct operatorsas one. mentalmodeleven thoughit should not be applicable. yes yes Modify the operatoronlyif the expected effects andthe 3. WhenPRODIGY applies an operator in its mental model realeffectsdiffer. that missessomeeffects, the state becomesincorrect. In spite of all these problems,we wouldlike the system Figure 7: Modifyingan operator for each scenario. to be able to solve problemswith the hypothesizedoperators. In order to cope with these difficulties, we allow the system If the operator is not applicable in the mental modelor to experimentin the environment. Wealso distinguish two in the environment,then there is no difference betweenour types of applicability: applicable in the mentalmodel, when expectation and the outcome.Therefore, we do not modify all the preconditionsof the operatorare satisfied, andapplicathe operator. ble in the environment,whenit is successfully applied while If the operator is not applicable in the mentalmodel,but executing it in the environment.Operators are nowapplied applicable in the environmentwhenwe test it, then some , 2 in the environment instead of in PRODIGY s mental model. of the preconditions in the operator are unnecessary. The Operators are not applied in the environmentrandomlyfor preconditionsthat are not satisfied in the current state are not exploration, nor are they applied only whenthey are applinecessary, and we drop them. cable in PRODIGY’s mental model. Instead, they are tested If the operator is applicable in the mentalmodel,but not in the environmentwheneverit is needed by the planner to applicable in the environment,then it is because the preachieve somegoal. The operator maynot be applicable in conditions of the operator are too general. Weare missing the environmentwhenwe test it, but the difference between the expectation in the mental modeland the actual outcome someof the preconditions. This happensif the hypothesized operator is a generalization of two distinct operators in the in the environmentgives the systeman opportunity to confirm or modifythe hypothesizedoperators. Figure 6 shows domainthat have the same effects. For example, suppose in some domain we have two correct operators REAL-OPthe modifiedplanningalgorithm that interacts with the en1 and REAL-OP-2, shownin Figure 8. Suppose also that vironmentand plans with incorrect operators. Only part of observation-3and observation-4 in Figure 9 are observedby step 4 is different from the algorithm in Figure 1. As the the system. They are instances of applications of operator operators are nowapplied in the environment,instead of in REAL-OP-Iand REAL-OP-2,respectively. Because these PRODIGY’s mental model; therefore, we cannot backtrack two observations have sameeffects, they are considered as oncean operator is applied. applications of the sameoperator by the learning system. Operator OP-3is formedfrom observation-3 and observation-4 2Sinceoursystemis notconnected to a robot, the operatoris (figure 9). Thepreconditionof OP-3is the intersection of the appfiedin a simulatedenvironment that behavesaccordingto the complete andcorrectset of operatorsin the domain. pre-states of observation-3and observation-4.If the current 12 learns fromobservationsand fromits owninteraction with the environment duringproblemsolving. It incrementallyforms a moreaccurateaction modelof the domain. state is ( pO), ( pl ), OP-3is applicablein the system’s tal model, but not in the environment.Weneedto split OP-3 into two operators each correspondingto one observation. Thenewoperators after splitting is shownin Figure10. 3.3 The current Implementation The learning mechanismis domainindependent, and has been tested in a few domainsin PRODIGY: blocksworld, Monkeyand-Bananas, an extendedversion of the stripsworld, logistics planningetc. In each of these domains,our systemcan solve simple problems after seeing each operator once, and can further generalize operators during its plan execution. But due to lack of control knowledgein the system, it cannot solve more complicated problems. (Operator REAL-OP-I (preconds () (and (p0) (pl) (p2))) (effects () ((add (p4))))) (Operator REAL-OP-2 (preconds () (and (pl) (p3))) (effects () ((add (p4))))) 4 Related Figure 8: Twooperators with the sameeffects. Observation-3: Pre-state: (p0) (pl) Post-state: (p0) (pl) (p2) Delta-state: adds: (p4) Observatlon-4: Pre-state: (pl) (p3) Post-state: (pl) (p3) Delta-state: adds: (p4) (Operator OP-3 (preconds () (pl)) (effects ((add (p4))))) Figure 9: Twoobservations and the hypothesizedoperator. (Operator OP-4 (preconds () (and (p0) (pl) (effects () ((add (p4))))) (p2))) (Operator OP-5 (preconds () (and (pl) (p3))) (effects () ((add (p4))))) Thereis considerable interest inlearning for planningsystems. Most of this work concentrates on learning search control knowledge[Mitchell et aL, 1990; Minton, 1988; Veloso, 1992]. There is also workon learning domainknowl. edge. EXPO[Gil, 1992] is a learning-by-experimentation modulefor refining incompletely domainknowledge.Learning is triggered whenplan execution monitoring detects a divergencebetweeninternal expectations and external observations. Oursystemdiffers in that learning is triggered when sequencesof actions of other agents are observed. The initial knowledgethat is given to our systemis muchless than thatin EXPO.Our system learns domainknowledgewithout knowingany operators initially, whereasEXPO relies on an initial partial knowledge of the operators. LIVE[Shen, 1989] is another system that learns from its environment.It is designed for exploration and discovery. It can formulate newoperators as well as new terms. Our system, on the other hand, concentrates on learning from other agents. There also has been work on learning by observing users. LEAP[Mitchell et al., 1990] is a learning apprentice system in the domainof VLSIdesign. It is able to learn newrules for decomposingcircuit modules, as well as rules for choosing amongalternative decompositions. The learning methodrequires strong domainknowledge.CAP[Jourdanet aL, 1991] is a learning apprentice for calendar management, whichallowsusers to schedulemeetingsand providesadvice regarding parameterssuch as the meetingtime and duration etc. Both these systems are domaindependent. OUrsystem, on the contrary, is domainindependent. 5 Figure 10: Twooperators resulted from splitting OP-3. Work Conclusions and Future Work In this paper, we have described a frameworkfor learning action modelsin thccontext of a general-purposeplanner. The In the last case, the operator is applicable in the mental system automatically learns domainknowledgeby observing modelas well as in the environment. If the effects of applying other agents, and by interacting with the environmentduring the operator in the environmentare the sameas expected, we problem solving. The action modelthat the system learns do not modifythe operator. But if there are unexpected effects maybe incomplete and incorrect. Planning with such model whenthe operator is applied in the environment,then wehave is moredifficult. In spite of all this, the systemis able to found the effects that were previously masked.Weaugment solve problemsusing the incomplete and incorrect modelof the effects of the operator to accommodate the neweffects. the domain.In the case of more complicatedproblems, the The three learning steps described above: hypothesizing search space becomesvery large. Search control knowledge operators fromobservations, planningwith incompleteor inbecomesessential for problemsolving. Currently, we are correct operators, and confirmingoperatorsduringplan execu- workingon learning search control knowledgeby observing tion, are interleaved in our system. Thesystemincrementally other agents. Wehave developeda methodfor inferring the 13 problem-solvingstructure of the observedagent, and we are workingon learning search control knowledgeby analyzing the inferred problem-solving structure of the observedagent. Our systemhas manylimitations. It assumesa perfect environmentin whichall the actions are deterministic, sensors that canobservethe completestate, andsensors that are noiseless. Toapplyour learningtechniqueto the field of robotics, these assumptions need to be relaxed. Makingthe system moreapplicableto real domainsis a focus of our future work. Acknowledgments This work was done under the Supervision of Professor Jaime Carbonell. It greatly benefited from discussion with Alicia P~rez, Richard Goodwin,ManuelaVeloso, and members of the PRODIGY group. References [Carbonell etal., 1990] Jaime J. Carbonell and Craig A. Knoblockand Steve N. Minton. PRODIGY: AnIntegrated Architecture for Planning and Learning. In Architecture for Intelligence, Hillside, NewJersey, editor Kurt VanLehn, 1990. [Dent, 1990] Lisa Dent, Jeffrey Schlimmer.Learning from Indifferent Agents. In Proceedingsof the Twelfth Annual Conferenceof the Cognitive Science Society, pages 780787, Boston, Massachusetts, August1990. [Fikes and Nilsson, 1971] Rich E Fikes and Nils J Nilsson. Strips: A NewApproachto the Application of Theorem Proving to ProblemSolving. In Artificial Intelligence, pages 189-208, 1971. [Gil, 1992] YolandaGil. Acquiring DomainKnowledgefor Planning by Experimentation. Ph. D Thesis, School of ComputerScience, Ce~’aegie Mellon University, Pittsburgh, Pennsylvania,1992. [Jourdanetal.,1991] J. Jourdan, L. Dent, J. McDermott, T.Mitchell, D. Zabowski.Interfaces that Learn: A Learning Apprentice for Calendar Management.Technical Report, School of ComputerScience, Carnegie MellonUniversity, Pittsburgh, Pennsylvania,1991. [Minton, 1988] Steve N Minton. Learning Search Control Knowledge: An Explanation.Based Approach. Kiuwer AcademicPublisher, 1988. [Mitchellet al., 1990]T. Mitchell, P. Utgoff,and R. Banerji. Learning by Experimentation: Acquiring and Refining Problem-solvingHeuristic. In MachineLearning, An Artificial Intelligence Approach,VolumeI. TiogaPress, Palo Alto, CA,1983. [Mitchell et al., 1990] T. Mitchell et al. LEAP:A Learning Apprentice Systemfor VLSIDesign. In MachineLearning, An Artificial Intelligence Approach,VolumeIlL Morgan Kanfmann,Irvine, CA1990. [Shen, 1989] Wei-MinShen. Learning from Environment Based on Percepts and Actions. Ph. D Thesis, School of ComputerScience, Carnegie MellonUniversity, Pittsburgh, Pennsylvania,1989. [Veioso, 1992] ManuelaMVeloso. Learning by Analogical Reasoning in General Problem Solving. Ph. D Thesis, 14 School of ComputerScience, Carnegie MellonUniversity. Pittsburgh, Pennsylvania,1992.