Learning Action Models By Observing Other Agents *

advertisement
From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Learning Action Models By Observing Other Agents *
Xuemei Wang
School of Computer Science
Carnegie Mellon University
Pittsburgh,
PA 15213
Abstract
applied. Theeffects describe howthe application of the operator changesthe world.A state is representedas a conjunction
of literals, wherea literal is a formulaor its negation.Given
an initial state and a goal expression, PRODIGY
searches for
a sequenceof operators that will transformthe initial state
into a state that matchesthe goal expression. Figure 1 shows
the basic search algorithm in PRODIGY.
Notice that the operators are applied in PRODIGY’s
mental model, not in the
environment.PRODIGY
backtracks if the current path fails
to find a solution.
Autonomous
agents should havethe ability to learn
from the environment.The actions of other agents
in the environment
provide a useful sourcefor such
learning[Dent,1990]. This paper presents a frameworkfor learning action modelsby observingother
agents in the context of PRODIGY
[Carbonell et
al. , 1990], a general-purpose planner. A domain
in PRODIGY
is specified by a set of operators that
represent actions. The systemlearns the operators
incrementally through a close-loop integration of
observingother agents, learning, planning, and executing plans in the environment,
l. Check
if the goalstatement
is true in the currentstate. If it is,
solutionhas beenfound,returnthe plan.
2. Choosea goalfromthe set of pendinggoals.
3. Findan operatorto achievethe goal.
4. Applythe operatorif applicable.Otherwise,
addthe preconditions of the operatorto the set of pending
goals.GotoI.
1 Introduction
Webelieve that an autonomous
intelligent agent shouldbe able
to interact with the environment,to observeactions of other
agents, to learn fromthem, and to use the learned knowledge
Figure 1: Basic search algorithm in PRODIGY
to solve new problems. A general-purpose planner provides
a frameworkin which the agent can learn an action model
and test the modelwhile solving encounteredproblems. In
this paper, wepresent a framework
for learning action models
for planning domainsby observing other agents. Our system 3 Learning by Observing Other Agents in
PRODIGY
observes sequencesof changes in the environmentcaused by
the actions of other agents, and hypothesizeoperators of the
for learning by
domainbasedon the observations. Whilethe systemis given This section describes in detail the mechanism
observation
in
the
context
of
PRODIGY.
newproblemsto solve, it uses the hypothesizedoperators,
modifiesand confirms themduring its ownplan execution.
3.1 Learning by Observation: The Problem and the
Hypothesis
2 Background:
The PRODIGY Problem
Ourlearning systemhas twosources of input for learning: (a)
the observedsequencesof state changescausedby the actions
PRODIGY
[Carbonell et al., 1990] is a domainindependent of other agents, and (b) the observed state changescaused
roblemsolver that uses means-endsanalysis. As in STRIPS by its ownactions during plan execution whensolving new
ikes and Niisson, 1971], PRODIGY’s
operators have preproblems.Less initial knowledgebuilt in the systemmeans
conditions and effects. The precondition expressionsspecify
a moregeneral learning mechanism,we therefore prefer syswhatmustbe satisfied in the state before the operator can be
temswith as little initial knowledge
as possible. The initial
knowledgein our systemconsists of the types of the objects
*’Vnisresearchwassponsoredby
the AvionicsLaboratory,Wright in the domainand the predicates available for representing
Research
andDevelopment
Center,Aeronautical
SystemsDivision
(AFSC),U. S. Air Force, Wright-PattersonAFB,OH45433-6543 states. Nooperator modelis given to the system.
Weassumethat everything in the state is observable. We
underContract F33615-90-C-1465,
ArpaOrder No. 7597. The
also assumethat the operators and the environmentare deterviewsandconclusions
containedin this document
are those of the
ministic. Applyingthe sameoperator in the exact samestate
authorandshouldnot be interpretedas representing
the official policies, eitherexpressed
or implied,of the U.S.Government.
has the exact san~e outcome.Undersuch ~,~sumptions, the
Solver
~F
10
learning systemlearns domainswhoseoperators have conjunctive, non-negatedpreconditions,andno conditionaleffects.
3.2 Learning by Observation: The Methods
Eachobservationof a state changefromthe abovetwo input
sources gives the systeman opportunity
to hypothesizea new
operator,or to generalizean existing operator,or to recover
froman incorrectgeneralization. Thehypothesization,generalization, andrecoverymethods
are describedin detail in
this section.
(Operator OP-I
(preoonds ((<x, Location)
(<y, Location)
((z> Location))
(and
(thirsty monkey)
(at banana <x>)
(at glass <z,)
(hungry monkey)
(at box <x~)
(at waterfountain <y>)
(hasknlfe monkey)
(onbox <x>)))
(effects ((add (hasbanana monkey)))))
3.2.1 Hypothesizingoperators from observations
Thefirst sourceof inputis the observed
sequencesof state
Figure3: HypothesizedoperatorOP-1fromobservation-1.
changescausedby the actions of otheragents. Anobservation
consists of a pre-stateanda post-state. Thepre-stateis the
state beforethe action is executedby the observedagent, and
(thirstymonkey)and (at glass
the post-state is the state after the action is executed.The the preconditions
<z,
)
of
0per~0r
OP1
because
they~enotsatisfied
under
delta-state of an observationis computed
by finding the difobservation-2.
The
modified
operator
is
shown
in
Figure
5.
ference betweenthe post-state andthe pre-state. Givenan
The
modified
operator
will
be
further
generalized
if
the
sysobservation,the systemmatchesthe delta-state of the obserfor by it. The
vationagainsteachoperatorit has learnedso far (there are no temsees otherobservationsthat are accounted
preconditions
of
this
operator
will
be
the
intersection
of the
suchoperatorsinitially). Anoperatoraccountsfor the obserpre-statesof all the observations
that will be accounted
for by
vationif the delta-stateof the observationandthe effects of
this
operator.
the operatormatch.
If no operatoraccountsfor the observation,we hypothesize a newoperatorwhoseprecondition
is the pre-stateof the
Pre-state:
observationandwhoseeffect is the delta-state of the obser(at banana p3)
(hungry monkey)
vation. Wegeneralizethe objects in the state to variablesof
(at box p3)
(at waterfountain pS)
the mostspecific typesx. Sincethe systeminitially starts with
(hasknife monkey)
(onbox p3)
no operatormodels,all operatorsare addedto the systemthis
Post-state:
way. For example,supposethe system is nowlearning in
(st banana p3)
(hungry monkey)
the Monkey-and-Bananas
domain.After the initial observa(st
box
p3)
(at waterfountain pS)
tion shownin Figure2, it formsthe operatorOp-1 shownin
(hasknife monkey)
(onbox p3)
Figure3.
(hasbanana monkey)
Pre-state :
(thirsty monkey)
(at glass p6)
(at box p3)
(hasknife monkey)
Post-state:
(thirsty monkey)
(at glass p6)
(at box p3)
(hasknife monkey)
(hasbanana monkey)
Delta-state:
adds:
(hasbanana
(at banana p3)
(hungry monkey)
(at waterfountaln
(onbox p3)
(at banana p3)
(hungry monkey)
(at wsterfountain
(onbox p3)
Delta-state:
adds: (haabanana
p5)
monkey)
Figure4: Observation-2.
(Operator OP-i
(preconds ((<x, Location)
(<y> Location))
(and (at banana <x>)
(hungry monkey)
(at box <x,)
(at waterfountain <y>)
(hasknife monkey)
(onbox <x,)))
(effects ((add (hasbanana monkey)))))
pS)
monkey)
Figure2: Observation1.
If exactly one operatorthat accountsfor the observation
is found, the system modifythe operatorby droppingthe
preconditions
that do not holdin the pre-state of the observation. For example,given observationshownin Figure4.
we find that OP-1 accountsfor this observation. Wedrop
tObjectsthatdo notchange
fromproblem
to problem
aretreated
as constants
andarenotgeneralized.
11
Figure5: OP-Iafter observation-Iandobservation-2.
If morethanone operatoraccountsfor the observation,we
cannottell whichoperatorthe observedagent is using. We
leave the operatorsunmodified
anddo not learn anythingfrom
this observation.
3.2.2 Manningwith incomplete or incorrect operators
PRODIGY’s
standard planner assumes a correct action
model. But the hypothesizedoperators based on the observations maybe incorrect in the followingways:
I. The preconditions can be too specific. This is because
weassumethat the preconditionsof the operators are the
completestate at whichthe operators were applied by
the observedagent. But parts of the state are irrelevant
to the operator application.
2. Different operators can have exactly the sameeffects.
Our systemwouldincorrectly consider themas the same
operator and generalize themto a single wrongoperator.
3. Theeffects of an operatorcan be maskedby the pre-state.
For example,if a literal should be addedby an operator
but is true in the pre-state of the observation,wecannot
learn that this literal shouldbe addedbythe operator.
Planningwith the incorrect operators presents a numberof
problems:
if thegoalstatement
is true in thecurrentstate. If it is.
1. Check
solutionfound,returnthe plan;
2. Choose
a goalfromthe set of pending
goals;
3. Findan operatorto achievethe goal;
4. Applythe operatorin the environment.Modi.~the operator
basedon the expectation
andresultof applyin
8 the operatorin
environment.
If the operatoris not appficable,addthe preconditions
of the
operatorto the set of pending
goals. Goto1.
Figure 6: Modifiedbasic search algorithm in PRODIGY.
3.2.3 Confirmingoperators during execution
Whenthe systeminteracts with the environmentby testing
in it the hypothesizedoperators, one of the four scenarioes
shownin figure 7 is possible. Theoperator is modifiedaccordingly for each scenario. Figure 7 gives a brief overview
of the modifications.Details follow.
1. WhenPRODIGY
is planning in its mental model, an opHowto n~odifythe operator
Operator Operator
erator is applicableif and onlyif all its preconditionsare
applicable applicable
satisfied. Therefore, whenan operator has unnecessary
in mental in the
preconditions, PRODIGY
has to achieve its unnecessary
model
environment
no
no
Donot changethe operaior.
preconditions in order to apply the operator. Not only
Some
preconditionsin the operator
no
yes
does this force PRODIGY
to do unnecessarysearch, but
are unnecessary,
dropthem.
also it can makemanysolvable problemsunsolvable.
yes
no
Split the operatorbecausewehave
2. Whenwe havean over generalized operator, the operator
generalizedthe preconditions
of
may be wrongly considered applicable in PRODIGY’s
twodistinct operatorsas one.
mentalmodeleven thoughit should not be applicable.
yes
yes
Modify
the operatoronlyif the
expected
effects andthe
3. WhenPRODIGY
applies an operator in its mental model
realeffectsdiffer.
that missessomeeffects, the state becomesincorrect.
In spite of all these problems,we wouldlike the system
Figure 7: Modifyingan operator for each scenario.
to be able to solve problemswith the hypothesizedoperators.
In order to cope with these difficulties, we allow the system
If the operator is not applicable in the mental modelor
to experimentin the environment. Wealso distinguish two
in
the environment,then there is no difference betweenour
types of applicability: applicable in the mentalmodel, when
expectation
and the outcome.Therefore, we do not modify
all the preconditionsof the operatorare satisfied, andapplicathe
operator.
ble in the environment,whenit is successfully applied while
If the operator is not applicable in the mentalmodel,but
executing it in the environment.Operators are nowapplied
applicable
in the environmentwhenwe test it, then some
,
2
in the environment instead of in PRODIGY
s mental model.
of the preconditions in the operator are unnecessary. The
Operators are not applied in the environmentrandomlyfor
preconditionsthat are not satisfied in the current state are not
exploration, nor are they applied only whenthey are applinecessary, and we drop them.
cable in PRODIGY’s
mental model. Instead, they are tested
If the operator is applicable in the mentalmodel,but not
in the environmentwheneverit is needed by the planner to
applicable
in the environment,then it is because the preachieve somegoal. The operator maynot be applicable in
conditions
of the operator are too general. Weare missing
the environmentwhenwe test it, but the difference between
the expectation in the mental modeland the actual outcome someof the preconditions. This happensif the hypothesized
operator is a generalization of two distinct operators in the
in the environmentgives the systeman opportunity to confirm or modifythe hypothesizedoperators. Figure 6 shows domainthat have the same effects. For example, suppose
in some domain we have two correct operators REAL-OPthe modifiedplanningalgorithm that interacts with the en1 and REAL-OP-2,
shownin Figure 8. Suppose also that
vironmentand plans with incorrect operators. Only part of
observation-3and observation-4 in Figure 9 are observedby
step 4 is different from the algorithm in Figure 1. As the
the system. They are instances of applications of operator
operators are nowapplied in the environment,instead of in
REAL-OP-Iand REAL-OP-2,respectively. Because these
PRODIGY’s
mental model; therefore, we cannot backtrack
two observations have sameeffects, they are considered as
oncean operator is applied.
applications of the sameoperator by the learning system. Operator OP-3is formedfrom observation-3 and observation-4
2Sinceoursystemis notconnected
to a robot, the operatoris
(figure 9). Thepreconditionof OP-3is the intersection of the
appfiedin a simulatedenvironment
that behavesaccordingto the
complete
andcorrectset of operatorsin the domain.
pre-states of observation-3and observation-4.If the current
12
learns fromobservationsand fromits owninteraction with the
environment
duringproblemsolving. It incrementallyforms
a moreaccurateaction modelof the domain.
state is ( pO), ( pl ), OP-3is applicablein the system’s
tal model, but not in the environment.Weneedto split OP-3
into two operators each correspondingto one observation.
Thenewoperators after splitting is shownin Figure10.
3.3 The current Implementation
The learning mechanismis domainindependent, and has been
tested in a few domainsin PRODIGY:
blocksworld, Monkeyand-Bananas,
an extendedversion of the stripsworld, logistics
planningetc. In each of these domains,our systemcan solve
simple problems after seeing each operator once, and can
further generalize operators during its plan execution. But
due to lack of control knowledgein the system, it cannot
solve more complicated problems.
(Operator REAL-OP-I
(preconds ()
(and (p0) (pl) (p2)))
(effects ()
((add (p4)))))
(Operator REAL-OP-2
(preconds ()
(and (pl) (p3)))
(effects ()
((add (p4)))))
4 Related
Figure 8: Twooperators with the sameeffects.
Observation-3:
Pre-state:
(p0) (pl)
Post-state: (p0) (pl) (p2)
Delta-state: adds: (p4)
Observatlon-4:
Pre-state:
(pl) (p3)
Post-state:
(pl) (p3)
Delta-state: adds: (p4)
(Operator OP-3
(preconds ()
(pl))
(effects ((add (p4)))))
Figure 9: Twoobservations and the hypothesizedoperator.
(Operator OP-4
(preconds ()
(and (p0) (pl)
(effects ()
((add
(p4)))))
(p2)))
(Operator OP-5
(preconds ()
(and (pl) (p3)))
(effects ()
((add
(p4)))))
Thereis considerable interest inlearning for planningsystems. Most of this work concentrates on learning search
control knowledge[Mitchell et aL, 1990; Minton, 1988;
Veloso, 1992]. There is also workon learning domainknowl.
edge. EXPO[Gil, 1992] is a learning-by-experimentation
modulefor refining incompletely domainknowledge.Learning is triggered whenplan execution monitoring detects a
divergencebetweeninternal expectations and external observations. Oursystemdiffers in that learning is triggered when
sequencesof actions of other agents are observed. The initial knowledgethat is given to our systemis muchless than
thatin EXPO.Our system learns domainknowledgewithout
knowingany operators initially, whereasEXPO
relies on an
initial partial knowledge
of the operators.
LIVE[Shen, 1989] is another system that learns from its
environment.It is designed for exploration and discovery.
It can formulate newoperators as well as new terms. Our
system, on the other hand, concentrates on learning from
other agents.
There also has been work on learning by observing users.
LEAP[Mitchell
et al., 1990] is a learning apprentice system
in the domainof VLSIdesign. It is able to learn newrules for
decomposingcircuit modules, as well as rules for choosing
amongalternative decompositions. The learning methodrequires strong domainknowledge.CAP[Jourdanet aL, 1991]
is a learning apprentice for calendar management,
whichallowsusers to schedulemeetingsand providesadvice regarding
parameterssuch as the meetingtime and duration etc. Both
these systems are domaindependent. OUrsystem, on the
contrary, is domainindependent.
5
Figure 10: Twooperators resulted from splitting OP-3.
Work
Conclusions
and Future Work
In this paper, we have described a frameworkfor learning
action modelsin thccontext of a general-purposeplanner. The
In the last case, the operator is applicable in the mental system automatically learns domainknowledgeby observing
modelas well as in the environment.
If the effects of applying other agents, and by interacting with the environmentduring
the operator in the environmentare the sameas expected, we problem solving. The action modelthat the system learns
do not modifythe operator. But if there are unexpected
effects
maybe incomplete and incorrect. Planning with such model
whenthe operator is applied in the environment,then wehave
is moredifficult. In spite of all this, the systemis able to
found the effects that were previously masked.Weaugment solve problemsusing the incomplete and incorrect modelof
the effects of the operator to accommodate
the neweffects.
the domain.In the case of more complicatedproblems, the
The three learning steps described above: hypothesizing search space becomesvery large. Search control knowledge
operators fromobservations, planningwith incompleteor inbecomesessential for problemsolving. Currently, we are
correct operators, and confirmingoperatorsduringplan execu- workingon learning search control knowledgeby observing
tion, are interleaved in our system. Thesystemincrementally other agents. Wehave developeda methodfor inferring the
13
problem-solvingstructure of the observedagent, and we are
workingon learning search control knowledgeby analyzing
the inferred problem-solving
structure of the observedagent.
Our systemhas manylimitations. It assumesa perfect environmentin whichall the actions are deterministic, sensors
that canobservethe completestate, andsensors that are noiseless. Toapplyour learningtechniqueto the field of robotics,
these assumptions need to be relaxed. Makingthe system
moreapplicableto real domainsis a focus of our future work.
Acknowledgments
This work was done under the Supervision of Professor
Jaime Carbonell. It greatly benefited from discussion with
Alicia P~rez, Richard Goodwin,ManuelaVeloso, and members of the PRODIGY
group.
References
[Carbonell etal., 1990] Jaime J. Carbonell and Craig A.
Knoblockand Steve N. Minton. PRODIGY:
AnIntegrated
Architecture for Planning and Learning. In Architecture
for Intelligence, Hillside, NewJersey, editor Kurt VanLehn, 1990.
[Dent, 1990] Lisa Dent, Jeffrey Schlimmer.Learning from
Indifferent Agents. In Proceedingsof the Twelfth Annual
Conferenceof the Cognitive Science Society, pages 780787, Boston, Massachusetts, August1990.
[Fikes and Nilsson, 1971] Rich E Fikes and Nils J Nilsson.
Strips: A NewApproachto the Application of Theorem
Proving to ProblemSolving. In Artificial Intelligence,
pages 189-208, 1971.
[Gil, 1992] YolandaGil. Acquiring DomainKnowledgefor
Planning by Experimentation. Ph. D Thesis, School of
ComputerScience, Ce~’aegie Mellon University, Pittsburgh, Pennsylvania,1992.
[Jourdanetal.,1991] J. Jourdan, L. Dent, J. McDermott,
T.Mitchell, D. Zabowski.Interfaces that Learn: A Learning Apprentice for Calendar Management.Technical Report, School of ComputerScience, Carnegie MellonUniversity, Pittsburgh, Pennsylvania,1991.
[Minton, 1988] Steve N Minton. Learning Search Control
Knowledge: An Explanation.Based Approach. Kiuwer
AcademicPublisher, 1988.
[Mitchellet al., 1990]T. Mitchell, P. Utgoff,and R. Banerji.
Learning by Experimentation: Acquiring and Refining
Problem-solvingHeuristic. In MachineLearning, An Artificial Intelligence Approach,VolumeI. TiogaPress, Palo
Alto, CA,1983.
[Mitchell et al., 1990] T. Mitchell et al. LEAP:A Learning
Apprentice Systemfor VLSIDesign. In MachineLearning,
An Artificial Intelligence Approach,VolumeIlL Morgan
Kanfmann,Irvine, CA1990.
[Shen, 1989] Wei-MinShen. Learning from Environment
Based on Percepts and Actions. Ph. D Thesis, School
of ComputerScience, Carnegie MellonUniversity, Pittsburgh, Pennsylvania,1989.
[Veioso, 1992] ManuelaMVeloso. Learning by Analogical
Reasoning in General Problem Solving. Ph. D Thesis,
14
School of ComputerScience, Carnegie MellonUniversity.
Pittsburgh, Pennsylvania,1992.
Download