Learning Continuous Perception-Action Models Through Experience

advertisement
From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Learning Continuous Perception-Action
Models Through Experience
Ashwin Ram and Juan Carlos Santamar|a
College of Computing
Georgia Institute of Technology
Atlanta, Georgia 30332-0280
{ashwin, carlos }@cc. gatech, edu
Autonomous
robotic navigationis defined as the task of finding a path along whicha robot can movesafely froma source
point to a destination point in a obstacle-riddenterrain and,
often, executing the actions to carry out the movement
in a
real or simulated world. Several methodshave been proposed
for this task, ranging from high-level planning methodsto
reactive methods.
High-level planning methodsuse extensive world knowledge and inferences about the environment they interact
with (see [Fikes el al., 1972; Georgeff, 1987; Maes,1990;
Sacerdoti, 1975]). Knowledgeabout available actions and
their consequencesis used to formulate a detailed plan before the actions are actually executed in the world. Such
systems can successfully perform the path-finding required
by the navigation task, but only ff an accurate and complete
representation
of the world is available to the system. Considerable high-level knowledgeis also neededto learn from
planning experiences (see, for example, [Hammond,
1989;
Minion, 1988; Mostowand Bhatnagar, 1987; Segre, 1988]).
Sucha representation is usually not available in real-world
environments, whichare complexand dynamicin nature. To
build the necessaryrepresentations, a fast and accurate perception process is required to reliably mapsensory inputs to
high-level representations of the world. A second problem
with high-level planning is the large amountof processing
time required, resulting in significant slowdown
and the in.
ability to respondimmediatelyto unexpectedsituations.
Situated or reactive control methodshave been proposed
as an alternative to high-level planningmethods(see [Arkin,
1989; Brooks, 1986; Kaelbling, 1986; Payton, 1986]). In
these methods, no planning is performed; instead, a simple
sensory representation of the environmentis used to select
the next action that should he performed. Actions are represented as simple behaviors, whichcan be selected and executed rapidly, often in real-time. Thesemethodscan cope
with unknownand dynamicenvironmentalconfigurations,but
only those that lie within the scope of predeterminedbehav.
iors. It requires a great deal of careful design and tuning on
the part of the humandesignerto developthe control systems
that drive such robots, and eventhen these systems run into
serious difficulties whenfaced with environmentswhichare
dlffment from those that the designer anticipated. Furthermore,evenif the designer could anticipate and modelall the
relevant aspects of the operatingenvironment
of the robot, the
dynamicnature of the real world wouldrender parts of this
2O
RobotAgent
I
/
Env,ronment
Figure 1: Systemarchitecture
modelobsolete over time.
The ability to adapt to changes in the environment,and
learn from experiences, is crucial to adequate performance
and survivability in the real world. Wehave developed a
self-improving navigation systemthat uses reactive control
for fast performance,augmentedwith multistrategy learning
methodsthat allow the systemto adapt to novel environments
and to learn fromits experiences(see figure 1). The system
autonomouslyand progressively constructs representational
structures that encapsulate its experiencesinto "cases" that
are then used to aid the navigationtask in two ways:they allowthe systemto dynamicallyselect the appropriate robotic
control behaviorsin different situations, and they also allow
the systemto _a~pt selected behaviors to the immediatedemands of the environment (see [Moormanand Ram, 1992;
Ramet al., 1992; Ramand Santamaria, 1993a: Ramand Santanuu’ia, 1993b]for further details).
The system’s cases are automatically constructed using a
hybrid case-based and reinforcement learning methodwithout extensive high-level reasoning. Thelearning and reactive
modulesfunction in an integrated manner.The learning module is alwaystrying to find a better modelof the interaction
of the systemwith its environmentso that it can tune the
reactive moduleto performits function better. Thereactive
moduleprovides feedback to the learning moduleso it can
build a better modelof this interaction. Thebehaviorof the
systemis the result of an equilibriumpoint established by the
learning module,whichis trying to refine the model,and the
environment,which is complexand dynamicin nature. This
equilibriummayshift and needto be re-established if the environmentch,’mgesdrastically; however,the modelis generic
enoughat any point to be able to deal with a very widerange
of enviromnents.
The learning methodis based on a combinationof ideas
from case-based reasoning and learning, which deals with
the issue of using past experiences to deal with and learn
fromnovel situations, and fromreinforcementlearning, which
deals with the issue of updatingthe content of system’sknowledge basedon feedbackfrom the environment(e.g., see [Sutton, 1992]). However,
in traditional case-basedplanningsystems (e.g., [H,’unmond,1989]) learning and adaptation requires a detailed modelof the domain.This is exactly what
reactive planningsystemsare trying to avoid. Earlier attempts
to combinereactive control with classical planning systems
(e.g.. [Chienel al., 1991])or explanation-based
learning systeuns (e.g., [Mitchell, 1990]) also relied on deepreasoning
and weretypically too slowfor the fast, reflexive behaviorrequired in reactive control systems. Unlike these approaches,
our methoddoes not fall backon slow non-reactivetechniques
for improvingreactive control.
Eachcase represents an observedregularity betweena particular environmental
configurationandthe effects of different
actions, and prescribes the values of the schemaparameters
that are most appropriate (as far as the systemknowsbased
on its previous experience) for that environment.Thelearning moduleperformsthe following tasks in a cyclic manner:
(1) perceive and represent the current environment;(2)
trieve a case whoseinput vector represents an environment
mostsimilar to the current enviromnent:(3) adapt the schema
p~u’,’unetervalues in use by the reactive control moduleby installing the values recomnnended
by the output vectors of the
case; and (4) learn new associations and/or adapt existing
associationsrepresentedin the case to reflect ,any newinformationgained throughthe use of the case in the newsituation
to enhancethe reliability of their predictions.
The perceive step builds a set of four input vectors, one
for each sensory input described earlier, whichare matched
against the correspondinginput vectors of the cases in the
system’s memoryin the retrieve step. The case similarity
metric is based on the meansquareddifference betweeneach
of the vector values of the case over a trending window,and
the vector values of the environment. The best matchwindowis calculated using a reverse sweepover the time axis
similar to a convolutionprocessto find the relative position
that matchesbest. The best matchingcase is handedto the
adapt step, whichselects the schemaparametervalues from
the output vectors of the case and modifiesthe corresponding values of the reactive behaviorscurrently in use using a
reinforcementformulawhichuses the case simih’u’ity metric
as a scalar reward. Thusthe actual adaptations performed
depend on the goodness of match betweenthe case and the
Figure 2: Typical navigational behaviorsof different tunings
of the reactive control module.The figure on the left shows
the non-learningsystemwith high obstacle avoidanceand low
goal attraction. Onthe right, the learning systemhas lowered
obstacle avoidanceand increased goal attraction, allowingit
to "squeeze"throughthe obstacles and then take a relatively
direct path to the goal.
envirolunent.
relative similarity measurequantifies howsimilar the current
environmentconfigurationis to the environmentconfiguration
encodedby the case relative to howsimilar the environment
has beenin previousutilizations of the case. Intuitively, if
case matchesthe current situation better than previous situations it wasusedin, it is likely that the situation involvesthe
very regularities that the case is beginningto capture; thus,
it is worthwhilemodifyingthe case in the direction of the
current situation. Alternatively, if the matchis not quite as
good, the case should not be modifiedbecausethat will take it
awayfromthe regularity it was convergingtowards. Finally,
if the current situation is a verybad fit to the case, it makes
moresense to create a newcase to represent whatis probably
a newclass of situations.
A detailed description of each step wouldrequire more
spacethan is available in this paper(see [Ramand Santamaria,
1993a; Ramand Santamaria, 1993b] for details). Here, we
note that since the reinforcement
formulais basedon a relative
similarity measure,the overall effect of the learningprocessis
to cause the cases to convergeon stable associations between
environmentconfigurations and schemaparameters. Stable
associationsrepresent regularities in the worldthat havebeen
identified by the systemthrough its experience, and provide
the predictivepowernecessaryto navigatein future situations.
The assumption behind this methodis that the interaction
betweenthe systemand the environmentcan be characterized
by a finite set of causal patterns or associations betweenthe
sensory inputs and the actions performedby the system. The
nnethodallows the systemto learn these causal patterns and
to use thenn to modifyits ,~tions by updating its schema
panunetersas appropriate.
Finally. the learn step uses statistical informationabout
prior applications of the case to determinewhetherinformation from the current application of the case should be used
to m(glify this case, or whethera newcase should be created. The vectors encodedin the cases ,are adapted using
a reinforcementformulain whicha relative similarity measure is used as a scalar rewardor reinforcementsignal. The
Wehave developeda three-dimensional interactive visualization of a robot navigating througha simulatedobstacleridden worldthat allowsa user to configureand test the navigational performanceof the robot (see figure 2). The user
can also test two robots simultaneously,and comparethe performanceof a robot using traditional, non-learningreactive
control withanotherthat uses our methodto adapt itself to the
21
environmentto enhancethe performanceof the navigation.
The system has been tested through extensive empirical
simulations on a wide variety of environmentsusing several
different performance
metrics(see [Moorman
andRam,1992;
Ramet al., 1992;Ramand Santamaria,1993b]for further details). Thesystemis very robust and can performsuccessfully
in (and learn from) novel environments,yet it comparesfavorably with traditional reactive methodsin terms of speed
and performance.A further advantageof the methodis that
the systemdesignersdo not needto foresee and represent all
the possibilities that mightoccursince the systemdevelopsits
own"understanding"of the world and its actions. Through
experience,the systemis able to _~__pt to, and performwell
in, a wide range of environmentswithout any user intervention or supervisory input. This is a primarycharacteristic
that autonomous
agents must haveto interact with real-world
environments.
References
[Arkin. 1989] gonald C. Arkin. Motorschema-basedmobile
robot navigation. The International Journal of Robotics
Research, 8(4):92-112, August 1989.
[Brooks, 1986] RodneyBrooks. A robust layered control
system for a mobile robot. IEEEJournal of Robotics and
Automation, RA-2(I): 14-23, August1986.
[Chien et al., 1991] S. A. Chien, M. T. Gervasio, and G. F.
De2ong.Onbecomingdecreasingly reactive: Learning to
deliberate minimally.In Proceedingsof the Eighth International Workshopon MachineLearning, pages 288-292,
Chicago,IL, June 1991.
[Pikes et al., 1972]R. E. Fikes, P. E. Hart, and N. L Nilsson.
l.,earning and executinggeneralizedrobot plans. Artificial
Intelligence, 3:251-288, 1972.
[Georgeff, 1987] M. Georgeff. Planning. AnnualReview of
ComputerScience, 2:359--400, 1987.
[Hammond,1989] Ktistian J. Hammond.
Case-Based Planning: Viewing Planning as a MemoryTask. Perspectives
in Artificial Intelligence. AcademicPress, Boston, MA,
1989.
[Kaelbling, 1986]L. Kaelbling. Anarchitecture for intelligent reactive systems. Technical Note 400, SRI International, October1986.
[Maes, 1990] Pattie Maes.Situated agents canhave goals.
Robotics and Auwnomaus
Systems, 6:49-70, 1990.
[Minton, 1988] Steven Minton. Learning effective search
control knowledge:An explanation-based approach. PhD
thesis, Carnegie-MellonUniversity, ComputerScience De,
partment~ Pittsburgh, PA, 1988. Technical Report CMUCS-88-133.
[Mitchell, 1990] T. M. Mitchell. Becoming
increasingly reactive. In Proceedingsof the Eighth National Cw(erence
on Artificial Intelligence, pages 1051-1058,Boston, MA,
August 1990.
[Moonnan and Ram, 1992] Kenneth Moormanand Ashwin
Ram.A case-based approach to reactive control for autonomousrobots. In Proceedings of the AAAI Pall Sym-
22
posium on ,41for Real-WorldAutonomousMobile Robots,
Cambridge, MA,October 1992.
[Moslowand Bhatnagar, 1987] J. Mostowand N. Bhamagar.
FAILSAFE
- A floor planner that uses EBGto learn from
its failures. In Proceedings
of the TenthinternationalJoint
Cm~erence
on Artificial Intelligence, pages 249-255,Milan, Italy, August1987.
[Payton, 1986] D. Payton. Anarchitecture for reflexive autonomousvehicle control. In Proceedingsof the IEEECon.
ference on Robotics and Automation, pages 1838-1845,
1986.
[Ramand Santamaria, 1993a] AshwinRamand Juan Carlos
Santamaria. Continuouscase-based reasoning. In D. B.
Leake, editor, Proceedingsof the AAAIWorkshopon CaseBased Reasoning, Washington,DC, July 1993.
[Ramand Santamaria, 1993b] AshwinRamand Juan Carlos
Santamaria. A multistrategy case-based and reinforcement
learning approachto self-improvingreactive control systemsfor autonomous
robotic navigation. In R. S. Michalski
and G. Tecuci, editors, Proceedingsof the Secondlnternational Workshop
on Multistrat egy Learning,HarpersFerry,
WV,May1993. Center for Artificial Intelligence. George
MasonUniversity, Falrfax, VA.
[Ramet al., 1992] AshwinRam,Ronald C. Arkin. Kenneth
Moorman,
and Russell J. Clark. Case-basedreactive navigation: A case-based methodfor on-line selection and
adaptation of reactive control parameters in autonomous
robotic systems. Technical Report GIT-CC-92/57,College
of Computing,Georgia Institute of Technology,Atlanta,
GA, 1992.
[Sacerdoti, 1975]E. D. Sacerdoti. A structure for plans and
behavior. TechnicalNote 109, Stanford ResearchInstitute,
Artificial Intelligence Center, 1975. Summarized
in P.R.
Cohenand E.A. Feigenbaum’sHandbookof AI, Vol. III,
pp. 541-550.
[Segre, 1988] Alberto M. Segre. MachineLearning of Robot
Assembly Plans. KluwerAcademicPublishers, Norwell,
MA, 1988.
[Sutton, 1992] R. S. Sutton, editor. MachineLearning,
Special issue on reinforcement learning, volume8(3/4).
Kluwer Academic, Hingham,MA,1992.
Download