Combining Experience with Quantitative Models

advertisement
From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Combining Experience with Quantitative Models
John J. Grefenstette
Connie Loggia Ramsey
Navy Center for Applied Research in AI
Naval Research Laboratory
Code 5514
Washington, DC 20375-5337
Abstract
have. Several previous studies have explored different
methods
for automatingknowledgeacquisition for intelliThis is a progressreport on our efforts to design intelligent
gent
robots,
each approachtypically dependingon different
robots for complexenvironments.The sort of applications
assumptions
about what is already knownabout the task
we have in mind include senlry robots, autonomous
environment.
If the robot’s task environmentis well underdelivery vehicles, undersea surveillance vehicles, and
stood,
it
may
be most et~ient to dependlargely on a preautomated warehouse robots. Weare investigating the
programmed
model
of the environment, and to use learning
issues relating to machinelearning, using multiple mobile
to improveefficiency and reactivity (Laird et al., 1991).
robots to performtasks such as playing hide-and-seek, tag,
the effects of the robot’s actions can not be easily predicted
or competingto find hidden objects. Wepropose that the
but
there are only a fewimportantstate variables that affect
knowledge acquisition task for autonomous robots be
the
robot’s decisions, the robot mightuse an internal model
viewedas a cooperative effort betweenthe robot designers
to
accelerate
its learning of state-action mappings(Sutton,
and the robot itself. The robot should have access to the
1990).
If
the
environmentis assumedto exhibit perpetual
best modelof its world that the designer can reasonably
novelty,
it
might
be useful to enable the robot to learn a
provide. Onthe other hand, someaspects of the environwide variety of cognitive sUucturesbased on low-level permcnt will be unknownin advance. For such aspects, the
robot itself is in the best position to acquire the knowledge ceptual stimuli (Booker,1988). Webelieve that in practical
cases, there will be a mix of the abovecases. Someaspects
of what to expect in its world. Wehave implementedthese
of the robot’s world will be accurately knownin advance,
ideas in an arrangementwe call case-based anytime learnand someaspects can only be learned through the experiing. This systemstarts with a parameterized modelof its
enceof the robot.
world and then learns a set of specific models that
correspond to the environmental cases it actually
Wepropose that the knowledgeacquisition task for autoencounters. The system uses genetic algorithms to learn
nomousrobots be viewed as a cooperative effort between
high-performancereactive strategies for each environmen- the robot designers and the robot itself. Someaspects of the
tal model.
environmentwill be knownin great detail to the designer,
for example,the size and weight of the robot, the charac1 Introduction
teristics of its sensorsand eff~ctors, and at least someof the
physics of the task environment.Our first principle guiding
This is a progressreport on our efforts to designintelligent
the cooperativeknowledge
acquisition effort is:
robots for complexenvironments.The sort of applications
we have in mind include seatry robots, autonomous
1. Therobot needsall the help it can get.
delivery vehicles, undersea surveillance vehicles, and
That is, the robot shouldhaveaccess to the best modelof its
automated warehouse robots. Weare investigating the
issues relating to machinelearning, using multiple mobile world that the designer can reasonably provide. This is
likely to include a quantitative simulation modelof the
robots to performtasks such as playing hide-and-seek, lag,
robot and its environment.Onthe other hand, someaspects
or competingto find hiddenobjects.
of
the environment will be unknownin advance. These
An intelligent robot will need extensive knowledgeto
aspects
could include such things as the frictional characinteract effectively with its external environment. This
teristics
of the floor, the reflective atuibutesof the wall surchallenge represents an important opportunity for machine
faces,
the
location of objects and obstacles, and the speed
learning. A key issue is not necessarily howcomplexthe
and maneuverabilityof the other agents in the environment.
environmentis, but howeasy it is to provide the robot with
the knowledgeit needs, given the available knowledgewe For such aspects, the robot itself is in the best position to
57
acquire the knowledgeof what to expect in its world. This
observationleads to our secondprinciple:
2. Therobot shoulduse its experienceto fill
in the missingaspects of its worldmodel.
Again, this should be a cooperative effort. The designer’s
responsibility should include identifying the uncertain
aspects of the environment, supplying plausible ranges of
values, and makingsure that the initial modelprovided to
the robot u~ats these aspects as parameters that can be
changedas necessary to reflect the actual environment.The
robot can then modifyits worldmodelbased on its observation of the actual environment. Wehave put these principles into practice in an arrangementwe call anytimelearning (Orefenstette and Ramsey,1992; Ramseyand Grefenstette, 1993).
2 Anytime Learning
Anytimelearning is a methodfor continuous learning in
changing environments. Wecall the approach anytime
learning to emphasizeits relationship to recent work on
anytime planning and scheduling (Dean and Boddy, 1988;
Zweben,Deale and Gargan, 1990). The basic characteristics of anytime algorithms are: (1) the algorithm can
suspended and resumedwith negligible overhead, (2) the
algorithm can be terminated at any time and will return
someanswer, and (3) the answers returned improve over
time. However, our use of the term anytime learning is
meant to denote a particular wayof integrating execution
and learning. The basic idea is to integrate two continuously running modules: an execution moduleand a learn.
ing module(see Figure 1). The agent’s learning module
contains a simulation modelwith parameterized aspects of
the domain.It continuouslytests newstrategies against the
simulation model to develop improved strategies, and
updates the knowledgebase used by the agent (in the execution system) on the basis of the best available results.
The execution modulecontrols the agent’s interaction with
the environment, and includes a monitor that can detect
changes in the environment, and dynamically modify the
parameter ranges of the simulation model (used by the
learning module)to reflect these changes. Whenthe simulation modelis modified,the learning process is restarted on
the modified model. The learning system is assumed to
opexate indefinitely, and the execution system uses the
results of learning as they becomeavailable. In short, this
approach uses an existing simulation model as a base on
whichstrategies are learned, and updates the parametersof
the simulation modelas newinformation is observedin the
external environmentin order to improvelong-term performalice.
Weare still in the process of elaborating the anytime
learning model. Oneaspect concerns the criteria for deciding whenthe external environment
has changed.Currently,
the monitor compares
measurable
aspectsof the environment with the parametersprovided by the simulation
58
Figure 1. AnytimeLearning System
design. In addition, the monitor might also detect
differences betweenthe expected and actual performanceof
1 For example,if the
the current strategy in the environment.
performance level degrades in the environment, that is a
sign that the current strategy is nolonger applicable. If the
performanceof the current strategy improvesunexpectedly,
it mayindicate that the environmenthas changed, and that
another strategy mayperformeven better.
3 Case-Based Initialization
Algorithms
of Genetic
Whilethis approachcould be used merelyto fill in particular values for parametersthat are initially unknown,it is
especially useful in changing environments.Each observed
set of values for the parameters can be treated as an
environmental case. The learning system can store strategies for different cases, indexed by the environmental
parameters that characterize that case. Whenthe environmentchanges, the set of previous cases can be searched for
similar cases, and the correspondingbest strategies can be
used as a starting point for the newcase.
Our particular instantiation of anytime learning uses a
learning system called SAMUEL
(Cobb and Grefenstette,
1991; Gordon, 1991: Grefenstette, 1991: Grefenstette,
Ramseyand Schultz, 1990; Schuitz, 1991). SAMUEL
is
system for learning reactive strategies expressed as symboric condition-action rules, given a simulation modelof
the environment.It uses a modified genetic algorithm and
reinforcement learning to generate increasingly competent
strategies. SAMUEL
has successfully learned strategies for
a numberof multi-agenttasks in (simulated) static environmerits, including evadingattackers, tracking other agents at
a distance, and dogfighting. Whilethe basic ideas of anytime learning could be applied using a numberof other
learning methods, especially other reinforcement learning
methods, SAMUEL
has some important advantages for this
approach. In particular, SAMUEL
can learn rapidly from
s Hart, Anderson and Cohm(1990) discuss related
the design of planners that monitor differences
actual
progress
of a plan.
issues
concerning
betweat expected and
partially correct strategies and with limited fidelity simulation models (Schultz and Grefenstette, 1990; Ramsey,
Schultz & Grefenstette, 1990). Moreimportantly for this
discussion, the genetic algorithms provide an effective
mechanismfor modifying previously learned cases. When
an environmentalchangeis detected, the genetic algorithm
is restarted with a newinitial population. Thisinitial population can be "seeded" with the best strategies found for
previous similar cases. Wecall this case-basedinitialization of the genetic algorithm. Our studies have shownthat
by using goodstrategies fromseveral different past cases as
well as exploratory strategies, default strategies and
membersof the previous population, the genetic algorithm
can respond effectively to environmentalchanges, and can
also recover gracefully from spurious false alarms (i.e.,
whenthe monitor mistakenly reports that the environment
has changed).
5
Future Plans
So far we have tested the anytime learning approach on
simulated environments only. Wehave recently acquired
two mobile robots from NomadicTechnologies, Inc. The
robots are equippedwith several sensor packages,including
contact sensors, infrared, sonar, and structur~ light rangefinders. Our experiments will involve the same kind of
cat-and-mouse tasks ~bed above. A significant feature
of the Nomadicrobots is that they come equipped with a
complete simulation model, allowing the designer to test
control strategies in simulation before testing on the real
robot. Our approach calls for putting these simulations to
work as the learning test bed for the robots, exactly as
described above. Wewill soon be able to report our initial
findings on the physical robots.
The general approach described here is not limited to
mobilerobots, but could be used in a variety of other setA case-based anytime learning system can be viewed as
tings. There are manysimulation modelsused for training
one that starts with an underspecified modelof its world
humanoperators, for example,flight simulators, air traffg
(the original parameterized quantitative model), and then
control simulators, and power plant simulators. Many
learns, on the basis of experience,a set of fully specified
aspects of such simulators, such as weatherconditions, termodelsthat correspond to the environmentalcases it acturain characteristics, even the skill levels of the computer
ally encounters. The power of this approach derives from
generated agents, are usually parameterized to allow a
the cooperative effort between designer and robot. The
range of training experiences or to modelprobabilistic
designer provides a rich sets of modelsthat can be used for
events. Besides training simulators, there are simulators
learning, and the robot selects the appropriate modelsfrom that have beenused to evaluate intelligent control systems
that (possibly infinite) set of models.Eachpartner makes
prior to deploymentin an operational setting (Fogarty,
significant contribution to the knowledgeacquisition pro1989). Both training simulators and testing simulators usucess.
ally include an evaluation mechanismfor the decisions
madeduring the simulation run. These evaluation mechan4 Current Results
isms provide feedback that can be used by autonomous
Preliminary experiments have shownthe effectiveness of
learning agents. The approach described here could be
anytime learning in a changing environment(Grefenstette
used in manyof these settings, allowing an automated
and Ramsey,1992; Ramseyand Grefenstette, 1993). The
powerplant, for example, to learn from its ownsimulation
task used in these experiments was a cat-and-mouse game model,and to adjust that modelto reflect changingenvironin which one robot had to track another without being
mental conditions. Weexpect that the combination of
detected. The changing environmental parameters included
experience and flexible quantitative modelscan reduce the
the speeddistribution of the target agent and the maneuver- overall knowledgeacquisition effort required to produce
ability of the target agent, as well as environmentalvarieffective autonomoussystems.
ables that were in fact irrelevant to the performanceof the
task. The most promising aspect of these results is that,
within each time period (epoch) after an environmental
References
change, the case-based anytime learning system consistently showed a muchmore rapid improvementin the
Booker,L. B. (1988). Classifier systemsthat learn internal
performanceof the execution system than a baseline learnworld models. MachineLearning 3(3), 161-192.
ing system(with its monitor disabled). Case-basedinitialiCobb,
H. G. and J. J. Grefenstette (1991). Learningthe perzation allows the systemto automatically bias the search of
sistence
of actions in reactive conu’ol rules. Proceedings
the genetic algorithm towardrelevant areas of the search
of
the
Eighth
International MachineLearning Workshop
space. Morerecent experiments showthat the longer the
(pp.
293-297).
Evanston, IL: MorganKaufmann.
epochs last and the longer the system runs and gathers a
Dean, T. and M. Boddy (1988). An analysis of timebase of experiences of different environmentalcases, the
dependent planning. Proceedings of the Seventh
greater the expectedbenefit of case-based anytimelearning
National Conference on A/(AAA/-88)(pp. 49-54).
is.
Paul, MN:Morgan Kaufmann.
Fogarty, T. (1989). The machine learning of rules for
combustion control in multiple bun~r installations.
59
Zweben, M., M. Deale and R. Gargan (1990). Anytime
re.scheduling. Proceedingsof a Workshopon Innovative
Approachesto Planning, Scheduling and Control (.pp.
251-259). San Diego: MorganKaufmann.
Proceedingsof the Fifth IEEEConferenceon A! Applica.
t/ons (pp. 215-221).
Gordon, D. F. (1991). An enhancer for reactive plans.
Proceedingsof the Eighth International MachineLearn.
ing Workshop(pp 505-508). Evanston, IL: MorganKaufmann.
Grefenstette, J. J. (1991). Lamarckianlearning in multiagent environments. Proceedingsof the Fourth International Conference of Genetic Algorithms (pp 303-310).
San Diego, CA: Morgan Kaufmann.
Grefenstette, J. J. and C. L. Ramsey(1992). Anapproach
anytimelearning. Proceedingsof the Ninth International
Conference on MachineLearning (pp 189-195), D. Sleeman and P. Edwards (eds.), San Mateo, CA: Morgan
Kaufmann.
Grefenstette, J. J., C. L. Ramseyand A. C. Schultz (1990).
I.earning sequential decision rules using simulation
models and competition. Machine Learning 5(4), 355381.
Hart, D. M., S. Anderson and P. R. Cohen (1990).
Envelopes as a vehicle for improving the efficiency of
plan execution. Proceedings of a Workshopon Innovative Approaches to Planning, Scheduling and Control
(pp. 71-76). San Diego: MorganKaufmann.
Laird, J. E., E. $. Yager, M. Huckaand C. M. Tuck (1991).
Robo-Soar:Anintegration of external interaction, planning, and learning using Soar. Robotics and Autonomous
Systems 8(1-2), 113-129.
Ramsey,C. L. and J. J. Grefenstette (1993). Case-basedinitialization of genetic algorithms. To appear in the
Proceedings of the Fifth International Conference on
Genetic Algorithms.
Ramsey,C. L., A. C. Schultz and J. J. Grefenstette (1990).
Simulation-assisted learning by competition: Effects of
noise differences between training model and target
environment. Proceedings of the Seventh International
Conference on MachineLearning (pp 211-215). Austin,
TX: Morgan Kaufmann.
Schultz, A. C. (1991). Using a genetic algorithm to learn
strategies for colfision avoidanceand local navigation.
Seventh International Symposiumon Unmanned,Untethered, Submersible Technology (pp 213-225). Durham,
NH.
Schultz, A. C. and J. J. Grefenstette (1990). Improvingtactical plans with genetic algorithms. ProceedingsoflEEE
Conference on Tools for AI 90 (pp 328-334). Washington, DC:n~.~E.
Sutton, R. S. (1990). Integrated architectures for learning,
planning, and reacting based on approximating dynamic
programming.Proceedings of the Seventh International
Conference on MachineLearning (ML-90), Porter, B. W.
and R. J. Mooney,eds., (pp 216-224). Austin, TX: Morgan Kaufmann,
6O
Download