From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Combining Experience with Quantitative Models John J. Grefenstette Connie Loggia Ramsey Navy Center for Applied Research in AI Naval Research Laboratory Code 5514 Washington, DC 20375-5337 Abstract have. Several previous studies have explored different methods for automatingknowledgeacquisition for intelliThis is a progressreport on our efforts to design intelligent gent robots, each approachtypically dependingon different robots for complexenvironments.The sort of applications assumptions about what is already knownabout the task we have in mind include senlry robots, autonomous environment. If the robot’s task environmentis well underdelivery vehicles, undersea surveillance vehicles, and stood, it may be most et~ient to dependlargely on a preautomated warehouse robots. Weare investigating the programmed model of the environment, and to use learning issues relating to machinelearning, using multiple mobile to improveefficiency and reactivity (Laird et al., 1991). robots to performtasks such as playing hide-and-seek, tag, the effects of the robot’s actions can not be easily predicted or competingto find hidden objects. Wepropose that the but there are only a fewimportantstate variables that affect knowledge acquisition task for autonomous robots be the robot’s decisions, the robot mightuse an internal model viewedas a cooperative effort betweenthe robot designers to accelerate its learning of state-action mappings(Sutton, and the robot itself. The robot should have access to the 1990). If the environmentis assumedto exhibit perpetual best modelof its world that the designer can reasonably novelty, it might be useful to enable the robot to learn a provide. Onthe other hand, someaspects of the environwide variety of cognitive sUucturesbased on low-level permcnt will be unknownin advance. For such aspects, the robot itself is in the best position to acquire the knowledge ceptual stimuli (Booker,1988). Webelieve that in practical cases, there will be a mix of the abovecases. Someaspects of what to expect in its world. Wehave implementedthese of the robot’s world will be accurately knownin advance, ideas in an arrangementwe call case-based anytime learnand someaspects can only be learned through the experiing. This systemstarts with a parameterized modelof its enceof the robot. world and then learns a set of specific models that correspond to the environmental cases it actually Wepropose that the knowledgeacquisition task for autoencounters. The system uses genetic algorithms to learn nomousrobots be viewed as a cooperative effort between high-performancereactive strategies for each environmen- the robot designers and the robot itself. Someaspects of the tal model. environmentwill be knownin great detail to the designer, for example,the size and weight of the robot, the charac1 Introduction teristics of its sensorsand eff~ctors, and at least someof the physics of the task environment.Our first principle guiding This is a progressreport on our efforts to designintelligent the cooperativeknowledge acquisition effort is: robots for complexenvironments.The sort of applications we have in mind include seatry robots, autonomous 1. Therobot needsall the help it can get. delivery vehicles, undersea surveillance vehicles, and That is, the robot shouldhaveaccess to the best modelof its automated warehouse robots. Weare investigating the issues relating to machinelearning, using multiple mobile world that the designer can reasonably provide. This is likely to include a quantitative simulation modelof the robots to performtasks such as playing hide-and-seek, lag, robot and its environment.Onthe other hand, someaspects or competingto find hiddenobjects. of the environment will be unknownin advance. These An intelligent robot will need extensive knowledgeto aspects could include such things as the frictional characinteract effectively with its external environment. This teristics of the floor, the reflective atuibutesof the wall surchallenge represents an important opportunity for machine faces, the location of objects and obstacles, and the speed learning. A key issue is not necessarily howcomplexthe and maneuverabilityof the other agents in the environment. environmentis, but howeasy it is to provide the robot with the knowledgeit needs, given the available knowledgewe For such aspects, the robot itself is in the best position to 57 acquire the knowledgeof what to expect in its world. This observationleads to our secondprinciple: 2. Therobot shoulduse its experienceto fill in the missingaspects of its worldmodel. Again, this should be a cooperative effort. The designer’s responsibility should include identifying the uncertain aspects of the environment, supplying plausible ranges of values, and makingsure that the initial modelprovided to the robot u~ats these aspects as parameters that can be changedas necessary to reflect the actual environment.The robot can then modifyits worldmodelbased on its observation of the actual environment. Wehave put these principles into practice in an arrangementwe call anytimelearning (Orefenstette and Ramsey,1992; Ramseyand Grefenstette, 1993). 2 Anytime Learning Anytimelearning is a methodfor continuous learning in changing environments. Wecall the approach anytime learning to emphasizeits relationship to recent work on anytime planning and scheduling (Dean and Boddy, 1988; Zweben,Deale and Gargan, 1990). The basic characteristics of anytime algorithms are: (1) the algorithm can suspended and resumedwith negligible overhead, (2) the algorithm can be terminated at any time and will return someanswer, and (3) the answers returned improve over time. However, our use of the term anytime learning is meant to denote a particular wayof integrating execution and learning. The basic idea is to integrate two continuously running modules: an execution moduleand a learn. ing module(see Figure 1). The agent’s learning module contains a simulation modelwith parameterized aspects of the domain.It continuouslytests newstrategies against the simulation model to develop improved strategies, and updates the knowledgebase used by the agent (in the execution system) on the basis of the best available results. The execution modulecontrols the agent’s interaction with the environment, and includes a monitor that can detect changes in the environment, and dynamically modify the parameter ranges of the simulation model (used by the learning module)to reflect these changes. Whenthe simulation modelis modified,the learning process is restarted on the modified model. The learning system is assumed to opexate indefinitely, and the execution system uses the results of learning as they becomeavailable. In short, this approach uses an existing simulation model as a base on whichstrategies are learned, and updates the parametersof the simulation modelas newinformation is observedin the external environmentin order to improvelong-term performalice. Weare still in the process of elaborating the anytime learning model. Oneaspect concerns the criteria for deciding whenthe external environment has changed.Currently, the monitor compares measurable aspectsof the environment with the parametersprovided by the simulation 58 Figure 1. AnytimeLearning System design. In addition, the monitor might also detect differences betweenthe expected and actual performanceof 1 For example,if the the current strategy in the environment. performance level degrades in the environment, that is a sign that the current strategy is nolonger applicable. If the performanceof the current strategy improvesunexpectedly, it mayindicate that the environmenthas changed, and that another strategy mayperformeven better. 3 Case-Based Initialization Algorithms of Genetic Whilethis approachcould be used merelyto fill in particular values for parametersthat are initially unknown,it is especially useful in changing environments.Each observed set of values for the parameters can be treated as an environmental case. The learning system can store strategies for different cases, indexed by the environmental parameters that characterize that case. Whenthe environmentchanges, the set of previous cases can be searched for similar cases, and the correspondingbest strategies can be used as a starting point for the newcase. Our particular instantiation of anytime learning uses a learning system called SAMUEL (Cobb and Grefenstette, 1991; Gordon, 1991: Grefenstette, 1991: Grefenstette, Ramseyand Schultz, 1990; Schuitz, 1991). SAMUEL is system for learning reactive strategies expressed as symboric condition-action rules, given a simulation modelof the environment.It uses a modified genetic algorithm and reinforcement learning to generate increasingly competent strategies. SAMUEL has successfully learned strategies for a numberof multi-agenttasks in (simulated) static environmerits, including evadingattackers, tracking other agents at a distance, and dogfighting. Whilethe basic ideas of anytime learning could be applied using a numberof other learning methods, especially other reinforcement learning methods, SAMUEL has some important advantages for this approach. In particular, SAMUEL can learn rapidly from s Hart, Anderson and Cohm(1990) discuss related the design of planners that monitor differences actual progress of a plan. issues concerning betweat expected and partially correct strategies and with limited fidelity simulation models (Schultz and Grefenstette, 1990; Ramsey, Schultz & Grefenstette, 1990). Moreimportantly for this discussion, the genetic algorithms provide an effective mechanismfor modifying previously learned cases. When an environmentalchangeis detected, the genetic algorithm is restarted with a newinitial population. Thisinitial population can be "seeded" with the best strategies found for previous similar cases. Wecall this case-basedinitialization of the genetic algorithm. Our studies have shownthat by using goodstrategies fromseveral different past cases as well as exploratory strategies, default strategies and membersof the previous population, the genetic algorithm can respond effectively to environmentalchanges, and can also recover gracefully from spurious false alarms (i.e., whenthe monitor mistakenly reports that the environment has changed). 5 Future Plans So far we have tested the anytime learning approach on simulated environments only. Wehave recently acquired two mobile robots from NomadicTechnologies, Inc. The robots are equippedwith several sensor packages,including contact sensors, infrared, sonar, and structur~ light rangefinders. Our experiments will involve the same kind of cat-and-mouse tasks ~bed above. A significant feature of the Nomadicrobots is that they come equipped with a complete simulation model, allowing the designer to test control strategies in simulation before testing on the real robot. Our approach calls for putting these simulations to work as the learning test bed for the robots, exactly as described above. Wewill soon be able to report our initial findings on the physical robots. The general approach described here is not limited to mobilerobots, but could be used in a variety of other setA case-based anytime learning system can be viewed as tings. There are manysimulation modelsused for training one that starts with an underspecified modelof its world humanoperators, for example,flight simulators, air traffg (the original parameterized quantitative model), and then control simulators, and power plant simulators. Many learns, on the basis of experience,a set of fully specified aspects of such simulators, such as weatherconditions, termodelsthat correspond to the environmentalcases it acturain characteristics, even the skill levels of the computer ally encounters. The power of this approach derives from generated agents, are usually parameterized to allow a the cooperative effort between designer and robot. The range of training experiences or to modelprobabilistic designer provides a rich sets of modelsthat can be used for events. Besides training simulators, there are simulators learning, and the robot selects the appropriate modelsfrom that have beenused to evaluate intelligent control systems that (possibly infinite) set of models.Eachpartner makes prior to deploymentin an operational setting (Fogarty, significant contribution to the knowledgeacquisition pro1989). Both training simulators and testing simulators usucess. ally include an evaluation mechanismfor the decisions madeduring the simulation run. These evaluation mechan4 Current Results isms provide feedback that can be used by autonomous Preliminary experiments have shownthe effectiveness of learning agents. The approach described here could be anytime learning in a changing environment(Grefenstette used in manyof these settings, allowing an automated and Ramsey,1992; Ramseyand Grefenstette, 1993). The powerplant, for example, to learn from its ownsimulation task used in these experiments was a cat-and-mouse game model,and to adjust that modelto reflect changingenvironin which one robot had to track another without being mental conditions. Weexpect that the combination of detected. The changing environmental parameters included experience and flexible quantitative modelscan reduce the the speeddistribution of the target agent and the maneuver- overall knowledgeacquisition effort required to produce ability of the target agent, as well as environmentalvarieffective autonomoussystems. ables that were in fact irrelevant to the performanceof the task. The most promising aspect of these results is that, within each time period (epoch) after an environmental References change, the case-based anytime learning system consistently showed a muchmore rapid improvementin the Booker,L. B. (1988). Classifier systemsthat learn internal performanceof the execution system than a baseline learnworld models. MachineLearning 3(3), 161-192. ing system(with its monitor disabled). Case-basedinitialiCobb, H. G. and J. J. Grefenstette (1991). Learningthe perzation allows the systemto automatically bias the search of sistence of actions in reactive conu’ol rules. Proceedings the genetic algorithm towardrelevant areas of the search of the Eighth International MachineLearning Workshop space. Morerecent experiments showthat the longer the (pp. 293-297). Evanston, IL: MorganKaufmann. epochs last and the longer the system runs and gathers a Dean, T. and M. Boddy (1988). An analysis of timebase of experiences of different environmentalcases, the dependent planning. Proceedings of the Seventh greater the expectedbenefit of case-based anytimelearning National Conference on A/(AAA/-88)(pp. 49-54). is. Paul, MN:Morgan Kaufmann. Fogarty, T. (1989). The machine learning of rules for combustion control in multiple bun~r installations. 59 Zweben, M., M. Deale and R. Gargan (1990). Anytime re.scheduling. Proceedingsof a Workshopon Innovative Approachesto Planning, Scheduling and Control (.pp. 251-259). San Diego: MorganKaufmann. Proceedingsof the Fifth IEEEConferenceon A! Applica. t/ons (pp. 215-221). Gordon, D. F. (1991). An enhancer for reactive plans. Proceedingsof the Eighth International MachineLearn. ing Workshop(pp 505-508). Evanston, IL: MorganKaufmann. Grefenstette, J. J. (1991). Lamarckianlearning in multiagent environments. Proceedingsof the Fourth International Conference of Genetic Algorithms (pp 303-310). San Diego, CA: Morgan Kaufmann. Grefenstette, J. J. and C. L. Ramsey(1992). Anapproach anytimelearning. Proceedingsof the Ninth International Conference on MachineLearning (pp 189-195), D. Sleeman and P. Edwards (eds.), San Mateo, CA: Morgan Kaufmann. Grefenstette, J. J., C. L. Ramseyand A. C. Schultz (1990). I.earning sequential decision rules using simulation models and competition. Machine Learning 5(4), 355381. Hart, D. M., S. Anderson and P. R. Cohen (1990). Envelopes as a vehicle for improving the efficiency of plan execution. Proceedings of a Workshopon Innovative Approaches to Planning, Scheduling and Control (pp. 71-76). San Diego: MorganKaufmann. Laird, J. E., E. $. Yager, M. Huckaand C. M. Tuck (1991). Robo-Soar:Anintegration of external interaction, planning, and learning using Soar. Robotics and Autonomous Systems 8(1-2), 113-129. Ramsey,C. L. and J. J. Grefenstette (1993). Case-basedinitialization of genetic algorithms. To appear in the Proceedings of the Fifth International Conference on Genetic Algorithms. Ramsey,C. L., A. C. Schultz and J. J. Grefenstette (1990). Simulation-assisted learning by competition: Effects of noise differences between training model and target environment. Proceedings of the Seventh International Conference on MachineLearning (pp 211-215). Austin, TX: Morgan Kaufmann. Schultz, A. C. (1991). Using a genetic algorithm to learn strategies for colfision avoidanceand local navigation. Seventh International Symposiumon Unmanned,Untethered, Submersible Technology (pp 213-225). Durham, NH. Schultz, A. C. and J. J. Grefenstette (1990). Improvingtactical plans with genetic algorithms. ProceedingsoflEEE Conference on Tools for AI 90 (pp 328-334). Washington, DC:n~.~E. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.Proceedings of the Seventh International Conference on MachineLearning (ML-90), Porter, B. W. and R. J. Mooney,eds., (pp 216-224). Austin, TX: Morgan Kaufmann, 6O