From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Learning Continuous Perception-Action Models Through Experience Ashwin Ram and Juan Carlos Santamar|a College of Computing Georgia Institute of Technology Atlanta, Georgia 30332-0280 {ashwin, carlos }@cc. gatech, edu Autonomous robotic navigationis defined as the task of finding a path along whicha robot can movesafely froma source point to a destination point in a obstacle-riddenterrain and, often, executing the actions to carry out the movement in a real or simulated world. Several methodshave been proposed for this task, ranging from high-level planning methodsto reactive methods. High-level planning methodsuse extensive world knowledge and inferences about the environment they interact with (see [Fikes el al., 1972; Georgeff, 1987; Maes,1990; Sacerdoti, 1975]). Knowledgeabout available actions and their consequencesis used to formulate a detailed plan before the actions are actually executed in the world. Such systems can successfully perform the path-finding required by the navigation task, but only ff an accurate and complete representation of the world is available to the system. Considerable high-level knowledgeis also neededto learn from planning experiences (see, for example, [Hammond, 1989; Minion, 1988; Mostowand Bhatnagar, 1987; Segre, 1988]). Sucha representation is usually not available in real-world environments, whichare complexand dynamicin nature. To build the necessaryrepresentations, a fast and accurate perception process is required to reliably mapsensory inputs to high-level representations of the world. A second problem with high-level planning is the large amountof processing time required, resulting in significant slowdown and the in. ability to respondimmediatelyto unexpectedsituations. Situated or reactive control methodshave been proposed as an alternative to high-level planningmethods(see [Arkin, 1989; Brooks, 1986; Kaelbling, 1986; Payton, 1986]). In these methods, no planning is performed; instead, a simple sensory representation of the environmentis used to select the next action that should he performed. Actions are represented as simple behaviors, whichcan be selected and executed rapidly, often in real-time. Thesemethodscan cope with unknownand dynamicenvironmentalconfigurations,but only those that lie within the scope of predeterminedbehav. iors. It requires a great deal of careful design and tuning on the part of the humandesignerto developthe control systems that drive such robots, and eventhen these systems run into serious difficulties whenfaced with environmentswhichare dlffment from those that the designer anticipated. Furthermore,evenif the designer could anticipate and modelall the relevant aspects of the operatingenvironment of the robot, the dynamicnature of the real world wouldrender parts of this 2O RobotAgent I / Env,ronment Figure 1: Systemarchitecture modelobsolete over time. The ability to adapt to changes in the environment,and learn from experiences, is crucial to adequate performance and survivability in the real world. Wehave developed a self-improving navigation systemthat uses reactive control for fast performance,augmentedwith multistrategy learning methodsthat allow the systemto adapt to novel environments and to learn fromits experiences(see figure 1). The system autonomouslyand progressively constructs representational structures that encapsulate its experiencesinto "cases" that are then used to aid the navigationtask in two ways:they allowthe systemto dynamicallyselect the appropriate robotic control behaviorsin different situations, and they also allow the systemto _a~pt selected behaviors to the immediatedemands of the environment (see [Moormanand Ram, 1992; Ramet al., 1992; Ramand Santamaria, 1993a: Ramand Santanuu’ia, 1993b]for further details). The system’s cases are automatically constructed using a hybrid case-based and reinforcement learning methodwithout extensive high-level reasoning. Thelearning and reactive modulesfunction in an integrated manner.The learning module is alwaystrying to find a better modelof the interaction of the systemwith its environmentso that it can tune the reactive moduleto performits function better. Thereactive moduleprovides feedback to the learning moduleso it can build a better modelof this interaction. Thebehaviorof the systemis the result of an equilibriumpoint established by the learning module,whichis trying to refine the model,and the environment,which is complexand dynamicin nature. This equilibriummayshift and needto be re-established if the environmentch,’mgesdrastically; however,the modelis generic enoughat any point to be able to deal with a very widerange of enviromnents. The learning methodis based on a combinationof ideas from case-based reasoning and learning, which deals with the issue of using past experiences to deal with and learn fromnovel situations, and fromreinforcementlearning, which deals with the issue of updatingthe content of system’sknowledge basedon feedbackfrom the environment(e.g., see [Sutton, 1992]). However, in traditional case-basedplanningsystems (e.g., [H,’unmond,1989]) learning and adaptation requires a detailed modelof the domain.This is exactly what reactive planningsystemsare trying to avoid. Earlier attempts to combinereactive control with classical planning systems (e.g.. [Chienel al., 1991])or explanation-based learning systeuns (e.g., [Mitchell, 1990]) also relied on deepreasoning and weretypically too slowfor the fast, reflexive behaviorrequired in reactive control systems. Unlike these approaches, our methoddoes not fall backon slow non-reactivetechniques for improvingreactive control. Eachcase represents an observedregularity betweena particular environmental configurationandthe effects of different actions, and prescribes the values of the schemaparameters that are most appropriate (as far as the systemknowsbased on its previous experience) for that environment.Thelearning moduleperformsthe following tasks in a cyclic manner: (1) perceive and represent the current environment;(2) trieve a case whoseinput vector represents an environment mostsimilar to the current enviromnent:(3) adapt the schema p~u’,’unetervalues in use by the reactive control moduleby installing the values recomnnended by the output vectors of the case; and (4) learn new associations and/or adapt existing associationsrepresentedin the case to reflect ,any newinformationgained throughthe use of the case in the newsituation to enhancethe reliability of their predictions. The perceive step builds a set of four input vectors, one for each sensory input described earlier, whichare matched against the correspondinginput vectors of the cases in the system’s memoryin the retrieve step. The case similarity metric is based on the meansquareddifference betweeneach of the vector values of the case over a trending window,and the vector values of the environment. The best matchwindowis calculated using a reverse sweepover the time axis similar to a convolutionprocessto find the relative position that matchesbest. The best matchingcase is handedto the adapt step, whichselects the schemaparametervalues from the output vectors of the case and modifiesthe corresponding values of the reactive behaviorscurrently in use using a reinforcementformulawhichuses the case simih’u’ity metric as a scalar reward. Thusthe actual adaptations performed depend on the goodness of match betweenthe case and the Figure 2: Typical navigational behaviorsof different tunings of the reactive control module.The figure on the left shows the non-learningsystemwith high obstacle avoidanceand low goal attraction. Onthe right, the learning systemhas lowered obstacle avoidanceand increased goal attraction, allowingit to "squeeze"throughthe obstacles and then take a relatively direct path to the goal. envirolunent. relative similarity measurequantifies howsimilar the current environmentconfigurationis to the environmentconfiguration encodedby the case relative to howsimilar the environment has beenin previousutilizations of the case. Intuitively, if case matchesthe current situation better than previous situations it wasusedin, it is likely that the situation involvesthe very regularities that the case is beginningto capture; thus, it is worthwhilemodifyingthe case in the direction of the current situation. Alternatively, if the matchis not quite as good, the case should not be modifiedbecausethat will take it awayfromthe regularity it was convergingtowards. Finally, if the current situation is a verybad fit to the case, it makes moresense to create a newcase to represent whatis probably a newclass of situations. A detailed description of each step wouldrequire more spacethan is available in this paper(see [Ramand Santamaria, 1993a; Ramand Santamaria, 1993b] for details). Here, we note that since the reinforcement formulais basedon a relative similarity measure,the overall effect of the learningprocessis to cause the cases to convergeon stable associations between environmentconfigurations and schemaparameters. Stable associationsrepresent regularities in the worldthat havebeen identified by the systemthrough its experience, and provide the predictivepowernecessaryto navigatein future situations. The assumption behind this methodis that the interaction betweenthe systemand the environmentcan be characterized by a finite set of causal patterns or associations betweenthe sensory inputs and the actions performedby the system. The nnethodallows the systemto learn these causal patterns and to use thenn to modifyits ,~tions by updating its schema panunetersas appropriate. Finally. the learn step uses statistical informationabout prior applications of the case to determinewhetherinformation from the current application of the case should be used to m(glify this case, or whethera newcase should be created. The vectors encodedin the cases ,are adapted using a reinforcementformulain whicha relative similarity measure is used as a scalar rewardor reinforcementsignal. The Wehave developeda three-dimensional interactive visualization of a robot navigating througha simulatedobstacleridden worldthat allowsa user to configureand test the navigational performanceof the robot (see figure 2). The user can also test two robots simultaneously,and comparethe performanceof a robot using traditional, non-learningreactive control withanotherthat uses our methodto adapt itself to the 21 environmentto enhancethe performanceof the navigation. The system has been tested through extensive empirical simulations on a wide variety of environmentsusing several different performance metrics(see [Moorman andRam,1992; Ramet al., 1992;Ramand Santamaria,1993b]for further details). Thesystemis very robust and can performsuccessfully in (and learn from) novel environments,yet it comparesfavorably with traditional reactive methodsin terms of speed and performance.A further advantageof the methodis that the systemdesignersdo not needto foresee and represent all the possibilities that mightoccursince the systemdevelopsits own"understanding"of the world and its actions. Through experience,the systemis able to _~__pt to, and performwell in, a wide range of environmentswithout any user intervention or supervisory input. This is a primarycharacteristic that autonomous agents must haveto interact with real-world environments. References [Arkin. 1989] gonald C. Arkin. Motorschema-basedmobile robot navigation. The International Journal of Robotics Research, 8(4):92-112, August 1989. [Brooks, 1986] RodneyBrooks. A robust layered control system for a mobile robot. IEEEJournal of Robotics and Automation, RA-2(I): 14-23, August1986. [Chien et al., 1991] S. A. Chien, M. T. Gervasio, and G. F. De2ong.Onbecomingdecreasingly reactive: Learning to deliberate minimally.In Proceedingsof the Eighth International Workshopon MachineLearning, pages 288-292, Chicago,IL, June 1991. [Pikes et al., 1972]R. E. Fikes, P. E. Hart, and N. L Nilsson. l.,earning and executinggeneralizedrobot plans. Artificial Intelligence, 3:251-288, 1972. [Georgeff, 1987] M. Georgeff. Planning. AnnualReview of ComputerScience, 2:359--400, 1987. [Hammond,1989] Ktistian J. Hammond. Case-Based Planning: Viewing Planning as a MemoryTask. Perspectives in Artificial Intelligence. AcademicPress, Boston, MA, 1989. [Kaelbling, 1986]L. Kaelbling. Anarchitecture for intelligent reactive systems. Technical Note 400, SRI International, October1986. [Maes, 1990] Pattie Maes.Situated agents canhave goals. Robotics and Auwnomaus Systems, 6:49-70, 1990. [Minton, 1988] Steven Minton. Learning effective search control knowledge:An explanation-based approach. PhD thesis, Carnegie-MellonUniversity, ComputerScience De, partment~ Pittsburgh, PA, 1988. Technical Report CMUCS-88-133. [Mitchell, 1990] T. M. Mitchell. Becoming increasingly reactive. In Proceedingsof the Eighth National Cw(erence on Artificial Intelligence, pages 1051-1058,Boston, MA, August 1990. [Moonnan and Ram, 1992] Kenneth Moormanand Ashwin Ram.A case-based approach to reactive control for autonomousrobots. In Proceedings of the AAAI Pall Sym- 22 posium on ,41for Real-WorldAutonomousMobile Robots, Cambridge, MA,October 1992. [Moslowand Bhatnagar, 1987] J. Mostowand N. Bhamagar. FAILSAFE - A floor planner that uses EBGto learn from its failures. In Proceedings of the TenthinternationalJoint Cm~erence on Artificial Intelligence, pages 249-255,Milan, Italy, August1987. [Payton, 1986] D. Payton. Anarchitecture for reflexive autonomousvehicle control. In Proceedingsof the IEEECon. ference on Robotics and Automation, pages 1838-1845, 1986. [Ramand Santamaria, 1993a] AshwinRamand Juan Carlos Santamaria. Continuouscase-based reasoning. In D. B. Leake, editor, Proceedingsof the AAAIWorkshopon CaseBased Reasoning, Washington,DC, July 1993. [Ramand Santamaria, 1993b] AshwinRamand Juan Carlos Santamaria. A multistrategy case-based and reinforcement learning approachto self-improvingreactive control systemsfor autonomous robotic navigation. In R. S. Michalski and G. Tecuci, editors, Proceedingsof the Secondlnternational Workshop on Multistrat egy Learning,HarpersFerry, WV,May1993. Center for Artificial Intelligence. George MasonUniversity, Falrfax, VA. [Ramet al., 1992] AshwinRam,Ronald C. Arkin. Kenneth Moorman, and Russell J. Clark. Case-basedreactive navigation: A case-based methodfor on-line selection and adaptation of reactive control parameters in autonomous robotic systems. Technical Report GIT-CC-92/57,College of Computing,Georgia Institute of Technology,Atlanta, GA, 1992. [Sacerdoti, 1975]E. D. Sacerdoti. A structure for plans and behavior. TechnicalNote 109, Stanford ResearchInstitute, Artificial Intelligence Center, 1975. Summarized in P.R. Cohenand E.A. Feigenbaum’sHandbookof AI, Vol. III, pp. 541-550. [Segre, 1988] Alberto M. Segre. MachineLearning of Robot Assembly Plans. KluwerAcademicPublishers, Norwell, MA, 1988. [Sutton, 1992] R. S. Sutton, editor. MachineLearning, Special issue on reinforcement learning, volume8(3/4). Kluwer Academic, Hingham,MA,1992.