From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Successive Refinement of Empirical Models Ralph E. Gonzalez Department of Mathematical Sciences Rutgers University Camden, NJ 08102, USA (rgonzal@ gandalf.rutgers.edu) Abstract period, andthe size of the lookup-tableincreases accordingly. This approachis implemented in a proceduralframework,whereeach region is associated with an independent adaptive elementwhoseaccuracyis limited by the current regionsize. Beginning with large regionsencouragesgeneralizationandspeedsinitial learning.Asthe regionscontract, the associatedadaptiveelementsmayobtain arbitrary accuracy. Learningis acceleratedin those areas of state space whicharise mostfrequently,andthe rangeof generalization is successivelyreduced.Applicationto the optimalcontrol of a simulatedroboticmanipulator is discussed. Anapproachfor automaticcontrol using on-line learningis presented.Thecontroller’smodelfor the systemrepresentsthe effects of control actions in v,’u’ious(hyperspherical) regionsof state space.The partition producedby these regions is initially coarse, limiting optimizationof the corresponding control actions. Regionscontract periodically to enable optimization to progress further. New regions and associated optimizing elements are generatedto maintaincoverageof state space, with characteristics drawnfromneighboringregions. Duringon-line operation the resulting state space partition undergoessuccessive refinement, with "generalization" occurring over successively smaller areas. Thesystemmodeleffectively grows and becomesmoreaccurate with time. 2 LearningControl An automatic meansof controlling a compiexsystem is desired. Therelevant aspects of the systemcomprisethe system’sstate, denotedby the vector x. If the dynamics of the systemare unknown or are difficult to analyze, the 1 Introduction controller mayempirically learn the appropriate control vectoru for eachstate, accordingto someavailableindexof Anon-linelearningcontroUermaintainsa modelof its envi- performanceOP)f(x,u). Thecontroller producesa model ronment(the action model),fromwhichcontrol actions are mappingthe state space X onto the control actions u*(x) determined. The modelis refined during nonsupervised whichoptimizethe IP. operation to improvecontrol with respect to an index of Of particular interest is the problemof on-line performance.A performance-adaptive controller [Saridis, learning, wherethe controller must construct a system 1979]learns by direct reductionof its uncertainties,rather modelwhilecontrollingthe system.In this caseit is desirthanexplicit identification of the environmental parameters. able for the controller to be initialized with or to quickly Thatis, the controller’sunderstanding of its environment is learn controlactionswhichmaintainstable operation,andto representedby its control decisions, and is constructed continueto improveuponthese actions over time. Arobust empirically. learningalgorithm is required,since the selectionof states is Anexampleis the lookup-tablecontroller [’Waltzand determinedby the dynamicsof the systemrather than by a Fu, 1965]. Thespace of systemstates is partitioned, and supervisoryalgorithm. each"control situation" (partition element)is ,associated with an independent adoptive element. Each element 2.1 State SpacePartitioning attemptsto determinethe control action appropriatewithin If the state spaceXis discrete, the learningcontrollermay its regionof state space: for example,usinga probabifity- approximate u*(x)for eachstate x~X. If the cardinality reinforcementscheme.Thesystemmodelis a table relating the set of states is very large or if X is continous,then eachregionto the responsewhichis mostlikely optimal. computingu*(x) for all states mayrequire excessive (or This paper extends the lookup-table approach by infinite) time. In this ease, the controllermaypartitionstate allowing regions to contract and to recursively spawn space into discrete regions Iri: i=l,2,...n}. Thelearning smallerregions. Theresulting state space partition under- algorithmassociates a single suboptimalcontrol action u i goes successive refinement during the on-line training with eachregion, producing a lookup-table. 42 If the state space partition is very coarse (n is small), then the effectiveness of ui mayvary greatly for different states within the regionr i. Onthe other hand,if the partition is very fine then the overall learning time will remainlarge. Previous researchers have applied techniques to improvethe effectiveness of a coarse partition. Waltz and Fu [1965] suggested that regions along the switching boundaryof a binary controller maybe partitioned once more (based on the failure of the control actions associated with these regions to converge during the training period). Albus’ neural model CMAC [1975] used overlapping coarse partitions to give the effect of a finer partition while achieving greater memoryefficiency. Rosen et al. [1992] allowed region boundaries to migrate to conform to the control surface of the system, which also resulted in a controller which adapted well to changing system dynamics. These techniques do not fully overcomethe accuracy limitation whichultimately characterizes any fixed-size systemmodel. ( x ri (a) ) 2.2 Contractible Adaptive Elements The system modeldescribed here consists of a set Cffi{ci: i=l,2,...n(t)} of Contractible Adaptive Elements (CAEs), whosecardinality n is a nondecreasing function of time. EachCAEci associates a single suboptim~dcontrol action u i with a hyperspherical region r i of the state space X. The element ci becomesactive whenthe state x(t) movesinto i (Figure la). While,active, the CAEsearches for an optimal control action u*(x), according to someindex of perform,’mcef(x,u). Since u*(x) varies over i, the search may fail to converge. Underconditions of continuity, a boundon this "noise" can be obtained (generally based on the size of the region). This enables the establishment of an accuracy threshold 5i beyondwhichthe search becomesineffective. The active CAEci searches until its accuracythreshold 8i is reached. At this point, the CAEreduces the size of its region r i (Figure lb) and reduces 6i. As the conlroller employedon-line, each CAE’sregion maycontract repeatedly to permit arbitrary search accuracy. Whena region contracts, it is possible for a state to arise nearby which falls outside of the boundary of all existing CAEsin the set C (Figure lc). A new CAE generated, whoseregion maybe centered at the current state location (Figure ld). The size of the new region and the st,’u’ting value for the search process maybe "generalized" from neighboring CAEs,with consideration of the distance to the center of their ,associated regions. As morenearby states arise during on-line operation, the contraction of a CAEresults in repopulation of the vacated area of state space with sm,’dler CAEs.These second-generation CAEs will themselveseventually contract, leading to still-smaller third-generation CAEs.The range of generalization conIr, tcts witheachgeneration. It is possiblefor a state to fall within the boundariesof two or more hyperspherical CAEs.Only one may be ,activated; e.g., by considerationof the distance fromthe current sutte to the center of the respective regions. ( (b) (c) ) ( (d) Rgure 1. CAEregions on ~imensional state space; (a) at beginning of controlinterval, (b) afterri contracts, (c) at beginning of nextcontrolinterval,(d) after creationof nowregion. /43 2-~.1 Performance Features Preliminarylearning is acceleratedby the use of a coarse initial partition whoseindividual elementsenjoy frequent activation. Undercertain conditions, it can be proventhat the successiverefinement,approachachievesthe accuracy possiblewitha fixed partition in less time [Gonzalez, 1989]. TheCAE ,approachis also moreflexible than that involving a fixed pm’fition,since in manycases the desiredpartition resolution is not known beforehand. During on-line control, the state space pm’tition defined by the set C generally becomesmorerefined in those areas of state spacewhicharise mostfrequently.This provides memory efficiency and accelerates learning in these,areas. The optimal control action mayvary more quickly over someareas of state space than others. For example, there maybe a discontinuityin the valueof the optimum at somesurfaceof state space. If this surfacepassesthrougha CAE’sregion, then that CAE’ssearch process will fail to converge. Theregion should then contract, allowing the spaceit vacatedto be repopulatedwithsmallerCAEs. If this occurssuccessively,the discontinuitywill be containedin regions whosecombinedvolumecan be madearbitrarily small. Thusthe likelyhoodof a state arising in oneof these critical regionswill be equallysmall. Onthe other hand,if there is very little variation in the optimum over a large region, it is possible for the associated CAEs’search processes to use a finer accuracy threshold without contracting. This allowsmorerapid learning than wouldbe the case with smaller CAEs. determinesthe minimum allowablestep size. In somecases /i i can be shown to be proportionalto the radiusof r i. This occurswhenthe index of performancef is of the form: f(x,u) = kllx-y(u)ll; i.e., f measures the distancebetween state "target" vector andan "output"vector y(u) [Gonzalez, 1989]. 3 Robotics Application Wedescribean applicationof the on-linelearningcontroller to a dynamicsimulation of a 2-degree of freedomrobotic manipulator,including gravity and friction effects. The robot’s dynamicsare unknown to the controller, whosetask is to determinethe joint torques whichoptimallymovethe robot from aninitial configurationto a target in a rectangular regionin the planeof the robot (Figure2). Theindex of performance is a weightedsumof the hand’sposition and velocity error and the energyandpowerrequirements.It is generallydifficult to analytically minimize sucha function in the face of the robot’s nonlinear dynamics,and in any case the computationsnecessarywill generally preclude real-time control. Likewise,conventionaladaptivemethods do not produceoptimalcontrol of suchsystems. Thecontrolinterval is a single trial, beginningwitha fixed robot configurationand presentationof a randomlyselectedtarget. Sincethe robot’sconfigurationat the beginningof eachcontrolinterval is constant,the state x=[tI T t2] consists solely of the 2-dimensionalcoordinates of the target. Thecontroller com~utesa 4-dimensionalcontrol vector uf[T’la F2aFldF2d]t, consistingof the acceleration and decelerationtorques for each of the twojoints. (The control vector could as easily havedefinedthe electrical 2.2.2 SearchTechniques input signal to the joint actuators.)Theaccelerationtorques Assuming state spaceis continuousandthe controller is in are appliedto their respectivejoints for the first half of the on-line operation, a particular state x wit generallyarise control interval (0.5 see), andthe decelerationtorques,are onlyonce. Thereforeit is not possibleto accuratelyestimate appliedduringthe secondhalf of the interval. Theindexof the gradientVf(x,.) representingthe changein performance performance is computed at the endof the control interval. producedby a changein control action for a given state. Thisvalueis usedby the active CAE to adjust the direction This prevents the use of pure steepest-descent based of search. methodswhensearchingfor control actions. If the rangeof controlactionsis itself discrete, a probability-reinforcement scheme[Waltz and Fu, 1965] maybe used in place of search. If controlspaceis continuous,direct-searchoptimizationis used. Direct-searchmethodstake manyforms, and generally utilize implicit estimatesfor the performance gradient. In our implementation, each CAEmaintains an independent direct search process. Its parametersinclude the current (trt suboptimalcontrol action, searchdirection, and step size. Duringeach control interval, (1) a CAEi containing the currentstate x is activated,(2) i incrementstsi suboptimal controlactionUi byonestep in the searchdirection,(3) i is appliedto the systemundercontrol, (4) the resulting IP evaluated,and(5) i modifies its search direction according to this value. (The state is assumedconstant during the controlinterval.) Thestep size is reducedas the suboptimal Figure 2. Initial robotconfiguration andsample target. control action nears optimum.Theaccuracythreshold ~i i Sincethe control interval coincideswith the length of a trial, control during each trial is open-loop, or "blind". Duringon-line training, the controller constructs a lookuptable containing suboptimal open-looptorque prof’fles for reachingany target. The system modelconsists initially of a single CAE, whosedisk-shapedregion is centered in the state space. Its control vector has beenoptimizedfor a target at the center of state space, ,and the region radius is correspondinglyvery sm,’dl. (The optimization is accomplishedby waining the initial CAEwith a series of trinis whosetargets are fixed at this location.) 3.1 Simulation Results Figure 3 shows the CAEregions existing after 10, 100, I000, and 10,000 trials. Whena CAEc i has a relatively large region r i, the likelyhoodof a (randomly-selected)state x falling in r i is correspondingly large. Therefore,ci is activated fa’equently during on-line operation, and converges quickly to a control vector ui satisfying its accuracy threshold. Thereupon r i contracts. As training progressesthe average size of the individual CAEscauses themto be activatedrarely, and le,’u’ningslows. The size of a region is inversely related to the accuracy of the control action of its associated CAE.Since states are selected uniformlyover state space, the variation in size of regions after 10,000Irials is related primarily to the difficulty of optimization in someareas of state space. There also is bias towardgreater accuracy at the center of state space due to the influence of the initial CAE,and a bias toward lower accuracy al the edges of state space where someregions overlap the boundary. After 1O,000trials, the controller succeeds in placing the end-effector on any target with little error. Since energy and powerusage axe also included in the index of performance,the robot’s motionis smoothand efficient. At this point, learning may cease. Alternately, the CAEsmay remain active during on-line use of the robot to enable continually-improving performance. After 10,000 trials, the computermemoryrequirement for this application is about 893 kb (each CAErequires roughly 450 bytes). This includes the lookup-table of suboptimal control actions (the system model) as well as ~mfor the individual CAEsearch processes. If learning continues during on-line use, the numberof CAEswill increase and memoryrequirements will continue to grow. 3.1.1 Comparisonwith Fixed Partition To comparethe learning rate using CAEsversus that using a fixed partition of state space,another experiment is conducted. Eachadaptive element is initinliTod with the control action which is optimumat thecenter of state space; (a) (b) (c) (d) Figure 3. CAE regions after(a) 10trials, (b) 100trials, (c) 1000 trials, (d) 10,000 45 Fixedregion size Conwactible regions Trial number Number of elements I0 I00 I000 6 IOO0(3 I00000 21 174 2002 10852 Average Average IP region size 580.7 441.2 243.0 71.3 41.4 Number of elements 10 95.08 42.52 22.70 100 881 5.25 1.78 4025 6390 Average IP Region size 694.6 687.4 5 795.2 5 5 611.3 168.3 5 5 Table1. Partitionstatisticscomparing CAE approach versusfixedregions. i.e., the samevalue which initialized the original experiment. The radius of each correspondingstate space region is fixed to the averageradius existing after the original 10,000 trials (5 units). Thesame10,000trials are presented again. As before, adaptive elements are generated whenevera state falls outside of all existing regions. However,once state space is fully blanketed with regions, the partition remains fixed. (Thesameperformanceis achieved if this partition is fixed before training begins.) Table 1 compares the results with those obtained using the CAEapproach. Following 10,000 uials, the numberof fixed regions is 4025, indicating that the average adaptive element has only been in operation during 2.5 control intervals. Over the course of 100,000 trials the average index of performance of the adaptive elements begins to improve. In contrast, the CAEapproach permits learning to begin immediately, and obtains superior performanceafter only 10,000trials. 4 Conclusion The benefits and limitations of this approachfor low-level control are apparent from the simulation results. A lengthy training period and large computer memoryis required to construct a useful systemmodel,particularly wherethe state space dimensionality is high. It maybe necessary to train the controller off-line using a simulationof the system. Lookup tables using fixed partitions suffer from lengthy training periods as well. The simulation results verify that the CAEapproachachieves greater accuracy with a giventraining period. The CAEapproachis flexible, since the resolution of the partition need not be fixed beforehand and mayvary throughout state space. The system modelis never "complete", but improves continually during on-fine operation. Efficiency may be improved by utilizing a higherorder (non-constanO approximation to the optimumwithin the region associated with each search process. Another variation uses a linearized modelof the system to help steer search in the appropriate direction. 46 In cases where the system parameters are changing gradually with time, time mayitself be included in the state vector. Theeffect is to makethe distancefromthe current state to "old" CAEregions greater than the distance to newly-created CAEs,so that the controller canbetter track time-varying dynamics. To conserve memory,very old CAEsare purged,or"forgotten". Further improvements in efficiency can be obtained wherethe overall control problemcan be decomposed hierarchically, such that the state spaces relevant to each subproblemhave smaller dimensionatity than the mainstate space. Optimizationin each of the respective state spaces must be coordinated carefully [Gonzalez, 1989]. This approach exploits the sharing of subproblems amongproblems, and encourages automatic composition of models in order to modelnew systems. Whilehigh-level tasks generally require modelswhich moreclosely represent the environment,these modelsthemselves may be fine-tuned empirically through a metalearning process based on lookup-table models. References [Albus, 1975] J. S. Albus, "A NewApproach to Manipulator Control: The Cerebellar ModelArticuLation Controller (CMAC)",Trans. ASME,pp. 220-227, 1975 [Gonzalez, 1989] R. E. Gonzalez, Learning by Progressive Subdivision of State Space, doctoral dissertation, Univ. of Pennsylvania, 1989 [Rosen et al., 1992] B. Rosen: J. M. Goodwin;J. J. Vich~, "Process Control with Adaptive Range Coding", Biol. Cybern., V. 66, no. 4, Mar. 1992 [Saridis, 1979] G. N. Saridis, "Towardthe Realization of Intelligent Conm~ls",Proc. of the IEEE, V. 67, no. 8, p. 1122, August 1979 [Waltz and Fu, 1965] M. D. Waltz and K. S. Fu, "A Heuristic Approach to Reinforcement Learning Control Systems", IEEE Trans. Automat. Contr., V. AC-10, pp. 390-398, 1965