Successive Refinement of Empirical Models

advertisement
From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Successive Refinement of Empirical Models
Ralph E. Gonzalez
Department of Mathematical Sciences
Rutgers University
Camden, NJ 08102, USA
(rgonzal@ gandalf.rutgers.edu)
Abstract
period, andthe size of the lookup-tableincreases accordingly.
This approachis implemented
in a proceduralframework,whereeach region is associated with an independent
adaptive elementwhoseaccuracyis limited by the current
regionsize. Beginning
with large regionsencouragesgeneralizationandspeedsinitial learning.Asthe regionscontract,
the associatedadaptiveelementsmayobtain arbitrary accuracy. Learningis acceleratedin those areas of state space
whicharise mostfrequently,andthe rangeof generalization
is successivelyreduced.Applicationto the optimalcontrol
of a simulatedroboticmanipulator
is discussed.
Anapproachfor automaticcontrol using on-line
learningis presented.Thecontroller’smodelfor the
systemrepresentsthe effects of control actions in
v,’u’ious(hyperspherical)
regionsof state space.The
partition producedby these regions is initially
coarse, limiting optimizationof the corresponding
control actions. Regionscontract periodically to
enable optimization to progress further. New
regions and associated optimizing elements are
generatedto maintaincoverageof state space, with
characteristics drawnfromneighboringregions.
Duringon-line operation the resulting state space
partition undergoessuccessive refinement, with
"generalization" occurring over successively
smaller areas. Thesystemmodeleffectively grows
and becomesmoreaccurate with time.
2 LearningControl
An automatic meansof controlling a compiexsystem is
desired. Therelevant aspects of the systemcomprisethe
system’sstate, denotedby the vector x. If the dynamics
of
the systemare unknown
or are difficult to analyze, the
1 Introduction
controller mayempirically learn the appropriate control
vectoru for eachstate, accordingto someavailableindexof
Anon-linelearningcontroUermaintainsa modelof its envi- performanceOP)f(x,u). Thecontroller producesa model
ronment(the action model),fromwhichcontrol actions are
mappingthe state space X onto the control actions u*(x)
determined. The modelis refined during nonsupervised whichoptimizethe IP.
operation to improvecontrol with respect to an index of
Of particular interest is the problemof on-line
performance.A performance-adaptive
controller [Saridis,
learning, wherethe controller must construct a system
1979]learns by direct reductionof its uncertainties,rather modelwhilecontrollingthe system.In this caseit is desirthanexplicit identification of the environmental
parameters. able for the controller to be initialized with or to quickly
Thatis, the controller’sunderstanding
of its environment
is
learn controlactionswhichmaintainstable operation,andto
representedby its control decisions, and is constructed continueto improveuponthese actions over time. Arobust
empirically.
learningalgorithm
is required,since the selectionof states is
Anexampleis the lookup-tablecontroller [’Waltzand determinedby the dynamicsof the systemrather than by a
Fu, 1965]. Thespace of systemstates is partitioned, and supervisoryalgorithm.
each"control situation" (partition element)is ,associated
with an independent adoptive element. Each element
2.1 State SpacePartitioning
attemptsto determinethe control action appropriatewithin If the state spaceXis discrete, the learningcontrollermay
its regionof state space: for example,usinga probabifity- approximate
u*(x)for eachstate x~X. If the cardinality
reinforcementscheme.Thesystemmodelis a table relating
the set of states is very large or if X is continous,then
eachregionto the responsewhichis mostlikely optimal.
computingu*(x) for all states mayrequire excessive (or
This paper extends the lookup-table approach by infinite) time. In this ease, the controllermaypartitionstate
allowing regions to contract and to recursively spawn space into discrete regions Iri: i=l,2,...n}. Thelearning
smallerregions. Theresulting state space partition under- algorithmassociates a single suboptimalcontrol action u
i
goes successive refinement during the on-line training
with eachregion, producing
a lookup-table.
42
If the state space partition is very coarse (n is small),
then the effectiveness of ui mayvary greatly for different
states within the regionr i. Onthe other hand,if the partition
is very fine then the overall learning time will remainlarge.
Previous researchers have applied techniques to improvethe
effectiveness of a coarse partition. Waltz and Fu [1965]
suggested that regions along the switching boundaryof a
binary controller maybe partitioned once more (based on
the failure of the control actions associated with these
regions to converge during the training period). Albus’
neural model CMAC
[1975] used overlapping coarse partitions to give the effect of a finer partition while achieving
greater memoryefficiency. Rosen et al. [1992] allowed
region boundaries to migrate to conform to the control
surface of the system, which also resulted in a controller
which adapted well to changing system dynamics. These
techniques do not fully overcomethe accuracy limitation
whichultimately characterizes any fixed-size systemmodel.
(
x
ri
(a)
)
2.2 Contractible Adaptive Elements
The system modeldescribed here consists of a set Cffi{ci:
i=l,2,...n(t)}
of Contractible Adaptive Elements (CAEs),
whosecardinality n is a nondecreasing function of time.
EachCAEci associates
a single suboptim~dcontrol action u
i
with a hyperspherical region r i of the state space X. The
element ci becomesactive whenthe state x(t) movesinto i
(Figure la). While,active, the CAEsearches for an optimal
control action u*(x), according to someindex of perform,’mcef(x,u). Since u*(x) varies over i, the search may
fail to converge. Underconditions of continuity, a boundon
this "noise" can be obtained (generally based on the size of
the region). This enables the establishment of an accuracy
threshold 5i beyondwhichthe search becomesineffective.
The active CAEci searches until its accuracythreshold
8i is reached. At this point, the CAEreduces the size of its
region r i (Figure lb) and reduces 6i. As the conlroller
employedon-line, each CAE’sregion maycontract repeatedly to permit arbitrary search accuracy.
Whena region contracts, it is possible for a state to
arise nearby which falls outside of the boundary of all
existing CAEsin the set C (Figure lc). A new CAE
generated, whoseregion maybe centered at the current state
location (Figure ld). The size of the new region and the
st,’u’ting value for the search process maybe "generalized"
from neighboring CAEs,with consideration of the distance
to the center of their ,associated regions. As morenearby
states arise during on-line operation, the contraction of a
CAEresults in repopulation of the vacated area of state
space with sm,’dler CAEs.These second-generation CAEs
will themselveseventually contract, leading to still-smaller
third-generation CAEs.The range of generalization conIr, tcts witheachgeneration.
It is possiblefor a state to fall within the boundariesof
two or more hyperspherical CAEs.Only one may be ,activated; e.g., by considerationof the distance fromthe current
sutte to the center of the respective regions.
(
(b)
(c)
)
(
(d)
Rgure 1. CAEregions on ~imensional state
space;
(a) at beginning
of controlinterval, (b) afterri
contracts,
(c) at beginning
of nextcontrolinterval,(d)
after creationof nowregion.
/43
2-~.1 Performance
Features
Preliminarylearning is acceleratedby the use of a coarse
initial partition whoseindividual elementsenjoy frequent
activation. Undercertain conditions, it can be proventhat
the successiverefinement,approachachievesthe accuracy
possiblewitha fixed partition in less time [Gonzalez,
1989].
TheCAE
,approachis also moreflexible than that involving
a fixed pm’fition,since in manycases the desiredpartition
resolution is not known
beforehand.
During on-line control, the state space pm’tition
defined by the set C generally becomesmorerefined in
those areas of state spacewhicharise mostfrequently.This
provides memory
efficiency and accelerates learning in
these,areas.
The optimal control action mayvary more quickly
over someareas of state space than others. For example,
there maybe a discontinuityin the valueof the optimum
at
somesurfaceof state space. If this surfacepassesthrougha
CAE’sregion, then that CAE’ssearch process will fail to
converge. Theregion should then contract, allowing the
spaceit vacatedto be repopulatedwithsmallerCAEs.
If this
occurssuccessively,the discontinuitywill be containedin
regions whosecombinedvolumecan be madearbitrarily
small. Thusthe likelyhoodof a state arising in oneof these
critical regionswill be equallysmall. Onthe other hand,if
there is very little variation in the optimum
over a large
region, it is possible for the associated CAEs’search
processes to use a finer accuracy threshold without
contracting. This allowsmorerapid learning than wouldbe
the case with smaller CAEs.
determinesthe minimum
allowablestep size. In somecases
/i i can be shown
to be proportionalto the radiusof r i. This
occurswhenthe index of performancef is of the form:
f(x,u) = kllx-y(u)ll; i.e., f measures
the distancebetween
state "target" vector andan "output"vector y(u) [Gonzalez,
1989].
3 Robotics Application
Wedescribean applicationof the on-linelearningcontroller
to a dynamicsimulation of a 2-degree of freedomrobotic
manipulator,including gravity and friction effects. The
robot’s dynamicsare unknown
to the controller, whosetask
is to determinethe joint torques whichoptimallymovethe
robot from
aninitial configurationto a target in a rectangular regionin the planeof the robot (Figure2). Theindex
of performance
is a weightedsumof the hand’sposition and
velocity error and the energyandpowerrequirements.It is
generallydifficult to analytically minimize
sucha function
in the face of the robot’s nonlinear dynamics,and in any
case the computationsnecessarywill generally preclude
real-time control. Likewise,conventionaladaptivemethods
do not produceoptimalcontrol of suchsystems.
Thecontrolinterval is a single trial, beginningwitha
fixed robot configurationand presentationof a randomlyselectedtarget. Sincethe robot’sconfigurationat the beginningof eachcontrolinterval is constant,the state x=[tI T
t2]
consists solely of the 2-dimensionalcoordinates of the
target. Thecontroller com~utesa 4-dimensionalcontrol
vector uf[T’la F2aFldF2d]t, consistingof the acceleration
and decelerationtorques for each of the twojoints. (The
control vector could as easily havedefinedthe electrical
2.2.2 SearchTechniques
input signal to the joint actuators.)Theaccelerationtorques
Assuming
state spaceis continuousandthe controller is in
are appliedto their respectivejoints for the first half of the
on-line operation, a particular state x wit generallyarise
control interval (0.5 see), andthe decelerationtorques,are
onlyonce. Thereforeit is not possibleto accuratelyestimate appliedduringthe secondhalf of the interval. Theindexof
the gradientVf(x,.) representingthe changein performance performance
is computed
at the endof the control interval.
producedby a changein control action for a given state.
Thisvalueis usedby the active CAE
to adjust the direction
This prevents the use of pure steepest-descent based
of search.
methodswhensearchingfor control actions. If the rangeof
controlactionsis itself discrete, a probability-reinforcement
scheme[Waltz and Fu, 1965] maybe used in place of
search. If controlspaceis continuous,direct-searchoptimizationis used.
Direct-searchmethodstake manyforms, and generally
utilize implicit estimatesfor the performance
gradient. In
our implementation, each CAEmaintains an independent
direct search
process. Its parametersinclude the current
(trt
suboptimalcontrol action, searchdirection, and step size.
Duringeach control interval, (1) a CAEi containing the
currentstate x is activated,(2) i incrementstsi suboptimal
controlactionUi byonestep in the searchdirection,(3) i is
appliedto the systemundercontrol, (4) the resulting IP
evaluated,and(5) i modifies its search direction according
to this value. (The state is assumedconstant during the
controlinterval.) Thestep size is reducedas the suboptimal
Figure
2. Initial robotconfiguration
andsample
target.
control action nears
optimum.Theaccuracythreshold ~i
i
Sincethe control interval coincideswith the length of a
trial, control during each trial is open-loop, or "blind".
Duringon-line training, the controller constructs a lookuptable containing suboptimal open-looptorque prof’fles for
reachingany target.
The system modelconsists initially of a single CAE,
whosedisk-shapedregion is centered in the state space. Its
control vector has beenoptimizedfor a target at the center
of state space, ,and the region radius is correspondinglyvery
sm,’dl. (The optimization is accomplishedby waining the
initial CAEwith a series of trinis whosetargets are fixed at
this location.)
3.1 Simulation Results
Figure 3 shows the CAEregions existing after 10, 100,
I000, and 10,000 trials. Whena CAEc i has a relatively
large region r i, the likelyhoodof a (randomly-selected)state
x falling in r i is correspondingly
large. Therefore,ci is activated fa’equently during on-line operation, and converges
quickly to a control vector ui satisfying its accuracy
threshold. Thereupon
r i contracts. As training progressesthe
average size of the individual CAEscauses themto be activatedrarely, and le,’u’ningslows.
The size of a region is inversely related to the accuracy
of the control action of its associated CAE.Since states are
selected uniformlyover state space, the variation in size of
regions after 10,000Irials is related primarily to the difficulty of optimization in someareas of state space. There
also is bias towardgreater accuracy at the center of state
space due to the influence of the initial CAE,and a bias
toward lower accuracy al the edges of state space where
someregions overlap the boundary.
After 1O,000trials, the controller succeeds in placing
the end-effector on any target with little error. Since energy
and powerusage axe also included in the index of performance,the robot’s motionis smoothand efficient. At this
point, learning may cease. Alternately, the CAEsmay
remain active during on-line use of the robot to enable
continually-improving performance.
After 10,000 trials, the computermemoryrequirement
for this application is about 893 kb (each CAErequires
roughly 450 bytes). This includes the lookup-table of suboptimal control actions (the system model) as well as ~mfor
the individual CAEsearch processes. If learning continues
during on-line use, the numberof CAEswill increase and
memoryrequirements will continue to grow.
3.1.1 Comparisonwith Fixed Partition
To comparethe learning rate using CAEsversus that using a
fixed partition of state space,another experiment is
conducted. Eachadaptive element is initinliTod with the
control
action
which
is optimumat thecenter
of state space;
(a)
(b)
(c)
(d)
Figure
3. CAE
regions
after(a) 10trials, (b) 100trials, (c) 1000
trials, (d) 10,000
45
Fixedregion
size
Conwactible
regions
Trial
number
Number of
elements
I0
I00
I000
6
IOO0(3
I00000
21
174
2002
10852
Average
Average IP region size
580.7
441.2
243.0
71.3
41.4
Number of
elements
10
95.08
42.52
22.70
100
881
5.25
1.78
4025
6390
Average IP Region size
694.6
687.4
5
795.2
5
5
611.3
168.3
5
5
Table1. Partitionstatisticscomparing
CAE
approach
versusfixedregions.
i.e., the samevalue which initialized the original experiment. The radius of each correspondingstate space region is
fixed to the averageradius existing after the original 10,000
trials (5 units). Thesame10,000trials are presented again.
As before, adaptive elements are generated whenevera state
falls outside of all existing regions. However,once state
space is fully blanketed with regions, the partition remains
fixed. (Thesameperformanceis achieved if this partition is
fixed before training begins.)
Table 1 compares the results with those obtained
using the CAEapproach. Following 10,000 uials, the
numberof fixed regions is 4025, indicating that the average
adaptive element has only been in operation during 2.5
control intervals. Over the course of 100,000 trials the
average index of performance of the adaptive elements
begins to improve. In contrast, the CAEapproach permits
learning to begin immediately, and obtains superior performanceafter only 10,000trials.
4 Conclusion
The benefits and limitations of this approachfor low-level
control are apparent from the simulation results. A lengthy
training period and large computer memoryis required to
construct a useful systemmodel,particularly wherethe state
space dimensionality is high. It maybe necessary to train
the controller off-line using a simulationof the system.
Lookup tables using fixed partitions suffer from
lengthy training periods as well. The simulation results
verify that the CAEapproachachieves greater accuracy with
a giventraining period. The CAEapproachis flexible, since
the resolution of the partition need not be fixed beforehand
and mayvary throughout state space. The system modelis
never "complete", but improves continually during on-fine
operation.
Efficiency may be improved by utilizing a higherorder (non-constanO approximation to the optimumwithin
the region associated with each search process. Another
variation uses a linearized modelof the system
to help steer
search in the appropriate direction.
46
In cases where the system parameters are changing
gradually with time, time mayitself be included in the state
vector. Theeffect is to makethe distancefromthe current
state to "old" CAEregions greater than the distance to
newly-created
CAEs,so that the controller canbetter track
time-varying dynamics. To conserve memory,very old
CAEsare purged,or"forgotten".
Further improvements
in efficiency can be obtained
wherethe overall control problemcan be decomposed
hierarchically, such that the state spaces relevant to each
subproblemhave smaller dimensionatity than the mainstate
space. Optimizationin each of the respective state spaces
must be coordinated carefully [Gonzalez, 1989]. This
approach exploits the sharing of subproblems amongproblems, and encourages automatic composition of models in
order to modelnew systems.
Whilehigh-level tasks generally require modelswhich
moreclosely represent the environment,these modelsthemselves may be fine-tuned empirically through a metalearning process based on lookup-table models.
References
[Albus, 1975] J. S. Albus, "A NewApproach to Manipulator Control: The Cerebellar ModelArticuLation Controller
(CMAC)",Trans. ASME,pp. 220-227, 1975
[Gonzalez, 1989] R. E. Gonzalez, Learning by Progressive
Subdivision of State Space, doctoral dissertation, Univ. of
Pennsylvania, 1989
[Rosen et al., 1992] B. Rosen: J. M. Goodwin;J. J. Vich~,
"Process Control with Adaptive Range Coding", Biol.
Cybern., V. 66, no. 4, Mar. 1992
[Saridis, 1979] G. N. Saridis, "Towardthe Realization of
Intelligent Conm~ls",Proc. of the IEEE, V. 67, no. 8, p.
1122, August 1979
[Waltz and Fu, 1965] M. D. Waltz and K. S. Fu, "A Heuristic Approach to Reinforcement Learning Control
Systems", IEEE Trans. Automat. Contr., V. AC-10, pp.
390-398, 1965
Download