Gradient Descent Methods

advertisement
Chapter 8:
Generalization and Function Approximation
Objectives of this chapter:
 Look at how experience with a limited part of the state set
be used to produce good behavior over a much larger part.
 Overview of function approximation (FA) methods and how
they can be adapted to RL
Any FA Method?
In principle, yes:
 artificial neural networks
 decision trees
 multivariate regression methods
 etc.
But RL has some special requirements:
 usually want to learn while interacting
 ability to handle nonstationarity
 other?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
2
Gradient Descent Methods
t  t (1),t (2), ,t (n)
transpose
T
Assume Vt is a (sufficiently smoot h) different iable function

of t , for alls S.
Assume, for now, t raining examples of this form
:
description ofst , V  (st )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
3
Performance Measures
 Many are applicable but…
 a common and simple one is the mean-squared error
(MSE) over a distribution P :
MSE ( t )   P(s)V (s)  Vt (s)

2
sS
 Why P ?
 Why minimize MSE?
 Let us assume that P is always the distribution of states at
 which backups are done.
 The on-policy distribution: the distribution created while
following the policy being evaluated. Stronger results are
available for this distribution.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
4
Gradient Descent
Let f be any function of the parameter space.
Its gradient at any point
t in this space is:
T
f () f ()

f (t )
t
t
 f (t )  
,
, ,
 .
 (n) 
 (1)  (2)
 (2)
t  t (1),t (2)
T

Iteratively move down the gradient:
 (1)
t 1  t   f (t )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
5
On-Line Gradient-Descent TD(l)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
6
Linear Methods
Represent states as feature vect ors
:
for eachs  S :
s   s (1), s (2), ,  s (n)
T
n
Vt (s)  tT s   t (i)  s (i)

i1
Vt (s)  ?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
7
Nice Properties of Linear FA Methods
Vt (s)  s
 The gradient is very simple:
 For MSE, the error surface is simple: quadratic surface
with a single minumum.
 Linear gradient descent TD(l) converges:

 Step size decreases appropriately
 On-line sampling (states sampled from the on-policy
distribution)
 Converges to parameter vector  with property:
1  l
MSE ( ) 
MSE ( )
1 
(Tsitsiklis & Van Roy, 1997)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
best parameter vector
8
Coarse Coding
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
9
Shaping Generalization in Coarse Coding
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
10
Learning and Coarse Coding
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
11
Tile Coding
 Binary feature for each tile
 Number of features present at
any one time is constant
 Binary features means weighted
sum easy to compute
 Easy to compute indices of the
freatures present
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
12
Tile Coding Cont.
Irregular tilings
Hashing
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
CMAC
“Cerebellar model arithmetic computer”
Albus 1971
13
Radial Basis Functions (RBFs)
e.g., Gaussians
 s  c
i

 s (i)  exp
2
2

i

2





R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
14
Can you beat the “curse of dimensionality”?
 Can you keep the number of features from going up
exponentially with the dimension?
 Function complexity, not dimensionality, is the problem.
 Kanerva coding:
 Select a bunch of binary prototypes
 Use hamming distance as distance measure
 Dimensionality is no longer a problem, only complexity
 “Lazy learning” schemes:
 Remember all the data
 To get new value, find nearest neighbors and interpolate
 e.g., locally-weighted regression
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
15
Control with FA
Learning state-action values
Training examples of the form:
descript ion of s(t , at ), vt 
The general gradient-descent rule:
t 1  t   vt  Qt (st ,at )Q(st ,at )
Gradient-descent Sarsa(l) (backward view):
t 1  t   t et

where
t  rt 1   Qt (st 1,at 1)  Qt (st ,at )
et   l et1   Qt (st ,at )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
16
GPI Linear Gradient Descent Watkins’ Q(l)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
17
GPI with Linear Gradient Descent Sarsa(l)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
18
Mountain-Car Task
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
19
Mountain Car with Radial Basis
Functions
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
20
Mountain-Car Results
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
21
Baird’s Counterexample
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
22
Baird’s Counterexample Cont.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
23
Should We Bootstrap?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
24
Summary
 Generalization
 Adapting supervised-learning function approximation
methods
 Gradient-descent methods
 Linear gradient-descent methods
 Radial basis functions
 Tile coding
 Kanerva coding
 Nonlinear gradient-descent methods? Backpropation?
 Subtleties involving function approximation, bootstrapping
and the on-policy/off-policy distinction
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
25
Value Prediction with FA
As usual: Policy Evaluation (the prediction problem):

for a given policy , compute the state-value function V
In earlier chapters, value functions were stored in lookup tables.
Here, t he value funct ion est imat e at t ime
t, Vt , depends
on a parame te r ve ctor
t , and only t he paramet er vect or
is updat ed.
e.g., t could be the vect or of connection weight s
of a neural net work.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
26
Adapt Supervised Learning Algorithms
Training Info = desired (target) outputs
Inputs
Supervised Learning
System
Outputs
Training example = {input, target output}
Error = (target output – actual output)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
27
Backups as Training Examples
e.g., the T D(0) backup
:
V (st )  V (st )   rt 1   V (st 1 ) V (st )
As a training example:

description ofst , rt 1   V (st 1)

input
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
target output
28
Gradient Descent Cont.
For the MSE given above and using the chain rule:
1
t 1  t   MSE(t )
2
2
1

 t    P(s)V (s)  Vt (s)
2
sS
 t    P(s)V  (s)  Vt (s)Vt (s)
sS

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
29
Gradient Descent Cont.
Use just the sample gradient instead:
2
1

t 1  t   V (st )  Vt (st )
2
 t   V  (st )  Vt (st )Vt (st ),
Since each sample gradient is an unbiased estimate of
the true gradient, this converges to a local minimum
 of the MSE if  decreases appropriately with t.
E V  (st ) Vt (st )Vt (st )   P(s)V  (s) Vt (s)Vt (s)
sS
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
30
But We Don’t have these Targets
Suppose we just have targetsv t instead:
t 1  t   v t  Vt (st )Vt (st )
If eachv t is an unbiased estimate ofV  (st ),
i.e., E v t   V  (st ), then gradient descent converges

to a local minimum (provided
 decreases appropriately).
e.g., the Monte Carlo target
v t  Rt :

t 1  t   Rt  Vt (st )Vt (st )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
31
What about TD(l) Targets?
t 1  t   Rtl Vt (st )Vt (st )
Not forl 1
But we do it anyway,using the backwards view:

t 1  t   t et ,
where:
t  rt 1   Vt (st 1 )  Vt (st ), as usual, and
et   l et1  Vt (st )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
32
Download