Gradient Descent Methods

Chapter 8: Generalization and Function Approximation Objectives of this chapter:  Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part.  Overview of function approximation (FA) methods and how they can be adapted to RL Any FA Method? In principle, yes:  artificial neural networks  decision trees  multivariate regression methods  etc. But RL has some special requirements:  usually want to learn while interacting  ability to handle nonstationarity  other? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 Gradient Descent Methods t  t (1),t (2), ,t (n) transpose T Assume Vt is a (sufficiently smoot h) different iable function  of t , for alls S. Assume, for now, t raining examples of this form : description ofst , V  (st )  R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Performance Measures  Many are applicable but…  a common and simple one is the mean-squared error (MSE) over a distribution P : MSE ( t )   P(s)V (s)  Vt (s)  2 sS  Why P ?  Why minimize MSE?  Let us assume that P is always the distribution of states at  which backups are done.  The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 Gradient Descent Let f be any function of the parameter space. Its gradient at any point t in this space is: T f () f ()  f (t ) t t  f (t )   , , ,  .  (n)   (1)  (2)  (2) t  t (1),t (2) T  Iteratively move down the gradient:  (1) t 1  t   f (t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 On-Line Gradient-Descent TD(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Linear Methods Represent states as feature vect ors : for eachs  S : s   s (1), s (2), ,  s (n) T n Vt (s)  tT s   t (i)  s (i)  i1 Vt (s)  ?  R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 Nice Properties of Linear FA Methods Vt (s)  s  The gradient is very simple:  For MSE, the error surface is simple: quadratic surface with a single minumum.  Linear gradient descent TD(l) converges:   Step size decreases appropriately  On-line sampling (states sampled from the on-policy distribution)  Converges to parameter vector  with property: 1  l MSE ( )  MSE ( ) 1  (Tsitsiklis & Van Roy, 1997) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction best parameter vector 8 Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Shaping Generalization in Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Learning and Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Tile Coding  Binary feature for each tile  Number of features present at any one time is constant  Binary features means weighted sum easy to compute  Easy to compute indices of the freatures present R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 Tile Coding Cont. Irregular tilings Hashing R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction CMAC “Cerebellar model arithmetic computer” Albus 1971 13 Radial Basis Functions (RBFs) e.g., Gaussians  s  c i   s (i)  exp 2 2  i  2      R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Can you beat the “curse of dimensionality”?  Can you keep the number of features from going up exponentially with the dimension?  Function complexity, not dimensionality, is the problem.  Kanerva coding:  Select a bunch of binary prototypes  Use hamming distance as distance measure  Dimensionality is no longer a problem, only complexity  “Lazy learning” schemes:  Remember all the data  To get new value, find nearest neighbors and interpolate  e.g., locally-weighted regression R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 Control with FA Learning state-action values Training examples of the form: descript ion of s(t , at ), vt  The general gradient-descent rule: t 1  t   vt  Qt (st ,at )Q(st ,at ) Gradient-descent Sarsa(l) (backward view): t 1  t   t et  where t  rt 1   Qt (st 1,at 1)  Qt (st ,at ) et   l et1   Qt (st ,at ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 GPI Linear Gradient Descent Watkins’ Q(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 GPI with Linear Gradient Descent Sarsa(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Mountain-Car Task R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Mountain Car with Radial Basis Functions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Mountain-Car Results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Baird’s Counterexample R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Baird’s Counterexample Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Should We Bootstrap? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Summary  Generalization  Adapting supervised-learning function approximation methods  Gradient-descent methods  Linear gradient-descent methods  Radial basis functions  Tile coding  Kanerva coding  Nonlinear gradient-descent methods? Backpropation?  Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 Value Prediction with FA As usual: Policy Evaluation (the prediction problem):  for a given policy , compute the state-value function V In earlier chapters, value functions were stored in lookup tables. Here, t he value funct ion est imat e at t ime t, Vt , depends on a parame te r ve ctor t , and only t he paramet er vect or is updat ed. e.g., t could be the vect or of connection weight s of a neural net work. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Training example = {input, target output} Error = (target output – actual output) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 Backups as Training Examples e.g., the T D(0) backup : V (st )  V (st )   rt 1   V (st 1 ) V (st ) As a training example:  description ofst , rt 1   V (st 1)  input R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction target output 28 Gradient Descent Cont. For the MSE given above and using the chain rule: 1 t 1  t   MSE(t ) 2 2 1   t    P(s)V (s)  Vt (s) 2 sS  t    P(s)V  (s)  Vt (s)Vt (s) sS  R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Gradient Descent Cont. Use just the sample gradient instead: 2 1  t 1  t   V (st )  Vt (st ) 2  t   V  (st )  Vt (st )Vt (st ), Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum  of the MSE if  decreases appropriately with t. E V  (st ) Vt (st )Vt (st )   P(s)V  (s) Vt (s)Vt (s) sS R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 But We Don’t have these Targets Suppose we just have targetsv t instead: t 1  t   v t  Vt (st )Vt (st ) If eachv t is an unbiased estimate ofV  (st ), i.e., E v t   V  (st ), then gradient descent converges  to a local minimum (provided  decreases appropriately). e.g., the Monte Carlo target v t  Rt :  t 1  t   Rt  Vt (st )Vt (st ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 What about TD(l) Targets? t 1  t   Rtl Vt (st )Vt (st ) Not forl 1 But we do it anyway,using the backwards view:  t 1  t   t et , where: t  rt 1   Vt (st 1 )  Vt (st ), as usual, and et   l et1  Vt (st ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Gradient Descent Methods

Related documents

Products

Support

Gradient Descent Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib