Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA) methods and how they can be adapted to RL Any FA Method? In principle, yes: artificial neural networks decision trees multivariate regression methods etc. But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 Gradient Descent Methods t t (1),t (2), ,t (n) transpose T Assume Vt is a (sufficiently smoot h) different iable function of t , for alls S. Assume, for now, t raining examples of this form : description ofst , V (st ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Performance Measures Many are applicable but… a common and simple one is the mean-squared error (MSE) over a distribution P : MSE ( t ) P(s)V (s) Vt (s) 2 sS Why P ? Why minimize MSE? Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 Gradient Descent Let f be any function of the parameter space. Its gradient at any point t in this space is: T f () f () f (t ) t t f (t ) , , , . (n) (1) (2) (2) t t (1),t (2) T Iteratively move down the gradient: (1) t 1 t f (t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 On-Line Gradient-Descent TD(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Linear Methods Represent states as feature vect ors : for eachs S : s s (1), s (2), , s (n) T n Vt (s) tT s t (i) s (i) i1 Vt (s) ? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 Nice Properties of Linear FA Methods Vt (s) s The gradient is very simple: For MSE, the error surface is simple: quadratic surface with a single minumum. Linear gradient descent TD(l) converges: Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution) Converges to parameter vector with property: 1 l MSE ( ) MSE ( ) 1 (Tsitsiklis & Van Roy, 1997) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction best parameter vector 8 Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Shaping Generalization in Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Learning and Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Tile Coding Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the freatures present R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 Tile Coding Cont. Irregular tilings Hashing R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction CMAC “Cerebellar model arithmetic computer” Albus 1971 13 Radial Basis Functions (RBFs) e.g., Gaussians s c i s (i) exp 2 2 i 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Can you beat the “curse of dimensionality”? Can you keep the number of features from going up exponentially with the dimension? Function complexity, not dimensionality, is the problem. Kanerva coding: Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity “Lazy learning” schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 Control with FA Learning state-action values Training examples of the form: descript ion of s(t , at ), vt The general gradient-descent rule: t 1 t vt Qt (st ,at )Q(st ,at ) Gradient-descent Sarsa(l) (backward view): t 1 t t et where t rt 1 Qt (st 1,at 1) Qt (st ,at ) et l et1 Qt (st ,at ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 GPI Linear Gradient Descent Watkins’ Q(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 GPI with Linear Gradient Descent Sarsa(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Mountain-Car Task R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Mountain Car with Radial Basis Functions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Mountain-Car Results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Baird’s Counterexample R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Baird’s Counterexample Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Should We Bootstrap? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Summary Generalization Adapting supervised-learning function approximation methods Gradient-descent methods Linear gradient-descent methods Radial basis functions Tile coding Kanerva coding Nonlinear gradient-descent methods? Backpropation? Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 Value Prediction with FA As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V In earlier chapters, value functions were stored in lookup tables. Here, t he value funct ion est imat e at t ime t, Vt , depends on a parame te r ve ctor t , and only t he paramet er vect or is updat ed. e.g., t could be the vect or of connection weight s of a neural net work. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Training example = {input, target output} Error = (target output – actual output) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 Backups as Training Examples e.g., the T D(0) backup : V (st ) V (st ) rt 1 V (st 1 ) V (st ) As a training example: description ofst , rt 1 V (st 1) input R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction target output 28 Gradient Descent Cont. For the MSE given above and using the chain rule: 1 t 1 t MSE(t ) 2 2 1 t P(s)V (s) Vt (s) 2 sS t P(s)V (s) Vt (s)Vt (s) sS R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Gradient Descent Cont. Use just the sample gradient instead: 2 1 t 1 t V (st ) Vt (st ) 2 t V (st ) Vt (st )Vt (st ), Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if decreases appropriately with t. E V (st ) Vt (st )Vt (st ) P(s)V (s) Vt (s)Vt (s) sS R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 But We Don’t have these Targets Suppose we just have targetsv t instead: t 1 t v t Vt (st )Vt (st ) If eachv t is an unbiased estimate ofV (st ), i.e., E v t V (st ), then gradient descent converges to a local minimum (provided decreases appropriately). e.g., the Monte Carlo target v t Rt : t 1 t Rt Vt (st )Vt (st ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 What about TD(l) Targets? t 1 t Rtl Vt (st )Vt (st ) Not forl 1 But we do it anyway,using the backwards view: t 1 t t et , where: t rt 1 Vt (st 1 ) Vt (st ), as usual, and et l et1 Vt (st ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32