slides - UCSD Cognitive Science

An Overview of Reinforcement Learning Angela Yu Cogs 118A February 26, 2009 Outline • A formal framework for learning from reinforcement – Markov decision problems – Interactions between an agent and its environment • Dynamic programming as a formal solution – Policy iteration – Value iteration • Temporal difference methods as a practical solution – Actor-critic learning – Q-learning • Extensions – Exploration vs. exploitation – Representation and neural networks RL as a Markov Decision Process Markov blanket for rt and xt+1 action state  reward at 1 at x t1 xt  rt 1  rt Pxy (a)  rt  r(xt ,at )  xt 1 rt 1 RL as a Markov Decision Process Goal: find optimal policy : x  a by maximizing return:   V (x)    t r(t) t 0 action  state  reward x,r at 1 at x t1 xt  rt 1  rt Pxy (a)  rt  r(xt ,at )  xt 1 rt 1 RL as a Markov Decision Process Simple case: assume transition and reward probabilities are known action state  reward at 1 x t1 xt  rt 1 at' at  rt Pxy (a)  rt  r(xt ,at )  xt 1  rt 1 x t 1 rt'1 Dynamic Programming I: Policy Iteration Policy Evaluation (system of linear equations) V  (x)   t   r(t) t 0  r (x,  (x))    t   r(t  1) t 0 x,r x,r  r (x,  (x))    P(x t 1 | x t ,  (x))V  (x t 1 ) x t 1 Policy Improvement  Q ( xt , at )  rˆ( xt , at )   P( xt 1 | xt , at )V  ( xt 1 ) xt 1 Based on the values of these state-action pairs, incrementally improve policy:  '( xt )  argmax Q  ( xt , at ) at Guaranteed to converge on (one set of) optimal values: V * ( x ), Q* ( x, a ) Dynamic Programming II: Value Iteration Q-value Update Q '( xt , at )  rˆ( xt , at )   P( xt 1 | xt , at ) max Q( xt 1, at 1 ) xt 1 at 1 Guaranteed to converge on (one set of) optimal values: Q* ( x, a ) Policy  '( xt )  argmax Q '( xt , at ) at Temporal Difference Learning Difficult (realistic) case: transition and reward probabilities are unknown action state  reward at 1 at x t1 xt  rt 1  rt Pxy (a)?  xt 1 rt 1 rt  r(xt ,at )? Solution: temporal difference (TD) learning  Actor-Critic Learning (related to policy iteration) Critic improves value estimation incrementally: stochastic gradient ascent V  ( xt )  rˆ( xt ,  ( xt ))   P( xt 1 | xt , ( xt ))V  ( xt 1 ) xt 1 • MC samples for < > MC samples • Boot-strapping: V(xt) V '( xt )  V ( xt )    rt  V ( xt 1 )  V ( xt )  Learning rate  a ( x)  exp(  M xa ) b exp(  M xb ) • Convergence? Temporal difference t Actor improves policy execution incrementally Stochastic policy • Mutual dependence V  (x)   a (x)Q (x,a) Delta-rule a ' M xa  M xa   Q ( xt , at )  V  ( xt )     M xa   ( rˆ( x t , at )    x ( xt 1 | xt , at )V ( xt 1 )  V ( xt )) t 1  M xa   ( rt   V  ( xt 1 )  V  ( xt ))  M xa   t Monte Carlo samples Learning rate Actor-Critic Learning Exploration vs. Exploitation exp(  M xa )  a ( x)  b exp(  M xb ) Best annealing schedule? Q-Learning (related to value iteration) State-action value estimation Q '( xt , at )  rˆ( xt , at )   P( xt 1 | xt , at ) max Q ( xt 1, at 1 ) xt 1 at 1  rt  max Q ( xt 1 , at 1 ) at 1 • MC samples for < > Policy • Boot-strapping: Q(xt, at) • Proven convergence  ( xt )  argmax Q( xt , at ) at • No explicit parameter to control explore/exploit Pro’s and Con’s of TD Learning TD learning practically appealing – no representation of sequences of states & actions – relatively simple computations – TD in the brain: dopamine signals temporal difference t TD suffers from several disadvantages – local optima – can be (exponentially) slow to converge – actor-critic not guaranteed to converge – no principled way to trade off exploration and exploitation – cannot easily deal with non-stationary environment TD in the Brain TD in the Brain Extensions to basic TD Learning • A continuum of improvements possible – more complete partial models of the effects of actions – estimate expected reward <r(xt)> – representing & processing longer sequences of actions & states V(x(t)) V(x(t))  r(0)  r(1)   2V(x(t  2)) V(x(t)) – faster learning & more efficient use of agent’s experiences – parameterize value function (versus look-up table)  V (x(t))  V (x)  w   (x) w  w (t)(x(t)) • Timing and partial observability in reward prediction  – state not (always) directly observable  – delayed payoffs – reward-prediction only (no instrumental contingencies) References Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial observability in the dopamine system. In Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Dayan, P & Watkins, CJCH (2001). Reinforcement learning. Encyclopedia of Cognitive Science. London, England: MacMillan Press. Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal of Royal Statistical Society B, 41: 148-177. Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction and reward. Science, 275, 1593-1599.

slides - UCSD Cognitive Science

Related documents

Products

Support

slides - UCSD Cognitive Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib