An Overview of Reinforcement Learning Angela Yu Cogs 118A February 26, 2009 Outline • A formal framework for learning from reinforcement – Markov decision problems – Interactions between an agent and its environment • Dynamic programming as a formal solution – Policy iteration – Value iteration • Temporal difference methods as a practical solution – Actor-critic learning – Q-learning • Extensions – Exploration vs. exploitation – Representation and neural networks RL as a Markov Decision Process Markov blanket for rt and xt+1 action state reward at 1 at x t1 xt rt 1 rt Pxy (a) rt r(xt ,at ) xt 1 rt 1 RL as a Markov Decision Process Goal: find optimal policy : x a by maximizing return: V (x) t r(t) t 0 action state reward x,r at 1 at x t1 xt rt 1 rt Pxy (a) rt r(xt ,at ) xt 1 rt 1 RL as a Markov Decision Process Simple case: assume transition and reward probabilities are known action state reward at 1 x t1 xt rt 1 at' at rt Pxy (a) rt r(xt ,at ) xt 1 rt 1 x t 1 rt'1 Dynamic Programming I: Policy Iteration Policy Evaluation (system of linear equations) V (x) t r(t) t 0 r (x, (x)) t r(t 1) t 0 x,r x,r r (x, (x)) P(x t 1 | x t , (x))V (x t 1 ) x t 1 Policy Improvement Q ( xt , at ) rˆ( xt , at ) P( xt 1 | xt , at )V ( xt 1 ) xt 1 Based on the values of these state-action pairs, incrementally improve policy: '( xt ) argmax Q ( xt , at ) at Guaranteed to converge on (one set of) optimal values: V * ( x ), Q* ( x, a ) Dynamic Programming II: Value Iteration Q-value Update Q '( xt , at ) rˆ( xt , at ) P( xt 1 | xt , at ) max Q( xt 1, at 1 ) xt 1 at 1 Guaranteed to converge on (one set of) optimal values: Q* ( x, a ) Policy '( xt ) argmax Q '( xt , at ) at Temporal Difference Learning Difficult (realistic) case: transition and reward probabilities are unknown action state reward at 1 at x t1 xt rt 1 rt Pxy (a)? xt 1 rt 1 rt r(xt ,at )? Solution: temporal difference (TD) learning Actor-Critic Learning (related to policy iteration) Critic improves value estimation incrementally: stochastic gradient ascent V ( xt ) rˆ( xt , ( xt )) P( xt 1 | xt , ( xt ))V ( xt 1 ) xt 1 • MC samples for < > MC samples • Boot-strapping: V(xt) V '( xt ) V ( xt ) rt V ( xt 1 ) V ( xt ) Learning rate a ( x) exp( M xa ) b exp( M xb ) • Convergence? Temporal difference t Actor improves policy execution incrementally Stochastic policy • Mutual dependence V (x) a (x)Q (x,a) Delta-rule a ' M xa M xa Q ( xt , at ) V ( xt ) M xa ( rˆ( x t , at ) x ( xt 1 | xt , at )V ( xt 1 ) V ( xt )) t 1 M xa ( rt V ( xt 1 ) V ( xt )) M xa t Monte Carlo samples Learning rate Actor-Critic Learning Exploration vs. Exploitation exp( M xa ) a ( x) b exp( M xb ) Best annealing schedule? Q-Learning (related to value iteration) State-action value estimation Q '( xt , at ) rˆ( xt , at ) P( xt 1 | xt , at ) max Q ( xt 1, at 1 ) xt 1 at 1 rt max Q ( xt 1 , at 1 ) at 1 • MC samples for < > Policy • Boot-strapping: Q(xt, at) • Proven convergence ( xt ) argmax Q( xt , at ) at • No explicit parameter to control explore/exploit Pro’s and Con’s of TD Learning TD learning practically appealing – no representation of sequences of states & actions – relatively simple computations – TD in the brain: dopamine signals temporal difference t TD suffers from several disadvantages – local optima – can be (exponentially) slow to converge – actor-critic not guaranteed to converge – no principled way to trade off exploration and exploitation – cannot easily deal with non-stationary environment TD in the Brain TD in the Brain Extensions to basic TD Learning • A continuum of improvements possible – more complete partial models of the effects of actions – estimate expected reward <r(xt)> – representing & processing longer sequences of actions & states V(x(t)) V(x(t)) r(0) r(1) 2V(x(t 2)) V(x(t)) – faster learning & more efficient use of agent’s experiences – parameterize value function (versus look-up table) V (x(t)) V (x) w (x) w w (t)(x(t)) • Timing and partial observability in reward prediction – state not (always) directly observable – delayed payoffs – reward-prediction only (no instrumental contingencies) References Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial observability in the dopamine system. In Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Dayan, P & Watkins, CJCH (2001). Reinforcement learning. Encyclopedia of Cognitive Science. London, England: MacMillan Press. Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal of Royal Statistical Society B, 41: 148-177. Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction and reward. Science, 275, 1593-1599.