Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research www.research.att.com/~sutton Outline • Core RL Theory – Value functions – Markov decision processes – Temporal-Difference Learning – Policy Iteration • The Space of RL Methods – Function Approximation – Traces – Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks RL is Learning from Interaction Environment action state reward Agent • complete agent • temporally situated • continual learning & planning • object is to affect environment • environment stochastic & uncertain Ball-Ballancing Demo • RL applied to a real physical system • Illustrates online learning • An easy problem...but learns as you watch! • Courtesy of GTE Labs: Hamid Benbrahim Judy Franklin John Doleac Oliver Selfridge • Same learning algorithm as early pole-balancer Markov Decision Processes (MDPs) In RL, the environment is a modeled as an MDP, defined by S – set of states of the environment A(s) – set of actions possible in state s S P(s,s',a) – probability of transition from s to s' given a R(s,s',a) – expected reward on transition s to s' given a – discount rate for delayed reward discrete time, t ... st at = 0, 1, 2, . . . rt +1 st +1 at +1 rt +2 st +2 at +2 rt +3 s t +3 ... at +3 The Objective is to Maximize Long-term Total Discounted Reward Find a policy s S aA(s) (could be stochastic) that maximizes the value (expected future reward) of each s : V (s) = E {rt +1 + rt +2 + 2 rt +3 + and each s,a pair: ... s t =s, } rewards Q (s,a) = E {rt +1+ rt +2 + 2rt +3+ ... s t =s, a t=a, } These are called value functions - cf. evaluation functions Optimal Value Functions and Policies There exist optimal value functions: Q* (s, a) max Q (s,a) V * (s) max V (s) And corresponding optimal policies: * (s) arg max Q* (s, a) a * is the greedy policy with respect to Q* 1-Step Tabular Q-Learning On each state transition: st rt 1 at st 1 Update: Q(st ,at ) Q(st , at ) rt 1 max Q(st 1 ,a) Q(st ,at ) a table entry a TD Error lim Q(s, a) Q* (s,a) Optimal behavior found without a t model of the environment! lim t * (Watkins, 1989) t Assumes finite MDP Example of Value Functions Vk for the random policy T k=0 Greedy Policy w.r.t V k 0.0 0.0 0.0 0.0 0.0 0.0 0.0 random policy 4Grid Example 0.0 0.0 0.0 0.0 0.0 0.0 0.0 T T -1.0 -1.0 -1.0 k= 1 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 T T -1.7 -2.0 -2.0 k= 2 -1.7 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -1.7 -2.0 -2.0 -1.7 T V k+1 (s) = E{ r + V k (s')} T -2.4 -2.9 -3.0 k= 3 -2.4 -2.9 -3.0 -2.9 -2.9 -3.0 -2.9 -2.4 -3.0 -2.9 -2.4 T T -6.1 -8.4 -9.0 k = 10 -6.1 -7.7 -8.4 -8.4 -8.4 -8.4 -7.7 -6.1 -9.0 -8.4 -6.1 T T -14. -20. -22. k= -14. -18. -20. -20. -20. -20. -18. -14. -22. -20. -14. T optimal policy Policy Iteration Summary of RL Theory: Generalized Policy Iteration policy evaluation value learning Value Function Policy V, Q policy improvement greedification * * V ,Q * Outline • Core RL Theory – Value functions – Markov decision processes – Temporal-Difference Learning – Policy Iteration • The Space of RL Methods – Function Approximation – Traces – Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks Sarsa Backup On transition: st rt 1 at st 1 Rummery & Niranjan Holland Sutton at1 Update: Q(st ,at ) Q(st , at ) rt 1 Q(st 1 ,at 1 ) Q(st , at ) TD Error st , at Backup! The value of this pair st 1 rt 1 at1 Is updated from the value of the next pair Dimensions of RL algorithms 1. Action Selection and Exploration 2. Function Approximation 3. Eligibility Trace 4. On-line Experience vs Simulated Experience 5. Kind of Backups . . . 1. Action Selection (Exploration) Ok, you've learned a value function, How do you pick Actions? Q Q* Greedy Action Selection: Always select the action that looks best: (s) arg max Q(s, a) a -Greedy Action Selection: Be greedy most of the time Occasionally take a random action Surprisingly, this is the state of the art. Exploitation vs Exploration! 2. Function Approximation s a Function Approximator Q(s,a ) targets or errors Could be: • table • Backprop Neural Network • Radial-Basis-Function Network • Tile Coding (CMAC) • Nearest Neighbor, Memory Based • Decision Tree gradientdescent methods Adjusting Network Weights Q(s,a) f (s, a, w) weight vector standard backprop gradient e.g., gradient-descent Sarsa: w w rt 1 Q(st 1 , at 1 ) Q(st ,at ) w f (st ,at , w) target value estimated value Sparse Coarse Coding . fixed expansive Re-representation . . . . . . . . . . Linear last layer features Coarse: Large receptive fields Sparse: Few features present at one time The Mountain Car Problem Moore, 1990 Goal Gravity wins Minimum-Time-to-Goal Problem SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting RL Applied to the Mountain Car 1. TD Method = Sarsa Backup 2. Action Selection = Greedy with initial optimistic values Q0 (s, a) 0 3. Function approximator = CMAC 4. Eligibility traces = Replacing Video Demo (Sutton, 1995) Tile Coding applied to Mountain Car a.k.a. CMACs (Albus, 1980) Shape/Size of tiles # Tilings Generalization Resolution of final approximation 5x5 grid (10) The Acrobot Problem Goal: Raise tip above line e.g., Dejong & Spong, 1994 Sutton, 1995 Torque applied here tip Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities CMAC of 48 layers RL same as Mountain Car Reward = -1 per time step 3. Eligibility Traces (*) • The in TD(), Q(), Sarsa() . . . • A more primitive way of dealing with delayed reward • The link to Monte Carlo methods • Passing credit back farther than one action • The general form is: global change in Q(s,a) at time t ~ specific to s, a (TD error) • (eligibility of s, a at time t ) at time t e.g., was s, a taken at t ? or shortly before t ? Eligibility Traces visits to s, a accumulating trace replace trace TIME Complex TD Backups (*) trial primitive TD backups e.g. TD ( ) Incrementally computes a weighted mixture of these backups as states are visited =0 Simple TD =1 Simple Monte Carlo 1– (1–) =1 2 (1–) 3 Kinds of Backups Exhaustive search Dynamic programming full backups sample backups Monte Carlo Temporaldifference learning shallow backups bootstrapping, deep backups Outline • Core RL Theory – Value functions – Markov decision processes – Temporal-Difference Learning – Policy Iteration • The Space of RL Methods – Function Approximation – Traces – Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks World-Class Applications of RL • TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player • Elevator Control Crites & Barto World's best down-peak elevator controller • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls • All these applications are large, stochastic, optimal control problems – Too hard for conventional methods to solve exactly – Require problem to be simplified • RL just finds an approximate solution – . . . which can be much better! Lesson #1: Approximate the Solution, Not the Problem Backgammon Tesauro, 1992,1995 SITUATIONS: configurations of the playing board (about 10 20 ) ACTIONS: moves REWARDS: win: +1 lose: –1 else: Pure delayed reward 0 Tesauro, 1992–1995 TD-Gammon Value Action selection by 2–3 ply search TD error Vt1 Vt Start with a random network Play millions of games against self Learn a value function from this simulated experience This produces arguably the best player in the world Lesson #2: The Power of Learning from Experience TD-Gammon 70% self-play performance against gammontool Tesauro, 1992 Neurogammon same network, but trained from 15,000 expert-labeled examples 50% 0 10 20 40 80 # hidden units Expert examples are expensive and scarce Experience is cheap and plentiful! And teaches the real solution Lesson #3: The Central Role of Value Functions ...of modifiable moment-by-moment estimates of how well things are going All RL methods learn value functions All state-space planners compute value functions Both are based on "backing up" value Recognizing and reacting to the ups and downs of life is an important part of intelligence Lesson #4: Learning and Planning can be radically similar Historically, planning and trial-and-error learning have been seen as opposites But RL treats both as processing of experience environment on-line interaction experience environment model simulation learning planning value/policy 1-Step Tabular Q-Planning 1. Generate a state, s, and action, a 2. Consult model for next state, s', and reward, r 3. Learn from the experience, s,a,r,s' : Q(s,a) Q(s, a) r max Q(s' , a' ) Q(s,a) a' 4. go to 1 With function approximation and cleverness in search control (Step 1), this is a viable, perhaps even superior, approach to state-space planning Lesson #5: Accretive Computation "Solves" Planning Dilemmas value/policy planning model acting direct RL experience model learning Processes only loosely coupled Can proceed in parallel, asynchronously Quality of solution accumulates in value & model memories Reactivity/Deliberation dilemma "solved" simply by not opposing search and memory Intractability of planning "solved" by anytime improvement of solution Dyna Algorithm 1. s current state 2. Choose an action, a, and take it 3. Receive next state, s’, and reward, r 4. Apply RL backup to s, a, s’, r e.g., Q-learning update 5. Update Model(s, a) with s’, r 6. Repeat k times: - select a previously seen state-action pair s,a - s’, r Model(s, a) - Apply RL backup to s, a, s’, r 7. Go to 1 Actions –> Behaviors MDPs seems too too flat, low-level Need to learn/plan/design agents at a higher level at the level of behaviors rather than just low-level actions e.g., open-the-door rather than twitch-muscle-17 walk-to-work rather than 1-step-forward Behaviors are based on entire policies, directed toward subgoals enable abstraction, hierarchy, modularity, transfer... Rooms Example 4 rooms 4 hallways ROOM HALLWAYS 4 unreliable primitive actions up left O1 right Fail 33% of the time down G? O2 G? 8 multi-step options (to each room's 2 hallways) Given goal location, quickly plan shortest route Goal states are given a terminal value of 1 All rewards zero = .9 Planning by Value Iteration with primitive actions (cell-to-cell) V (goal )=1 Iteration #1 Iteration #2 Iteration #3 with behaviors (room-to-room) V (goal )=1 Iteration #1 Iteration #2 Iteration #3 Behaviors are much like Primitive Actions • Define a higher level Semi-MDP, overlaid on the original MDP • Can be used with same planning methods - faster planning - guaranteed correct results • Interchangeable with primitive actions • Models and policies can be learned by TD methods Provide a clear semantics for multi-level, re-usable knowledge for the general MDP context Lesson #6: Generality is no impediment to working with higher-level knowledge Summary of Lessons 1. Approximate the Solution, Not the Problem 2. The Power of Learning from Experience 3. The Central Role of Value Functions in finding optimal sequential behavior 4. Learning and Planning can be Radically Similar 5. Accretive Computation "Solves Dilemmas" 6. A General Approach need not be Flat, Low-level All stem from trying to solve MDPs Outline • Core RL Theory – Value functions – Markov decision processes – Temporal-Difference Learning – Policy Iteration • The Space of RL Methods – Function Approximation – Traces – Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks Strands of History of RL Trial-and-error learning Thorndike () 1911 Temporal-difference Optimal control, learning value functions Secondary reinforcement () Hamilton (Physics) 1800s Shannon Minsky Samuel Klopf Holland Witten Bellman/Howard (OR) Werbos Barto et al Sutton Watkins Radical Generality of the RL/MDP problem RL/MDPs Classical General stochastic dynamics Determinism General goals Goals of achievement Uncertainty Complete knowledge Reactive decision-making Closed world Unlimited deliberation Getting the degree of abstraction right • Actions can be low level (e.g. voltages to motors), high level (e.g., accept a job offer), "mental" (e.g., shift in focus of attention) • Situations: direct sensor readings, or symbolic descriptions, or "mental" situations (e.g., the situation of being lost) • An RL agent is not like a whole animal or robot, which consist of many RL agents as well as other components • The environment is not necessarily unknown to the RL agent, only incompletely controllable • Reward computation is in the RL agent's environment because the RL agent cannot change it arbitrarily Some of the Open Problems • Incomplete state information • Exploration • Structured states and actions • Incorporating prior knowledge • Using teachers • Theory of RL with function approximators • Modular and hierarchical architectures • Integration with other problem–solving and planning methods Dimensions of RL Algorithms • TD --- Monte Carlo () • Model --- Model free • Kind of Function Approximator • Exploration method • Discounted --- Undiscounted • Computed policy --- Stored policy • On-policy --- Off-policy • Action values --- State values • Sample model --- Distribution model • Discrete actions --- Continuous actions A Unified View • There is a space of methods, with tradeoffs and mixtures - rather than discrete choices • Value function are key - all the methods are about learning, computing, or estimating values - all the methods store a value function • All methods are based on backups - different methods just have backups of different shapes and sizes, sources and mixtures Psychology and RL PAVLOVIAN CONDITIONING The single most common and important kind of learning Ubiquitous - from people to paramecia learning to predict and anticipate, learning affect TD methods are a compelling match to the data Most modern, real-time methods are based on TemporalDifference Learning INSTRUMENTAL LEARNING (OPERANT CONDITIONING) "Actions followed by good outcomes are repeated" Goal-directed learning, The Law of Effect Biological version of Reinforcement Learning But careful comparisons to data have not yet been made Hawkins & Kandel Hopfield & Tank Klopf Moore et al. Byrne et al. Sutton & Barto Malaka Neuroscience and RL RL interpretations of brain structures are being explored in several areas: Basal Ganglia Dopamine Neurons Value Systems Final Summary • RL is about a problem: - goal-directed, interactive, real-time decision making - all hail the problem! • Value functions seem to be the key to solving it, and TD methods the key new algorithmic idea • There are a host of immediate applications - large-scale stochastic optimal control and scheduling • There are also a host of scientific problems • An exciting time Bibliography Sutton, R.S., Barto, A.G., Reinforcement Learning: An Introduction. MIT Press 1998. Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement learning: A survey". Journal of Artificial Intelligence Research, 4, 1996. Special issues on Reinforcement Learning: Machine Learning, 8(3/4),1992, and 22(1/2/3), 1996. Neuroscience: Schultz, W., Dayan, P., and Montague, P.R. "A neural substrate of prediction and reward". Science, 275:1593--1598, 1997. Animal Learning: Sutton, R.S. and Barto, A.G. "Timederivative models of Pavlovian reinforcement". In Gabriel, M. and Moore, J., Eds., Learning and Computational Neuroscience, pp. 497--537. MIT Press, 1990. See also web pages, e.g., starting from http://www.cs.umass.edu/~rich.