From Markov Decision Processes to Artificial Intelligence Rich Sutton with thanks to: Andy Barto Satinder Singh Doina Precup The steady march of computing science is changing artificial intelligence • More computation-based approximate methods – Machine learning, neural networks, genetic algorithms • Machines are taking on more of the work – More data, more computation – Less handcrafted solutions, human understandability • More search – Exponential methods are still exponential… but compute-intensive methods increasingly winning • More general problems – stochastic, non-linear, optimal – real-time, large state action state action The problem is to predict and control a doubly branching interaction unfolding over time, with a long-term goal state action state Agent World Sequential, state-action-reward problems are ubiquitous • • • • • • • • • Walking Flying a helicopter Playing tennis Logistics Inventory control Intruder detection Repair or replace? Visual search for objects Playing chess, Go, Poker • • • • • • • • • Medical tests, treatment Conversation User interfaces Marketing Queue/server control Portfolio management Industrial process control Pipeline failure prediction Real-time load balancing Markov Decision Processes (MDPs) state action state Discrete time t 0,1,2,3, States st S Actions at A Policy : S A [0,1] action (s,a) Pr at a st s Transition probabilities state a Rewards Pss Pr st 1 s st s,at a action state rt Rss E rt 1 st s, at a, st 1 s a MDPs Part II: The Objective • “Maximize cumulative reward” • Define the value of being in a state under a policy as V (s) E r1 gr2 g r3 2 s0 s, There are other possibilities... where delayed rewards are discounted by g [0,1] • Defines a partial ordering over policies, with at least one optimal policy: Needs proving Markov Decision Processes • Extensively studied since 1950s • In Optimal Control – Specializes to Ricatti equations for linear systems – And to HJB equations for continuous time systems – Only general, nonlinear, optimal-control framework • In Operations Research – Planning, scheduling, logistics – Sequential design of experiments – Finance, marketing, inventory control, queuing, telecomm • In Artificial Intelligence (last 15 years) – Reinforcement learning, probabilistic planning • Dynamic Programming is the dominant solution method Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) – The curse of dimensionality • Reinforcement Learning (RL) – TD(l) algorithm – TD-Gammon example – Acrobot example • RL significantly extends DP methods for solving MDPs – RoboCup example • Conclusion, from the AI point of view – Spy plane example The Principle of Optimality • Dynamic Programming (DP) requires a decomposition into subproblems • In MDPs this comes from the Independence of Path assumption • Values can be written in terms of successor values, e.g., V (s) E r1 gr2 g 2 r3 s0 s, E r1 gV (s1 ) s0 s, (s,a) Ps s Rs s gV ( s) a a s a “Bellman Equations” Dynamic Programming: Sweeping through the states, updating an approximation to the optimal value function For example, Value Iteration: Initialize: V(s) : 0 s Do forever: V(s) : max Psas Rsas gV (s) a * s s S In theany limit, V V Pick of the maximizing actions to get * DP is repeated backups, shallow lookahead searches a re Qui ckTi me ™ a nd a d eco mp re sso r ne e de d to see thi s p i cture. s a a Ps s V V s’ V(s’) a Rss s’’ V(s’’) Dynamic Programming is the dominant solution method for MDPs • Routinely applied to problems with millions of states • Worst case scales polynomially in |S| and |A| • Linear Programming has better worst-case bounds but in practice scales 100s of times worse • On large stochastic problems, only DP is feasible Perennial Difficulties for DP 1. Large state spaces “The curse of dimensionality” 2. Difficulty calculating the dynamics, a a Pss and Rss e.g., from a simulation 3. Unknown dynamics V(s) : max Psas Rsas gV (s) a s s S The Curse of Dimensionality Bellman, 1961 • The number of states grows exponentially with dimensionality -- the number of state variables • Thus, on large problems, – Can’t complete even one sweep of DP • Can’t enumerate states, need sampling! – Can’t store separate values for each state • Can’t store values in tables, need function approximation! V(s) : max Psas Rsas gV (s) a s s S Reinforcement Learning: Using experience in place of dynamics Let s0, a0 ,r1,s1,a1,r2,s2, a2, be an observed sequence with actions selected by For every time step, t, V (st ) E rt1 gV (st 1 ) st , , “Bellman Equation” which suggests the DP-like update: st V(st ) : E rt 1 gV(st 1 ) st , . We don’t know this expected value, but we know the actual , an unbiased sample of it. rt 1 gV(sat 1step ) In RL, we take toward this sample, e.g., half way “Tabular TD(0)” V(st ) : 12 V (st ) 1 2 rt 1 gV (st1 ) Temporal-Difference Learning (Sutton, 1988) Updating a prediction based on its change (temporal difference) from one moment to the next. Tabular TD(0): first prediction better, later prediction V(st ) : V (st ) rt 1 gV (st1 ) V(st ) t 0,1,2, Temporal difference Or, V is, e.g., a neural network with parameter q Then use gradient-descent TD(0): : q uses rt1differences gV(st 1 ) V(s V (st ) predictions t 0,1,2, TD(l),ql>0, from as well t )qlater TD-Gammon Tesauro, 1 9 9 2 -1 9 9 5 ... ... ... ≈ probability of winning ... 162 TD Error St art wit h a random Net work Play millions of games against it self Learn a value funct ion from t his simulat ed experience This produces arguably t he best player in t he world Act ion select ion by 2 -3 ply search (Tesauro, 1992) TD-Gammon vs an Expert-Trained Net TD-Gammon +features .70 fraction of games won against Gammontool TD-Gammon .65 * EP+features "Neurogammon" .60 .55 EP (BP net trained from expert moves) .50 .45 0 10 20 40 number of hidden units 80 Examples of Reinforcement Learning • Elevator Control Crites & Barto – (Probably) world's best down-peak elevator controller • Walking Robot Benbrahim & Franklin – Learned critical parameters for bipedal walking • Robocup Soccer Teams e.g., Stone & Veloso, Reidmiller et al. – RL is used in many of the top teams • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis – 10-15% improvement over industry standard methods • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin – World's best assigner of radio channels to mobile telephone calls • KnightCap and TDleaf Baxter, Tridgell & Weaver – Improved chess play from intermediate to master in 300 games Does function approximation beat the curse of dimensionality? • Yes… probably • FA makes dimensionality per se largely irrelevant • With FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it • If you can get FA to work! FA in DP and RL (1st bit) • Conventional DP works poorly with FA – Empirically [Boyan and Moore, 1995] – Diverges with linear FA [Baird, 1995] – Even for prediction (evaluating a fixed policy) [Baird, 1995] • RL works much better – Empirically [many applications and Sutton, 1996] – TD(l) prediction converges with linear FA [Tsitsiklis & Van Roy, 1997] – TD(l) control converges with linear FA [Perkins & Precup, 2002] • Why? Following actual trajectories in RL ensures that every state is updated at least as often as it is the basis for updating DP+FA fails RL+FA works More transitions can go in to a state than go out Real trajectories always leave a state after entering it Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) – The curse of dimensionality • Reinforcement Learning (RL) – TD(l) algorithm – TD-Gammon example – Acrobot example • RL significantly extends DP methods for solving MDPs – RoboCup example • Conclusion, from the AI point of view – Spy plane example Moore, 1990 The Mountain Car Problem Goal Gravity wins Minimum-Time-to-Goal Problem SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting Sutton, 1996 Value Functions Learned while solving the Mountain Car problem Goal region Minimize Time-to-Goal Value = estimated time to goal Lower is better Sparse, Coarse, Tile-Coding (CMAC) Car velocity Car position Albus, 1980 Albus, 1980 Tile Coding (CMAC) Example of Sparse Coarse-Coded Networks . fixed expansive Re-representation . . . . . . . . . . Linear last layer features Coarse: Large receptive fields Sparse: Few features present at one time The Acrobot Problem Sutton, 1996 Goal: Raise tip above line e.g., Dejong & Spong, 1994 fixed base Torque applied here Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities q1 q2 tip Reward = -1 per time step Tile coding with 48 tilings The RoboCup Soccer Competition 13 Continuous State Variables (for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Stone & Sutton, 2001 RoboCup Feature Vectors . Full soccer state Sparse, coarse, tile coding 13 continuous state variables . . . . . . . . . . action values Linear map q r Huge binary feature vector fs (about 400 1’s and 40,000 0’s) Stone & Sutton, 2001 Stone & Sutton, 2001 Learning Keepaway Results 3v2 handcrafted takers Multiple, independent runs of TD(l) Hajime Kimura’s RL Robots (dynamics knowledge) QuickTime™ and a decompressor are needed to see this picture. Before QuickTime™ and a decompressor are needed to see this picture. Backward QuickTime™ and a decompressor are needed to see this picture. After QuickTime™ and a decompressor are needed to see this picture. New Robot, Same algorithm Assessment re: DP • RL has added some new capabilities to DP methods – Much larger MDPs can be addressed (approximately) – Simulations can be used without explicit probabilities – Dynamics need not be known or modeled • Many new applications are now possible – Process control, logistics, manufacturing, telecomm, finance, scheduling, medicine, marketing… • Theoretical and practical questions remain open Outline • Markov decision processes (MDPs) • Dynamic Programming (DP) – The curse of dimensionality • Reinforcement Learning (RL) – TD(l) algorithm – TD-Gammon example – Acrobot example • RL significantly extends DP methods for solving MDPs – RoboCup example • Conclusion, from the AI point of view – Spy plane example A lesson for AI: The Power of a “Visible” Goal • In MDPs, the goal (reward) is part of the data, part of the agent’s normal operation • The agent can tell for itself how well it is doing • This is very powerful… we should do more of it in AI • Can we give AI tasks visible goals? – – – – Visual object recognition? Better would be active vision Story understanding? Better would be dialog, eg call routing User interfaces, personal assistants Robotics… say mapping and navigation, or search • The usual trick is to make them into long-term prediction problems • Must be a way. If you can’t feel it, why care about it? Assessment re: AI • DP and RL are potentially powerful probabilistic planning methods – But typically don’t use logic or structured representations • How is they as an overall model of thought? – Good mix of deliberation and immediate judgments (values) – Good for causality, prediction, not for logic, language • The link to data is appealing…but incomplete – MDP-style knowledge may be learnable, tuneable, verifiable – But only if the “level” of the data is right • Sometimes seems too low-level, too flat Ongoing and Future Directions • Temporal abstraction [Sutton, Precup, Singh, Parr, others] – Generalize transitions to include macros, “options” – Multiple overlying MDP-like models at different levels • States representation [Littman, Sutton, Singh, Jaeger...] – Eliminate the nasty assumption of observable state – Get really real with data – Work up to higher-level, yet grounded, representations • Neuroscience of reward systems [Dayan, Schultz, Doya] – Dopamine reward system behaves remarkably like TD • Theory and practice of value function approximation [everybody] Spy Plane Example Sutton & Ravindran, 2001 (Reconnaissance Mission Planning) 25 15 (reward) 25 (mean time between weather changes) 50 8 options 50 – Actions: which direction to fly now – Options: which site to head for • Options compress space and time 5 100 • Mission: Fly over (observe) most valuable sites and return to base • Stochastic weather affects observability (cloudy or clear) of sites • Limited fuel • Intractable with classical optimal control methods • Temporal scales: 10 – Reduce steps from ~600 to ~6 – Reduce states from ~1011 to ~106 50 Base 100 decision steps QO* (s, o) rso psosVO* ( s) any state (106) s sites only (6) Sutton & Ravindran, 2001 Spy Plane Results Expected Reward/Mission 60 30 – Assumes options followed to completion – Plans optimal SMDP solution • SMDP planner with re-evaluation 50 40 • SMDP planner: High Fuel Low Fuel SMDP SMDP Static planner Planner Re-planner with re-evaluation Temporal abstraction of options on finds better approximation each step than static planner, with little more computation than SMDP planner – Plans as if options must be followed to completion – But actually takes them for only one step – Re-picks a new option on every step • Static planner: – Assumes weather will not change – Plans optimal tour among clear sites – Re-plans whenever weather changes Didn’t have time for • Action Selection – Exploration/Exploitation – Action values vs. search – How learning values leads to policy improvements • Different returns, e.g., the undiscounted case • Exactly how FA works, backprop • Exactly how options work – How planning at a high level can affect primitive actions • How states can be abstracted to affordances – And how this directly builds on the option work