Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David McAllester, Sanjoy Dasgupta Computers have gotten faster and bigger • Analytic solutions are less important • Computer-based approximate solutions – Neural networks – Genetic algorithms • Machines take on more of the work • More general solutions to more general problems – Non-linear systems – Stochastic systems – Larger systems • Exponential methods are still exponential… but compute-intensive methods increasingly winning New Computers have led to a New Artificial Intelligence More general problems and algorithms, automation - Data intensive methods - learning methods Less handcrafted solutions, expert systems More probability, numbers Less logic, symbols, human understandability More real-time decision-making States, Actions, Goals, Probability => Markov Decision Processes Markov Decision Processes State Space S (finite) Action Space A (finite) Discrete time t = 0,1,2,… Episode s0 a0 r1 s1 a1 r2 s2 a2 (rT sT ) Transition Probabilities a p ss Pr st1 s st s, at a Expected Rewards Policy rsas E rt 1 st s, at a, st1 s Return : S A [0,1] (s,a) Pr at a st s Value [0,1] (discount rate) Rt rt1 rt 2 2 rt3 Optimal policy PREDICTION Problem V (s) E Rt st s * arg max V CONTROL Problem Key Distinctions • Control vs Prediction • Bootstrapping/Truncation vs Full Returns • Sampling vs Enumeration • Function approximation vs Table lookup • Off-policy vs On-policy Harder, more challenging and interesting Easier, conceptually simpler Full Depth Search ˆ (s) Computing V s Full Returns 2 a r r r L r s’ a’ r’ r” is of exponential complexity BD branching factor depth Truncated Search Computing V (s) s Truncated Returns a r Vˆ ( s) r s’ Vˆ ( s) Search truncated after one ply Approximate values used at stubs Values computed from their own estimates! -- “Bootstrapping” Dynamic Programming is Bootstrapping s Truncated Returns E r Vˆ (s ) s a r s’ Vˆ Vˆ Vˆ Vˆ E.g., DP Policy Evaluation Vˆ0 (s) arbitrary Vˆk 1 (s) a lim Vˆk V k a a (s,a) p r ss ss Vˆk (s) s s S Boostrapping seems to Speed Learning M OUNTAIN CAR 700 RANDOM W ALK 0.5 650 accumulating traces 600 Steps per episode 0.4 accumulating traces 550 RMS error 0.3 500 450 replacing traces replacing traces 400 0 0.2 0.4 0.6 0.8 1 0 0.4 0.6 0.8 1 CART AND POLE PUDDLE W ORLD 240 0.2 0.2 300 230 250 220 210 Cost per episode 200 200 replacing traces 190 accumulating traces 180 150 170 100 160 150 50 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Failures per 100,000 steps Bootstrapping/Truncation • Replacing possible futures with estimates of value • Can reduce computation and variance • A powerful idea, but • Requires stored estimates of value for each state The Curse of Dimensionality Bellman, 1961 • The number of states grows exponentially with dimensionality -- the number of state variables • Thus, on large problems, – Can’t complete even one sweep of DP • Can’t enumerate states, need sampling! – Can’t store separate values for each state • Can’t store values in tables, need function approximation! DP Policy Evaluation Vˆk 1 (s) (s,a) psas rsas Vˆk ( s) a s s S DP Policy Evaluation a a Vˆk 1 (s) (s, a) pssrss Vˆk ( s) s S a a Vˆk 1 (s) d(s) (s,a) pss rss Vˆk ( s) s S a a s s Some distribution over states, possibly uniform DP Policy Evaluation a a Vˆk 1 (s) (s, a) pssrss Vˆk ( s) s S a a Vˆk 1 (s) d(s) (s,a) pss rss Vˆk ( s) s S a a s s These terms can be replaced by sampling DP Policy Evaluation a a Vˆk 1 (s) (s, a) pssrss Vˆk ( s) s S a a Vˆk 1 (s) d(s) (s,a) pss rss Vˆk ( s) s S a a s s Tabular TD(0) Sutton, 1988; Witten, 1974 For each sample transition, s,a s’,r : Vˆ (s) Vˆ (s) r Vˆ ( s) Vˆ (s) lim Vˆ (s) V ( ) Sampling vs Enumeration Sample Returns can also be either Full or Truncated r r r 2 r r r L r Vˆ ( s) As in the general TD() algorithm Function Approximation • Store values as a parameterized form Vˆ (s) f (s,q) • Update q, e.g., by gradient descent: qk1 qk d(s) (s,a) psas rsas Vˆk ( s) Vˆk (s) qVˆk (s) s a s cf. DP Policy Evaluation (rewritten to include a step-size ): Vˆk 1 (s) Vˆk (s) d(s) (s, a) psas rsas Vˆk ( s) Vˆk (s) a s s S Linear Function Approximation Vˆ (s) qT s qVˆ (s) s Each state s represented by a feature vector s Or respresent a state-action pair with sa and approximate action values: Q (s,a) E Rt st s,at a Qˆ (s,a) qTs,a Linear TD() After each episode: T 1 where q q q t t0 e.g., rt 1 q T st1at1 qt Rt q T st at st at “ -return” Rt (1 ) n1 Rt(n) n1 “n-step return” Rt(n) rt 1 rt2 n1 rtn nq T st natn Sutton, 1988 RoboCup An international AI and Robotics research initiative • Use soccer as a rich and realistic testbed • Robotic and simulation leagues – Open source simulator (Noda) Research Challenges • • • • • Multiple teammates with a common goal Multiple adversaries – not known in advance Real-time decision making necessary Noisy sensors and actuators Enormous state space, > 2310 states 9 RoboCup Feature Vectors . Full soccer state Sparse, coarse, tile coding 13 continuous state variables . . . . . . . . . . action values Linear map q r Huge binary feature vector s (about 400 1’s and 40,000 0’s) 13 Continuous State Variables (for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Sparse, Coarse, Tile-Coding (CMACs) 32 tilings per group of state variables Learning Keepaway Results 3v2 handcrafted takers 14 Multiple, independent runs of TD() 12 Episode Duration (seconds) 10 8 handcoded 6 random always hold 4 0 10 20 25 Hours of Training Time (bins of 1000 episodes) Stone & Sutton, 2001 Key Distinctions • Control vs Prediction • Bootstrapping/Truncation vs Full Returns • Function approximation vs Table lookup • Sampling vs Enumeration • Off-policy vs On-policy – The distribution d(s) Off-Policy Instability • Examples of diverging qk are known for – Linear FA – Bootstrapping • Even for – Prediction – Enumeration – Uniform d(s) Baird, 1995 Gordon, 1995 Bertsekas & Tsitsiklis, 1996 • In particular, linear Q-learning can diverge Baird’s Counterexample q0 + 2q1 q0 + 2q2 q0 + 2q3 q0 + 2q4 Markov chain (no actions) q0 + 2q5 All states updated equally often, synchronously Exact solution exists: q = 0 100% Initial q0 = (1,1,1,1,1,10,1)T 2q0 + q6 1 10 10 10 Parameter values, qk(i) (log scale, broken at ±1) 10 0 5 / -10 0 -10 5 10 - 10 0 1000 2000 3000 Iterations (k) 4000 5000 On-Policy Stability • If d(s) is the stationary distribution of the MDP under policy (the on-policy distribution) • Then convergence is guaranteed for – – – – Linear FA Bootstrapping Sampling Prediction Tsitsiklis & Van Roy, 1997 Tadic, 2000 • Furthermore, asymptotic mean square error is a bounded expansion of the minimal MSE: 1 MSE(q ) min MSE(q ) 1 q — Value Function Space — True V* Region of * best admissable V* value fn. best admissable policy Sarsa, TD() & other on-policy methods Q-learning, DP & other off-policy methods divergence possible chattering without divergence or guaranteed convergence Original naïve hope guaranteed convergence to good policy Res gradient et al. guaranteed convergence to less desirable policy There are Two Different Problems: Chattering Instability • Is due to Control + FA • Is due to Bootstrapping + FA + Off-Policy • Bootstrapping not involved • Not necessarily a problem • Being addressed with policy-based methods • Argmax-ing is to blame • Control not involved • Off-policy is to blame Yet we need Off-policy Learning • Off-policy learning is needed in all the frameworks that have been proposed to raise reinforcement learning to a higher level – Macro-actions, options, HAMs, MAXQ – Temporal abstraction, hierarchy, modularity – Subgoals, goal-and-action-oriented perception • The key idea is: We can only follow one policy, but we would like to learn about many policies, in parallel – To do this requires off-policy learning On-Policy Policy Evaluation Problem Use data (episodes) generated by to learn Qˆ Q Off-Policy Policy Evaluation Problem Use data (episodes) generated by ’ to learn Qˆ Q behavior policy Target policy Naïve Importance-Sampled TD() qt Relative prob. Rt q T st at st at r1 r2 r3 L rT-1 of episode under and ’ (st , at ) rt (st ,at ) importance sampling correction ratio for t We expect this to have relatively high variance Per-Decision Importance-Sampled TD() qt Rt q T st at st at r1 r2 r3 L rt (st , at ) rt (st ,at ) Rt is like Rt , except in terms of Rt(n) rt1 rt 2 rt 1 rt2 rt1 rt2 n rt 1 rt n q T stnatn Per-Decision Theorem Precup, Sutton & Singh (2000) E Rt st , at E Rt st ,at New Result for Linear PD Algorithm Precup, Sutton & Dasgupta (2001) E q s0 ,a0 E q s0 , a0 Total change over episode for new algorithm Total change for conventional TD() Convergence Theorem • Under natural assumptions – – – – – S and A are finite All s,a are visited under ’ and ’ are proper (terminate w.p.1) bounded rewards usual stochastic approximation conditions on the step size k • And one annoying assumption varr1 r2 rT1 B s1 S • Then the off-policy linearlength PD algorithm converges to e.g., bounded episode the same q as on-policy TD() The variance assumption is restrictive But can often be satisfied with “artificial” terminations • Consider a modified MDP with bounded episode length – – – – We have data for this MDP Our result assures good convergence for this This solution can be made close to the sol’n to original problem By choosing episode bound long relative to or the mixing time • Consider application to macro-actions – – – – Here it is the macro-action that terminates Termination is artificial, real process is unaffected Yet all results directly apply to learning about macro-actions We can choose macro-action termination to satisfy the variance condition Empirical Illustration Agent always starts at S Terminal states marked G Deterministic actions Behavior policy chooses up-down with 0.4-0.1 prob. Target policy chooses up-down with 0.1-0.4 If the algorithm is successful, it should give positive weight to rightmost feature, negative to the leftmost one Trajectories of Two Components of q 0.5 0.4 0.3 µr *i ght most , down 0.2 µr i ght most , down 0.1 = 0.9 decreased 0 -0.1 µl ef t most , down -0.2 -0.3 µl*ef t most , down -0.4 0 1 2 3 4 5 Episodes x 100,000 q appears to converge as advertised Comparison of Naïve and PD IS Algs 2.5 Root Mean Squared Error = 0.9 constant Naive IS 2 1.5 (after 100,000 episodes, averaged over 50 runs) Per-Decision IS 1 -12 -13 -14 -15 -16 -17 Log2 Precup, Sutton & Dasgupta, 2001 Can Weighted IS help the variance? Return to the tabular case, consider two estimators: ith return following s,a n QnIS (s,a) 1n Ri wi i1 IS correction product rt 1 rt 2 rt 3 L r T 1 (s,a occurs at t ) converges with finite variance iff the wi have finite variance n Ri wi QnISW (s,a) i1n wi Can this be extended to the FA case? i1 converges with finite variance even if the wi have infinite variance Restarting within an Episode • We can consider episodes to start at any time • This alters the weighting of states, – But we still converge, – And to near the best answer (for the new weighting) Incremental Implementation At the start of each episode: c0 g0 e0 c00 On each step: st at rt1 st1 at1 rt1 (st1, at1 ) (st 1, at 1 ) t rt 1 r t1 q T t 1 q T t qt t et ct1 rt 1 ct gt 1 et1 rt 1 et ct1 t1 0t T Key Distinctions • Control vs Prediction • Bootstrapping/Truncation vs Full Returns • Sampling vs Enumeration • Function approximation vs Table lookup • Off-policy vs On-policy Harder, more challenging and interesting Easier, conceptually simpler Conclusions • RL is beating the Curse of Dimensionality – FA and Sampling • There is a broad frontier, with many open questions • MDPs: States, Decisions, Goals, and Probability is a rich area for mathematics and experimentation