Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/ 1 Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” (Rich Sutton) Contents Defining AI Markovian Decision Problems Dynamic Programming Approximate Dynamic Programming Generalizations 2 Literature Books Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 Dimitri P. Bertsekas, John Tsitsiklis: NeuroDynamic Programming, Athena Scientific, 1996 Journals JMLR, MLJ, JAIR, AI Conferences NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI 3 Some More Books Martin L. Puterman. Markov Decision Processes. Wiley, 1994. Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007). James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003. 4 Resources RL-Glue http://rlai.cs.ualberta.ca/RLBB/top.html http://rlai.cs.ualberta.ca/RLR/index.html http://www.igi.tugraz.at/riltoolbox/general/overview.html RL-Library The RL Toolbox 2.0 OpenDP http://opendp.sourceforge.net http://rl-competition.org/ June 1st, 2008: Test runs begin! Operations research (MOR, OR) Control theory (IEEE TAC, Automatica, IEEE CDC, ECC) Simulation optimization (Winter Simulation Conference) RL-Competition (2008)! Related fields: 5 Abstract Control Model Environment Sensations (and reward) actions Controller = agent “Perception-action loop” 6 Zooming in.. memory reward external sensations agent state internal sensations actions 7 A Mathematical Model Plant (controlled object): xt+1 = f(xt,at,vt) xt : state, vt : noise zt = g(xt,wt) zt : sens/obs, wt : noise State: Sufficient statistics for the future Independently of what we measure ..or.. Relative to measurements Controller at = F(z1,z2,…,zt) at: action/control => PERCEPTION-ACTION LOOP “CLOSED-LOOP CONTROL” Design problem: F = ? Goal: t=1T r(zt,at)! max “Subjective “Objective State” State” 8 A Classification of Controllers Feedforward: a1,a2,… is designed ahead in time ??? Feedback: Purely reactive systems: at = F(zt) Why is this bad? Feedback with memory: mt = M(mt-1,zt,at-1) ~interpreting sensations at = F(mt) decision making: deliberative vs. reactive 9 Feedback controllers Plant: xt+1 = f(xt,at,vt) zt+1 = g(xt,wt) Controller: mt = M(mt-1,zt,at-1) at = F(mt) mt ¼ xt: state estimation, “filtering” difficulties: noise,unmodelled parts How do we compute at? With a model (f’): model-based control ..assumes (some kind of) state estimation Without a model: model-free control 10 Markovian Decision Problems 11 Markovian Decision Problems (X,A,p,r) X – set of states A – set of actions (controls) p – transition probabilities p(y|x,a) r – rewards r(x,a,y), or r(x,a), or r(x) ° – discount factor 0·°<1 12 The Process View (Xt,At,Rt) Xt – state at time t At – action at time t Rt – reward at time t Laws: Xt+1~p(.|Xt,At) At ~ ¼(.|Ht) ¼: policy Ht = (Xt,At-1,Rt-1, .., A1,R1,X0) – history Rt = r(Xt,At,Xt+1) 13 The Control Problem Value functions: P1 V¼(x) = E¼[ t = 0 ° t Rt jX 0 = x] Optimal value function: V ¤ (x) = max¼ V¼(x) Optimal policy: V¼¤ (x) = V ¤ (x) 14 Applications of MDPs Operations research Econometrics Control, statistics Games, AI Optimal investments Replacement problems Option pricing Logistics, inventory management Active vision Production scheduling Dialogue control Bioreactor control Robotics (Robocup Soccer) Driving Real-time load balancing Design of experiments (Medical tests) 15 Variants of MDPs Discounted Undiscounted: Stochastic Shortest Path Average reward Multiple criteria Minimax Games 16 MDP Problems Planning The MDP (X,A,P,r,°) is known. Find an optimal policy ¼*! Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy ¼*! Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning 17 Solving MDPs – Dimensions Which problem? (Planning, learning, optimal learning) Exact or approximate? Uses samples? Incremental? Uses value functions? Yes: Value-function based methods Planning: DP, Random Discretization Method, FVI, … Learning: Q-learning, Actor-critic, … No: Policy search methods Planning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus), Representation Structured state: Factored states, logical representation, … Structured policy space: Hierarchical methods 18 Dynamic Programming 19 Richard Bellman (1920-1984) Control theory Systems Analysis Dynamic Programming: RAND Corporation, 1949-1955 Bellman equation Bellman-Ford algorithm Hamilton-Jacobi-Bellman equation “Curse of dimensionality” invariant imbeddings Grönwall-Bellman inequality 20 Bellman Operators Let ¼:X ! A be a stationary policy B(X) = { V | V:X! R, ||V||1<1 } T¼:B(X)! B(X) (T¼ V)(x) = yp(y|x,¼(x)) [r(x,¼(x),y)+° V(y)] Theorem: T¼ V¼ = V¼ Note: This is a linear system of equations: r¼ + ° P¼ V¼ = V¼ V¼ = (I-° P¼)-1 r¼ 21 Proof of T¼ V¼ = V¼ What you need to know: Linearity of expectation: E[A+B] = E[A]+E[B] Law of total expectation: E[ Z ] = x P(X=x) E[ Z | X=x ], and E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. Markov property: E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y] V¼(x) = E¼ [t=01 °t Rt|X0 = x] = y P(X1=y|X0=x) E¼[t=01 °t Rt|X0 = x,X1=y] (by the law of total expectation) = y p(y|x,¼(x)) E¼[t=01 °t Rt|X0 = x,X1=y] (since X1~p(.|X0,¼(X0))) = y p(y|x,¼(x)) {E¼[ R0|X0=x,X1=y]+° E¼ [ t=01°t Rt+1|X0=x,X1=y]} (by the linearity of expectation) = y p(y|x,¼(x)) {r(x,¼(x),y) + ° V¼(y)} (using the definition of r, V¼) = (T¼ V¼)(x). (using the definition of T¼) 22 The Banach Fixed-Point Theorem B = (B,||.||) Banach space T: B1! B2 is L-Lipschitz (L>0) if for any U,V, || T U – T V || · L ||U-V||. T is contraction if B1=B2, L<1; L is a contraction coefficient of T Theorem [Banach]: Let T:B! B be a °-contraction. Then T has a unique fixed point V and 8 V02 B, Vk+1=T Vk, Vk ! V and ||Vk-V||=O(°k) 23 An Algebra for Contractions Prop: If T1:B1! B2 is L1-Lipschitz, T2: B2 ! B3 is L2-Lipschitz then T2 T1 is L1 L2 Lipschitz. Def: If T is 1-Lipschitz, T is called a non-expansion Prop: M: B(X£ A) ! B(X), M(Q)(x) = maxa Q(x,a) is a non-expansion Prop: Mulc: B! B, Mulc V = c V is |c|-Lipschitz Prop: Addr: B ! B, Add V = r + V is a non-expansion. Prop: K: B(X) ! B(X), (K V)(x)=y K(x,y) V(y) is a non-expansion if K(x,y)¸ 0, y K(x,y) =1. 24 Policy Evaluations are Contractions Def: ||V||1 = maxx |V(x)|, supremum norm; here ||.|| Theorem: Let T¼ the policy evaluation operator of some policy ¼. Then T¼ is a °-contraction. Corollary: V¼ is the unique fixed point of T¼. Vk+1 = T¼ Vk ! V¼, 8 V0 2 B(X) and ||Vk-V¼|| = O(°k). 25 The Bellman Optimality Operator Let T:B(X)! B(X) be defined by (TV)(x) = maxa y p(y|x,a) { r(x,a,y) + ° V(y) } Def: ¼ is greedy w.r.t. V if T¼V =T V. Prop: T is a °-contraction. Theorem (BOE): T V* = V*. Proof: Let V be the fixed point of T. T¼ · T V* · V. Let ¼ be greedy w.r.t. V. Then T¼ V = T V. Hence V¼ = V V · V* V = V*. 26 Value Iteration Theorem: For any V0 2 B(X), Vk+1 = T Vk, Vk ! V* and in particular ||Vk – V*||=O(°k). What happens when we stop “early”? Theorem: Let ¼ be greedy w.r.t. V. Then ||V¼ – V*|| · 2||TV-V||/(1-°). Proof: ||V¼-V*||· ||V¼-V||+||V-V*|| … Corollary: In a finite MDP, the number of policies is finite. We can stop when ||Vk-TVk|| · ¢(1-°)/2, where ¢ = min{ ||V*-V¼|| : V¼ V* } Pseudo-polynomial complexity 27 Policy Improvement [Howard ’60] Def: U,V2 B(X), V ¸ U if V(x) ¸ U(x) holds for all x2 X. Def: U,V2 B(X), V > U if V ¸ U and 9 x2 X s.t. V(x)>U(x). Theorem (Policy Improvement): Let ¼’ be greedy w.r.t. V¼. Then V¼’ ¸ V¼. If T V¼>V¼ then V¼’>V¼. 28 Policy Iteration Policy Iteration(¼) V V¼ Do {improvement} V’ V Let ¼: T¼ V = T V V V¼ While (V>V’) Return ¼ 29 Policy Iteration Theorem Theorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy. Proof: Follows from the Policy Improvement Theorem. 30 Linear Programming V ¸ T V V ¸ V* = T V*. Hence, V* is the “largest” V that satisfies V ¸ T V. V¸TV , (*) V(x) ¸ yp(y|x,a){r(x,a,y)+° V(y)}, 8x,a LinProg(V): x V(x) ! min s.t. V satisfies (*). Theorem: LinProg(V) returns the optimal value function, V*. Corollary: Pseudo-polynomial complexity 31 Variations of a Theme 32 Approximate Value Iteration AVI: Vk+1 = T Vk + ²k AVI Theorem: Let ² = maxk ||²k||. Then limsupk!1 ||Vk-V*|| · 2° ² / (1-°). Proof: Let ak = ||Vk –V*||. Then ak+1 = ||Vk+1 – V*|| = ||T Vk – T V* + ²k || · ° ||Vk-V*|| + ² = ° ak + ². Hence, ak is bounded. Take “limsup” of both sides: a· ° a + ²; reorder.// (e.g., [BT96]) 33 Fitted Value Iteration – Non-expansion Operators FVI: Let A be a non-expansion, Vk+1 = A T Vk. Where does this converge to? Theorem: Let U,V be such that A T U = U and T V = V. Then ||V-U|| · ||AV –V||/(1-°). Proof: Let U’ be the fixed point of TA. Then ||U’-V|| · ° ||AV-V||/(1-°). Since A U’ = A T (AU’), U=AU’. Hence, ||U-V|| =||AU’-V|| · ||AU’-AV||+||AV-V|| … [Gordon ’95] 34 Application to Aggregation Let ¦ be a partition of X, S(x) be the unique cell that x belongs to. Let A: B(X)! B(X) be (A V)(x) = z ¹(z;S(x)) V(z), where ¹ is a distribution over S(x). p’(C|B,a) = x2 B ¹(x;B) y2 C p(y|x,a), r’(B,a,C) = x2 B ¹(x;B) y2 C p(y|x,a) r(x,a,y). Theorem: Take (¦,A,p’,r’), let V’ be its optimal value function, V’E(x) = V’(S(x)). Then ||V’E – V*|| · ||AV*-V*||/(1-°). 35 Action-Value Functions L: B(X)! B(X£ A), (L V)(x,a) = y p(y|x,a) {r(x,a,y) + ° V(y)}. “One-step lookahead”. Note: ¼ is greedy w.r.t. V if (LV)(x,¼(x)) = max a (LV)(x,a). Def: Q* = L V*. Def: Let Max: B(X£ A)! B(X), (Max Q)(x) = max a Q(x,a). Note: Max L = T. Corollary: Q* = L Max Q*. Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*. T = L Max is a °-contraction Value iteration, policy iteration, … 36 Changing Granularity Asynchronous Value Iteration: Every time-step update only a few states AsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V*. How to use? Prioritized Sweeping IPS [MacMahan & Gordon ’05]: Instead of an update, put state on the priority queue When picking a state from the queue, update it Put predecessors on the queue Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positive LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95] Focussing on parts of the state that matter Constraints: Same problem solved from several initial positions Decisions have to be fast Idea: Update values along the paths 37 Changing Granularity Generalized Policy Iteration: Partial evaluation and partial improvement of policies Multi-step lookahead improvement AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird ’93] 38 Variations of a theme [SzeLi99] Game against nature [Heger ’94]: infw t°t Rt(w) with X0 = x Risk-sensitive criterion: log ( E[ exp(t°t Rt ) | X_0 = x ] ) Stochastic Shortest Path Average Reward Markov games Simultaneous action choices (Rockpaper-scissor) Sequential action choices Zero-sum (or not) 39 References [Howard ’60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960. [Gordon ’95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261—268, 1995. [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. [McMahan, Gordon ’05] H. B. McMahan and Geoffrey J. Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS. [Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189–211, 1990. [Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81—138, 1995. [Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-9314, November, 1993. [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999. [Heger ’94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105—111, 1994. 40