lecture1-mdps - University of Alberta

Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/ 1 Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” (Rich Sutton)  Contents      Defining AI Markovian Decision Problems Dynamic Programming Approximate Dynamic Programming Generalizations 2 Literature  Books  Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998  Dimitri P. Bertsekas, John Tsitsiklis: NeuroDynamic Programming, Athena Scientific, 1996  Journals  JMLR, MLJ, JAIR, AI  Conferences  NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI 3 Some More Books  Martin L. Puterman. Markov Decision Processes. Wiley, 1994.  Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007).  James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003. 4 Resources  RL-Glue  http://rlai.cs.ualberta.ca/RLBB/top.html  http://rlai.cs.ualberta.ca/RLR/index.html  http://www.igi.tugraz.at/riltoolbox/general/overview.html  RL-Library  The RL Toolbox 2.0  OpenDP  http://opendp.sourceforge.net   http://rl-competition.org/ June 1st, 2008: Test runs begin!    Operations research (MOR, OR) Control theory (IEEE TAC, Automatica, IEEE CDC, ECC) Simulation optimization (Winter Simulation Conference)  RL-Competition (2008)!  Related fields: 5 Abstract Control Model Environment Sensations (and reward) actions Controller = agent “Perception-action loop” 6 Zooming in.. memory reward external sensations agent state internal sensations actions 7 A Mathematical Model  Plant (controlled object):  xt+1 = f(xt,at,vt) xt : state, vt : noise  zt = g(xt,wt) zt : sens/obs, wt : noise  State: Sufficient statistics for the future  Independently of what we measure ..or..  Relative to measurements  Controller  at = F(z1,z2,…,zt) at: action/control => PERCEPTION-ACTION LOOP “CLOSED-LOOP CONTROL”  Design problem: F = ?  Goal: t=1T r(zt,at)! max “Subjective “Objective State” State” 8 A Classification of Controllers  Feedforward:  a1,a2,… is designed ahead in time  ???  Feedback:  Purely reactive systems: at = F(zt)  Why is this bad?  Feedback with memory: mt = M(mt-1,zt,at-1) ~interpreting sensations at = F(mt) decision making: deliberative vs. reactive 9 Feedback controllers  Plant:  xt+1 = f(xt,at,vt)  zt+1 = g(xt,wt)  Controller:  mt = M(mt-1,zt,at-1)  at = F(mt)  mt ¼ xt: state estimation, “filtering” difficulties: noise,unmodelled parts  How do we compute at?  With a model (f’): model-based control  ..assumes (some kind of) state estimation  Without a model: model-free control 10 Markovian Decision Problems 11 Markovian Decision Problems (X,A,p,r) X – set of states A – set of actions (controls) p – transition probabilities p(y|x,a)  r – rewards r(x,a,y), or r(x,a), or r(x)  ° – discount factor 0·°<1     12 The Process View      (Xt,At,Rt) Xt – state at time t At – action at time t Rt – reward at time t Laws:  Xt+1~p(.|Xt,At)  At ~ ¼(.|Ht)  ¼: policy  Ht = (Xt,At-1,Rt-1, .., A1,R1,X0) – history  Rt = r(Xt,At,Xt+1) 13 The Control Problem  Value functions: P1 V¼(x) = E¼[ t = 0 ° t Rt jX 0 = x]  Optimal value function: V ¤ (x) = max¼ V¼(x)  Optimal policy: V¼¤ (x) = V ¤ (x) 14 Applications of MDPs  Operations research  Econometrics  Control, statistics  Games, AI Optimal investments Replacement problems Option pricing Logistics, inventory management  Active vision  Production scheduling  Dialogue control  Bioreactor control  Robotics (Robocup Soccer)  Driving  Real-time load balancing  Design of experiments (Medical tests)     15 Variants of MDPs  Discounted  Undiscounted: Stochastic Shortest Path  Average reward  Multiple criteria  Minimax  Games 16 MDP Problems  Planning The MDP (X,A,P,r,°) is known. Find an optimal policy ¼*!  Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy ¼*!  Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning 17 Solving MDPs – Dimensions      Which problem? (Planning, learning, optimal learning) Exact or approximate? Uses samples? Incremental? Uses value functions?  Yes: Value-function based methods  Planning: DP, Random Discretization Method, FVI, …  Learning: Q-learning, Actor-critic, …  No: Policy search methods  Planning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus),  Representation  Structured state:  Factored states, logical representation, …  Structured policy space:  Hierarchical methods 18 Dynamic Programming 19 Richard Bellman (1920-1984)  Control theory  Systems Analysis  Dynamic Programming: RAND Corporation, 1949-1955       Bellman equation Bellman-Ford algorithm Hamilton-Jacobi-Bellman equation “Curse of dimensionality” invariant imbeddings Grönwall-Bellman inequality 20 Bellman Operators  Let ¼:X ! A be a stationary policy  B(X) = { V | V:X! R, ||V||1<1 }  T¼:B(X)! B(X)  (T¼ V)(x) = yp(y|x,¼(x)) [r(x,¼(x),y)+° V(y)]  Theorem: T¼ V¼ = V¼  Note: This is a linear system of equations: r¼ + ° P¼ V¼ = V¼  V¼ = (I-° P¼)-1 r¼ 21 Proof of T¼ V¼ = V¼  What you need to know:    Linearity of expectation: E[A+B] = E[A]+E[B] Law of total expectation: E[ Z ] = x P(X=x) E[ Z | X=x ], and E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. Markov property: E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y]  V¼(x) = E¼ [t=01 °t Rt|X0 = x] = y P(X1=y|X0=x) E¼[t=01 °t Rt|X0 = x,X1=y] (by the law of total expectation) = y p(y|x,¼(x)) E¼[t=01 °t Rt|X0 = x,X1=y] (since X1~p(.|X0,¼(X0))) = y p(y|x,¼(x)) {E¼[ R0|X0=x,X1=y]+° E¼ [ t=01°t Rt+1|X0=x,X1=y]} (by the linearity of expectation) = y p(y|x,¼(x)) {r(x,¼(x),y) + ° V¼(y)} (using the definition of r, V¼) = (T¼ V¼)(x). (using the definition of T¼) 22 The Banach Fixed-Point Theorem  B = (B,||.||) Banach space  T: B1! B2 is L-Lipschitz (L>0) if for any U,V, || T U – T V || · L ||U-V||.  T is contraction if B1=B2, L<1; L is a contraction coefficient of T  Theorem [Banach]: Let T:B! B be a °-contraction. Then T has a unique fixed point V and 8 V02 B, Vk+1=T Vk, Vk ! V and ||Vk-V||=O(°k) 23 An Algebra for Contractions  Prop: If T1:B1! B2 is L1-Lipschitz, T2: B2 ! B3 is L2-Lipschitz then T2 T1 is L1 L2 Lipschitz.  Def: If T is 1-Lipschitz, T is called a non-expansion  Prop: M: B(X£ A) ! B(X), M(Q)(x) = maxa Q(x,a) is a non-expansion  Prop: Mulc: B! B, Mulc V = c V is |c|-Lipschitz  Prop: Addr: B ! B, Add V = r + V is a non-expansion.  Prop: K: B(X) ! B(X), (K V)(x)=y K(x,y) V(y) is a non-expansion if K(x,y)¸ 0, y K(x,y) =1. 24 Policy Evaluations are Contractions  Def: ||V||1 = maxx |V(x)|, supremum norm; here ||.||  Theorem: Let T¼ the policy evaluation operator of some policy ¼. Then T¼ is a °-contraction.  Corollary: V¼ is the unique fixed point of T¼. Vk+1 = T¼ Vk ! V¼, 8 V0 2 B(X) and ||Vk-V¼|| = O(°k). 25 The Bellman Optimality Operator  Let T:B(X)! B(X) be defined by (TV)(x) = maxa y p(y|x,a) { r(x,a,y) + ° V(y) }  Def: ¼ is greedy w.r.t. V if T¼V =T V.  Prop: T is a °-contraction.  Theorem (BOE): T V* = V*.  Proof: Let V be the fixed point of T. T¼ · T  V* · V. Let ¼ be greedy w.r.t. V. Then T¼ V = T V. Hence V¼ = V  V · V*  V = V*. 26 Value Iteration  Theorem: For any V0 2 B(X), Vk+1 = T Vk, Vk ! V* and in particular ||Vk – V*||=O(°k).  What happens when we stop “early”?  Theorem: Let ¼ be greedy w.r.t. V. Then ||V¼ – V*|| · 2||TV-V||/(1-°).  Proof: ||V¼-V*||· ||V¼-V||+||V-V*|| …  Corollary: In a finite MDP, the number of policies is finite. We can stop when ||Vk-TVk|| · ¢(1-°)/2, where ¢ = min{ ||V*-V¼|| : V¼  V* } Pseudo-polynomial complexity 27 Policy Improvement [Howard ’60]  Def: U,V2 B(X), V ¸ U if V(x) ¸ U(x) holds for all x2 X.  Def: U,V2 B(X), V > U if V ¸ U and 9 x2 X s.t. V(x)>U(x).  Theorem (Policy Improvement): Let ¼’ be greedy w.r.t. V¼. Then V¼’ ¸ V¼. If T V¼>V¼ then V¼’>V¼. 28 Policy Iteration  Policy Iteration(¼)  V  V¼  Do {improvement}  V’  V  Let ¼: T¼ V = T V  V  V¼  While (V>V’)  Return ¼ 29 Policy Iteration Theorem  Theorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy.  Proof: Follows from the Policy Improvement Theorem. 30 Linear Programming  V ¸ T V  V ¸ V* = T V*.  Hence, V* is the “largest” V that satisfies V ¸ T V. V¸TV , (*) V(x) ¸ yp(y|x,a){r(x,a,y)+° V(y)}, 8x,a  LinProg(V):   x V(x) ! min s.t. V satisfies (*).  Theorem: LinProg(V) returns the optimal value function, V*.  Corollary: Pseudo-polynomial complexity 31 Variations of a Theme 32 Approximate Value Iteration  AVI: Vk+1 = T Vk + ²k  AVI Theorem: Let ² = maxk ||²k||. Then limsupk!1 ||Vk-V*|| · 2° ² / (1-°).  Proof: Let ak = ||Vk –V*||. Then ak+1 = ||Vk+1 – V*|| = ||T Vk – T V* + ²k || · ° ||Vk-V*|| + ² = ° ak + ². Hence, ak is bounded. Take “limsup” of both sides: a· ° a + ²; reorder.// (e.g., [BT96]) 33 Fitted Value Iteration – Non-expansion Operators  FVI: Let A be a non-expansion, Vk+1 = A T Vk. Where does this converge to?  Theorem: Let U,V be such that A T U = U and T V = V. Then ||V-U|| · ||AV –V||/(1-°).  Proof: Let U’ be the fixed point of TA. Then ||U’-V|| · ° ||AV-V||/(1-°). Since A U’ = A T (AU’), U=AU’. Hence, ||U-V|| =||AU’-V|| · ||AU’-AV||+||AV-V|| … [Gordon ’95] 34 Application to Aggregation  Let ¦ be a partition of X, S(x) be the unique cell that x belongs to.  Let A: B(X)! B(X) be (A V)(x) =  z ¹(z;S(x)) V(z), where ¹ is a distribution over S(x).  p’(C|B,a) =  x2 B ¹(x;B)  y2 C p(y|x,a), r’(B,a,C) =  x2 B ¹(x;B)  y2 C p(y|x,a) r(x,a,y).  Theorem: Take (¦,A,p’,r’), let V’ be its optimal value function, V’E(x) = V’(S(x)). Then ||V’E – V*|| · ||AV*-V*||/(1-°). 35 Action-Value Functions  L: B(X)! B(X£ A), (L V)(x,a) =  y p(y|x,a) {r(x,a,y) + ° V(y)}. “One-step lookahead”.  Note: ¼ is greedy w.r.t. V if (LV)(x,¼(x)) = max a (LV)(x,a).  Def: Q* = L V*.  Def: Let Max: B(X£ A)! B(X), (Max Q)(x) = max a Q(x,a).  Note: Max L = T.  Corollary: Q* = L Max Q*.  Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*.  T = L Max is a °-contraction  Value iteration, policy iteration, … 36 Changing Granularity       Asynchronous Value Iteration:  Every time-step update only a few states AsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V*. How to use?  Prioritized Sweeping IPS [MacMahan & Gordon ’05]:  Instead of an update, put state on the priority queue  When picking a state from the queue, update it  Put predecessors on the queue Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positive LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95]  Focussing on parts of the state that matter  Constraints:  Same problem solved from several initial positions  Decisions have to be fast  Idea: Update values along the paths 37 Changing Granularity  Generalized Policy Iteration:  Partial evaluation and partial improvement of policies  Multi-step lookahead improvement  AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird ’93] 38 Variations of a theme [SzeLi99]  Game against nature [Heger ’94]: infw t°t Rt(w) with X0 = x  Risk-sensitive criterion: log ( E[ exp(t°t Rt ) | X_0 = x ] )  Stochastic Shortest Path  Average Reward  Markov games  Simultaneous action choices (Rockpaper-scissor)  Sequential action choices  Zero-sum (or not) 39 References          [Howard ’60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960. [Gordon ’95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261—268, 1995. [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. [McMahan, Gordon ’05] H. B. McMahan and Geoffrey J. Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS. [Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189–211, 1990. [Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81—138, 1995. [Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-9314, November, 1993. [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999. [Heger ’94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105—111, 1994. 40

lecture1-mdps - University of Alberta

Related documents

Products

Support

lecture1-mdps - University of Alberta

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib