Introduction to AI — Stochastic Planning: Markov Decision Processes Introduction to AI Stochastic Planning: Markov Decision Processes Markov Decision Processes Erez Karpas Slides by Carmel Domshlak Finite Horizon IE&M — Technion Infinite Horizon Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 1 / 31 Classical deterministic planning Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI Stochastic (decision-theoretic) planning I dynamics: deterministic, nondeterministic or probabilistic I dynamics: deterministic, nondeterministic or probabilistic I observability: full, partial or none I observability: full, partial or none I horizon: finite or infinite I horizon: finite or infinite I ... I ... 1. classical planning 1. classical planning 2. conformant planning 2. conformant planning 3. conditional planning with full observability 3. conditional planning with full observability 4. conditional planning with partial observability 4. conditional planning with partial observability 5. Markov decision processes (MDP) 5. Markov decision processes (MDP) 6. partially observable MDPs (POMDP) 6. partially observable MDPs (POMDP) Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 2 / 31 3 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 4 / 31 Markov Decision Processes Markov Decision Processes Markov Decision Processes Markov Decision Processes Transition systems Objective Definition (MDP) An MDP is hS, A, R, T i where Objection (informally) I S is a finite set of states (|S| = n) I A is a finite set of actions (|A| = m) transition function Tr : S × A × S 7→ [0, 1] with Tr (s, a, s 0 ) = P(s 0 | s, a) I I I I I select actions in order to maximize total reward I example: in a goal-based domain, ( 1, s is a goal state R(s) = 0, otherwise Probability of going to state s 0 after taking action a in state s How many parameters does it take to represent? m · n2 bounded, real-valued reward function R : S 7→ [0, rmax ] I R(s) is immediate reward we get for reaching state s Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 5 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Markov Decision Processes I I Classical planning ; sequence of actions ha1 , a2 , . . . , ak i Reward only depends on current state P(R t | At , S t , At−1 , S t−1 , . . . , S 0 ) = P(R t | S t ) We assume deterministic rewards I For MDPs, action sequences are insufficient I I Stationary (time independent) dynamics I I Next state only depends on current state and action P(S t+1 | At , S t , At−1 , S t−1 , . . . , S 0 ) = P(S t+1 | At , S t ) Markovian reward process I I Solution Concept Markovian dynamics (history independence) I I 6 / 31 Markov Decision Processes Assumptions I Introduction to AI I The world dynamics do not depend on the absolute time P(S t+1 | At , S t ) = P(S k+1 | Ak , S k ) for all t, k Actions have stochastic effects, so the state we end up in is uncertain We might end up in states where the remainder of the action sequence is not attractive (or even doesn’t apply) A solution should tell us what the best action is for any possible situation that might arise Full observability I Though we can’t predict exactly which state we will reach when we execute an action, once it is executed, we know what the state is Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 7 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 8 / 31 Markov Decision Processes Markov Decision Processes Policies (“plans” for MDPs) I A solution to an MDP is a policy I I I What if we want to continue taking actions indefinitely? I A stationary policy is a function from states to actions Two types of policies: nonstationary and stationary I Nonstationary (NS) policies are used when we are given a finite planning horizon H I I Policies (“plans” for MDPs) I I We are told how many actions we will be allowed to take I NS policy is a function from states and times to actions I I π : S × T 7→ A, where T is the non-negative integers π(s, t) tells us what action to take at state s when there are t stages-to-go Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI Note that both nonstationary and stationary policies assume or have these properties: I I I 9 / 31 π : S 7→ A π(s) is action to take at state s (regardless of time) specifies a continuously reactive controller full observability history independence deterministic action choice Erez Karpas Slides by Carmel Domshlak (Technion) Markov Decision Processes I How do we measure reward “accumulated by π” Value function V : S[×T ] 7→ R associates value with each state (or each state and time for non-stationary π) Vπ (s) denotes value of policy π at state s I I I Finite-Horizon Value Functions How good is a policy π? I I 10 / 31 Finite Horizon Value of a Policy I Introduction to AI Depends on immediate reward, but also what you achieve subsequently by following π An optimal policy is one that is no worse than any other policy at any state We first consider maximizing total reward over a finite horizon I Assumes the agent has H time steps to live I To act optimally, should the agent use a stationary or non-stationary policy? I If you had only one week to live would you act the same way as if you had fifty years to live? The goal of MDP planning is to compute an optimal policy (method depends on how we define value) Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 11 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 12 / 31 Finite Horizon Finite Horizon Finite-Horizon Problems Computational Problems Policy Evaluation I I Value (utility) depends on stage-to-go ; use a nonstationary policy Given an MDP, a nonstationary policy π, and t ∈ T , compute finite-horizon value function Vπt (s). Vπk (s) is k-stage-to-go value function of π I Policy Optimization expected total reward for executing π starting in s for k time steps " k # X Vπk (s) = E Rt | π, s Given an MDP, and a horizon H, compute an optimal finite-horizon policy π∗ t=0 " = E k X # R(st ) | at = π(st , k − t), s0 = s I How many finite-horizon policies are there? |A|Hn I Simple enumeration will not work ... I We’ll show that policy optimization is equivalent to computing “optimal value function” t=0 I Rt and st are random variables Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 13 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Finite Horizon 14 / 31 Finite Horizon Finite-Horizon Policy Evaluation I Introduction to AI Policy Optimization: Bellman Backups How can we compute compute Vπt ∗ (s) given Vπt−1 ∗ (s)? Can use dynamic programming to compute Vπt (s) ; Markov property is critical for this Vπt−1 ∀s ∈ S : Vπ0 (s) = R(s) 0.7 Vπt (s) = R(s) + X P s 0 | s, π(s, t) · Vπt−1 s 0 a1 s0 s a s 0.7 0.3 Vπt Vπt s1 π(s, t) 0.3 0.4 a2 s1 s2 s3 0.6 s2 s4 Vπt−1 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 15 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 16 / 31 Finite Horizon Finite Horizon Policy Optimization: Bellman Backups Policy Optimization: Bellman Backups How can we compute compute Vπt ∗ (s) given Vπt−1 ∗ (s)? How can we compute compute Vπt ∗ (s) given Vπt−1 ∗ (s)? Vπt−1 Compute expectations 0.7 a1 Vπt s 0.7 Compute MAX Vπt 0.4 a2 s1 s2 0.3 a1 0.3 s 0.4 s3 a2 0.6 s1 s2 s3 0.6 s4 Erez Karpas Slides by Carmel Domshlak (Technion) Vπt−1 Compute expectations s4 Introduction to AI 17 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Finite Horizon Introduction to AI Finite Horizon Value Iteration: Finite Horizon Case Value Iteration: Complexity Markov property allows exploitation of DP principle for optimal policy construction I no need to enumerate |A|Hn possible policies! I DP: optimal solution to t − 1 stage problem can be used without modification as part of optimal solution to t-stage problem Value Iteration I Complexity? I I I ∀s ∈ S : Vπ0∗ (s) = R(s) Vπt ∗ (s) = R(s) + max a π ∗ (s, t) = arg max X a X 18 / 31 0 P s | s, a · Vπt−1 ∗ s 0 I H iterations At each iteration, each of n states, computes expectation for |A| actions Each expectation takes O(n) time Total time complexity: O(H|A|n2 ) I Polynomial in the number of states. Is that good? s0 P s 0 | s, a · Vπt−1 s0 ∗ s0 Vπt ∗ (s) is optimal t-stage-to-go value function π ∗ (s, t) is optimal t-stage-to-go policy Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 19 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 20 / 31 Finite Horizon Infinite Horizon Summary: Finite Horizon I Discounted Infinite Horizon MDPs I Value Iteration provides us with an optimal policy I Vπt ∗ (s) ≥ Vπt (s), ∀π, s, t I ; convince yourself (prove) by induction on t I Defining value as total reward is problematic with infinite horizons I “Trick”: introduce discount factor 0 ≤ γ < 1 I Note: optimal value function is unique, but optimal policy is not many or all policies have infinite expected reward some MDPs are ok (e.g., zero-cost absorbing states) future rewards discounted by γ per time step " Vπ (s) = E ∞ X # γ t Rt | π, s t=0 I Why? The value gets bounded. "∞ # X t Vπ (s) ≤ E γ Rmax = t=0 I Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 21 / 31 Motivation: economic? probability of death? convenience? Erez Karpas Slides by Carmel Domshlak (Technion) Infinite Horizon Introduction to AI 22 / 31 Infinite Horizon Notes: Discounted Infinite Horizon I 1 Rmax 1−γ Policy Evaluation I Optimal policy maximizes value at each state Value equation for fixed policy I I Optimal policies guaranteed to exist! (Howard, 1960) I Furthermore there is always an optimal stationary policy Immediate reward + Expected discounted future reward Vπ (s) = R(s) + γ X P s 0 | s, π(s) · Vπ s 0 s0 I I Intuition: why would we change action at s at a new time when there is always forever ahead? ∗ We define V (s) = V (s) for some optimal policy π π∗ ∗ I How can we compute Vπ ? I we are given R (rewards) and P (action dynamics) I linear system with n variables and n constraints I I I Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 23 / 31 variables: Vπ (s1 ), . . . , Vπ (sn ) constraints: one value equation (above) per state use linear algebra to solve this system and get Vπ Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 24 / 31 Infinite Horizon Infinite Horizon Computing an Optimal Value Function I Bellman equation for optimal value function X V ∗ (s) = R(s) + γ · max P s 0 | s, a · V ∗ s 0 a I I I Can compute optimal policy using value iteration based on Bellman backups, just like finite-horizon problems (but include discount term) V 0 (s) = R(s) s0 V t (s) = R(s) + γ · max Bellman proved this is always true for an optimal value function a I The MAX operator makes the system non-linear, so the problem is more difficult than policy evaluation Erez Karpas Slides by Carmel Domshlak (Technion) lim V k = V ∗ I When should we stop in practice? s0 Introduction to AI 25 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) I I Given a V k from value iteration that closely approximates V ∗ , what should we use as our policy? I Use greedy policy (one step lookahead): X greedy[V k ](s) = arg max P s 0 | s, a · V k (s 0 ) For any V and V 0 , ||B[V ] − B[V 0 ]|| ≤ γ · ||V − V 0 || I 26 / 31 How to Act? Bellman Backup B is a contraction operator on value functions I Introduction to AI Infinite Horizon Convergence I s0 Converges to optimal value function? YES Infinite Horizon I P s 0 | s, a · V t−1 s 0 k→∞ Notice that the optimal value function is a fixed-point of the Bellman Backup operator B (i.e. B[V ∗ ] = V ∗ ) X B[V ](s) = R(s) + γ · max P s 0 | s, a · V s 0 a I X How can we compute V ∗ ? I I Value Iteration ||V || is the max-norm: ||(0.1, 100, 5, 12)|| = 100 So applying a Bellman backup to any two value functions causes them to get closer together in the max-norm sense! a s0 Convergence is assured from any V ||V ∗ − B[V ]|| = ||B[V ∗ ] − B[V ]|| ≤ γ · ||V ∗ − V || I Note: Value of greedy policy may not be equal to V k ⇒ limk→∞ V k = V ∗ I How close is V greedy to V ∗ ? I When to stop in practice? When ||V k − V k−1 || ≤ I I if V k is -close to V ∗ , then V greedy is 2γ 1−γ -close to V ∗ γ ensures ||V ∗ − V k || ≤ 1−γ (be able to verify that) k k−1 so stop when V and V are similar enough Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 27 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 28 / 31 Infinite Horizon Infinite Horizon Optimization via Policy Iteration I Policy Iteration Notes Given fixed policy, can compute its value exactly: "∞ # X t Vπ (s) = E γ Rt | π, s t=0 I Policy Iteration exploits this: iterates steps of policy evaluation and policy improvement I Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible I Convergence assured (Howard) I I 1. Choose a random policy π 2. Loop: I no local maxima in value space, and each policy must improve value; since finite number of policies, will converge to optimal policy Gives exact value of optimal policy 2.1 Evaluate Vπ P 2.2 For each s, set π 0 (s) := arg maxa s 0 P (s 0 | s, a) · Vπ (s 0 ) 2.3 Replace π with π 0 Until no improving action possible at any state Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 29 / 31 Infinite Horizon Value Iteration vs. Policy Iteration I Which is faster: VI or PI? I I VI takes more iterations than PI, but PI requires more time on each iteration I I depends on the problem instance PI must perform policy evaluation on each step which involves solving a linear system Complexity I I I There are at most exp(n) policies, so PI is no worse than exponential time in number of states Empirically, O(n) iterations are required! Polynomial bound on the number of PI iterations was open problem until very recently. Closed! Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 31 / 31 Erez Karpas Slides by Carmel Domshlak (Technion) Introduction to AI 30 / 31