CS 5100 MDPs Robert Platt Northeastern University Material adapted from: 1. Russell & Norvig, AIMA 2. Dan Klein & Pieter Abbeel, UC Berkeley CS 188 3. Lawson Wong, CS 5100 4. Chris Amato, CS 5100 Markov Decision Process (MDP): Grid-world example MDP = (S, A, T, R, γ) +1 -1 Rewards R(s) – or R(s, a, s'): – Agent gets these rewards in these cells – Goal of agent is to maximize reward – Rewards can be negative Discount factor γ (gamma): – Trade-off between current vs. future rewards States S: – Each cell is a state Actions A: Left, right, up, down – Take one action per time step Transition function T(s, a, s') = P(s' | s, a): – Actions are stochastic: Only goes in intended direction 80% of the time (10% of time goes in perpendicular direction) – If hit a wall, no move happens Markov Decision Process (MDP) Deterministic – Same action always has same outcome 1.0 Stochastic – Same action could have different outcomes 0.1 0.1 0.8 Markov Decision Process (MDP) Same action could have different outcomes: 0.1 0.1 0.1 0.8 0.1 0.8 Transition function at s_1: s' s_2 s_3 s_4 T(s,a,s') 0.1 0.8 0.1 Markov Decision Process (MDP) Technically, an MDP is a 5-tuple An MDP (Markov Decision Process) defines a stochastic control problem: State set: Action Set: Transition function: Reward function: Discount factor: or, more generally: Markov Decision Process (MDP) Technically, an MDP is a 5-tuple An MDP (Markov Decision Process) defines a stochastic control problem: State set: Probability of going from s to s' when executing action a Action Set: Transition function: Reward function: or, more generally: Discount factor: What is the objective? Markov Decision Process (MDP) Technically, an MDP is a 5-tuple An MDP (Markov Decision Process) defines a stochastic control problem: State set: Probability of going from s to s' when executing action a Action Set: Transition function: Reward function: or, more generally: Discount factor: Objective: Calculate a strategy for acting that maximizes the return (expected sum of future rewards) – We will calculate a policy that will tell us how to act Markov Decision Process Markov property / assumption: Given the current state, the future and the past are independent Markov decision process: Transition function is Markovian Similar to situation in search: Successor function only depended on current state (not history) State at t=1 State at t=2 Actions / transitions A n d re y M arko v (1 8 5 6 -1 9 2 2 ) Markov Decision Process Markov property / assumption: Given the current state, the future and the past are independent Markov decision process: Transition function is Markovian State at t=1 State at t=2 Actions / transitions Similar to situation in search: Can you give an example of a Markovian process? Successor dependedprocess? Howfunction about a only non-Markovian on current state (not history) A n d re y M arko v (1 8 5 6 -1 9 2 2 ) What is a policy? A policy tells the agent what action to execute as a function of state: Deterministic policy: – Agent always executes the same action from a given state Stochastic policy: – Agent selects an action to execute by drawing from a probability distribution encoded by the policy For MDPs, there always exists an optimal policy that is deterministic – We will see stochastic policies when learning MDPs (reinforcement learning) What is a policy? Policies are more general than plans Plan: – Specifies a sequence of actions to execute – Cannot react to unexpected outcome Policy: – Tells you what action to take from any state Plan might not be optimal U(r,r)=15 U(r,b)=15 U(b,r)=20 U(b,b)=20 The optimal policy can achieve U=30 Stochastic outcome Policy for grid-world example +1 Terminal states -1 π((4,1)) = left R(s) = -0.04 for all non-terminal s Objective: Calculate a strategy for acting that maximizes the expected return (expected sum of future rewards) Policy for grid-world example R(s) = -0.04 for all non-terminal s Objective: Calculate a strategy for acting that maximizes the expected return (expected sum of future rewards) Policy for grid-world example R(s) = -0.04 for all non-terminal s Objective: Calculate a strategy for acting that maximizes the expected return (expected sum of future rewards) Policy for grid-world example R(s) = -0.04 for all non-terminal s Objective: Calculate a strategy for acting that maximizes the expected return (expected sum of future rewards) Policy for grid-world example R(s) = -0.04 for all non-terminal s Objective: Calculate a strategy for acting that maximizes the expected return (expected sum of future rewards) Policy for grid-world example R(s) = -0.04 for all non-terminal s Objective: Calculate a strategy for acting that maximizes the expected return (expected sum of future rewards) Choosing a reward function A few possibilities: – Place rewards on goal states / undesirable states – Negative reward everywhere except terminal states – Gradually increasing reward as you approach the goal (reward shaping) In general: – Reward can be whatever you want Caution! Reward hacking +1 -1 Decision theory Decision theory = Probability theory + Utility theory Given a single-step decision problem, choose action that maximizes expected utility: Expected utility: U([ p1 , s1;¼; pn, sn ]) =å piU(si ) i Max expected utility: What about sequential decision problems? Maximize the expected return (expected utility of a sequence): 19 Utility theory strikes again What preferences should an agent have over reward sequences? More or less? [1, 2, 2] or [2, 3, 4] Now or later? [0, 0, 1] or [1, 0, 0] 20 Utility theory strikes again What preferences should an agent have over reward sequences? More or less? [1, 2, 2] or [2, 3, 4] Now or later? [0, 0, 1] or [1, 0, 0] Intuitively: - Prefer more sum of rewards - Prefer rewards now over rewards in the future 21 Utility theory strikes again What preferences should an agent have over reward sequences? More or less? [1, 2, 2] or [2, 3, 4] Now or later? [0, 0, 1] or [1, 0, 0] Intuitively: - Prefer more sum of rewards - Prefer rewards now over rewards in the future Solution: Sum of discounted rewards 2 + γ * 3 + γ2 * 4 > 1 + γ * 2 + γ2 * 2 1 + γ * 0 + γ2 * 0 > 0 + γ * 0 + γ2 * 1 22 Utility theory strikes again What preferences should an agent have over reward sequences? Theorem: If we assume stationary preferences: Then there are only two ways to define utilities: Additive utility: Discounted utility: 23 Utility theory strikes again What preferences should an agent have over reward sequences? Theorem: If we assume stationary preferences: Then there are only two ways to define utilities: Additive utility: Discounted utility: 24 Utility theory strikes again What preferences should an agent have over reward sequences? Additive utility: Discounted utility: 25 Utility theory strikes again What preferences should an agent have over reward sequences? Additive utility: Discounted utility: More generally: For , γ = 0: γ = 1: 0 < γ < 1: 26 Utility theory strikes again What preferences should an agent have over reward sequences? Additive utility: Discounted utility: More generally: For , γ = 0: Single-step utility γ = 1: Additive utility (this is why we use additive edge costs in search!) 0 < γ < 1: Discounted utility Often γ = 0.9, 0.95, 0.99 27 Discount factor Given: Actions: East, West, and Exit (only available in exit states a, e) Transitions: deterministic Q1: For = 1, what is the optimal policy? Q2: For = 0.1, what is the optimal policy? Q3: For which are West and East equally good when in state d? Discount factor What preferences should an agent have over reward sequences? For , Policies can run forever! - Can accrue infinite utility - Want to avoid comparing ∞ > ∞ 29 Discount factor What preferences should an agent have over reward sequences? For , Solutions for avoiding infinite utilities: - Finite horizon (ends after T time steps) - Terminal states (but must ensure that agent eventually reaches one) - Discount factor γ < 1: 30 Value function Utility for a specific sequence of state: But the environment is stochastic! - Cannot predict exact future sequence of states 31 Value function Utility for a specific sequence of state: But the environment is stochastic! - Cannot predict exact future sequence of states Solution: Maximum Expected Utility (MEU) 32 Value function Utility for a specific sequence of state: But the environment is stochastic! - Cannot predict exact future sequence of states Solution: Maximum Expected Utility (MEU) Take an expectation (over future outcomes): 33 Value function Utility for a specific sequence of state: But the environment is stochastic! - Cannot predict exact future sequence of states Solution: Maximum Expected Utility (MEU) Take an expectation (over future outcomes): 34 Value function Utility for a specific sequence of state: But the environment is stochastic! - Cannot predict exact future sequence of states Solution: Maximum Expected Utility (MEU) Take an expectation (over future outcomes): 35 Value function Value function: Expected return starting in state s0 37 Value function Value function: Expected return starting in state s0 The above quantity is ill-defined! We are missing something... 38 Value function Value function: Expected return starting in state s0 The above quantity is ill-defined! We are missing something... Technically a conditional expectation (given s0 , a0 , a1 , a2 ,...) Where do the actions come from? 39 Value function Value function: Expected return starting in state s0 and following policy π thereafter 40 Value function Value function: Expected return starting in state s0 and following policy π thereafter 41 Value function Value function: Expected return starting in state s0 and following policy π thereafter Technically a conditional expectation (given s0 , a0 = π(s0) , a1 = π(s1) , a2 = π(s2) ,...) 42 Value function Value function: Expected return starting in state s0 and following policy π thereafter 43 Value function Value function: Expected return starting in state s0 and following policy π thereafter Recursive definition: Usually written as: 44 Value function Value function: Expected return starting in state s0 and following policy π thereafter Recursive definition: Usually written as: Current reward (Discounted) (Expected) Future return 45 Value function Value function: Expected return starting in state s0 and following policy π thereafter Recursive definition: Usually written as: How to find Vπ, if you know policy π? 46 Value function Value function: Expected return starting in state s0 and following policy π thereafter Recursive definition: Usually written as: How to find Vπ, if you know policy π? Characterization above gives |S| linear equations (one per state) with |S| unknowns (V(s) for each state s) 47 Value function Value function: Expected return starting in state s0 and following policy π thereafter Recursive definition: Usually written as: This applies for any policy, including the optimal one: π* 48 Optimal value function Value function: Expected return starting in state s and following policy π* thereafter Optimal value function should maximize expected return: 49 Optimal value function Value function: Expected return starting in state s and following policy π* thereafter Optimal value function should maximize expected return: Bellman equation (1957) 50 Optimal value function R(s) = 0 for non-terminal s γ = 0.9 51 Optimal value function Interpreting what the equation says using expectimax: Max (you) V*(s) Perform action b Perform action c Multiply expected future return by discount factor γ Exp (nature) T(s,a=b,s'=w) Max (you) Obtain reward R(s), add max of children T(s,a=b,s'=x) V*(s'=w) V*(s'=x) T(s,a=c,s'=y) V*(s'=y) Compute expected future return (value) from taking action c T(s,a=c,s'=z) V*(s'=z) Value (expected future return) if you acted optimally from the next step onward 52 An algorithm for computing the value function? Interpreting what the equation says using expectimax: Max (you) V*(s) Perform action b Perform action c Multiply expected future return by discount factor γ Exp (nature) T(s,a=b,s'=w) Max (you) Obtain reward R(s), add max of children T(s,a=b,s'=x) V*(s'=w) V*(s'=x) T(s,a=c,s'=y) V*(s'=y) Compute expected future return (value) from taking action c T(s,a=c,s'=z) V*(s'=z) Value (expected future return) if you acted optimally from the next step onward 54 Value iteration Init: Repeat until convergence: For each state s: 55 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search 56 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically 57 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically is equivalent to the expectimax value after (k-1) plys 58 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically is equivalent to the expectimax value after (k-1) plys V1(s) is equivalent to the reward function R(s) (similar to the terminal utility, rather than 1 ply of search) 59 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically is equivalent to the expectimax value after (k-1) plys V1(s) is equivalent to the reward function R(s) (similar to the terminal utility, rather than 1 ply of search) is the optimal value function for horizon (k-1) 60 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically is equivalent to the expectimax value after (k-1) plys V1(s) is equivalent to the reward function R(s) (similar to the terminal utility, rather than 1 ply of search) is the optimal value function for horizon (k-1) - i.e., (k-1) actions left - k states left, where s is the first state 61 Value iteration Init: Repeat until convergence: For each state s: Guaranteed to converge to V* This performs 1 ply of expectimax search - Convergence: Typically is equivalent to the expectimax value after (k-1) plys V1(s) is equivalent to the reward function R(s) (similar to the terminal utility, rather than 1 ply of search) is the optimal value function for horizon (k-1) - i.e., (k-1) actions left - k states left, where s is the first state 62 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically - For more general rewards, do: - Complexity of each iteration: 63 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search - Convergence: Typically - For more general rewards, do: - Complexity of each iteration: O(|S|2 |A|) 64 Value iteration R(s) = 0 for non-terminal s γ = 0.9 65 Value iteration R(s) = 0 for non-terminal s γ = 0.9 66 Value iteration R(s) = 0 for non-terminal s γ = 0.9 67 Value iteration R(s) = 0 for non-terminal s γ = 0.9 68 Value iteration R(s) = 0 for non-terminal s γ = 0.9 69 Value iteration R(s) = 0 for non-terminal s γ = 0.9 70 Value iteration R(s) = 0 for non-terminal s γ = 0.9 71 Value iteration R(s) = 0 for non-terminal s γ = 0.9 72 Value iteration R(s) = 0 for non-terminal s γ = 0.9 73 Value iteration R(s) = 0 for non-terminal s γ = 0.9 74 Value iteration R(s) = 0 for non-terminal s γ = 0.9 75 Value iteration R(s) = 0 for non-terminal s γ = 0.9 76 Value iteration R(s) = 0 for non-terminal s γ = 0.9 77 Value iteration R(s) = 0 for non-terminal s γ = 0.9 78 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search How to obtain policy? 79 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search How to obtain policy? 80 Value iteration Init: Repeat until convergence: For each state s: This performs 1 ply of expectimax search How to obtain policy? Notice anything interesting about the policy during VI? (Look at the grid-world example again) 81 Policy iteration Notice anything interesting about the policy during VI? (Look at the grid-world example again) Policy converges much faster than the value function - Getting the exact values do not matter if the strategy to act is the same 82 Policy iteration Notice anything interesting about the policy during VI? (Look at the grid-world example again) Policy converges much faster than the value function - Getting the exact values do not matter if the strategy to act is the same In value iteration, we iteratively compute the value function In policy iteration, we 83 Policy iteration Notice anything interesting about the policy during VI? (Look at the grid-world example again) Policy converges much faster than the value function - Getting the exact values do not matter if the strategy to act is the same In value iteration, we iteratively compute the value function In policy iteration, we iteratively compute the policy 84 Summary Both value iteration and policy iteration compute the same thing (optimal values/policy) Both are dynamic programming algorithms for MDPs Value iteration: - Every iteration updates both values and (implicitly) policy - Do not track the policy, but implicitly computed by max Init: Repeat until convergence: For each state s: Extract the final policy: 85 Appendix: Policy Iteration 86 Policy iteration Notice anything interesting about the policy during VI? (Look at the grid-world example again) Policy converges much faster than the value function - Getting the exact values do not matter if the strategy to act is the same In value iteration, we iteratively compute the value function In policy iteration, we iteratively compute the policy 87 Policy iteration In value iteration, we iteratively compute the value function In policy iteration, we iteratively compute the policy Init: Repeat until convergence: 88 Policy iteration In value iteration, we iteratively compute the value function In policy iteration, we iteratively compute the policy Init: Repeat until convergence: Policy evaluation: For each state s, compute Policy improvement: For each state s, 89 Policy iteration In value iteration, we iteratively compute the value function In policy iteration, we iteratively compute the policy Init: Repeat until convergence: Policy evaluation: For each state s, compute Policy improvement: For each state s, - Convergence: Policy no longer changes 90 Policy iteration In value iteration, we iteratively compute the value function In policy iteration, we iteratively compute the policy Init: Repeat until convergence: Guaranteed to converge to π* Policy evaluation: For each state s, compute Policy improvement: For each state s, - Convergence: Policy no longer changes 91 Policy evaluation Policy evaluation: Compute 92 Recall: Value function Value function: Expected return starting in state s0 and following policy π thereafter Recursive definition: Usually written as: How to find Vπ, if you know policy π? Characterization above gives |S| linear equations (one per state) with |S| unknowns (V(s) for each state s) 93 Policy evaluation Policy evaluation: For each state s, compute How to find Vπ, if you know policy π? Characterization above gives |S| linear equations (one per state) with |S| unknowns (V(s) for each state s) Just a standard linear system of |S| equations (MATLAB: v = A\b) 94 Approximate policy evaluation Policy evaluation: For each state s, compute How to find Vπ, if you know policy π? Characterization above gives |S| linear equations (one per state) with |S| unknowns (V(s) for each state s) Just a standard linear system of |S| equations (MATLAB: v = A\b) Or: Run a few (small number!) simplified value iteration steps: 95 Approximate policy evaluation Policy evaluation: For each state s, compute How to find Vπ, if you know policy π? Characterization above gives |S| linear equations (one per state) with |S| unknowns (V(s) for each state s) Just a standard linear system of |S| equations (MATLAB: v = A\b) Or: Run a few (small number!) simplified value iteration steps: Policy iteration with approximate policy evaluation is also known as modified policy iteration 96 Summary Both value iteration and policy iteration compute the same thing (optimal values/policy) Value iteration: - Every iteration updates both values and (implicitly) policy - Do not track the policy, but implicitly computed by max Policy iteration: - Every iteration does a quick update of values using fixed policy (no max, just compute for a single action) - After policy evaluated, choose a new policy (max action) - The new policy will be better Both are dynamic programming algorithms for MDPs 97 Summary Value iteration Init: Repeat until convergence: For each state s: Policy iteration Init: Repeat until convergence: Policy evaluation: For each state s, compute Policy improvement: For each state s, 98 Glossary 99