Markov Decision Process (MDP) Ruti Glick Bar-Ilan university 1 Policy Policy is similar to plan generated ahead of time Unlike traditional plans, it is not a sequence of actions that an agent must execute If there are failures in execution, agent can continue to execute a policy Prescribes an action for all the states Maximizes expected reward, rather than just reaching a goal state 2 Utility and Policy utility Compute for every state “What is the usage (utility) of this state for the overall task?” Policy Complete mapping from states to actions “In which state should I perform which action?” policy: state action 3 The optimal Policy π*(s) = argmaxa∑s’T(s, a, s’)U(s’) T(s, a, s’) = Probability of reaching state s’ from state s U(s’) = Utility of state s’. If we know the utility we can easily compute the optimal policy. The problem is to compute the correct utilities for all states. 4 Finding π* Value iteration Policy iteration 5 Value iteration Process: Calculate the utility of each state Use the values to select an optimal action 6 Bellman Equation Bellman Equation: U(s) = R(s) + γmaxa ∑ (T(s, a, s’) U(s’)) For example: Up U(1,1) = -0.04+γmax{ 0.8U(1,2)+0.1U(2,1)+0.1U(1,1), Left 0.9U(1,1)+0.1U(1,2), Down 0.9U(1,1)+0.1U(2,1), +1 0.8U(2,1)+0.1U(1,2)+0.1U(1,1,) } Right -1 START 7 Bellman Equation properties U(s) = R(s) + γ maxa ∑ (T(s, a, s’) U(s’)) n equations. One for each step n vaiables Problem Operator max is not a linear operator Non-linear equations. Solution Iterative approach 8 Value iteration algorithm Initial arbitrary values for utilities Update utility of each state from it’s neighbors Ui+1(s) R(s) + γ maxa ∑ (T(s, a, s’) Ui(s’)) Iteration step called Bellman update Repeat till converges 9 Value Iteration properties This equilibrium is a unique solution! Can prove that the value iteration process converges Don’t need exact values 10 convergence Value iteration is contraction Function of one argument When applied on to inputs produces value that are “closer together Have only one fixed point When applied the value must be closer to fixed point We’ll not going to prove last point converge to correct value 11 Value Iteration Algorithm function VALUE_ITERATION (mdp) returns a utility function input: mdp, MDP with states S, transition model T, reward function R, discount γ local variables: U, U’ vectors of utilities for states in S, initially identical to R repeat U U’ for each state s in S do U’[s] R[s] + γmaxa s’ T(s, a, s’)U[s’] until close-enough(U,U’) return U 12 Example Small version of our main example 2x2 world The agent is placed in (1,1) States (2,1), (2,2) are goal states If blocked by the wall – stay in place The reward are written in the board R=-0.04 R=+1 R=-0.04 R=-1 13 First iteration Example (cont.) U=-0.04 R=-0.04 U=+1 R=+1 U=-0.04 R=-0.04 U=-1 R=-1 U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = -0.04 + 1 x max { 0.8x(-0.04) + 0.1x(-0.04) + 0.1x(-1), 0.9x(-0.04) + 0.1x(-0.04), 0.9x(-0.04) + 0.1x(-1), 0.8x(-1) + 0.1x(-0.04) + 0.1x(-0.04)} =-0.04 + max{ -0.136, -0.04, -0.136, -0.808} =-0.08 U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = -0.04 + 1 x max {0.9x(-0.04)+ 0.1x1, 0.9x(-0.04) + 0.1x(-0.04), 0.8x(-0.04) + 0.1x1 + 0.1x(-0.04), 0.8x1 + 0.1x(-0.04) + 0.1x(-0.04)} =-0.04 + max{ 0.064, -0.04, 0.064, 0.792} =0.752 Goal states remain the same 14 U=0.752 R=-0.04 U=+1 R=+1 U=-0.08 U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), R=-0.04 U=-1 R=-1 Second iteration Example (cont.) 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = -0.04 + 1 x max { 0.8x(0.752) + 0.1x(-0.08) + 0.1x(-1), 0.9x(-0.08) + 0.1x(0.752), 0.9x(-0.08) + 0.1x(-1), 0.8x(-1) + 0.1x(-0.08) + 0.1x(0.752)} =-0.04 + max{ 0.4936, 0.0032, -0.172, -0.3728} =0.4536 U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = -0.04 + 1 x max {0.9x(0.752)+ 0.1x1, 0.9x(0.752) + 0.1x(-0.08), 0.8x(-0.08) + 0.1x1 + 0.1x(0.752), 0.8x1 + 0.1x(0.752) + 0.1x(-0.08)} =-0.04 + max{ 0.7768, 0.6688, 0.1112, 0.8672} = 0.8272 15 U=0.8272 R=-0.04 U=+1 R=+1 U= 0.4536 U(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), R=-0.04 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = -0.04 + 1 x max { 0.8x(0.8272) + 0.1x(0.4536) + 0.1x(-1), 0.9x(0.4536) + 0.1x(0.8272), 0.9x(0.4536) + 0.1x(-1), 0.8x(-1) + 0.1x(0.4536) + 0.1x(0.8272)} =-0.04 + max{ 0.6071, 0.491, 0.3082, -0.6719} =0.5676 U(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = -0.04 + 1 x max {0.9x(0.8272)+ 0.1x1, 0.9x(0.8272) + 0.1x(0.4536), 0.8x(0.4536) + 0.1x1 + 0.1x(0.8272), 0.8x1 + 0.1x(0.8272) + 0.1x(0.4536)} =-0.04 + max{ 0.8444, 0.7898, 0.5456, 0.9281} = 0.8881 U=-1 R=-1 Third iteration Example (cont.) 16 Example (cont.) Continue to next iteration… U= 0.8881 R=-0.04 U=+1 R=+1 U=-0.5676 R=-0.04 U=-1 R=-1 Finish if “close enough” Here last change was 0.114 – close enough 17 “close enough” We will not go down deeply to this issue! Different possibilities to detect convergence: RMS error –root mean square error of the utility value compare to the correct values 1 RMS |S| 2 ( U ( i ) U '( i )) i1 |S | demand of RMS(U, U’) < ε when: ε – maximum error allowed in utility of any state in an iteration 18 “close enough” (cont.) Policy Loss : difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 – Ui || < ε (1-γ) / γ When: ||U|| = maxa |U(s)| ε – maximum error allowed in utility of any state in an iteration γ – the discount factor 19 Finding the policy True utilities have founded New search for the optimal policy: For each s in S do π[s] argmaxa ∑s’ T(s, a, s’)U(s’) Return π 20 Example (cont.) Find the optimal police Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = argmaxa { 0.8x(0.8881) + 0.1x(0.5676) + 0.1x(-1), 0.9x(0.5676) + 0.1x(0.8881), 0.9x(0.5676) + 0.1x(-1), 0.8x(-1) + 0.1x(0.5676) + 0.1x(0.8881)} = argmaxa { 0.6672, 0.5996, 0.4108, -0.6512} = Up Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = argmaxa { 0.9x(0.8881)+ 0.1x1, 0.9x(0.8881) + 0.1x(0.5676), 0.8x(0.5676) + 0.1x1 + 0.1x(0.8881), 0.8x1 + 0.1x(0.8881) + 0.1x(0.5676)} = argmaxa {0.8993, 0.8561, 0.6429, 0.9456} = Right //Up //Left //Down //Right //Up //Left //Down //Right U= 0.8881 R=-0.04 U=+1 R=+1 U=-0.5676 R=-0.04 U=-1 R=-1 21 Summery – value iteration +1 0.812 0.868 0.912 +1 -1 0.762 0.660 -1 0.705 0.655 0.611 0.388 1. The given environment 2. Calculate utilities +1 +1 -1 -1 3. Extract optimal policy 4. Execute actions 22 Example - convergence Error allowed 23 Policy iteration picking a policy, then calculating the utility of each state given that policy (value iteration step) Update the policy at each state using the utilities of the successor states Repeat until the policy stabilize 24 Policy iteration For each state in each step Policy evaluation Given policy πi. Calculate the utility Ui of each state if π were to be execute Policy improvement Calculate new policy πi+1 Based on πi π i+1[s] argmaxa ∑s’ T(s, a, s’)U π_i (s’) 25 Policy iteration Algorithm function POLICY_ITERATION (mdp) returns a policy input: mdp, an MDP with states S, transition model T local variables: U, U’ vectors of utilities for states in S, initially identical to R π, a policy, vector indexed by states, initially random repeat U Policy-Evaluation(π, mdp) unchanged? true for each state s in S do if maxa s’ T(s, a, s’) U[s’] > s’T(s, π[s], s’) U[s’] then π[s] argmaxa s’ T(s, a, s’) U[s’] unchanged? false end until unchanged? return π 26 Example Back to our last example… 2x2 world The agent is placed in (1,1) States (2,1), (2,2) are goal states If blocked by the wall – stay in place The reward are written in the board Initial policy: Up (for every step) R=-0.04 R=+1 R=-0.04 R=-1 27 Example (cont.) First iteration – policy evaluation U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)) U(1,2) = R(1,2) + γ x (0.9U(1,2) + 0.1U(2,2)) U(2,1) = R(2,1) U(2,2) = R(2,2) U(1,1) = -0.04 + 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1) U(1,2) = -0.04 + 0.9U(1,2) + 0.1U(2,2) U(2,1) = -1 U(2,2) = 1 0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2) 0.04 = 0U(1,1) – 0.1U(1,2) + 0U(2,1) + 0.1U(2,2) -1 = 0U(1,1) + 0U(1,2) - 1U(2,1) + 0U(2,2) 1 = 0U(1,1) + 0U(1,2) - 0U(2,1) + 1U(2,2) 0.9 0.8 0.1 0 U (1,1) 0.04 0 U (1, 2) 0.04 0.1 0 0.1 0 0 1 0 U (2,1) 1 0 0 1 U (2, 2) 1 0 U(1,1) U(1,2) U(2,1) U(2,2) = = = = 0.3778 0.6 -1 1 U=-0.04 R=-0.04 U=+1 R=+1 U=-0.04 R=-0.04 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Up http://www.math.ncsu.edu/ma114/PDF/1.4.pdf 28 Example (cont.) First iteration – policy improvement Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = argmaxa { 0.8x(0.6) + 0.1x(0.3778) + 0.1x(-1), 0.9x(0.3778) + 0.1x(0.6), 0.9x(0.3778) + 0.1x(-1), 0.8x(-1) + 0.1x(0.3778) + 0.1x(0.6)} = argmaxa { 0.4178, 0.4, 0.24, -0.7022} = Up don’t have to update Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = argmaxa { 0.9x(0.6)+ 0.1x1, 0.9x(0.6) + 0.1x(0.3778), 0.8x(0.3778) + 0.1x1 + 0.1x(0.6), 0.8x1 + 0.1x(0.6) + 0.1x(0.3778)} = argmaxa { 0.64, 0.5778, 0.4622, 0.8978} = Right update U= 0.6 R=-0.04 U=+1 R=+1 U=0.3778 R=-0.04 U=-1 R=-1 //Up //Left //Down //Right Policy Π(1,1) = Up Π(1,2) = Up //Up //Left //Down //Right 29 Example (cont.) Second iteration – policy evaluation U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)) U(1,2) = R(1,2) + γ x (0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1)) U(2,1) = R(2,1) U=0.6 U=+1 U(2,2) = R(2,2) R=-0.04 R=+1 U(1,1) = -0.04 + 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1) U=0.3778 U=-1 U(1,2) = -0.04 + 0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1) U(2,1) = -1 R=-0.04 R=-1 U(2,2) = 1 Policy 0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2) 0.04 = 0.1U(1,1) – 0.9U(1,2) + 0U(2,1) + 0.8U(2,2) Π(1,1) = Up -1 = 0U(1,1) + 0U(1,2) - 1U(2,1) + 0U(2,2) 1 = 0U(1,1) + 0U(1,2) - 0U(2,1) + 1U(2,2) Π(1,2) = Right 0.9 0.8 0.1 0 U (1,1) 0.04 0.1 0.9 0 0.8 U (1, 2) 0.04 0 0 1 0 U (2,1) 1 0 0 1 U (2, 2) 1 0 U(1,1) U(1,2) U(2,1) U(2,2) = = = = 0.5413 0.7843 -1 1 30 Example (cont.) Second iteration – policy improvement Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = argmaxa { 0.8x(0.7843) + 0.1x(0.5413) + 0.1x(-1), 0.9x(0.5413) + 0.1x(0.7843), 0.9x(0.5413) + 0.1x(-1), 0.8x(-1) + 0.1x(0.5413) + 0.1x(0.7843)} = argmaxa { 0.5816, 0.5656, 0.3871, -0.6674} = Up don’t have to update Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = argmaxa { 0.9x(0.7843)+ 0.1x1, 0.9x(0.7843) + 0.1x(0.5413), 0.8x(0.5413) + 0.1x1 + 0.1x(0.7843), 0.8x1 + 0.1x(0.7843) + 0.1x(0.5413)} = argmaxa {0.8059, 0.76, 0.6115, 0.9326} = Right don’t have to update //Up //Left //Down //Right U=0.7843 R=-0.04 U=+1 R=+1 U=0.5413 R=-0.04 U=-1 R=-1 //Up //Left //Down //Right Policy Π(1,1) = Up Π(1,2) = Right 31 Example (cont.) No change in the policy has found finish The optimal policy: π(1,1) = Up π(1,2) = Right Policy iteration must terminate since policy’s number is finite 32 Simplify Policy iteration Can focus of subset of state Find utility by simplified value iteration: Ui+1(s) = R(s) + γ ∑s’ (T(s, π(s), s’) Ui(s’)) OR Policy Improvement Guaranteed to converge under certain conditions on initial polity and utility values 33 Policy Iteration properties Linear equation – easy to solve Fast convergence in practice Proved to be optimal 34 Value vs. Policy Iteration Which to use: Policy iteration is more expensive per iteration In practice, Policy iteration requires fewer iterations 35 Reinforcement Learning: An Introduction http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html Richard S. Sutton and Andrew G. Barto A Bradford Book The MIT Press Cambridge, Massachusetts London, England 36