MDP2

advertisement
Markov Decision Process
(MDP)
Ruti Glick
Bar-Ilan university
1
Policy
Policy is similar to plan
generated ahead of time
Unlike traditional plans, it is not a sequence of
actions that an agent must execute
If there are failures in execution, agent can continue to
execute a policy
Prescribes an action for all the states
Maximizes expected reward, rather than just
reaching a goal state
2
Utility and Policy
utility
Compute for every state
“What is the usage (utility) of this state for the
overall task?”􀂄
Policy
Complete mapping from states to actions
“In which state should I perform which action?”
policy: state  action
3
The optimal Policy
π*(s) = argmaxa∑s’T(s, a, s’)U(s’)
T(s, a, s’) = Probability of reaching state s’ from state s
U(s’) = Utility of state s’.
If we know the utility we can easily compute
the optimal policy.􀂄
The problem is to compute the correct utilities
for all states.
4
Finding π*
Value iteration
Policy iteration
5
Value iteration
Process:
Calculate the utility of each state
Use the values to select an optimal action
6
Bellman Equation
Bellman Equation:
U(s) = R(s) + γmaxa ∑ (T(s, a, s’) U(s’))
For example:
Up
U(1,1) = -0.04+γmax{ 0.8U(1,2)+0.1U(2,1)+0.1U(1,1),
Left
0.9U(1,1)+0.1U(1,2),
Down
0.9U(1,1)+0.1U(2,1),
+1
0.8U(2,1)+0.1U(1,2)+0.1U(1,1,) } Right
-1
START
7
Bellman Equation
properties
U(s) = R(s) + γ maxa ∑ (T(s, a, s’) U(s’))
n equations. One for each step
n vaiables
Problem
Operator max is not a linear operator
Non-linear equations.
Solution
Iterative approach
8
Value iteration algorithm
Initial arbitrary values for utilities
Update utility of each state from it’s
neighbors
Ui+1(s) R(s) +
γ maxa ∑ (T(s, a, s’) Ui(s’))
Iteration step called Bellman update
Repeat till converges
9
Value Iteration properties
This equilibrium is a unique
solution!
Can prove that the value iteration
process converges
Don’t need exact values
10
convergence
Value iteration is contraction
Function of one argument
When applied on to inputs produces value that are
“closer together
Have only one fixed point
When applied the value must be closer to fixed
point
We’ll not going to prove last point
converge to correct value
11
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility function
input: mdp, MDP with states S, transition model T,
reward function R, discount γ
local variables: U, U’ vectors of utilities for states in S,
initially identical to R
repeat
U U’
for each state s in S do
U’[s]  R[s] + γmaxa s’ T(s, a, s’)U[s’]
until close-enough(U,U’)
return U
12
Example
Small version of our main example
2x2 world
The agent is placed in (1,1)
States (2,1), (2,2) are goal states
If blocked by the wall – stay in place
The reward are written in the board
R=-0.04
R=+1
R=-0.04
R=-1
13
First iteration
Example (cont.)
U=-0.04
R=-0.04
U=+1
R=+1
U=-0.04
R=-0.04
U=-1
R=-1
U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1),
0.9U(1,1) + 0.1U(1,2),
0.9U(1,1) + 0.1U(2,1),
0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)}
= -0.04 + 1 x max { 0.8x(-0.04) + 0.1x(-0.04) + 0.1x(-1),
0.9x(-0.04) + 0.1x(-0.04),
0.9x(-0.04) + 0.1x(-1),
0.8x(-1) + 0.1x(-0.04) + 0.1x(-0.04)}
=-0.04 + max{ -0.136, -0.04, -0.136, -0.808}
=-0.08
U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2),
0.9U(1,2) + 0.1U(1,1),
0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2),
0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)}
= -0.04 + 1 x max {0.9x(-0.04)+ 0.1x1,
0.9x(-0.04) + 0.1x(-0.04),
0.8x(-0.04) + 0.1x1 + 0.1x(-0.04),
0.8x1 + 0.1x(-0.04) + 0.1x(-0.04)}
=-0.04 + max{ 0.064, -0.04, 0.064, 0.792}
=0.752
Goal states remain the same
14
U=0.752
R=-0.04
U=+1
R=+1
U=-0.08
U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1),
R=-0.04
U=-1
R=-1
Second iteration
Example (cont.)
0.9U(1,1) + 0.1U(1,2),
0.9U(1,1) + 0.1U(2,1),
0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)}
= -0.04 + 1 x max { 0.8x(0.752) + 0.1x(-0.08) + 0.1x(-1),
0.9x(-0.08) + 0.1x(0.752),
0.9x(-0.08) + 0.1x(-1),
0.8x(-1) + 0.1x(-0.08) + 0.1x(0.752)}
=-0.04 + max{ 0.4936, 0.0032, -0.172, -0.3728}
=0.4536
U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2),
0.9U(1,2) + 0.1U(1,1),
0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2),
0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)}
= -0.04 + 1 x max {0.9x(0.752)+ 0.1x1,
0.9x(0.752) + 0.1x(-0.08),
0.8x(-0.08) + 0.1x1 + 0.1x(0.752),
0.8x1 + 0.1x(0.752) + 0.1x(-0.08)}
=-0.04 + max{ 0.7768, 0.6688, 0.1112, 0.8672}
= 0.8272
15
U=0.8272
R=-0.04
U=+1
R=+1
U= 0.4536
U(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1),
R=-0.04
0.9U(1,1) + 0.1U(1,2),
0.9U(1,1) + 0.1U(2,1),
0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)}
= -0.04 + 1 x max { 0.8x(0.8272) + 0.1x(0.4536) + 0.1x(-1),
0.9x(0.4536) + 0.1x(0.8272),
0.9x(0.4536) + 0.1x(-1),
0.8x(-1) + 0.1x(0.4536) + 0.1x(0.8272)}
=-0.04 + max{ 0.6071, 0.491, 0.3082, -0.6719}
=0.5676
U(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2),
0.9U(1,2) + 0.1U(1,1),
0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2),
0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)}
= -0.04 + 1 x max {0.9x(0.8272)+ 0.1x1,
0.9x(0.8272) + 0.1x(0.4536),
0.8x(0.4536) + 0.1x1 + 0.1x(0.8272),
0.8x1 + 0.1x(0.8272) + 0.1x(0.4536)}
=-0.04 + max{ 0.8444, 0.7898, 0.5456, 0.9281}
= 0.8881
U=-1
R=-1
Third iteration
Example (cont.)
16
Example (cont.)
Continue to next iteration…
U= 0.8881
R=-0.04
U=+1
R=+1
U=-0.5676
R=-0.04
U=-1
R=-1
Finish if “close enough”
Here last change was 0.114 – close enough
17
“close enough”
We will not go down deeply to this issue!
Different possibilities to detect convergence:
RMS error –root mean square error of the utility
value compare to the correct values
1
RMS 

|S|
2
(
U
(
i
)

U
'(
i
))
i1
|S |
demand of RMS(U, U’) < ε
when: ε – maximum error allowed in utility of any
state in an iteration
18
“close enough” (cont.)
Policy Loss :
difference between the expected utility using
the policy to the expected utility obtain by the
optimal policy
|| Ui+1 – Ui || < ε (1-γ) / γ
When: ||U|| = maxa |U(s)|
ε – maximum error allowed in utility of
any state
in an iteration
γ – the discount factor
19
Finding the policy
True utilities have founded
New search for the optimal policy:
For each s in S do
π[s]  argmaxa ∑s’ T(s, a, s’)U(s’)
Return π
20
Example (cont.)
Find the optimal police
Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1),
0.9U(1,1) + 0.1U(1,2),
0.9U(1,1) + 0.1U(2,1),
0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)}
= argmaxa { 0.8x(0.8881) + 0.1x(0.5676) + 0.1x(-1),
0.9x(0.5676) + 0.1x(0.8881),
0.9x(0.5676) + 0.1x(-1),
0.8x(-1) + 0.1x(0.5676) + 0.1x(0.8881)}
= argmaxa { 0.6672, 0.5996, 0.4108, -0.6512}
= Up
Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2),
0.9U(1,2) + 0.1U(1,1),
0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2),
0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)}
= argmaxa { 0.9x(0.8881)+ 0.1x1,
0.9x(0.8881) + 0.1x(0.5676),
0.8x(0.5676) + 0.1x1 + 0.1x(0.8881),
0.8x1 + 0.1x(0.8881) + 0.1x(0.5676)}
= argmaxa {0.8993, 0.8561, 0.6429, 0.9456}
= Right
//Up
//Left
//Down
//Right
//Up
//Left
//Down
//Right
U= 0.8881
R=-0.04
U=+1
R=+1
U=-0.5676
R=-0.04
U=-1
R=-1
21
Summery – value iteration
+1
0.812 0.868
0.912
+1
-1
0.762
0.660
-1
0.705
0.655 0.611 0.388
1. The given environment
2. Calculate utilities
+1
+1
-1
-1
3. Extract optimal policy
4. Execute actions
22
Example - convergence
Error
allowed
23
Policy iteration
picking a policy, then calculating the utility
of each state given that policy (value
iteration step)
Update the policy at each state using the
utilities of the successor states
Repeat until the policy stabilize
24
Policy iteration
For each state in each step
Policy evaluation
Given policy πi.
Calculate the utility Ui of each state if π were to be
execute
Policy improvement
Calculate new policy πi+1
Based on πi
π i+1[s]  argmaxa ∑s’ T(s, a, s’)U π_i (s’)
25
Policy iteration Algorithm
function POLICY_ITERATION (mdp) returns a policy
input: mdp, an MDP with states S, transition model T
local variables: U, U’ vectors of utilities for states in S, initially
identical to R
π, a policy, vector indexed by states, initially random
repeat
U Policy-Evaluation(π, mdp)
unchanged?  true
for each state s in S do
if maxa s’ T(s, a, s’) U[s’] > s’T(s, π[s], s’) U[s’] then
π[s]  argmaxa s’ T(s, a, s’) U[s’]
unchanged?  false
end
until unchanged?
return π
26
Example
Back to our last example…
2x2 world
The agent is placed in (1,1)
States (2,1), (2,2) are goal states
If blocked by the wall – stay in place
The reward are written in the board
Initial policy: Up (for every step)
R=-0.04
R=+1
R=-0.04
R=-1
27
Example (cont.)
First iteration – policy evaluation
U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1))
U(1,2) = R(1,2) + γ x (0.9U(1,2) + 0.1U(2,2))
U(2,1) = R(2,1)
U(2,2) = R(2,2)
U(1,1) = -0.04 + 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)
U(1,2) = -0.04 + 0.9U(1,2) + 0.1U(2,2)
U(2,1) = -1
U(2,2) = 1
0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2)
0.04 =
0U(1,1) – 0.1U(1,2) + 0U(2,1) + 0.1U(2,2)
-1 =
0U(1,1) + 0U(1,2) - 1U(2,1) + 0U(2,2)
1 =
0U(1,1) + 0U(1,2) - 0U(2,1) + 1U(2,2)
 0.9 0.8 0.1 0   U (1,1)  0.04 
 0
  U (1, 2)  0.04 

0.1
0
0.1




 0




0
1
0 U (2,1)
1 


 

0
0
1  U (2, 2)   1 
 0
U(1,1)
U(1,2)
U(2,1)
U(2,2)
=
=
=
=
0.3778
0.6
-1
1
U=-0.04
R=-0.04
U=+1
R=+1
U=-0.04
R=-0.04
U=-1
R=-1
Policy
Π(1,1) = Up
Π(1,2) = Up
http://www.math.ncsu.edu/ma114/PDF/1.4.pdf
28
Example (cont.)
First iteration – policy improvement
Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1),
0.9U(1,1) + 0.1U(1,2),
0.9U(1,1) + 0.1U(2,1),
0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)}
= argmaxa { 0.8x(0.6) + 0.1x(0.3778) + 0.1x(-1),
0.9x(0.3778) + 0.1x(0.6),
0.9x(0.3778) + 0.1x(-1),
0.8x(-1) + 0.1x(0.3778) + 0.1x(0.6)}
= argmaxa { 0.4178, 0.4, 0.24, -0.7022}
= Up  don’t have to update
Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2),
0.9U(1,2) + 0.1U(1,1),
0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2),
0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)}
= argmaxa { 0.9x(0.6)+ 0.1x1,
0.9x(0.6) + 0.1x(0.3778),
0.8x(0.3778) + 0.1x1 + 0.1x(0.6),
0.8x1 + 0.1x(0.6) + 0.1x(0.3778)}
= argmaxa { 0.64, 0.5778, 0.4622, 0.8978}
= Right  update
U= 0.6
R=-0.04
U=+1
R=+1
U=0.3778
R=-0.04
U=-1
R=-1
//Up
//Left
//Down
//Right
Policy
Π(1,1) = Up
Π(1,2) = Up
//Up
//Left
//Down
//Right
29
Example (cont.)
Second iteration – policy evaluation
U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1))
U(1,2) = R(1,2) + γ x (0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1))
U(2,1) = R(2,1)
U=0.6
U=+1
U(2,2) = R(2,2)
R=-0.04
R=+1
U(1,1) = -0.04 + 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)
U=0.3778
U=-1
U(1,2) = -0.04 + 0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1)
U(2,1) = -1
R=-0.04
R=-1
U(2,2) = 1
Policy
0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2)
0.04 = 0.1U(1,1) – 0.9U(1,2) + 0U(2,1) + 0.8U(2,2)
Π(1,1) = Up
-1 =
0U(1,1) + 0U(1,2) - 1U(2,1) + 0U(2,2)
1 =
0U(1,1) + 0U(1,2) - 0U(2,1) + 1U(2,2)
Π(1,2) = Right
 0.9 0.8 0.1 0   U (1,1)  0.04 
 0.1 0.9 0 0.8  U (1, 2)  0.04 




 0
0
1
0   U (2,1)   1 


 

0
0
1  U (2, 2)   1 
 0
U(1,1)
U(1,2)
U(2,1)
U(2,2)
=
=
=
=
0.5413
0.7843
-1
1
30
Example (cont.)
Second
iteration – policy improvement
Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1),
0.9U(1,1) + 0.1U(1,2),
0.9U(1,1) + 0.1U(2,1),
0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)}
= argmaxa { 0.8x(0.7843) + 0.1x(0.5413) + 0.1x(-1),
0.9x(0.5413) + 0.1x(0.7843),
0.9x(0.5413) + 0.1x(-1),
0.8x(-1) + 0.1x(0.5413) + 0.1x(0.7843)}
= argmaxa { 0.5816, 0.5656, 0.3871, -0.6674}
= Up  don’t have to update
Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2),
0.9U(1,2) + 0.1U(1,1),
0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2),
0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)}
= argmaxa { 0.9x(0.7843)+ 0.1x1,
0.9x(0.7843) + 0.1x(0.5413),
0.8x(0.5413) + 0.1x1 + 0.1x(0.7843),
0.8x1 + 0.1x(0.7843) + 0.1x(0.5413)}
= argmaxa {0.8059, 0.76, 0.6115, 0.9326}
= Right  don’t have to update
//Up
//Left
//Down
//Right
U=0.7843
R=-0.04
U=+1
R=+1
U=0.5413
R=-0.04
U=-1
R=-1
//Up
//Left
//Down
//Right
Policy
Π(1,1) = Up
Π(1,2) = Right
31
Example (cont.)
No change in the policy has found
 finish
The optimal policy:
π(1,1) = Up
π(1,2) = Right
Policy iteration must terminate since policy’s
number is finite
32
Simplify Policy iteration
Can focus of subset of state
Find utility by simplified value iteration:
Ui+1(s) = R(s) + γ ∑s’ (T(s, π(s), s’) Ui(s’))
OR
Policy Improvement
Guaranteed to converge under certain
conditions on initial polity and utility values
33
Policy Iteration properties
Linear equation – easy to solve
Fast convergence in practice
Proved to be optimal
34
Value vs. Policy Iteration
Which to use:
Policy iteration is more expensive per
iteration
In practice, Policy iteration requires fewer
iterations
35
Reinforcement Learning:
An Introduction
http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html
Richard S. Sutton and Andrew G. Barto
A Bradford Book
The MIT Press
Cambridge, Massachusetts
London, England
36
Download