Uploaded by shutupandletrohitin

mdps

advertisement
CS 5100
MDPs
Robert Platt
Northeastern University
Material adapted from:
1. Russell & Norvig, AIMA
2. Dan Klein & Pieter Abbeel, UC Berkeley CS 188
3. Lawson Wong, CS 5100
4. Chris Amato, CS 5100
Markov Decision Process (MDP): Grid-world example
MDP = (S, A, T, R, γ)
+1
-1
Rewards R(s) – or R(s, a, s'):
– Agent gets these rewards in these cells
– Goal of agent is to maximize reward
– Rewards can be negative
Discount factor γ (gamma):
– Trade-off between
current vs. future rewards
States S:
– Each cell is a state
Actions A: Left, right, up, down
– Take one action per time step
Transition function T(s, a, s') = P(s' | s, a):
– Actions are stochastic:
Only goes in intended direction 80% of the time
(10% of time goes in perpendicular direction)
– If hit a wall, no move happens
Markov Decision Process (MDP)
Deterministic
– Same action always has same outcome
1.0
Stochastic
– Same action could have different outcomes
0.1
0.1
0.8
Markov Decision Process (MDP)
Same action could have different outcomes:
0.1
0.1
0.1
0.8
0.1
0.8
Transition function at s_1:
s'
s_2
s_3
s_4
T(s,a,s')
0.1
0.8
0.1
Markov Decision Process (MDP)
Technically, an MDP is a 5-tuple
An MDP (Markov Decision Process)
defines a stochastic control problem:
State set:
Action Set:
Transition function:
Reward function:
Discount factor:
or, more generally:
Markov Decision Process (MDP)
Technically, an MDP is a 5-tuple
An MDP (Markov Decision Process)
defines a stochastic control problem:
State set:
Probability of going from s to s'
when executing action a
Action Set:
Transition function:
Reward function:
or, more generally:
Discount factor:
What is the objective?
Markov Decision Process (MDP)
Technically, an MDP is a 5-tuple
An MDP (Markov Decision Process)
defines a stochastic control problem:
State set:
Probability of going from s to s'
when executing action a
Action Set:
Transition function:
Reward function:
or, more generally:
Discount factor:
Objective: Calculate a strategy for acting that
maximizes the return (expected sum of future rewards)
– We will calculate a policy that will tell us how to act
Markov Decision Process
Markov property / assumption:
Given the current state,
the future and the past
are independent
Markov decision process:
Transition function is Markovian
Similar to situation in search:
Successor function only depended
on current state (not history)
State at t=1
State at t=2
Actions / transitions
A n d re y M arko v
(1 8 5 6 -1 9 2 2 )
Markov Decision Process
Markov property / assumption:
Given the current state,
the future and the past
are independent
Markov decision process:
Transition function is Markovian
State at t=1
State at t=2
Actions / transitions
Similar
to situation
in search:
Can you
give an example
of a Markovian process?
Successor
dependedprocess?
Howfunction
about a only
non-Markovian
on current state (not history)
A n d re y M arko v
(1 8 5 6 -1 9 2 2 )
What is a policy?
A policy tells the agent what action to execute as a function of state:
Deterministic policy:
– Agent always executes the same action from a given state
Stochastic policy:
– Agent selects an action to execute by drawing from a
probability distribution encoded by the policy
For MDPs, there always exists an optimal policy that is deterministic
– We will see stochastic policies when learning MDPs
(reinforcement learning)
What is a policy?
Policies are more general than plans
Plan:
– Specifies a sequence of actions to execute
– Cannot react to unexpected outcome
Policy:
– Tells you what action to take from any state
Plan might not be optimal
U(r,r)=15
U(r,b)=15
U(b,r)=20
U(b,b)=20
The optimal policy can achieve U=30
Stochastic outcome
Policy for grid-world example
+1
Terminal
states
-1
π((4,1)) = left
R(s) = -0.04
for all non-terminal s
Objective: Calculate a strategy for acting that
maximizes the expected return (expected sum of future rewards)
Policy for grid-world example
R(s) = -0.04
for all non-terminal s
Objective: Calculate a strategy for acting that
maximizes the expected return (expected sum of future rewards)
Policy for grid-world example
R(s) = -0.04
for all non-terminal s
Objective: Calculate a strategy for acting that
maximizes the expected return (expected sum of future rewards)
Policy for grid-world example
R(s) = -0.04
for all non-terminal s
Objective: Calculate a strategy for acting that
maximizes the expected return (expected sum of future rewards)
Policy for grid-world example
R(s) = -0.04
for all non-terminal s
Objective: Calculate a strategy for acting that
maximizes the expected return (expected sum of future rewards)
Policy for grid-world example
R(s) = -0.04
for all non-terminal s
Objective: Calculate a strategy for acting that
maximizes the expected return (expected sum of future rewards)
Choosing a reward function
A few possibilities:
– Place rewards on goal states /
undesirable states
– Negative reward everywhere
except terminal states
– Gradually increasing reward
as you approach the goal
(reward shaping)
In general:
– Reward can be
whatever you want
Caution! Reward hacking
+1
-1
Decision theory
Decision theory = Probability theory + Utility theory
Given a single-step decision problem,
choose action that maximizes expected utility:
Expected utility:
U([ p1 , s1;¼; pn, sn ]) =å piU(si )
i
Max expected utility:
What about sequential decision problems?
Maximize the expected return (expected utility of a sequence):
19
Utility theory strikes again
What preferences should an agent have over
reward sequences?
More or less? [1, 2, 2] or [2, 3, 4]
Now or later? [0, 0, 1] or [1, 0, 0]
20
Utility theory strikes again
What preferences should an agent have over
reward sequences?
More or less? [1, 2, 2] or [2, 3, 4]
Now or later? [0, 0, 1] or [1, 0, 0]
Intuitively:
- Prefer more sum of rewards
- Prefer rewards now over rewards in the future
21
Utility theory strikes again
What preferences should an agent have over
reward sequences?
More or less? [1, 2, 2] or [2, 3, 4]
Now or later? [0, 0, 1] or [1, 0, 0]
Intuitively:
- Prefer more sum of rewards
- Prefer rewards now over rewards in the future
Solution: Sum of discounted rewards
2 + γ * 3 + γ2 * 4 > 1 + γ * 2 + γ2 * 2
1 + γ * 0 + γ2 * 0 > 0 + γ * 0 + γ2 * 1
22
Utility theory strikes again
What preferences should an agent have over
reward sequences?
Theorem: If we assume stationary preferences:
Then there are only two ways to define utilities:
Additive utility:
Discounted utility:
23
Utility theory strikes again
What preferences should an agent have over
reward sequences?
Theorem: If we assume stationary preferences:
Then there are only two ways to define utilities:
Additive utility:
Discounted utility:
24
Utility theory strikes again
What preferences should an agent have over
reward sequences?
Additive utility:
Discounted utility:
25
Utility theory strikes again
What preferences should an agent have over
reward sequences?
Additive utility:
Discounted utility:
More generally: For
,
γ = 0:
γ = 1:
0 < γ < 1:
26
Utility theory strikes again
What preferences should an agent have over
reward sequences?
Additive utility:
Discounted utility:
More generally: For
,
γ = 0: Single-step utility
γ = 1: Additive utility
(this is why we use
additive edge costs in search!)
0 < γ < 1: Discounted utility
Often γ = 0.9, 0.95, 0.99
27
Discount factor
 Given:
 Actions: East, West, and Exit (only available in exit states a, e)
 Transitions: deterministic
 Q1: For  = 1, what is the optimal policy?
 Q2: For  = 0.1, what is the optimal policy?
 Q3: For which are West and East equally good when in state d?
Discount factor
What preferences should an agent have over
reward sequences?
For
,
Policies can run forever!
- Can accrue infinite utility
- Want to avoid comparing ∞ > ∞
29
Discount factor
What preferences should an agent have over
reward sequences?
For
,
Solutions for avoiding infinite utilities:
- Finite horizon (ends after T time steps)
- Terminal states (but must ensure that
agent eventually reaches one)
- Discount factor γ < 1:
30
Value function
Utility for a specific sequence of state:
But the environment is stochastic!
- Cannot predict exact future sequence of states
31
Value function
Utility for a specific sequence of state:
But the environment is stochastic!
- Cannot predict exact future sequence of states
Solution: Maximum Expected Utility (MEU)
32
Value function
Utility for a specific sequence of state:
But the environment is stochastic!
- Cannot predict exact future sequence of states
Solution: Maximum Expected Utility (MEU)
Take an expectation (over future outcomes):
33
Value function
Utility for a specific sequence of state:
But the environment is stochastic!
- Cannot predict exact future sequence of states
Solution: Maximum Expected Utility (MEU)
Take an expectation (over future outcomes):
34
Value function
Utility for a specific sequence of state:
But the environment is stochastic!
- Cannot predict exact future sequence of states
Solution: Maximum Expected Utility (MEU)
Take an expectation (over future outcomes):
35
Value function
Value function: Expected return starting in state s0
37
Value function
Value function: Expected return starting in state s0
The above quantity is ill-defined!
We are missing something...
38
Value function
Value function: Expected return starting in state s0
The above quantity is ill-defined!
We are missing something...
Technically a conditional expectation
(given s0 , a0 , a1 , a2 ,...)
Where do the actions come from?
39
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
40
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
41
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Technically a conditional expectation
(given s0 , a0 = π(s0) , a1 = π(s1) , a2 = π(s2) ,...)
42
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
43
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Recursive definition:
Usually written as:
44
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Recursive definition:
Usually written as:
Current reward
(Discounted) (Expected)
Future return
45
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Recursive definition:
Usually written as:
How to find Vπ, if you know policy π?
46
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Recursive definition:
Usually written as:
How to find Vπ, if you know policy π?
Characterization above gives
|S| linear equations (one per state)
with |S| unknowns (V(s) for each state s)
47
Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Recursive definition:
Usually written as:
This applies for any policy, including the optimal one: π*
48
Optimal value function
Value function: Expected return starting in state s
and following policy π* thereafter
Optimal value function should maximize expected return:
49
Optimal value function
Value function: Expected return starting in state s
and following policy π* thereafter
Optimal value function should maximize expected return:
Bellman equation (1957)
50
Optimal value function
R(s) = 0 for
non-terminal s
γ = 0.9
51
Optimal value function
Interpreting what the equation says using expectimax:
Max
(you)
V*(s)
Perform
action b
Perform
action c
Multiply expected
future return by
discount factor γ
Exp
(nature)
T(s,a=b,s'=w)
Max
(you)
Obtain reward R(s),
add max of children
T(s,a=b,s'=x)
V*(s'=w)
V*(s'=x)
T(s,a=c,s'=y)
V*(s'=y)
Compute expected
future return (value)
from taking action c
T(s,a=c,s'=z)
V*(s'=z)
Value (expected future return) if you acted optimally from the next step onward
52
An algorithm for computing the value function?
Interpreting what the equation says using expectimax:
Max
(you)
V*(s)
Perform
action b
Perform
action c
Multiply expected
future return by
discount factor γ
Exp
(nature)
T(s,a=b,s'=w)
Max
(you)
Obtain reward R(s),
add max of children
T(s,a=b,s'=x)
V*(s'=w)
V*(s'=x)
T(s,a=c,s'=y)
V*(s'=y)
Compute expected
future return (value)
from taking action c
T(s,a=c,s'=z)
V*(s'=z)
Value (expected future return) if you acted optimally from the next step onward
54
Value iteration
Init:
Repeat until convergence:
For each state s:
55
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
56
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
57
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
is equivalent to the expectimax value after (k-1) plys
58
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
is equivalent to the expectimax value after (k-1) plys
V1(s) is equivalent to the reward function R(s)
(similar to the terminal utility, rather than 1 ply of search)
59
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
is equivalent to the expectimax value after (k-1) plys
V1(s) is equivalent to the reward function R(s)
(similar to the terminal utility, rather than 1 ply of search)
is the optimal value function for horizon (k-1)
60
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
is equivalent to the expectimax value after (k-1) plys
V1(s) is equivalent to the reward function R(s)
(similar to the terminal utility, rather than 1 ply of search)
is the optimal value function for horizon (k-1)
- i.e., (k-1) actions left
- k states left, where s is the first state
61
Value iteration
Init:
Repeat until convergence:
For each state s:
Guaranteed to
converge to V*
This performs 1 ply
of expectimax search
- Convergence: Typically
is equivalent to the expectimax value after (k-1) plys
V1(s) is equivalent to the reward function R(s)
(similar to the terminal utility, rather than 1 ply of search)
is the optimal value function for horizon (k-1)
- i.e., (k-1) actions left
- k states left, where s is the first state
62
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
- For more general rewards, do:
- Complexity of each iteration:
63
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
- Convergence: Typically
- For more general rewards, do:
- Complexity of each iteration: O(|S|2 |A|)
64
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
65
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
66
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
67
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
68
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
69
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
70
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
71
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
72
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
73
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
74
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
75
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
76
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
77
Value iteration
R(s) = 0 for
non-terminal s
γ = 0.9
78
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
How to obtain policy?
79
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
How to obtain policy?
80
Value iteration
Init:
Repeat until convergence:
For each state s:
This performs 1 ply
of expectimax search
How to obtain policy?
Notice anything interesting about the policy during VI?
(Look at the grid-world example again)
81
Policy iteration
Notice anything interesting about the policy during VI?
(Look at the grid-world example again)
Policy converges much faster than the value function
- Getting the exact values do not matter
if the strategy to act is the same
82
Policy iteration
Notice anything interesting about the policy during VI?
(Look at the grid-world example again)
Policy converges much faster than the value function
- Getting the exact values do not matter
if the strategy to act is the same
In value iteration, we iteratively compute the value function
In policy iteration, we
83
Policy iteration
Notice anything interesting about the policy during VI?
(Look at the grid-world example again)
Policy converges much faster than the value function
- Getting the exact values do not matter
if the strategy to act is the same
In value iteration, we iteratively compute the value function
In policy iteration, we iteratively compute the policy
84
Summary
Both value iteration and policy iteration
compute the same thing (optimal values/policy)
Both are dynamic programming algorithms for MDPs
Value iteration:
- Every iteration updates both values and (implicitly) policy
- Do not track the policy, but implicitly computed by max
Init:
Repeat until convergence:
For each state s:
Extract the final policy:
85
Appendix: Policy Iteration
86
Policy iteration
Notice anything interesting about the policy during VI?
(Look at the grid-world example again)
Policy converges much faster than the value function
- Getting the exact values do not matter
if the strategy to act is the same
In value iteration, we iteratively compute the value function
In policy iteration, we iteratively compute the policy
87
Policy iteration
In value iteration, we iteratively compute the value function
In policy iteration, we iteratively compute the policy
Init:
Repeat until convergence:
88
Policy iteration
In value iteration, we iteratively compute the value function
In policy iteration, we iteratively compute the policy
Init:
Repeat until convergence:
Policy evaluation:
For each state s, compute
Policy improvement:
For each state s,
89
Policy iteration
In value iteration, we iteratively compute the value function
In policy iteration, we iteratively compute the policy
Init:
Repeat until convergence:
Policy evaluation:
For each state s, compute
Policy improvement:
For each state s,
- Convergence: Policy no longer changes
90
Policy iteration
In value iteration, we iteratively compute the value function
In policy iteration, we iteratively compute the policy
Init:
Repeat until convergence:
Guaranteed to
converge to π*
Policy evaluation:
For each state s, compute
Policy improvement:
For each state s,
- Convergence: Policy no longer changes
91
Policy evaluation
Policy evaluation: Compute
92
Recall: Value function
Value function: Expected return starting in state s0
and following policy π thereafter
Recursive definition:
Usually written as:
How to find Vπ, if you know policy π?
Characterization above gives
|S| linear equations (one per state)
with |S| unknowns (V(s) for each state s)
93
Policy evaluation
Policy evaluation: For each state s, compute
How to find Vπ, if you know policy π?
Characterization above gives
|S| linear equations (one per state)
with |S| unknowns (V(s) for each state s)
Just a standard linear system of |S| equations (MATLAB: v = A\b)
94
Approximate policy evaluation
Policy evaluation: For each state s, compute
How to find Vπ, if you know policy π?
Characterization above gives
|S| linear equations (one per state)
with |S| unknowns (V(s) for each state s)
Just a standard linear system of |S| equations (MATLAB: v = A\b)
Or: Run a few (small number!) simplified value iteration steps:
95
Approximate policy evaluation
Policy evaluation: For each state s, compute
How to find Vπ, if you know policy π?
Characterization above gives
|S| linear equations (one per state)
with |S| unknowns (V(s) for each state s)
Just a standard linear system of |S| equations (MATLAB: v = A\b)
Or: Run a few (small number!) simplified value iteration steps:
Policy iteration with approximate policy evaluation
is also known as modified policy iteration
96
Summary
Both value iteration and policy iteration
compute the same thing (optimal values/policy)
Value iteration:
- Every iteration updates both values and (implicitly) policy
- Do not track the policy, but implicitly computed by max
Policy iteration:
- Every iteration does a quick update of values using
fixed policy (no max, just compute for a single action)
- After policy evaluated, choose a new policy (max action)
- The new policy will be better
Both are dynamic programming algorithms for MDPs
97
Summary
Value iteration
Init:
Repeat until convergence:
For each state s:
Policy iteration
Init:
Repeat until convergence:
Policy evaluation: For each state s, compute
Policy improvement:
For each state s,
98
Glossary
99
Download