Automated Planning Planning under Uncertainty

advertisement
Automated Planning
Planning under Uncertainty
Jonas Kvarnström
Automated Planning Group
Department of Computer and Information Science
Linköping University
jonas.kvarnstrom@liu.se – 2015

2
Recall the restricted state transition system Σ = (S,A,γ)
Finite set of world states
Finite set of actions
State transition function, where |γ(s,a)| ≤ 1
 S = { s0, s1, … }:
 A = { a0, a1, … }:
 γ: S × A  2S:
▪ If γ(s,a) = {s’},
then whenever you are in state s,
you can execute action a
and you end up in state s’
▪ If γ(s,a) = ∅ (the empty set),
then a cannot be executed in s
S = { s0, s1, … }
A = { take1, put1, … }
γ: S × A  2S
γ(s0, take2) = { s1 }
γ(s1, take2) = ∅
Often we also add a cost function:
c: S × A  ℝ
s0
take2
put2
s1
put3
take3
s2
jonkv@ida
Restricted State Transition System

3
Recall the classical planning problem
 Let Σ = (S,A,γ) be a state transition system
 Let s0 ∈ S
 Let Sg ⊆ S
satisfying the assumptions A0 to A7
(called a restricted state transition system in the book)
be the initial state
be the set of goal states
 Then, find a sequence of transitions
labeled with actions [a1, a2, …, an]
that can be applied starting at s0
resulting in a sequence of states [s1, s2, …, sn]
such that sn ∈ Sg
start
goal
goal
goal
jonkv@ida
Classical Planning Problem

Assumes we know in advance:
 The state of the world when plan execution starts
 The outcome of any action, given the state where it is executed
▪ State + action  unique resulting state

So if there is a solution:
 There is an unconditional solution
Planning
Model says: we end up
in this specific state!
Start
here…
A1
Execution
No new information can be relevant
Just follow the unconditional plan…
4
jonkv@ida
Planning with Complete Information

5
In reality, actions may have multiple outcomes
 Some outcomes can indicate faulty / imperfect execution
▪ pick-up(object)
Intended outcome: carrying(object) is true
Unintended outcome:
carrying(object) is false
▪ move(100,100)
Intended outcome: xpos(robot)=100
Unintended outcome:
xpos(robot) != 100
▪ jump-with-parachute
To a planner,
Intended outcome: alive is true
Unintended outcome:
alive is false
there is generally
no difference between
 Some outcomes are more random,
these cases!
but clearly desirable / undesirable
▪ Pick a present at random – do I get the one I longed for?
▪ Toss a coin – do I win?
 Sometimes we have no clear idea what is desirable
▪ Outcome will affect how we can continue,
but in less predictable ways
jonkv@ida
Multiple Outcomes

7
Nondeterministic planning:
 S = { s0, s1, … }:
 A = { a0, a1, … }:
 γ: S × A  2S:
Finite set of world states
Finite set of actions
State transition function, where |γ(s,a)| is finite
Planning
Model says: we end up
in one of these states
Start
here…
A1
Execution
Will we find out more
when we execute?
jonkv@ida
Nondeterministic Planning

8
NOND: Non-Observable Non-Deterministic
 Also called conformant non-deterministic
 Only predictions can guide us
▪ We have no sensors to determine what happened
Planning
Execution
Model says: we end up
in one of these states
Start
here…
A1
We still only know that
we're in one of these states
Start
here…
A1
jonkv@ida
NOND Planning

NOND plans are still plain sequences
 Trivial example: VCR domain
▪ Initial state: Don’t know the current tape position
▪ Goal: Tape at 00:20
▪ Operators:
Fast-forward n minutes;
Rewind completely (counter := 0).
▪ Solution: Rewind completely; fast-forward 20 minutes
 More interesting: Sorting domain (sortnet)
▪ Initial state: Don't know the order of the list
▪ Operators: Compare-and-swap orders two elements correctly
▪ Plan: A sequence that can sort any list of the given length
▪ http://ldc.usb.ve/~bonet/ipc5/
9
jonkv@ida
NOND Planning: Examples

10
Possible plan for sorting:
position 1
position 2
position 3
position 4
Comp
/
Swap
Comp
/
Swap
Comp
/
Swap
Comp
/
Swap
Comp
/
Swap
Comp
/
Swap
All elements
in place
Last two
in place
Last element
in place
No need for the planner to know the element values
No need to know during runtime if elements were swapped or not
jonkv@ida
NOND Planning: Examples

11
FOND: Fully Observable Non-Deterministic
 After executing an action, sensors determine exactly which state we are in
Planning
Execution
Model says: we end up
in one of these states
Start
here…
A1
Sensors say: we are
in this state!
Start
here…
A1
jonkv@ida
FOND Planning

12
Example state transition system:
wait
wait
Initial state s0:
at pos1, standing
stand-up
s1: at pos1, fallen
Multiple outcomes: May or may not fall
move-to(pos2)
wait
wait
Goal state s2:
at pos2, standing

Intuitive strategy:
 while (not in s2) {
}
move-to(pos2);
if (fallen) stand-up;
stand-up
s3: at pos2, fallen
In FOND planning, action choice
should depend on actual outcomes
(through the current state)
There may be no upper bound on how
many actions we may have to execute!
jonkv@ida
FOND Planning: Plan Structure (1)

13
Examples of formal plan structures:
 Conditional plans (with if/then/else statements)
 Policies π : S → A
▪ Defining, for each state, which action to execute whenever we end up there
▪ π(s0) = move-to(pos2)
▪ π(s1) = stand-up
Or at least, for every state
that is reachable from the given initial state
▪ π(s2) = wait
(A policy can be a partial function)
▪ π(s3) = stand-up
wait
Initial state s0:
at pos1, standing
wait
move-to(pos2)
Goal state s2:
at pos2, standing
wait
stand-up
wait
stand-up
s1: at pos1, fallen
s3: at pos2, fallen
jonkv@ida
FOND Planning: Plan Structure (2)

14
Assume our objective remains to reach a state in Sg
 A weak solution:
For some outcomes, the goal is reached in a finite number of steps
▪
▪
▪
▪
π(s0) = move-to(pos2)
π(s1) = wait
π(s2) = wait
π(s3) = stand-up
wait
Initial state s0:
at pos1, standing
wait
move-to(pos2)
Goal state s2:
at pos2, standing
wait
stand-up
wait
stand-up
s1: at pos1, fallen
s3: at pos2, fallen
jonkv@ida
Solution Types 1

15
Assume our objective remains to reach a state in Sg
 A strong solution:
For every outcome, the goal is reached in a finite number of steps
▪ Not possible for this example problem
▪ Could fall every time
wait
Initial state s0:
at pos1, standing
wait
move-to(pos2)
Goal state s2:
at pos2, standing
wait
stand-up
wait
stand-up
s1: at pos1, fallen
s3: at pos2, fallen
jonkv@ida
Solution Types 2

16
Assume our objective remains to reach a state in Sg
 A strong cyclic solution will reach a goal state in a finite number of steps
given a fairness assumption:
Informally, ”if we can exit a loop, we eventually will”
▪
▪
▪
▪
π(s0) = move-to(pos2)
π(s1) = stand-up
π(s2) = wait
π(s3) = stand-up
wait
Initial state s0:
at pos1, standing
wait
move-to(pos2)
Goal state s2:
at pos2, standing
wait
stand-up
wait
stand-up
s1: at pos1, fallen
s3: at pos2, fallen
jonkv@ida
Solution Types 3

The cost of a FOND policy is undefined
 We don't know in advance which actions we must execute
 We have no estimate of how likely different outcomes are
We will not discuss non-deterministic planning further!
17
jonkv@ida
Solutions and Costs

19
Probabilistic planning uses a stochastic system Σ = (S, A, P)
Finite set of world states
 A = { a0, a1, … }:
Finite set of actions
 P(s, a, s'):
Given that we are in s and execute a,
Replaces γ
the probability of ending up in s’
 For every state s and action a, we have ∑s' ∈ S P(s, a, s’) = 1:
The world gives us 100% probability of ending up in some state
 S = { s0, s1, … }:
Planning
Model says: we end up
in one of these states
Start
here…
A1
…with this probability
jonkv@ida
Stochastic Systems
20
Example with "desirable outcome"
Arc indicates
outcomes of a
single action
Action: drive-uphill
S125,203
At location 5
Model says: 2% risk
of slipping, ending up
somewhere else
S125,204
At location 6
P(S125203, drive-uphill,
S125204) = 0.98
S125,222
Intermediate
location
P(S125203, drive-uphill,
S125222) = 0.02
jonkv@ida
Stochastic Systems (2)

21
May have very unlikely outcomes…
S125,204
At location 6
S125,203
At location 5
S125,222
Intermediate
location
Probability sum = 1
S247,129
Broken
Very unlikely, but may
still be important to
consider, if it has great
impact on goal
achievement!
jonkv@ida
Stochastic Systems (3)

And very many outcomes…
22
S125,104
At location 6
Fuel level 650
Uncertain how much
fuel will be consumed
S125,203
At location 5
Fuel level 980
S125,204
At location 6
Fuel level 750
S125,222
Intermediate
location
S247,129
Broken
As always, one state
for every combination
of properties
jonkv@ida
Stochastic Systems (4)

23
Like before, often many executable actions in every state
3 possible actions
(red, blue, green)
Probability sum = 1
(certain outcome)
Probability sum = 1
(three possible
outcomes of A2)
Arcs connect
edges
belonging to
the same
action
We choose
the action…
Nature chooses
the outcome!
Probability sum = 1
(three possible
outcomes of A3)
Searching
the state space
yields
an AND/OR graph
jonkv@ida
Stochastic Systems (5)
Example: A single robot
s5
 Moving between 5 locations
move(l3,l2)
move(l5,l4)
s1: at(r1, l1)
s2: at(r1, l2)
s3: at(r1, l3)
s4: at(r1, l4)
s5: at(r1, l5)
s3
wait
move(l3,l4)
▪
▪
▪
▪
▪
s2
wait
move(l2,l1)
states correspond
directly to
locations
move(l2,l3)
move(l4,l3)
 For simplicity,
move(l1,l2)

24
move(l4,l1)
wait
s1
s4
move(l1,l4)
wait
 Some transitions are deterministic, some are stochastic
▪ Trying to move from l2 to l3: You may end up at l5 instead (20% risk)
▪ Trying to move from l1 to l4: You may stay where you are instead (50% risk)
 (Can’t always move in both directions, e.g. due to terrain gradient)
wait
jonkv@ida
Stochastic System Example

jonkv@ida
Stochastic Shortest Path Problem
25
Closest to classical planning: Stochastic Shortest Path Problem
 Let Σ = (S, A, P) be a stochastic system as defined earlier
 Let s0 ∈ S
be the initial state
 Let Sg ⊆ S
be the set of goal states
 Then, find a policy of minimal expected cost
that can be applied starting at s0
and that reaches a state in Sg with probability 1
Expected cost:
A statistical concept
(Discussed soon)
"Reaches"? Not "Ends in"?
s5
Technically, a policy never terminates!
move(l3,l2)
move(l4,l3)
move(l1,l2)
wait
s3
move(l4,l1)
s1
s4
move(l1,l4)
wait
wait
move(l5,l4)
s2
move(l3,l4)
Then it doesn't matter that the policy continues
executing actions infinitely!
wait
move(l2,l1)
So, we can make every goal state g absorbing:
For every action a,
P(g,a,g) = 1
 returns to same state
c(g,a) = 0
 no more cost accumulates
move(l2,l3)
wait
26
Non-Observable:
No information
gained after action
Deterministic:
Exact outcome
known in advance
Fully Observable:
Exact outcome
known after action
Classical planning (possibly with extensions)
(Information dimension is irrelevant;
we know everything before execution)
Non-deterministic:
Multiple outcomes, no
probabilities
NOND:
Conformant Planning
FOND:
Conditional / Contingent
Planning
Probabilistic:
Multiple outcomes
with probabilities
Probabilistic
Conformant Planning
Probabilistic
Conditional Planning
(Special case of POMDPs,
Partially Observable MDPs)
Stochastic Shortest Path
Problems
Markov Decision Processes
(MDPs)
This lecture!
jonkv@ida
Overview
Example 1
 π1 = { (s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, wait)}
s5
move(l2,l3)
wait
s3
move(l3,l4)
move(l2,l1)
move(l4,l3)
move(l3,l2)
move(l4,l1)
wait
Start
s1
s4
move(l1,l4)
wait
Reaches s4 or s5, waits there infinitely many times
move(l5,l4)
s2
wait
move(l1,l2)

28
wait
jonkv@ida
Policy Example 1
Example 2
 π2 = { (s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4))}
s5
move(l2,l3)
s3
wait
move(l3,l4)
move(l2,l1)
move(l4,l3)
move(l3,l2)
move(l4,l1)
wait
Start
s1
s4
move(l1,l4)
wait
Always reaches state s4, waits there infinitely many times
move(l5,l4)
s2
wait
move(l1,l2)

29
wait
jonkv@ida
Policy Example 2
Example 3
 π3 = { (s1, move(l1,l4)),
(s2, move(l2,l1)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4)}
s5
move(l2,l3)
s3
wait
move(l3,l4)
move(l2,l1)
move(l4,l3)
move(l3,l2)
move(l5,l4)
s2
wait
move(l1,l2)

30
move(l4,l1)
wait
Start
s1
s4
move(l1,l4)
wait
Reaches state s4 with 100% probability ”in the limit”
(it could happen that you never reach s4, but the probability is 0)
wait
jonkv@ida
Policy Example 3

31
Sequential execution results in a state sequence: A history
 Infinite, since policies do not terminate
 h = ⟨s0, s1, s2, s3, s4, … ⟩
s0 (index zero): Variable used in histories, etc
s0: concrete state name used in diagrams
We may have s0 = s27

For classical planning:
 A plan yields a single history (last state repeated infinitely)

For probabilistic planning:
 Initial states can be probabilistic
▪ For every state s, there will be a probability P(s) that we begin in the state s
 Actions can have multiple outcomes
  A policy can yield many different histories
▪ Which one? Only known at execution time!
jonkv@ida
Policies and Histories
Example 1
s5
 π1 = { (s1, move(l1,l2)),
move(l2,l3)
s3
wait
move(l3,l2)
move(l1,l2)
move(l3,l4)
move(l4,l1)
wait
Start

s1
s4
move(l1,l4)
wait
Even if we only consider starting in s1: Two possible histories
 h1 = ⟨s1, s2, s3, s4, s4, … ⟩
h2 = ⟨s1, s2, s5, s5 … ⟩
– Reached s4, waits indefinitely
– Reached s5, waits indefinitely
How likely are these histories?
move(l5,l4)
s2
move(l2,l1)
(s2, move(l2,l3)),
(s3, move(l3,l4)), wait
(s4, wait),
(s5, wait)}
move(l4,l3)

32
wait
jonkv@ida
History Example 1
Each policy induces a probability distribution over histories
 Let h = ⟨s0, s1, s2, s3, … ⟩
move(l2,l3)
s3
wait
move(l3,l4)
move(l2,l1)
move(l3,l2)
move(l4,l3)
s2
wait
move(l5,l4)
 With unknown initial state:
▪ P(⟨s0, s1, s2, s3, … ⟩ | π) =
P(s0) ∏i ≥ 0 P(si, π(si), si+1)
s5
move(l1,l2)

33
 The book:
move(l4,l1)
▪ Assumes you start
wait
s1
s4
in a known state s0
▪ So all histories start
move(l1,l4)
wait
Start
with the same state
▪ P(⟨s0, s1, s2, s3, … ⟩ | π) = ∏i ≥ 0 P(si, π(si), si+1) if s0 is the known initial state
P(⟨s0, s1, s2, s3, … ⟩ | π) = 0
if s0 is any other state
wait
jonkv@ida
Probabilities: Initial States, Transitions
Example 1
s5
 π1 = { (s1, move(l1,l2)),
move(l2,l3)
s3
wait
move(l3,l2)
move(l1,l2)
move(l3,l4)
move(l4,l1)
wait
Start

s1
s4
move(l1,l4)
wait
Two possible histories, if we always start in s1
 h1 = ⟨s1, s2, s3, s4, s4, … ⟩
h2 = ⟨s1, s2, s5, s5 … ⟩
– P(h1 | π1) = 1 × 1 × 0.8 × 1 × … = 0.8
– P(h2 | π1) = 1 × 1 × 0.2 × 1 × … = 0.2
– P(h | π1) = 1 × 0 for all other h
move(l5,l4)
s2
move(l2,l1)
(s2, move(l2,l3)),
(s3, move(l3,l4)), wait
(s4, wait),
(s5, wait)}
move(l4,l3)

34
wait
jonkv@ida
History Example 1
Example 2
s5
 π2 = { (s1, move(l1,l2)),
move(l2,l3)
s3
wait
move(l3,l4)
move(l4,l3)
move(l3,l2)
move(l4,l1)
wait
Start
 h1 = ⟨s1, s2, s3, s4, s4, … ⟩
h3 = ⟨s1, s2, s5, s4, s4, … ⟩
s1
s4
move(l1,l4)
wait
P(h1 | π2) = 1 × 1 × 0.8 × 1 × … = 0.8
P(h3 | π2) = 1 × 1 × 0.2 × 1 × … = 0.2
P(h | π2) = 1 × 0 for all other h
move(l5,l4)
s2
move(l2,l1)
(s2, move(l2,l3)),
(s3, move(l3,l4)), wait
(s4, wait),
(s5, move(l5,l4))}
move(l1,l2)

35
wait
jonkv@ida
History Example 2
Example 3
s5
 π3 = { (s1, move(l1,l4)),
move(l2,l3)
s3
wait
move(l3,l2)
move(l3,l4)
move(l4,l3)
wait
move(l5,l4)
s2
move(l2,l1)
(s2, move(l2,l1)),
(s3, move(l3,l4)), wait
(s4, wait),
(s5, move(l5,l4)}
move(l1,l2)

36
move(l4,l1)
wait
Start
 h4 = ⟨s1, s4, s4, … ⟩
h5 = ⟨s1, s1, s4, s4, … ⟩
h6 = ⟨s1, s1, s1, s4, s4, … ⟩
•••
s1
s4
move(l1,l4)
wait
P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5
P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25
P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … =0.125
h∞ = ⟨s1, s1, s1, s1, s1, s1, … ⟩ P(hn | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
jonkv@ida
History Example 3

We have defined the Stochastic Shortest Path Problem
 Similar to the classical planning problem
 Adapted to probabilistic outcomes

But policies allow indefinite execution
 Can we exploit this fact to generalize from SSSPs?
Yes – remove the goal states, assume no termination
But without goal states, what is the objective?
38
jonkv@ida
Generalizating from the SSPP

39
An alternative to goal states:
 Numeric cost C(s,a) for each state s, action a
 Numeric reward R(s) for each state s

Example: Grid World
 Actions: North, South, West, East
▪ Associated with a cost
▪ 90% probability of doing
what you want
▪ 10% probability of not moving
 Rewards in some cells
 Danger in some cells
▪ Negative reward
-100
-200
+100
-80
+50
jonkv@ida
Costs and Rewards

40
Important: States != locations
Reward:
A person who wants to move
is on board
elevator-at(floor3)
person-at(p1, floor3)
wants-to-move(p1)
pickup(p1, floor3)
elevator-at(floor3)
person-onboard(p1)
wants-to-move(p1)
Can't "cycle" to receive
the same award again:
No path leads back to this
state
Can't stay and "accumulate rewards":
All actions lead to new states
jonkv@ida
States, not Locations

41
To model the SSPP using un-discounted utilities:
 Assume every action has positive cost
except in goal states, which are absorbing ( action cost 0)
 Set all rewards to 0
Example:
s5
r=0
c=1
s2
c=1
s3
r=0
r=0
c=100
c=1
c=100
c=1
c=100
▪ No rewards
▪ C(s,a) = 1 for each
“horizontal” action
▪ C(s,a) = 100 for
each “vertical” action
▪ C(s,wait) = 1
▪ Goal state s4 only
permits one action,
returning to s4
at cost 0
r=0
c=100

s1
s4
c=1
c=0
r=0
c=1
jonkv@ida
SSPP using Costs and Rewards

43
We want to know how good a policy is
 But we have many possible outcomes…

Intermediate step: Utility functions
 Suppose a policy has one particular outcome
 results in one particular history (state sequence)
 How ”useful / valuable" is this outcome to us?

One possibility: Un-discounted utility
 Let's use both costs and rewards for now…
 h = ⟨s0, s1, … ⟩  V(h | π) = ∑i ≥ 0 (R(si) – C(si,π(si)))
Un-discounted utility
of history h
given policy π
The reward
for being in
state si
Minus the cost
of the action
chosen in si
jonkv@ida
Utility Functions

44
The SSPP using un-discounted utilities:
 Positive action cost except in goal states; all rewards are zero
 If history h visits a goal state:
▪ Finitely many actions of finite positive cost
▪ Followed by infinitely many actions of cost 0,
which we won't actually execute
▪  Finite cost
Any policy of finite cost
solves the SSPP
 If history h does not visit a goal state:
▪ Infinitely many actions of finite positive cost
▪  Infinite cost
But what about the cases where we want
to execute infinitely or indefinitely?
jonkv@ida
Utility Functions and SSPP
45
Infinite horizon solution
Considers all possible infinite histories
(as defined earlier)
(Infinite execution)
Indefinite execution
Never ends
Unrealistic!
Will end, but we don't know when
Realistic…
"Goal-based" execution (SSPP)
Truly indefinite execution
Execute until we achieve a goal state
Solution guarantees:
Finitely many actions of cost > 0
No stop criterion
Solution can have infinitely many actions
of cost > 0
jonkv@ida
Planning and Execution
Example:
 The robot does not "want to reach s4"
 It wants to walk around gaining rewards
 Every time step it is in s5:
▪ Negative reward – maybe it's in our way
 Every time step it is in s4:
▪ Positive reward –
r=0
maybe it helps us
and "gets a salary"
c=1
r=-100
s5
c=1
s2
s3
c=100
c=1
r=0
c=100
c=100
c=100
c=100

46
c=1
c=1
r=0
s1
s4
c=1
c=0
r=+100
c=0
jonkv@ida
Indefinite/Infinite Execution
Example:
 Suppose π1 happens to result in h1 = ⟨s1, s2, s3, s4, s4, … ⟩
 V(h1 | π1) = (0 – 100) + (0 – 1) + (0 – 100) + 100 + 100 + …
 Stays at s4 forever, executing “wait”
 infinite amount of rewards!
r=-100
s5
r=0
s3
c=100
c=1
r=0
c=100
c=100
That’s why we specify
the policy and the history
to calculate a utility…
s2
c=1
c=100
This is not the only
history that could
result from the policy!
c=1
c=100

47
c=1
c=1
r=0
s1
s4
c=1
c=0
r=+100
c=0
jonkv@ida
Utility Functions
What’s the problem, given that we "like" being in state s4?
 We can’t distinguish between different ways of getting there!
▪ s1s2s3s4:
-201 + ∞ = ∞
▪ s1s2s1s2s3s4: -401 + ∞ = ∞
▪ Both appear equally good…
r=-100
s5
r=0
c=1
s2
c=1
s3
c=100
c=1
r=0
c=100
c=100
c=100
c=100

48
c=1
c=1
r=0
s1
s4
c=1
c=0
r=+100
c=0
jonkv@ida
Utility Functions

49
Solution: Use a discount factor, γ, with 0 ≤ γ ≤ 1
 To avoid infinite utilities V(…)
 To model "impatience":
rewards and costs far in the future are less important to us
Discounted utility of a history:
 V(h | π) = ∑i ≥ 0 γ i (R(si) – C(si,π(si)))
 Distant rewards/costs
s5
r=0
c=1
s2
c=1
c=1
r=0
c=100
c=100
c=100
Examples will use γ=0.9
c=1
c=1
Only to simplify formulas!
Should choose carefully…
s3
c=100
have less influence
 Convergence (with finite results)
is guaranteed if 0 ≤ γ < 1
r=-100
c=100

r=0
s1
s4
c=1
c=0
r=+100
c=0
jonkv@ida
Discounted Utility
r=0
c=1
s2
c=1
c=0
s3
c=100
c=1
r=0
c=100
c=100
γ = 0.9
s5
c=100
Given that we start in s1,
π1 can lead to only two histories:
80% chance of history h1,
20% chance of history h2
r=-100
c=100
π1 = {(s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, wait)}
50
c=1
c=1
r=0
Factors 1, 0.9, 0.81, 0.729, 0.6561…
s1
s4
c=1
c=0
r=+100
h1 = ⟨s1, s2, s3, s4, s4, … ⟩
V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9
h2 = ⟨s1, s2, s5, s5 … ⟩
V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100-0) + .93(–100-0) + … = –910.1
E(π1) = 0.8 * 547.9 + 0.2 (–910.1) = 256.3 We expect a reward of 256.3 on average
jonkv@ida
Example
r=0
c=1
s2
c=1
c=0
s3
c=100
c=1
r=0
c=100
c=100
γ = 0.9
s5
c=100
Given that we start in s1,
also two different histories…
80% chance of history h1,
20% chance of history h2
r=-100
c=100
π2 = {(s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4)}
51
c=1
c=1
s1
r=0
Factors 1, 0.9, 0.81, 0.729, 0.6561…
s4
c=1
c=0
r=+100
h1 = ⟨s1, s2, s3, s4, s4, … ⟩
V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9
h2 = ⟨s1, s2, s5, s5 … ⟩
V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100-100) + .93100 + … = 466.9
E(π1) = 0.8 * 547.9 + 0.2 (466.9) = 531,7
Expected reward 531,7 (π1 gave 256.3)
jonkv@ida
Example

Rewards are often generalized:
 Numeric cost C(s,a) for each state s, action a
 Numeric reward R(s,a,s') if you are in s, execute a, and reach s'
 V(⟨s0, s1, … ⟩ | π) = ∑i ≥ 0 γ i (R(si, π(si), si+1) – C(si, π(si))):
More expressive!
53
jonkv@ida
Costs and Rewards: More General

54
An observation:
 Every step yields one reward, one cost: R(si, π(si), si+1) – C(si, π(si))
r=-100 for all
outgoing edges
r=0 for all
outgoing
edges
s5
c=1
s2
c=1
c=100
c=100
c=100
c=100
c=100
r=0 for all
…
c=1
c=1
r=0 for
all …
s1
s4
c=1
c=0
Sum = –200
s3
c=1
Sum = –100
c=0
r=+100 for
all …
Any combination
of reward and cost
that gives us these sums
is equivalent!
jonkv@ida
Simplification
To simplify, include the cost in the reward!
 Decrease each R(si, π(si), si+1) by C(si, π(si))
C(s0, takeoff) = 80
R(s0, takeoff, s1) = 200
R(s0, takeoff, s2) = –100
s5
r= -1
s2
r= -1
s3
r= 0
r= -100
r= -100
R(s0, takeoff, s1) = 120
R(s0, takeoff, s2) = –180
r= 99
r= -1
s1
s4
r= -1
r= 100
V(⟨s0, s1, … ⟩ | π) = ∑i ≥ 0 γ i (R(si, π(si), si+1) – C(si, π(si)))
V(⟨s0, s1, … ⟩ | π) = ∑i ≥ 0 γ i R(si, π(si), si+1)
r= -1
r= -200
r= -1
r= -100

55
jonkv@ida
Simplification (2)
r= -100

57
Markov Decision Processes
 Underlying world model:
 Plan representation:
 Goal representation:
 Planning problem:
Stochastic system
Policy – which action to perform in any state
Utility function defining “solution quality”
Optimization: Maximize expected utility
Why "Markov"?
jonkv@ida
Overview

If a stochastic process has the Markov Property:
 It is memoryless
 The future of the process
can be predicted equally well
if we use only its current state
or if we use its entire history

This is part of the definition!
 P(s, a, s') is the probability
of ending up in s’
when we are in s and execute a
Nothing else matters!
58
jonkv@ida
Markov Property (1)
59
…
S122,281
At location 3
a
S125,203
At location 5
jonkv@ida
Markov Property (2)
S125,204
At location 6
S125,222
Intermediate
location
S121,284
At location 4
…
S247,129
Broken
We don’t need to
know the states we
visited before…
Only the
current state
and the
action…
…To find out where
we may end up, with
which prob.

60
Essential distinction:
Previous states in the history sequence:
What happened at earlier timepoints:
Cannot affect the transition function
Can partly be encoded into the current state
Can affect the transition function

Example:
 If you have visited the lectures, you are more likely to pass the exam
▪ Add predicate visitedLectures,
representing in this state what you did in the past
 This information is encoded and stored in the current state
▪ State space doubles in size
(and here we often treat every state separately!)
▪ We only have a finite number of states
 can't encode an unbounded history
jonkv@ida
Remembering the Past

Now we know the utility of each history, each outcome
 But we can only decide a policy

Each outcome has a probability
 So we can calculate an expected ("average") utility for the policy: E(π)
62
jonkv@ida
Expected Utility

63
A policy selects actions; the world chooses the outcome
 So we must consider all possible outcomes / histories
but not all possible choices
Policy chooses
green action
These outcomes
must be handled!
Irrelevant to us
jonkv@ida
Expected Utility 2

In the next step the policy again makes a choice
64
jonkv@ida
Expected Utility 3

65
Calculating expected utility E(π), method 1: "History-based"
<A,B,E,…>
<A,B,F,…>
<A,B,G,…>
<A,C,H,…>
…
 Find all possible infinite histories
 Calculate probabilities, rewards
A
over each entire history
 Multiply and sum
D
B
C
E
F
I
G
J
K
H
E(π) = ∑h P(h | π) V(h | π)
where V(h | π) = ∑i ≥ 0 γ i R(si, π(si), si+1)
Simple conceptually
Less useful for calculations
jonkv@ida
Expected Utility 4

66
Calculating expected costs, method 2: "Step-based"
 What's the probability
of the outcomes B, C, or D?
 What's the reward for
each outcome?
 What's the reward of
continuing from
B
there?
A
D
C
E
F
G
E(π) = E(π,s0)
E(π,s) = ∑s’ ∈S P(s, π(s), s') *
(R(s, π(s), s') + γ E(π,s'))
I
J
K
H
E(π) = expected cost "from the start"
E(π,s) = "continuing after having reached s"
jonkv@ida
Expected Utility 5

If π is a policy, then
 E(π,s) = ∑s’ ∈S P(s, π(s), s') * (R(s, π(s), s') + γ E(π,s'))
 The expected cost of continuing to execute π after having reached s
 Is the sum, for all possible states s’ ∈ S that you might end up in,
▪
of the probability P(s, π(s), s') of actually ending up in that state
given the action π(s) chosen by the policy, times
▪ the reward you get for this transition
▪ plus the discount factor
times the expected cost E(π,s') of continuing π from the new state s’
67
jonkv@ida
Expected Utility 6: "Step-Based"

68
jonkv@ida
Example 1
E(π2,s1) = The expected cost of executing π2 starting in s1:
 The cost of the first action: move(l1,l2)
 Plus the discount factor γ times…
▪ [Ending up in s2]
100% * E(π2,s2)
s5
r= -1
s2
r= -1
s3
r= -100
r= 0
r= -100
r= -100
π2 = {(s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4)}
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100

69
jonkv@ida
Example 2
E(π2,s2) = the expected cost of executing π2 starting in s2:
 The cost of the first action: move(l2,l3)
▪ (Which has multiple outcomes!)
 Plus the discount factor γ times…
▪ [Ending up in s3]
80% * E(π2,s3)
▪ Plus
[Ending up in s5]
20% * E(π2,s5)
s5
r= -1
s2
r= -1
s3
r= -100
r= 0
r= -100
r= -100
π2 = {(s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4)}
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100

70
jonkv@ida
Recursive?
Seems like we could easily calculate this recursively!
 E(π2s1)
▪ defined in terms of E(π2,s2)
▪ defined in terms of E(π2,s3) and E(π2,s5)
 …
 Just continue until you reach the end!
 Why doesn't this work?
s5
r= -1
s2
r= -1
s3
r= -100
r= 0
r= -100
r= -100
π2 = {(s1, move(l1,l2)),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4)}
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100
There isn’t always an ”end”!
 Modified example below is a valid policy π (different action in s5)
▪ E(π,s1) defined in terms of E(π,s2)
▪ E(π,s2) defined in terms of E(π,s3) and E(π,s5)
▪ E(π,s3) defined in terms of E(π,s4)
▪ E(π,s5) defined in terms of E(π,s2)…
s5
r= -1
s2
r= -1
s3
r= 0
r= -100
r= -100
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100

71
jonkv@ida
Not Recursive!
r= -100

72
If π is a policy, then
 E(π,s) = ∑s’ ∈S P(s, π(s), s') * (R(s, π(s), s') + γ E(π,s'))
 The expected cost of continuing to execute π after having reached s
 Is the sum, for all possible states s’ ∈ S that you might end up in,
▪
of the probability P(s, π(s), s') of actually ending up in that state
given the action π(s) chosen by the policy, times
▪ the reward you get for this transition
▪ plus the discount factor
times the expected cost E(π,s') of continuing π from the new state s’
This is an equation system: |S| equations, |S| variables!
Requires different solution methods…
jonkv@ida
Equation System

75
Let us first revisit the definition of utility
 We can define the actual utility given an outcome, a history
▪ Given any history 𝑠𝑠0 , 𝑠𝑠1 , … :
𝑉𝑉 𝑠𝑠0 , 𝑠𝑠1 , … 𝜋𝜋 = � 𝛾𝛾 𝑖𝑖 𝑅𝑅 𝑠𝑠𝑖𝑖 , 𝜋𝜋 𝑠𝑠𝑖𝑖 , 𝑠𝑠𝑖𝑖+1
𝑖𝑖≥0
 We can define the expected utility using the given probability distribution:
▪ Given that we start in state s:
𝐸𝐸(𝜋𝜋, 𝑠𝑠) =
� 𝑃𝑃 𝑠𝑠0 , 𝑠𝑠1 , …
𝑠𝑠0 ,𝑠𝑠1 ,…
𝑠𝑠0 = 𝑠𝑠) � 𝛾𝛾 𝑖𝑖 𝑅𝑅 𝑠𝑠𝑖𝑖 , 𝜋𝜋 𝑠𝑠𝑖𝑖 , 𝑠𝑠𝑖𝑖+1
𝑖𝑖≥0
▪ As we saw, we can also rewrite this recursively!
Given that we start in state s:
𝐸𝐸 𝜋𝜋, 𝑠𝑠 = � 𝑃𝑃 𝑠𝑠, 𝜋𝜋 𝑠𝑠 , 𝑠𝑠 ′ ⋅ 𝑅𝑅 𝑠𝑠, 𝜋𝜋 𝑠𝑠 , 𝑠𝑠 ′ + 𝛾𝛾𝛾𝛾 𝜋𝜋, 𝑠𝑠 ′
𝑠𝑠 ′ ∈𝑆𝑆
jonkv@ida
Repetition: Utility

76
Policy 𝜋𝜋 ∗ is optimal if it maximizes expected utility
 Given the specified probabilities and transition rewards
 For all states s and policies 𝜋𝜋,
𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠 ≥ 𝐸𝐸(𝜋𝜋, 𝑠𝑠)

Sometimes:
A given initial state s0 is assumed
Optimal = for all policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠0 ≥ 𝐸𝐸 𝜋𝜋, 𝑠𝑠0
Everywhere optimal = for all states s and policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠 ≥ 𝐸𝐸(𝜋𝜋, 𝑠𝑠)
When we solve an MDP exactly:
 A solution is defined as an optimal policy!

How do we find an optimal policy?
 Let’s begin with an important property of optimal policies
jonkv@ida
Optimality

77
jonkv@ida
Principle of Optimality
Bellman’s Principle of Optimality:
 An optimal policy has the property that
whatever the initial state and initial decision are,
the remaining decisions must constitute an optimal policy
with regard to the state resulting from the first decision
s5
r= -1
s3
r= -100
r= -100
r= -100
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
Suppose that π*
is an optimal policy
and that π*(s1) = move(l1,l2),
so the next state is always s2.
Then π* must also
maximize expexted utility
given that we start in s2!
s2
r= -1
r= 0
Example: Find a policy
that maximizes exp. utility
given that we start in s1.
r= -100
Sounds trivial? Depends on the Markov Property!
 Suppose costs depended on which states you had visited before
 To go s5  s4  s1:
▪ Use move(l5,l4) and move(l4,l1)
▪ Reward –200 + –400 = –600
 To go s4  s1 without having visited s5:
▪ Use move(l4,l1), same as above
▪ Reward for this step: 99, not –400
r= -1
r= 0
r=99 usually
r= - 400 if we visited s5
s1
s4
r= -1
r= -1
r= -100
in an MDP: Markovian!
s3
r= -100
 This can’t happen
r= -1
r= -1
r= 100
r= -200
  Optimal action would
have to take history
into account
s5
s2
r= -1
r= -100

78
jonkv@ida
Principle of Optimality (2)
r= -100

Suppose I have a non-optimal policy 𝜋𝜋
 I select an arbitrary state s
 I change 𝜋𝜋 𝑠𝑠 in a way that increases E 𝜋𝜋, 𝑠𝑠

 This cannot decrease E 𝜋𝜋, 𝑠𝑠 ′ for any 𝑠𝑠𝑠!
 We can find solutions through local improvements
 No need to “think globally”
Suppose a given initial state s0 is assumed
Optimal = for all policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠0 ≥ 𝐸𝐸 𝜋𝜋, 𝑠𝑠0
Then for all states s that are reachable from 𝑠𝑠0 ,
𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠 ≥ 𝐸𝐸 𝜋𝜋, 𝑠𝑠
79
jonkv@ida
Consequences

81
First algorithm: Policy iteration
 General idea:
▪ Start out with an initial policy, maybe randomly chosen
▪ Calculate the expected utility of executing that policy
from each state
▪ Update the policy by making a local decision for each state :
”Which action should my improved policy choose in this state,
given the expected utility of the current policy?”
▪ Iterate until convergence (until the policy no longer changes)
Simplification:
Our rewards actually do not depend on the state we end up in  R(s, π(s))
= ∑s' ∈S P(s, π(s), s’) (R(s, π(s), s’) + γ E(π,s'))

E(π,s)
= R(s, π(s)) + ∑s' ∈S P(s, π(s), s’) γ E(π,s')
E(π,s)
jonkv@ida
Policy Iteration

82
Preliminaries:
 Suppose I have a policy,
change the decision in the first step,
and keep the policy for everything else?
▪ Let Q(π,s,a) be the expected utility of π in a state s
if we start by executing the given action a,
but we use the policy π from then onward
 Expected utility of original policy π:
▪ E(π,s) = R(s, π(s)) + ∑s' ∈S P(s, π(s), s’) γ E(π,s')
 Expected utility given one-step change:
▪ Q(π,s,a) = R(s, a) + ∑s' ∈S P(s, a, s’) γ E(π,s')
Local change!
jonkv@ida
Preliminaries 1: Single-step policy changes
83
jonkv@ida
Preliminaries 2: Example
 Example: E(π,s1)
▪ The expected utility of following the current policy
▪ Starting in s1, beginning with move(l1,l2)
 Q(π,s1, move(l1,l4))
▪ The expected utility of first trying to move from l1 to l4,
then following the current policy
 Does not correspond to
s5
any possible policy!
r= -1
s2
s3
r= -100
r= 0
r= -100
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100
▪ If move(l1,l4) returns r= -1
you to state s1, then the
next action is
move(s1,s2)!
r= -100

Suppose you have an (everywhere) optimal policy π*
 Then, because of the principle of optimality:
▪ In every state, the local choice made by the policy is locally optimal
▪ For all states s, E(π*,s) = mina Q(π*,s,a)

This yields the modification step we will use!
 We have a possibly non-optimal policy π,
want to create an improved policy π’
 For every state s:
▪ Set π’(s) = arg mina Q(π,s,a)
84
jonkv@ida
Preliminaries 3
85
jonkv@ida
Preliminaries 4
 Example: E(π,s1)
▪ The expected utility of following the current policy
▪ Starting in s1, beginning with move(l1,l2)
 Q(π,s1, move(l1,l4))
▪ The expected utility of first trying to move from l1 to l4,
then following the current policy
s5
r= -1
s2
r= -1
s3
r= -100
r= 0
r= -100
π(s1) := move(l1,l4)
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100
If doing move(l1,l4) first
has a greater expected utility,
we should modify
the current policy:
r= -100
Policy iteration requires an initial policy
π1 = {(s1, wait),
(s2, wait),
(s3, wait),
(s4, wait),
(s5, wait)}
 Let’s start by choosing “wait” in every state
 Let’s set a discount factor: γ = 0.9
▪ Easy to use in calculations on these slides,
but in reality we might use a larger factor
(we’re not that short-sighted!)
 Need to know expected utilities!
▪ Because we will make changes
according to Q(π1,s,a),
r= -1
s2
which depends on
∑s' ∈S P(s, a, s’) E(π1,s')
s5
r= -1
s3
r= 0
r= -100
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100
r= -100

86
jonkv@ida
Policy Iteration 2: Initial Policy π1
r= -100

87
Calculate expected utilities for the current policy π1
 Simple: Chosen transitions are deterministic and return to the same state!
▪ E(π,s) =
R(s, π(s)) + γ ∑s' ∈S P(s, π(s), s’) E(π,s')
▪
▪
▪
▪
▪
E(π1,s1) =
E(π1,s2) =
E(π1,s3) =
E(π1,s4) =
E(π1,s5) =
R(s1, wait)
R(s2, wait)
R(s3, wait)
R(s4, wait)
R(s5, wait)
+γ
+γ
+γ
+γ
+γ
 Simple equations to solve:
▪ 0.1E(π1,s1) = –1
▪ 0.1E(π1,s2) = – 1
▪ 0.1E(π1,s3) = – 1
▪ 0.1E(π1,s4) = +100
▪ 0.1E(π1,s5) = –100
E(π1,s1)
E(π1,s2)
E(π1,s3)
E(π1,s4)
E(π1,s5)





= –1
= –1
= –1
= +100
= –100
+ 0.9 E(π1,s1)
+ 0.9 E(π1,s2)
+ 0.9 E(π1,s3)
+ 0.9 E(π1,s4)
+ 0.9 E(π1,s5)
E(π1,s1) = –10
E(π1,s2) = –10
E(π1,s3) = –10
E(π1,s4) = +1000
E(π1,s5) = –1000
Given this policy π1:
High rewards if we start in s4,
high costs if we start in s5
jonkv@ida
Policy Iteration 3: Expected Costs for π1
For every state s:
 Let π2(s) = argmaxa ∈ A Q(π1,s,a)
s5
r= –1
s2
s3
r= –1
r= –100
c=0
r= –1
r= –1
r = –200
= –10
= –10
= –10
= +1000
= –1000
r= –100

E(π1, s1)
E(π1, s2)
E(π1, s3)
E(π1, s4)
E(π1, s5)
r= –100
What is the best
local modification
according to the
expected utilities
of the current policy?
88
r=99
r= –1
s1
s4
r= –1
r=100
 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s')
▪ s1: wait
–1 + 0.9 * – 10
= – 10
move(l1,l2)
– 100 + 0.9 * – 10
= – 109
move(l1,l4)
–1 + 0.9 * (0.5* – 10 + 0.5*1000) = +444,5
Best improvement
 These are not the true expected utilities for starting in state s1!
▪ They are only correct if we locally change the first action to execute
and then go on to use the previous policy (in this case, always waiting)!
▪ But they can be proven to yield good guidance,
as long as you apply the improvements repeatedly (as policy iteration does)
jonkv@ida
Policy Iteration 4: Update 1a
r= –100
For every state s:
 Let π2(s) = argmaxa ∈ A Q(π1,s,a)
s5
r= –1
s2
s3
r= –1
r= –100
c=0
r= –1
r= –1
r = –200
= –10
= –10
= –10
= +1000
= –1000
r= –100

E(π1, s1)
E(π1, s2)
E(π1, s3)
E(π1, s4)
E(π1, s5)
r= –100
What is the best
local modification
according to the
expected utilities
of the current policy?
89
r=99
r= –1
s1
s4
r= –1
r=100
 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s')
▪ s2: wait
–1 + 0.9 * –10
= –10
move(l2,l1)
–100 + 0.9 * –10
= –109
move(l2,l3)
–1 + 0.9 * (0.8*–10+ 0.2*–1000) = –188,2
jonkv@ida
Policy Iteration 5: Update 1b
r= –100
For every state s:
 Let π2(s) = argmaxa ∈ A Q(π1,s,a)
s5
r= –1
s2
s3
r= –1
r= –100
c=0
r= –1
r= –1
r = –200
= –10
= –10
= –10
= +1000
= –1000
r= –100

E(π1, s1)
E(π1, s2)
E(π1, s3)
E(π1, s4)
E(π1, s5)
r= –100
What is the best
local modification
according to the
expected utilities
of the current policy?
90
r=99
r= –1
s1
s4
r= –1
r=100
 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s')
▪ s3: wait
–1 + 0.9 * –10
= –10
move(l3,l2)
–1 + 0.9 * –10
= –10
move(l3,l4)
–100 + 0.9 * +1000
= +800
▪ s4: wait
+100 + 0.9 * +1000
= +1000
move(l4,l1)
+99 + 0.9 * –10
= +90
…
▪ s5: wait
–100 + 0.9 * –1000
= –1000
move(l5,l2)
–101 + 0.9 * –10
= –110
move(l5,l4)
–200 + 0.9 * +1000
= +700
jonkv@ida
Policy Iteration 6: Update 1c
r= –100

91
jonkv@ida
Policy Iteration 7: Second Policy
This results in a new policy
π1 = {(s1, wait),
(s2, wait),
(s3, wait),
(s4, wait),
(s5, wait)}
E(π1,s1) =–10
E(π1,s2) = –10
E(π1,s3) = –10
E(π1,s4) =+1000
E(π1,s5) = –1000
π2 = { (s1, move(l1,l4),
(s2, wait),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4))}
>= +444,5
>= –10
>= +800
>= +1000
>= +700
Utilities based
on one modified
action, then
following π1
(can’t decrease!)
s5
s3
r= –100
r= –1
r= –1
c=0
r= –100
No change in s2 yet…
r= –1
s2
r= –100
 Try to go there
from s1 / s3 / s5!
r= –1
r=99
r= –1
s1
s4
r= –1
r=100
r = –200
Now we have made use of
earlier indications that
s4 seems to be a good state
r= –100

92
jonkv@ida
Policy Iteration 8: Expected Costs for π2
Calculate true expected utilities for the new policy π2
▪
▪
▪
▪
▪
E(π2,s1) =
E(π2,s2) =
E(π2,s3) =
E(π2,s4) =
E(π2,s5) =
R(s1, move(l1,l4))
R(s2, wait)
R(s3, move(l3,l4))
R(s4, wait)
R(s5, move(l5,l4))
+γ…
+ γ E(π2,s2)
+ γ E(π2,s4)
+ γ E(π2,s4)
+ γ E(π2,s4)
= –1 + 0.9 (0.5E(π2,s1) + 0.5E(π2,s4))
= –1 + 0.9 E(π2,s2)
= –100+ 0.9 E(π2,s4)
= +100+ 0.9 E(π2,s4)
= –200+ 0.9 E(π2,s4)
 Equations to solve:
▪ 0.1E(π2,s2) = –1
▪ 0.1E(π2,s4) = +100
▪ E(π2,s3) = –100 + 0.9E(π2,s4) = –100 + 0.9*1000 = +800
▪ E(π2,s5) = –200 + 0.9E(π2,s4) = –200 + 0.9*1000 = +700
▪ E(π2,s1) = –1 + 0.45 * E(π2,s1) + 0.45 * E(π2,s4) 
0.55 E(π2,s1) = –1 + 0.45 * E(π2,s4) 
0.55 E(π2,s1) = –1 + 450 
0.55 E(π2,s1) = +449 
E(π2,s1) = +816,3636…
 E(π2,s2) = –10
 E(π2,s4) = +1000
 E(π2,s3) = +800
 E(π2,s5) = +700
 E(π2,s1) = +816,36
π2 = {(s1, move(l1,l4),
(s2, wait),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4))}
Now we have the true expected utilities of the second policy…
π1 = {(s1, wait),
(s2, wait),
(s3, wait),
(s4, wait),
(s5, wait)}
E(π1,s1) =–10
E(π1,s2) = –10
E(π1,s3) = –10
E(π1,s4) =+1000
E(π1,s5) = –1000
π2 = { (s1, move(l1,l4),
(s2, wait),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4))}
s5
S5 wasn’t so bad after all,
since you can reach s4
in a single step!
r= –1
s3
r= –1
r= –100
r= –100
r= –1
r= –100
S2 seems much worse
in comparison,
since the benefits of s4
haven’t ”propagated” that far.
s2
r=99
r= –1
s1
s4
r= –1
r=100
r = –200
r= –1
S1 / s3 are even better.
E(π2,s1) = +816,36
E(π2,s2) = – 10
E(π2,s3) = +800
E(π2,s4) = +1000
E(π2,s5) = +700
>= +444,5
>= –10
>= +800
>= +1000
>= +700
c=0

93
jonkv@ida
Policy Iteration 9: Second Policy
r= –100
s5
r= –1
s2
s3
r= –100
r= –1
r= –1
c=0
r = –200
r= –1
r= –100

E(π2,s1) = +816,36
E(π2,s2) = –10
E(π2,s3) = +800
E(π2,s4) = +1000
E(π2,s5) = +700
r= –100
What is the best
local modification
according to the
expected utilities
of the current policy?
94
r=99
For every state s:
 Let π3(s) = argmaxa ∈ A Q(π2,s,a)
r= –1
s1
s4
r= –1
r=100
 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π2,s')
▪ s1: wait
–1 + 0.9 * 816,36
= +733,72
move(l1,l2)
–100 + 0.9 * –10
= –109
move(l1,l4)
–1 + 0.9 * (.5*1000+.5*816.36) = +816,36
Seems best – chosen!
▪ s2: wait
move(l2,l1)
move(l2,l3)
–1 + 0.9 * –10
–100 + 0.9 * 816,36
–1 + 0.9 * (0.8*800 + 0.2*700)
= –10
= +634,72
= +701
Now we will change the action taken at s2,
since we have the expected utilities for reachable states s1, s3, s5… have increased
jonkv@ida
Policy Iteration 10: Update 2a
r= –100
s5
r= –1
s3
r= –1
r= –100
For every state s:
 Let π3(s) = argmaxa ∈ A Q(π2,s,a)
s2
r= –1
c=0
r = –200
r= –1
r= –100

E(π2,s1) = +816,36
E(π2,s2) = –10
E(π2,s3) = +800
E(π2,s4) = +1000
E(π2,s5) = +700
r= –100
What is the best
local modification
according to the
expected utility
of the current policy?
95
r=99
r= –1
s1
s4
r= –1
r=100
 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π2,s')
▪ s3: wait
–1 + 0.9 * 800
= +719
move(l3,l2)
–1 + 0.9 * –10
= –10
move(l3,l4)
–100 + 0.9 * 1000
= +800
▪ s4: wait
+100+ 0.9 * 1000
= +1000
move(l4,l1)
+99 + 0.9 * 816,36
= +833,72
…
▪ s5: wait
–100 + 0.9 * 700
= +530
move(l5,l2)
–101 + 0.9 * –10
= –110
move(l5,l4)
–200 + 0.9 * –1000
= +700
jonkv@ida
Policy Iteration 11: Update 2b
r= –100
This results in a new policy π3
π3 = {(s1, move(l1,l4),
(s2, move(l2,l3)),
(s3, move(l3,l4)),
(s4, wait),
(s5, move(l5,l4))}
 True expected utilities are updated
by solving an equation system
 The algorithm will iterate once more
 No changes will be made to the policy
  Termination with optimal policy!
s5
r= –1
s3
r= –1
r= –1
c=0
r= –100
r= –100
r=99
r= –1
s1
s4
r= –1
r=100
r = –200
r= –1
s2
r= –100

96
jonkv@ida
Policy Iteration 12: Third Policy
r= –100

97
jonkv@ida
Policy Iteration 13: Algorithm
Policy iteration is a way to find an optimal policy π*
 Start with an arbitrary initial policy π1. Then, for i = 1, 2, …
▪ Compute expected utilities E(πi ,s) for every s by solving a system of equations
▪ System: For all s, E(πi,s) = R(s, πi(s)) + γ ∑s' ∈ S P (s, πi(s), s’) E(πi,s')
Find costs
according to ▪ Result: The expected utilities of the “current” policy in every state s
current policy
▪ Not a simple recursive calculation – the state graph is generally cyclic!
▪ Compute an improved policy πi+1 “locally” for every s
Find best
▪ πi+1(s) := argmaxa ∈ A R(s, a) + γ ∑s' ∈ S P(s, a, s') E(πi,s')
local
improvements ▪ Best action in any given state s given expected utilities of old policy πi
▪ If πi+1 = πi then exit
▪ No local improvement possible,
so the solution is optimal
▪ Otherwise
▪ This is a new policy πi+1 – with new expected utilities!
▪ Iterate, calculate those utilities, …

Converges in a finite number of iterations!
 We change which action to execute
if this improves expected (pseudo-)utility for this state
▪ This can sometimes decrease,
and never increase, the cost of other states!
▪ So costs are monotonically improving
and we only have to consider a finite number of policies

In general:
 May take many iterations
 Each iteration involved can be slow
 Mainly because of the need to solve a large equation system!
98
jonkv@ida
Convergence

100
How to avoid the need to solve an equation system?
 In PI we have a policy 𝜋𝜋𝑖𝑖 , want its expected utilities 𝐸𝐸(𝜋𝜋𝑖𝑖 , 𝑠𝑠)
 A way of approximating the same thing (one step in policy iteration):
▪ 𝐸𝐸𝑖𝑖, 0 𝜋𝜋, 𝑠𝑠 = 0 for all s
Utility for 0 steps
▪ Then iterate for j=0, 1, 2, … and for all states s:
𝐸𝐸𝑖𝑖, 𝑗𝑗+1 𝜋𝜋, 𝑠𝑠 = 𝑅𝑅 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠
′
+ 𝛾𝛾 � 𝑃𝑃 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 , 𝑠𝑠 𝐸𝐸𝑖𝑖,𝑗𝑗 𝜋𝜋𝑖𝑖 , 𝑠𝑠
𝑠𝑠 ′ ∈𝑆𝑆
′
Utility for 1 step,
2 steps,
3 steps, …
▪ Iterate until 𝐸𝐸𝑖𝑖,𝑗𝑗+1 is sufficiently close to 𝐸𝐸𝑖𝑖,𝑗𝑗
▪ Will converge in the limit:
𝛾𝛾 < 1  future steps become less and less important
▪ Finally, the utility function determines the best actions to use:
𝜋𝜋𝑖𝑖+1 𝑠𝑠 = arg max 𝑅𝑅 𝑠𝑠, 𝑎𝑎 + � 𝑃𝑃 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ 𝐸𝐸𝑖𝑖,𝑛𝑛 (𝜋𝜋, 𝑠𝑠)
𝑎𝑎∈𝐴𝐴
𝑠𝑠 ′ ∈𝑆𝑆
Essentially
maximizes Q(…)
jonkv@ida
An Alternative Policy Iteration

Replaces the equation system
 Not an exact solution
 Still takes a long time
Create arbitrary policy 𝜋𝜋0
Calculate 𝐸𝐸0,0 , 𝐸𝐸0,1 , 𝐸𝐸0,2 , …
 Approximate utility function
Update policy 𝜋𝜋1
Calculate 𝐸𝐸1,0, 𝐸𝐸1,1 , 𝐸𝐸1,2, …
 Approximate utility function
…
 But this gives us a hint for how we can continue…
101
jonkv@ida
An Alternative Policy Iteration (2)

102
Another algorithm: Value iteration
 Does not use a policy during iteration!
 Instead:
▪ 𝑉𝑉0 𝑠𝑠 = 0 for all s – no policy argument;
let’s call it “V” since it does not correspond to expected utilities!
▪ Then iterate for j=0, 1, 2, … and for all states s:
We used to let a policy select the action…
𝐸𝐸𝑖𝑖, 𝑗𝑗+1 𝜋𝜋, 𝑠𝑠 = 𝑅𝑅 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠
+ 𝛾𝛾 � 𝑃𝑃 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 , 𝑠𝑠 ′ 𝐸𝐸𝑖𝑖,𝑗𝑗 𝜋𝜋𝑖𝑖 , 𝑠𝑠 ′
𝑠𝑠 ′ ∈𝑆𝑆
▪ Now we use the action maximizing finite-horizon utility:
𝑉𝑉𝑗𝑗+1 𝑠𝑠 = max 𝑅𝑅 𝑠𝑠, 𝑎𝑎 + 𝛾𝛾 � 𝑃𝑃 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ 𝑉𝑉𝑗𝑗 𝑠𝑠
𝑎𝑎∈𝐴𝐴
𝑠𝑠 ′ ∈𝑆𝑆
▪ Finally, after convergence:
𝜋𝜋𝑖𝑖+1 𝑠𝑠 = arg max 𝑄𝑄(𝜋𝜋𝑖𝑖 , 𝑠𝑠, 𝑎𝑎)
𝑎𝑎∈𝐴𝐴
Calculate 𝑉𝑉0
Calculate 𝑉𝑉1
Calculate 𝑉𝑉2
Calculate 𝑉𝑉3
…
jonkv@ida
Value Iteration

Main difference:
 With policy iteration
▪ Find a policy
▪ Find exact expected utilities for infinite steps using this policy
(expensive, but gives the best possible basis for improvement)
▪ Use these to generate a new policy
▪ Throw away the old utilities,
find exact expected utilities for infinite steps using the new policy
▪ Use these to generate a new policy
▪ …
 With value iteration
▪ Find best utilities considering 0 steps; implicitly defines a policy
▪ Find best utilities considering 1 step; implicitly defines a policy
▪ Find best utilities considering 2 steps; implicitly defines a policy
▪ …
103
jonkv@ida
Value Iteration

Will always converge towards an optimal value function
 Will converge faster if V0(s) is close to the true value function
 Will actually converge regardless of the initial value of V0(s)
▪ Intuition: As n goes to infinity, the importance of V0(s) goes to zero
104
jonkv@ida
Value Iteration
Value iteration requires an initial approximation
V0(s1) = 0
V0(s2) = 0
V0(s3) = 0
V0(s4) = 0
V0(s5) = 0
 Let’s start with V0(s) = 0 for each s
 Does not correspond to any actual policy!
▪ Does correspond to the expected reward
of executing zero steps…
s5
r= -1
s2
r= -1
s3
r= 0
r= -100
r= -100
r= 99
r= -1
s1
s4
r= -1
r= -1
r= 100
r= -200
r= -1
r= -100

105
jonkv@ida
Value Iteration 2: Initial Guess V0
r= -100
r= -1
s2
r= -1
s3
r= -1
r= -1
r= -100
r= 0
r= -1
r= 99
s1
s4
r= -1
r= 100
 PI: find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s')
 FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) V0(s')
▪ s1: wait
–1 + 0.9 * 0
=–1
move(l1,l2)
–100 + 0.9 * 0
= – 100
move(l1,l4)
–1 + 0.9 * (0.5*0 + 0.5*0)
=–1
▪ s2: wait
–1 + 0.9 * 0
=–1
move(l2,l1)
–100 + 0.9 * 0
= – 100
move(l2,l3)
–1 + 0.9 * (0.8*0 + 0.2*0)
=–1
r= -200
For every state s:
s5
r= -100

V0(s1) = 0
V0(s2) = 0
V0(s3) = 0
V0(s4) = 0
V0(s5) = 0
r= -100
What is the best
local modification
according to the
current
approximation?
106
jonkv@ida
Value Iteration 3: Update 1a
r= -100
r= -1
s2
r= -1
s3
r= -1
r= -1
r= -100
r= 0
r= -1
r= 99
s1
s4
r= -1
r= 100
 FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) V0(s')
▪ s3: wait
– 1 + 0.9 * 0
=–1
move(l3,l2)
– 1 + 0.9 * 0
=–1
move(l3,l4)
–100 + 0.9 * 0
= – 100
▪ s4: wait
+100 + 0.9 * 0
= +100
move(l4,l1)
+99 + 0.9 * 0
= +99
…
▪ s5: wait
–100 + 0.9 * 0
= – 100
move(l5,l2)
–101 + 0.9 * 0
= – 101
move(l5,l4)
–200 + 0.9 * 0
= – 200
r= -200
For every state s:
s5
r= -100

V0(s1) = 0
V0(s2) = 0
V0(s3) = 0
V0(s4) = 0
V0(s5) = 0
r= -100
What is the best
local modification
according to the
current
approximation?
107
jonkv@ida
Value Iteration 4: Update 1b
r= -100
This results in a new approximation of the greatest expected utility
V0(s1) = 0
V0(s2) = 0
V0(s3) = 0
V0(s4) = 0
V0(s5) = 0
π1 = { (s1, wait),
(s2, wait),
(s3, move(l3,l2)),
(s4, wait),
(s5, wait)}
V1(s1) =– 1
V1(s2) = –1
V1(s3) = –1
V1(s4) = +100
V1(s5) = –100
V1 corresponds to one step of
many polices, including the one
shown here
s5
r= -1
s2
r= -1
s3
r= 99
r= -1
s1
s4
r= -1
r= -1
r= -100
r= -100
r= -100
r= -1
r= 100
r= -200
Policy iteration would now
calculate the true expected
utility for a chosen policy
Value iteration instead
continues using V1, which is
only a calculation guideline, not
the true exp util of any policy
For infinite execution,
E(π1,s1) = 10,
but this is not calculated…
r= 0

108
jonkv@ida
Value Iteration 5: Second Policy
r= -100
r= -1
s2
r= -1
s3
r= -1
r= -1
r= -100
r= 0
r= -1
r= 99
s1
s4
r= -1
r= 100
 PI: find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(πk,s')
 FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) Vk–1(s')
▪ s1: wait
–1 + 0.9 * –1
= –1.9
move(l1,l2)
–100 + 0.9 * –1
= –100.9
move(l1,l4)
–1 + 0.9 * (0.5*–1 + 0.5*100)
= +43,55
▪ s2: wait
–1 + 0.9 * –1
= –1.9
move(l2,l1)
–100 + 0.9 * –1
= –100.9
move(l2,l3)
–1 + 0.9 * (0.8*–1 + 0.2*–1)
= –1.9
r= -200
For every state s:
s5
r= -100

V1(s1) = –1
V1(s2) = –1
V1(s3) = –1
V1(s4) = +100
V1(s5) = –100
r= -100
What is the best
local modification
according to the
current
approximation?
109
jonkv@ida
Value Iteration 6: Update 2a
r= -100
r= -1
s2
r= -1
s3
r= -1
r= -1
r= -100
r= 0
r= -1
r= 99
s1
s4
r= -1
r= 100
 FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) Vk–1(s')
▪ s3: wait
–1 + 0.9 * –1
= –1.9
move(l3,l2)
–1 + 0.9 * –1
= –1.9
move(l3,l4)
–100 + 0.9 * +100
= –10
▪ s4: wait
+100 + 0.9 * +100
= +190
move(l4,l1)
+99 + 0.9 * –1
= +98.1
…
▪ s5: wait
–100 + 0.9 * –1
= –100.9
move(l5,l2)
–101 + 0.9 * –1
= –101.9
move(l5,l4)
–200 + 0.9 * +100
= –110
r= -200
For every state s:
s5
r= -100

V1(s1) = –1
V1(s2) = –1
V1(s3) = –1
V1(s4) = +100
V1(s5) = –100
r= -100
What is the best
local modification
according to the
current
approximation?
110
jonkv@ida
Value Iteration 7: Update 2b
r= -100
This results in another new approximation
V0(s1) = 0
V0(s2) = 0
V0(s3) = 0
V0(s4) = 0
V0(s5) = 0
π1 = { (s1, wait),
(s2, wait),
(s3, move(l3,l2)),
(s4, wait),
(s5, wait)}
Again, V2 doesn’t represent the
true expected utility of π2
π2 = { (s1, move(l1,l4)),
(s2, wait),
(s3, wait),
(s4, wait),
(s5, wait)}
V2(s1) = +43.55
V2(s2) = –1.9
V2(s3) = –1.9
V2(s4) = +190
V2(s5) = –100.9
s5
r= -1
s2
r= -1
s3
r= 99
r= -1
s1
s4
r= -1
r= -1
r= -100
It is the true expected utility of
one step of π2, then one of π1!
r= -100
r= -100
r= -1
r= 100
r= -200
Nor is it the true exp. utility of
executing two steps of π2
(But it will converge towards
true utility…)
V1(s1) = –1
V1(s2) = –1
V1(s3) = –1
V1(s4) = +100
V1(s5) = –100
r= 0

111
jonkv@ida
Value Iteration 8: Second Policy
r= -100

Significant differences from policy iteration
 Less accurate basis for action selection
▪ Based on approximate utility, not true expected utility
 Policy does not necessarily change in each iteration
▪ May first have to iterate n times, incrementally improving approximations
▪ Then another action suddenly seems better in some state
  Requires a larger number of iterations
▪ But each iteration is cheaper
  Can’t terminate just because the policy does not change
▪ Need another termination condition…
112
jonkv@ida
Differences

113
Illustration below
 Notice that we already calculated rows 1 and 2
▪ s1: wait
–1 + 0.9 * –1
move(l1,l2)
–100 + 0.9 * –1
move(l1,l4)
–1 + 0.9 * (0.5*–1 + 0.5*+100)
= –1.9
= –100.9
= +43,55
jonkv@ida
Illustration

114
Remember, these are “pseudo-rewards”!
324.109 = reward of waiting once in s5,
then continuing according to the previous 14 policies for 14 steps,
then doing nothing (which is impossible according to the model)
jonkv@ida
Illustration

115
The policy implicit in the value function changes incrementally…
jonkv@ida
Illustration

116
At some point we reach the final recommendation/policy:
Max value for
action move-s4
Will never
change
Max value for
action move-s3
Will never
change
Max value for
action move-s4
Only
wait
Will never
change
Max value for
action move-s4
Will never
change
Optimal policy found in iteration 4
Can’t know this:
These are not true rewards; maybe one action will soon “overtake” another!
jonkv@ida
Illustration

Suppose discount
factor is 0.99 instead
 Illustration, only showing
best pseudo-utility at each step
 Much slower convergence
▪ Change at step 20:
2%  5%
▪ Change at step 50:
0.07%  1.63%
 Care more about the future
 need to consider
many more steps!
117
jonkv@ida
Different Discount Factors

118
We can find bounds!
 Let ε be the greatest change in pseudo-utility between two iterations:
𝜖𝜖 = max 𝑉𝑉𝑛𝑛𝑛𝑛𝑛𝑛 𝑠𝑠 − 𝑉𝑉𝑜𝑜𝑜𝑜𝑜𝑜 (𝑠𝑠)
𝑠𝑠∈𝑆𝑆
 Then if we create a policy 𝜋𝜋 according to 𝑉𝑉𝑛𝑛𝑛𝑛𝑛𝑛 , we have a bound:
max 𝐸𝐸(𝜋𝜋, 𝑠𝑠) − 𝐸𝐸(𝜋𝜋 ∗ , 𝑠𝑠) < 2𝜖𝜖𝜖𝜖/(1 − 𝛾𝛾)
𝑠𝑠∈𝑆𝑆
▪ For every state, the reward of 𝜋𝜋
is at most 2𝜖𝜖𝜖𝜖/(1 − 𝛾𝛾) from the reward of an optimal policy
Maximum absolute
difference 𝜖𝜖 between
two iterations
0,001
0,01
0,1
1
5
10
100
0,5
0,002
0,02
0,2
2
10
20
200
0,9
0,018
0,18
1,8
18
90
180
1800
Discount factor 𝛾𝛾
0,95
0,038
0,38
3,8
38
190
380
3800
0,99
0,198
1,98
19,8
198
990
1980
19800
0,999
1,998
19,98
199,8
1998
9990
19980
199800
jonkv@ida
How Many Iterations?
Quit after 2 iterations  V2(s1)=43.
Guarantee: Corresponding policy gives >= 43 - 1620.
119
jonkv@ida
How Many Iterations? Discount 0.90
Bounds are
incrementally
tightened!
Quit after 10
iterations  we
know V10(s1)=467.
Guarantee: New
corresponding
policy gives
>= 467 – 697 if
we start in s1.
Quit after 50
iterations  we
know V50(s1)=811.
New guarantee:
The same policy
actually gives
>= 811 – 10 if we
start in s1.
120
jonkv@ida
How Many Iterations? Discount 0.99
Bounds are
incrementally
tightened!
Quit after 250
iterations 
we know
V250(s1)=8989.
Guarantee:
Corresponding
policy gives
>= 8989 - 1621.
Quit after 600
iterations 
we know
V600(s1)=9775.
Guarantee:
>= 9775 - 48.

121
Value iteration to find π*:
 Start with an arbitrary reward V0(s) for each s and an arbitrary ε > 0
 for k = 1, 2, …
Modified definition of Q(s,a):
▪ for each s in S do
Here we use the previous V()
▪ for each a in A do Q(s,a) := R(s,a) + γ ∑s' ∈ S Pa (s' | s) Vk–1(s')
▪ Vk(s) = maxa ∈ A Q(s,a)
▪ π(s) = argmaxa ∈ A Q(s,a)
// Only needed in final iteration
▪ if maxs ∈ S |Vk(s) – Vk-1(s)| < ε then exit
// Almost no change!
 On an acyclic graph, the values converge in finitely many iterations
 On a cyclic graph, value convergence can take infinitely many iterations
 That’s why ε > 0 is needed
jonkv@ida
Value Iteration

Both algorithms converge in a polynomial number of iterations
 But the variable in the polynomial is the number of states
▪ The number of states is usually huge
 Need to examine the entire state space in each iteration

122
Thus, these algorithms take huge amounts of time and space
 Probabilistic set-theoretic planning is EXPTIME-complete
▪ Much harder than ordinary set-theoretic planning, which was only PSPACEcomplete
▪ Methods exist for reducing the search space,
and for approximating optimal solutions
▪ Beyond the scope of this course
jonkv@ida
Discussion
124
Non-Observable:
No information
gained after action
Deterministic:
Exact outcome
known in advance
Fully Observable:
Exact outcome
known after action
Partially Observable:
Some information gained
after action
Classical planning (possibly with extensions)
(Information dimension is meaningless)
Non-deterministic:
Multiple outcomes,
no probabilities
Non-deterministic
Conformant Planning
Conditional (Contingent)
Planning
(No specific name)
Probabilistic:
Multiple outcomes
with probabilities
Probabilistic
Conformant Planning
Probabilistic
Conditional Planning
Partially Observable MDPs
(POMDPs)
(Special case of POMDPs)
Markov Decision
Processes (MDPs)

In general:
 Full information is the easiest
 Partial information is the hardest!
jonkv@ida
Overview B

Action representations:
 The book only deals with the underlying semantics:
“Unstructured” probability distribution P(s, a, s')
 Several “convenient” representations possible,
such as Bayes networks, probabilistic operators
126
jonkv@ida
Action Representations

127
Probabilistic PDDL: new constructs for effects, initial state
 (probabilistic p1 e1 ... pk ek)
▪ Effect e1 takes place with probability p1, etc.
▪ Sum of probabilities <= 1 (can be strictly less  implicit empty effect)
▪ (define (domain bomb-and-toilet)
(:requirements :conditional-effects :probabilistic-effects)
(:predicates (bomb-in-package ?pkg) (toilet-clogged) (bomb-defused))
(:action dunk-package
:parameters (?pkg)
First, a "standard" effect
:effect (and
(when (bomb-in-package ?pkg) (bomb-defused))
(probabilistic 0.05 (toilet-clogged)))))
5% chance of toilet-clogged,
▪ (define (problem bomb-and-toilet)
95% chance of no effect
(:domain bomb-and-toilet)
(:requirements :negative-preconditions)
(:objects package1 package2)
(:init (probabilistic 0.5 (bomb-in-package package1)
Probabilistic initial state
0.5 (bomb-in-package package2)))
(:goal (and (bomb-defused) (not (toilet-clogged)))))
jonkv@ida
Representation Example: PPDDL
128
▪ ;; Authors: Sylvie Thiébaux and Iain Little
You are stuck on a roof because the ladder you climbed up on fell down.
There are plenty of people around;
if you call out for help someone will certaintly lift the ladder up again.
Or you can try the climb down without it.
You aren't a very good climber though,
so there is a 50-50 chance that you will fall and break your neck if you go it alone.
What do you do?
▪ (define (problem climber-problem)
(:domain climber)
(:init (on-roof) (alive) (ladder-on-ground))
(:goal (and (on-ground) (alive))))
jonkv@ida
Ladder 1
▪ (define (domain climber)
(:requirements :typing :strips :probabilistic-effects)
(:predicates (on-roof) (on-ground)
(ladder-raised) (ladder-on-ground) (alive))
▪
(:action climb-without-ladder :parameters ()
:precondition (and (on-roof) (alive))
:effect (and (not (on-roof))
(on-ground)
(probabilistic 0.4 (not (alive)))))
▪ (:action climb-with-ladder :parameters ()
:precondition (and (on-roof) (alive) (ladder-raised))
:effect (and (not (on-roof)) (on-ground)))
▪
(:action call-for-help :parameters ()
:precondition (and (on-roof) (alive) (ladder-on-ground))
:effect (and (not (ladder-on-ground))
(ladder-raised))))
129
jonkv@ida
Ladder 2

130
When putting down a block:
 30% risk that it explodes
 Destroys what you placed the block on
 Use additional blocks as potential “sacrifices”
▪ (:action put-down-block-on-table
:parameters (?b - block)
:precondition (and (holding ?b)
(not (destroyed-table))
)
:effect (and (not (holding ?b))
(on-top-of-table ?b)
(when (not (detonated ?b))
(probabilistic .3 (and (detonated ?b)
(destroyed-table))
)))
)
jonkv@ida
Exploding Blocks World
131
Reward/cost-based
 Tire may go flat – calling AAA is expensive – good idea to load a spare


(:action mov-car :parameters (?from - location ?to - location)
:precondition (and (vehicle-at ?from) (road ?from ?to) (not (flattire)))
:effect (and (vehicle-at ?to) (not (vehicle-at ?from))
(decrease reward 1)
(probabilistic .15 (flattire))))
(:action loadtire :parameters (?loc - location)
:precondition (and (vehicle-at ?loc) (hasspare-location ?loc)
(not (hasspare-vehicle)))
:effect (and (hasspare-vehicle) (not (hasspare-location ?loc))
(decrease reward 1)))
(:action changetire
:precondition (and (hasspare-vehicle) (flattire))
:effect (and (decrease (reward) 1)
(not (hasspare-vehicle)) (not (flattire))))
(:action callAAA
:precondition (flattire)
:effect (and (decrease (reward) 100)
(not (flattire))))
jonkv@ida
Tire World
 Relational Dynamic Influence Diagram Language
▪ Based on Dynamic Bayesian Networks
▪ domain prop_dbn {
requirements = { reward - deterministic };
// Define the state and action variables ( not parameterized here )
pvariables {
p : { state - fluent , bool , default = false };
q : { state - fluent , bool , default = false };
r : { state - fluent , bool , default = false };
a : { action - fluent , bool , default = false };
};
// Define the conditional probability function for each next
// state variable in terms of previous state and action
cpfs {
p’ = if (p ^ r) then Bernoulli (.9) else Bernoulli (.3);
q’ = if (q ^ r) then Bernoulli (.9)
else if (a) then Bernoulli (.3) else Bernoulli (.8);
r’ = if (~q) then KronDelta (r) else KronDelta (r <=> q);
};
// Define the reward function ; note that boolean functions are
// treated as 0/1 integers in arithmetic expressions
reward = p + q - r;
}
132
jonkv@ida
Representation Example: RDDL
Download