Introduction to AI

advertisement
Introduction to AI
— Stochastic Planning: Markov Decision Processes
Introduction to AI
Stochastic Planning: Markov Decision Processes
Markov Decision Processes
Erez Karpas
Slides by Carmel Domshlak
Finite Horizon
IE&M — Technion
Infinite Horizon
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
1 / 31
Classical deterministic planning
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
Stochastic (decision-theoretic) planning
I
dynamics: deterministic, nondeterministic or probabilistic
I
dynamics: deterministic, nondeterministic or probabilistic
I
observability: full, partial or none
I
observability: full, partial or none
I
horizon: finite or infinite
I
horizon: finite or infinite
I
...
I
...
1. classical planning
1. classical planning
2. conformant planning
2. conformant planning
3. conditional planning with full observability
3. conditional planning with full observability
4. conditional planning with partial observability
4. conditional planning with partial observability
5. Markov decision processes (MDP)
5. Markov decision processes (MDP)
6. partially observable MDPs (POMDP)
6. partially observable MDPs (POMDP)
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
2 / 31
3 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
4 / 31
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Transition systems
Objective
Definition (MDP)
An MDP is hS, A, R, T i where
Objection (informally)
I
S is a finite set of states (|S| = n)
I
A is a finite set of actions (|A| = m)
transition function Tr : S × A × S 7→ [0, 1] with
Tr (s, a, s 0 ) = P(s 0 | s, a)
I
I
I
I
I
select actions in order to maximize total reward
I
example: in a goal-based domain,
(
1, s is a goal state
R(s) =
0, otherwise
Probability of going to state s 0 after taking action a in state s
How many parameters does it take to represent? m · n2
bounded, real-valued reward function R : S 7→ [0, rmax ]
I
R(s) is immediate reward we get for reaching state s
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
5 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Markov Decision Processes
I
I
Classical planning ; sequence of actions ha1 , a2 , . . . , ak i
Reward only depends on current state
P(R t | At , S t , At−1 , S t−1 , . . . , S 0 ) = P(R t | S t )
We assume deterministic rewards
I
For MDPs, action sequences are insufficient
I
I
Stationary (time independent) dynamics
I
I
Next state only depends on current state and action
P(S t+1 | At , S t , At−1 , S t−1 , . . . , S 0 ) = P(S t+1 | At , S t )
Markovian reward process
I
I
Solution Concept
Markovian dynamics (history independence)
I
I
6 / 31
Markov Decision Processes
Assumptions
I
Introduction to AI
I
The world dynamics do not depend on the absolute time
P(S t+1 | At , S t ) = P(S k+1 | Ak , S k ) for all t, k
Actions have stochastic effects, so the state we end up in is uncertain
We might end up in states where the remainder of the action sequence
is not attractive (or even doesn’t apply)
A solution should tell us what the best action is for any possible
situation that might arise
Full observability
I
Though we can’t predict exactly which state we will reach when we
execute an action, once it is executed, we know what the state is
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
7 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
8 / 31
Markov Decision Processes
Markov Decision Processes
Policies (“plans” for MDPs)
I
A solution to an MDP is a policy
I
I
I
What if we want to continue taking actions indefinitely?
I
A stationary policy is a function from states to actions
Two types of policies: nonstationary and stationary
I
Nonstationary (NS) policies are used when we are given a finite
planning horizon H
I
I
Policies (“plans” for MDPs)
I
I
We are told how many actions we will be allowed to take
I
NS policy is a function from states and times to actions
I
I
π : S × T 7→ A, where T is the non-negative integers
π(s, t) tells us what action to take at state s when there are t
stages-to-go
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
Note that both nonstationary and stationary policies assume or have
these properties:
I
I
I
9 / 31
π : S 7→ A
π(s) is action to take at state s (regardless of time)
specifies a continuously reactive controller
full observability
history independence
deterministic action choice
Erez Karpas Slides by Carmel Domshlak (Technion)
Markov Decision Processes
I
How do we measure reward “accumulated by π”
Value function V : S[×T ] 7→ R associates value with each state (or
each state and time for non-stationary π)
Vπ (s) denotes value of policy π at state s
I
I
I
Finite-Horizon Value Functions
How good is a policy π?
I
I
10 / 31
Finite Horizon
Value of a Policy
I
Introduction to AI
Depends on immediate reward, but also what you achieve subsequently
by following π
An optimal policy is one that is no worse than any other policy at any
state
We first consider maximizing total reward over a finite horizon
I
Assumes the agent has H time steps to live
I
To act optimally, should the agent use a stationary or non-stationary
policy?
I
If you had only one week to live would you act the same way as if you
had fifty years to live?
The goal of MDP planning is to compute an optimal policy (method
depends on how we define value)
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
11 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
12 / 31
Finite Horizon
Finite Horizon
Finite-Horizon Problems
Computational Problems
Policy Evaluation
I
I
Value (utility) depends on stage-to-go
; use a nonstationary policy
Given an MDP, a nonstationary policy π, and t ∈ T , compute
finite-horizon value function Vπt (s).
Vπk (s) is k-stage-to-go value function of π
I
Policy Optimization
expected total reward for executing π starting in s for k time steps
" k
#
X
Vπk (s) = E
Rt | π, s
Given an MDP, and a horizon H, compute an optimal finite-horizon policy
π∗
t=0
"
= E
k
X
#
R(st ) | at = π(st , k − t), s0 = s
I
How many finite-horizon policies are there? |A|Hn
I
Simple enumeration will not work ...
I
We’ll show that policy optimization is equivalent to computing
“optimal value function”
t=0
I
Rt and st are random variables
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
13 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Finite Horizon
14 / 31
Finite Horizon
Finite-Horizon Policy Evaluation
I
Introduction to AI
Policy Optimization: Bellman Backups
How can we compute compute Vπt ∗ (s) given Vπt−1
∗ (s)?
Can use dynamic programming to compute Vπt (s)
; Markov property is critical for this
Vπt−1
∀s ∈ S :
Vπ0 (s) = R(s)
0.7
Vπt (s) = R(s) +
X
P s 0 | s, π(s, t) · Vπt−1 s 0
a1
s0
s
a
s
0.7
0.3
Vπt
Vπt
s1
π(s, t)
0.3
0.4
a2
s1
s2
s3
0.6
s2
s4
Vπt−1
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
15 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
16 / 31
Finite Horizon
Finite Horizon
Policy Optimization: Bellman Backups
Policy Optimization: Bellman Backups
How can we compute compute Vπt ∗ (s) given Vπt−1
∗ (s)?
How can we compute compute Vπt ∗ (s) given Vπt−1
∗ (s)?
Vπt−1
Compute
expectations
0.7
a1
Vπt
s
0.7
Compute MAX
Vπt
0.4
a2
s1
s2
0.3
a1
0.3
s
0.4
s3
a2
0.6
s1
s2
s3
0.6
s4
Erez Karpas Slides by Carmel Domshlak (Technion)
Vπt−1
Compute
expectations
s4
Introduction to AI
17 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Finite Horizon
Introduction to AI
Finite Horizon
Value Iteration: Finite Horizon Case
Value Iteration: Complexity
Markov property allows exploitation of DP principle for
optimal policy construction
I no need to enumerate |A|Hn possible policies!
I
DP: optimal solution to t − 1 stage problem can be used without
modification as part of optimal solution to t-stage problem
Value Iteration
I
Complexity?
I
I
I
∀s ∈ S :
Vπ0∗ (s) = R(s)
Vπt ∗ (s)
= R(s) + max
a
π ∗ (s, t) = arg max
X
a
X
18 / 31
0
P s | s, a ·
Vπt−1
∗
s
0
I
H iterations
At each iteration, each of n states, computes expectation for |A| actions
Each expectation takes O(n) time
Total time complexity: O(H|A|n2 )
I
Polynomial in the number of states. Is that good?
s0
P s 0 | s, a · Vπt−1
s0
∗
s0
Vπt ∗ (s) is optimal t-stage-to-go value function
π ∗ (s, t) is optimal t-stage-to-go policy
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
19 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
20 / 31
Finite Horizon
Infinite Horizon
Summary: Finite Horizon
I
Discounted Infinite Horizon MDPs
I
Value Iteration provides us with an optimal policy
I
Vπt ∗ (s) ≥ Vπt (s),
∀π, s, t
I
; convince yourself (prove) by induction on t
I
Defining value as total reward is problematic with infinite horizons
I
“Trick”: introduce discount factor 0 ≤ γ < 1
I
Note: optimal value function is unique, but optimal policy is not
many or all policies have infinite expected reward
some MDPs are ok (e.g., zero-cost absorbing states)
future rewards discounted by γ per time step
"
Vπ (s) = E
∞
X
#
γ t Rt | π, s
t=0
I
Why? The value gets bounded.
"∞
#
X
t
Vπ (s) ≤ E
γ Rmax =
t=0
I
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
21 / 31
Motivation: economic? probability of death? convenience?
Erez Karpas Slides by Carmel Domshlak (Technion)
Infinite Horizon
Introduction to AI
22 / 31
Infinite Horizon
Notes: Discounted Infinite Horizon
I
1
Rmax
1−γ
Policy Evaluation
I
Optimal policy maximizes value at each state
Value equation for fixed policy
I
I
Optimal policies guaranteed to exist! (Howard, 1960)
I
Furthermore there is always an optimal stationary policy
Immediate reward + Expected discounted future reward
Vπ (s) = R(s) + γ
X
P s 0 | s, π(s) · Vπ s 0
s0
I
I
Intuition: why would we change action at s at a new time when there
is always forever ahead?
∗
We define V (s) = V (s) for some optimal policy π
π∗
∗
I
How can we compute Vπ ?
I
we are given R (rewards) and P (action dynamics)
I
linear system with n variables and n constraints
I
I
I
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
23 / 31
variables: Vπ (s1 ), . . . , Vπ (sn )
constraints: one value equation (above) per state
use linear algebra to solve this system and get Vπ
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
24 / 31
Infinite Horizon
Infinite Horizon
Computing an Optimal Value Function
I
Bellman equation for optimal value function
X
V ∗ (s) = R(s) + γ · max
P s 0 | s, a · V ∗ s 0
a
I
I
I
Can compute optimal policy using value iteration based on Bellman
backups, just like finite-horizon problems (but include discount term)
V 0 (s) = R(s)
s0
V t (s) = R(s) + γ · max
Bellman proved this is always true for an optimal value function
a
I
The MAX operator makes the system non-linear, so the problem is
more difficult than policy evaluation
Erez Karpas Slides by Carmel Domshlak (Technion)
lim V k = V ∗
I
When should we stop in practice?
s0
Introduction to AI
25 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
I
I
Given a V k from value iteration that closely approximates V ∗ , what
should we use as our policy?
I
Use greedy policy (one step lookahead):
X
greedy[V k ](s) = arg max
P s 0 | s, a · V k (s 0 )
For any V and V 0 , ||B[V ] − B[V 0 ]|| ≤ γ · ||V − V 0 ||
I
26 / 31
How to Act?
Bellman Backup B is a contraction operator on value functions
I
Introduction to AI
Infinite Horizon
Convergence
I
s0
Converges to optimal value function? YES
Infinite Horizon
I
P s 0 | s, a · V t−1 s 0
k→∞
Notice that the optimal value function is a fixed-point of the Bellman
Backup operator B (i.e. B[V ∗ ] = V ∗ )
X
B[V ](s) = R(s) + γ · max
P s 0 | s, a · V s 0
a
I
X
How can we compute V ∗ ?
I
I
Value Iteration
||V || is the max-norm: ||(0.1, 100, 5, 12)|| = 100
So applying a Bellman backup to any two value functions causes them
to get closer together in the max-norm sense!
a
s0
Convergence is assured from any V
||V ∗ − B[V ]|| = ||B[V ∗ ] − B[V ]|| ≤ γ · ||V ∗ − V ||
I
Note: Value of greedy policy may not be equal to V k
⇒ limk→∞ V k = V ∗
I
How close is V greedy to V ∗ ?
I
When to stop in practice? When ||V k − V k−1 || ≤ I
I
if V k is -close to V ∗ , then V greedy is
2γ
1−γ -close
to V ∗
γ
ensures ||V ∗ − V k || ≤ 1−γ
(be able to verify that)
k
k−1
so stop when V and V
are similar enough
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
27 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
28 / 31
Infinite Horizon
Infinite Horizon
Optimization via Policy Iteration
I
Policy Iteration Notes
Given fixed policy, can compute its value exactly:
"∞
#
X
t
Vπ (s) = E
γ Rt | π, s
t=0
I
Policy Iteration exploits this: iterates steps of policy evaluation and
policy improvement
I
Each step of policy iteration is guaranteed to strictly improve the
policy at some state when improvement is possible
I
Convergence assured (Howard)
I
I
1. Choose a random policy π
2. Loop:
I
no local maxima in value space, and each policy must improve value;
since finite number of policies, will converge to optimal policy
Gives exact value of optimal policy
2.1 Evaluate Vπ
P
2.2 For each s, set π 0 (s) := arg maxa s 0 P (s 0 | s, a) · Vπ (s 0 )
2.3 Replace π with π 0
Until no improving action possible at any state
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
29 / 31
Infinite Horizon
Value Iteration vs. Policy Iteration
I
Which is faster: VI or PI?
I
I
VI takes more iterations than PI, but PI requires more time on each
iteration
I
I
depends on the problem instance
PI must perform policy evaluation on each step which involves solving
a linear system
Complexity
I
I
I
There are at most exp(n) policies, so PI is no worse than exponential
time in number of states
Empirically, O(n) iterations are required!
Polynomial bound on the number of PI iterations was open problem
until very recently. Closed!
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
31 / 31
Erez Karpas Slides by Carmel Domshlak (Technion)
Introduction to AI
30 / 31
Download