Markov Decision Processes

advertisement
Markov Decision Processes
Definitions; Stationary policies; Value improvement
algorithm, Policy improvement algorithm, and linear
programming for discounted cost and average cost
criteria.
Markov Decision Processes
1
Markov Decision Process
Let X = {X0, X1, …} be a system description process on state
space E and
let D = {D0, D1, …} be a decision process with action space A.
The process (X, D) is a Markov decision process if , for j ∈ E
and n = 0, 1, …, P { X n +1 = j X n , Dn ,..., X 0 , D0 } = P { X n+1 = j X n , Dn }
Furthermore, for each k ∈ A, let fk be a cost vector and Pk be a
one-step transition probability matrix. Then the cost fk(i) is
incurred whenever Xn = i and Dn = k, and
P { X n +1 = j X n = i, Dn = k } = Pk ( i, j )
The problem is to determine how to choose a sequence of
actions in order to minimize cost.
Markov Decision Processes
2
Policies
A policy is a rule that specifies which action to take at each point
in time. Let D denote the set of all policies.
In general, the decisions specified by a policy may
– depend on the current state of the system description process
– be randomized (depend on some external random event)
– also depend on past states and/or decisions
A stationary policy is defined by a (deterministic) action function
that assigns an action to each state, independent of previous
states, previous actions, and time n.
Under a stationary policy, the MDP is a Markov chain.
Markov Decision Processes
3
Cost Minimization Criteria
Since a MDP goes on indefinitely, it is likely that the total
cost will be infinite. In order to meaningfully compare
policies, two criteria are commonly used:
1. Expected total discounted cost computes the present
worth of future costs using a discount factor α < 1, such
that one dollar obtained at time n = 1 has a present value
of α at time n = 0. Typically, if r is the rate of return,
then α = 1/(1 + r). The expected total discounted cost is
2.
∞
E  ∑ n =0 α n f Dn ( X n ) 


The long run average cost is mlim
→∞
1 m −1
f ( Xn )
∑
n = 0 Dn
m
Markov Decision Processes
4
Optimization with Stationary Policies
If the state space E is finite, there exists a stationary policy
that solves the problem to minimize the discounted cost:
∞
α
α
α

v ( i ) = min vd ( i ) , where vd ( i ) = Ed ∑ n =0 α n f D ( X n ) X 0 = i 

d ∈D
n

If every stationary policy results in an irreducible Markov
chain, there exists a stationary policy that solves the problem
to minimize the average cost:
1 m −1
ϕ * = min ϕ d , where ϕ d = lim ∑ n =0 f D ( X n )
d ∈D
m →∞
m
Markov Decision Processes
n
5
Computing Expected Discounted Costs
Let X = {X0, X1, …} be a Markov chain with one-step
transition probability matrix P, let f be a cost function that
assigns a cost to each state of the M.C., and let α (0 < α < 1)
be a discount factor. Then the expected total discounted cost
∞
−1
is
g ( i ) = E  ∑ n = 0 α n f ( X n ) X 0 = i  = ( I − α P ) f  ( i )




Why? Starting from state i, the expected discounted cost can
be found recursively as g ( i ) = f ( i ) + α ∑ j Pij g ( j ), or
g = f + α Pg
Note that the expected discounted cost always depends on the
initial state, while for the average cost criterion the initial
state is unimportant.
Markov Decision Processes
6
Solution Procedures for Discounted Costs
Let vα be the (vector) optimal value function whose ith
component is vα ( i ) = min vdα ( i )
d ∈D
{
}
For each i ∈ E, vα ( i ) = min f k ( i ) + α ∑ j∈E Pk ( i, j ) vα ( j )
k∈ A
These equations uniquely determine vα .
If we can somehow obtain the values vα that satisfy the above
equations, then the optimal policy is the vector a, where
{
}
a ( i ) = arg min f k ( i ) + α ∑ j∈E Pk ( i, j ) vα ( j )
k∈ A
“arg min” is “the argument that minimizes”
Markov Decision Processes
7
Value Iteration for Discounted Costs
Make a guess … keep applying the optimal value equations
until the fixed point is reached.
Step 1. Choose ε > 0, set n = 0, let v0(i) = 0 for each i in E.
Step 2. For each i in E, find vn+1(i) as
{
}
vn +1 ( i ) = min f k ( i ) + α ∑ j∈E Pk ( i, j ) vn ( j )
k∈ A
Step 3. Let δ = max {vn +1 ( i ) − vn ( i )}
i∈E
Step 4. If δ < ε, stop with vα = vn+1. Otherwise, set n = n+1
and return to Step 2.
Markov Decision Processes
8
Policy Improvement for Discounted Costs
Start myopic, then consider longer-term consequences.
Step 1. Set n = 0 and let a0(i) = arg min k∈A f k ( i )
Step 2. Adopt the cost vector and transition matrix:
f ( i ) = f an (i ) ( i ) P ( i, j ) = Pan ( i ) ( i, j )
Step 3. Find the value function v = ( I − α P )
{
−1
f
}
Step 4. Re-optimize: an +1 ( i ) = arg min f k ( i ) + α ∑ Pk ( i, j ) v ( j )
j∈E
k∈ A
Step 5. If an+1(i) = an(i), then stop with vα = v and aα = an(i).
Otherwise, set n = n + 1 and return to Step 2.
Markov Decision Processes
9
Linear Programming for Discounted Costs
Consider the linear program:
max ∑ i∈E u ( i )
s.t. u ( i ) ≤ f k ( i ) + α ∑ j∈E Pk ( i, j ) u ( j ) for each i, k
The optimal value of u(i) will be vα(i), and the optimal policy
is identified by the constraints that hold as equalities in the
optimal solution (slack variables equal 0).
Note: the decision variables are unrestricted in sign!
Markov Decision Processes
10
Long Run Average Cost per Period
For a given policy d, its long run average cost could be found
from its cost vector fd and one-step transition probability
matrix Pd:
First, find the limiting probabilities by solving
∞
π j = ∑ π i Pd ( i, j ), j ∈ E;
i∈E
Then
ϕd
∑
= lim
m →∞
m −1
n=0
fd ( X n ) ( X n )
m
∞
∑π
j∈E
j
=1
= ∑ fd ( j ) π j
j∈E
So, in principle we could simply enumerate all policies and choose the one
with the smallest average cost… not practical if A and E are large.
Markov Decision Processes
11
Recursive Equation for Average Cost
Assume that every stationary policy yields an irreducible
Markov chain. There exists a scalar ϕ∗ and a vector h such
that for all states i in E,
ϕ * + h ( i ) = min f k ( i ) + ∑ j∈E Pk ( i, j ) h ( j )
k∈ A
{
}
The scalar ϕ∗ is the optimal average cost and the optimal policy
is found by choosing for each state the action that achieves
the minimum on the right-hand-side.
The vector h is unique up to an additive constant … as we will
see, the difference between h(i) - h(j) represents the increase
in total cost from starting out in state i rather than j.
Markov Decision Processes
12
Relationships between Discounted Cost
and Long Run Average Cost
• If a cost of c is incurred each period and α is the discount
factor, then the total discounted cost is v = ∑ ∞ cα n = c
n =0
1−α
• Therefore, a total discounted cost v is equivalent to an
average cost of c = (1-α)v per period, so lim (1 − α ) vα ( i ) = ϕ *
α →1
• Let vα be the optimal discounted cost vector, ϕ* be the
optimal average cost and h be the mystery vector from the
previous slide.
lim vα ( i ) − vα ( j )  = h ( i ) − h ( j )
α →1
Markov Decision Processes
13
Policy Improvement for Average Costs
Designate one state in E to be “state number 1”
Step 1. Set n = 0 and let a0(i) = arg min k∈A f k ( i )
Step 2. Adopt the cost vector and transition matrix:
f ( i ) = f an (i ) ( i ) P ( i, j ) = Pan ( i ) ( i, j )
Step 3. With h(1) = 0, solve ϕ + h = f + Ph
Step 4. Re-optimize:
{
}
an +1 ( i ) = arg min f k ( i ) + α ∑ j∈E Pk ( i, j ) h ( j )
k∈ A
Step 5. If an+1(i) = an(i), then stop with ϕ* = ϕ and a*(i) = an(i).
Otherwise, set n = n + 1 and return to Step 2.
Markov Decision Processes
14
Linear Programming for Average Costs
Consider randomized policies: let wi(k) = P{Dn = k | Xn = i}. A
stationary policy has wi(k) = 1 for each k=a(i) and 0
otherwise. The decision variables are x(i,k) = wi(k)π(i).
The objective is to minimize the expected value of the average
cost (expectation taken over the randomized policy):
min ϕ = ∑ i∈E ∑ k∈A x ( i, k ) f k ( i )
s.t.
∑ x ( j, k ) = ∑ ∑
∑ ∑ x ( i, k ) = 1
k∈ A
i∈E
i∈E
k∈ A
x ( i, k )Pk ( i, j ) for each j ∈ E
k∈A
Note that one constraint will be redundant and may be dropped.
Markov Decision Processes
15
Download