Background Material: Markov Decision Process

advertisement
Background Material: Markov
Decision Process
Reference
Class notes
Further studies:
Dynamic programming and Optimal Control
D. Bertsekas, Volume 1
Chapters 1, 4, 5, 6, 7
Discrete Time Framework
xk System State belongs to set Sk
uk Control Action belongs to set U(xk ) Ck
wk Random Disturbance characterized by a probability
distribution Pk (./ xk , uk ) which may depend on xk , uk but not
on values of prior disturbances w0 ……wk-1
xk+1 = fk (xk , uk , wk )
N Number of times control is applied
gk (xk , uk , wk ) Cost in slot k
gN (xN) Terminal Cost
Finite Horizon Objective
Choose the controls such that additive expected cost over
N time slots is minimized, that is minimize
E{gN (xN) + k=0N-1 gk (xk , uk , wk ) }
Control strategy:
 = {w0 = 0(x0) …wN-1 = N-1(xN-1)… }
Cost associated with control  and initial state x0
J (x0) = Ew {gN (xN) + k=0N-1 gk (xk , uk , wk ) }
Choose  such that J (x0) is minimized for all initial states x0
Optimal controls need only be a function of the current state
(history independence)
Type of Control
Open loop
Can not change in response to system states
Optimal if disturbance is a deterministic function
of the state and the control
Closed loop
Can change in response to system states
Illustrating Example: Inventory
Control
xk Stock available at the beginning of the kth period
Sk Set of integers
uk Stock ordered at the beginning of the kth period
U(xk ) =Ck Set of nonnegative integers
wk Demand during the kth period characterized by a
probability distribution Pk (wk), w0 ……wN-1 which are
independent
xk+1 = xk + uk - wk
Negative stock: Backlogged Demand
N Time horizon of optimization
gk (xk , uk , wk ) Cost in slot k consists of two components
Penalty for storage and unfulfilled demands r(xk )
Ordering cost cxk
gk (xk , uk , wk ) = cxk + r(xk )
gN (xN) = r(xN) Terminal Cost for being left with inventory xN
Example Control action
uk = k - xk if xk < k
= 0 otherwise
(Threshold type)
Bellmans Principle of Optimality
Let the optimal strategy be * = {0* ……..N-1*} Assume that a
given state x occurs with a positive probability at time j. Let the
system be in state x in slot j, then the truncated control sequence
{i* ……..N-1*} minimizes the cost to go from slot j to N, that is
minimizes Ew {gN (xN) + k=jN-1 gk (xk , uk , wk ) }
Dynamic Programming
Algorithm
Optimal algorithm is given by the following
iteration which proceeds backwards
JN (xN) = gN (xN)
Jk(xk) = minu in U(x) Ew { gk (xk , uk , wk ) + Jk (xk+1) }
= minu in U(x) Ew { gk (xk , uk , wk ) + Jk (fk (xk , uk , wk ) ) }
Optimizing a Chess Match
Strategy
A player plays against an opponent who does not change
his actions in accordance with the current state
They play N games,
If the scores are tied towards the end, then the
players go to sudden death, where they play until
one is ahead of the other
A draw fetches 0 point for both, a win fetches 1
point for the winner and 0 for the loser
The player can play timid, in that case draws a match with
probability pd and loses with probability (1- pd )
The player can play bold, in that case wins a match with
probability pw and loses with probability (1- pw )
Optimal strategy in sudden death?
Play bold
Optimal Strategy in initial N
games
xN Difference between the score of the player and his
opponent
Sk integers between k and -k
uk Timid (0) or Bold (1)
U(xk ) = {0, 1}
wk Outcome: Probability distribution for timid {pd , 1- pd}
Probability distribution for bold {pw , 1- pw}
xk+1 = wk
N Time horizon of optimization
Consider maximization of reward instead of
minimization of cost
gk (xk , uk , wk ) is the probability of winning in k games
gN (xN) = 0 if xN < 0
= pw if xN = 0
= 1 if xN > 0
gk (xk , uk , wk ) = 0 if k < N
JN (xN) = gN (xN)
Jk(xk) = maxu in U(x) Ew { Jk (xk+1) }
= max {pd Jk (xk) + (1- pd )Jk (xk - 1), pw Jk (xk+1) + (1pw)Jk (xk - 1) }
Lets work it out!
State Augmentation
What if system state depends not only on the preceding
state and control, but also earlier state and control?
xk+1 = fk (xk , uk , xk-1 , uk-1 , wk ),
x1 = f1 (x0 , u0 , w0 )
Now state is (xk , yk , sk )
xk+1 = fk (xk , yk , sk , uk , wk ),
yk+1 = xk
sk+1 = uk
Time lag in cost
Correlated Disturbances
What if w0
…wN-1 are
not independent?
Let wj depend on wj-1
state is (xk , yk , sk )
xk+1 = fk (xk , yk , uk , wk ),
yk+1 = wk
Linear Systems and Quadratic
Cost
xk+1 = Ak xk + Bk uk + wk,
gN (xN) = xNT QNxN
gk (xk) = xkT Qkxk + ukT Rkuk
k(xk) = Lk xk
Lk
KN
Kk
= - (BkT Kk+1Bk + Rk )-1 BkT Kk+1Ak
= QN
= - AkT(Kk+1 - Kk+1Bk (BkT Kk+1Bk + Rk )-1 BkT Kk+1 )Ak + Qk
J(x0) = x0T K0x0 + k=0N-1 E(wkT Kk+1wk)
optimum cost
Let Ak = A,, Bk = B, Rk = R, Qk = Q
Then as k becomes large, Kk converges to the steady state
solution of algebraic Ricatti equation,
K = - AT(K - KB (BT KB + R )-1 BT K )A + Q
(x) = L x
L
= - (BTKB + R)-1 BT KA
Optimal Stopping Problem
One of the control actions allow the system to stop in any slot
Decision maker can terminate the system at a certain loss
or choose to continue at a certain cost.
The challenge will be when to stop so as to minimize
the total cost.
Asset selling problem
A person has an asset for which he receives quotes in
every slot, w0 …wN-1
Quotes are independent from slot to slot
If a person accepts the offer, he can invest it at a fixed
rate of interest r > 0
Control action is to sell or not to sell
State is the offer in the previous slot if the asset is not sold
yet, or T if it is sold
xk+1 = T
if sold in previous slots
Reward:
gN (xN) = xN if xN T
= 0 otherwise
gk (xk, , uk , wk ) = (1+r)N-kxk if xN T, decision is to sell
= 0 otherwise
JN (xN) = xN
=0
if xN T
otherwise
Jk(xk) = max{(1+r)N-kxk,, EJk+1 (xk+1) } if xk T
=0
if xN = T
Let k = EJk+1 (wk)/ (1+r)N-k
Optimal strategy: Accept the offer if xk > k
Reject the offer if xk < k
Act either way otherwise
To show k is non-increasing function of k
We will show by induction that Jk+1 (x)/ (1+r)N-k is nonincreasing for all x
JN (x)/ (1+r) = x/(1+r)
JN-1 (x)/ (1+r)2 = max(x/(1+r), EJN (xk+1))
Thus JN (x) )/ (1+r)  JN-1 (x)/ (1+r)2 , base case holds
Jk(x)/ (1+r)N-k+1 = max{(1+r)-1x, EJk+1 (w)/ (1+r)N-k+1 }
Jk+1(x)/ (1+r)N-k = max{(1+r)-1x, EJk+2 (w)/ (1+r)N-k }
By induction, Jk+1 (w)/ (1+r)N-k+1  Jk+2 (w)/ (1+r)N-k
The result follows
Iterative Computation of
threshold
Let Vk (xk) = Jk (wk)/ (1+r)N-k
VN (xN) = xN
=0
if xN T
otherwise
Vk(xk) = max{xk,, (1+r)-1 EVk+1 (w) }
Let k = EVk+1 (w)/ (1+r)
Vk(xk) = max(xk,, k )
Let k = EVk+1 (w)/ (1+r)
=E max(w,, k ) )/ (1+r)
= (0 to k+1 k+1 dP + k+1 to infty wdP) )/ (1+r)
P is the cumulative distribution function of w
Note that the first and the last parts are upperbounded
k is a decreasing sequence
For large k, the sequence converges to  where
Let  = (0 to   dP +  to infty wdP) )/ (1+r)
General Stopping Problem
Decision maker can terminate the system in slot k at a
certain cost t(xk)
Terminal cost is t(xN)
JN (xN) = t(xN )
Jk(xk) = min{t(xk),, minu in U(x) E {g(xk, , uk , wk ) + Jk+1 (f(xk,u,w))}
}
Optimal to stop at time k for states x in the set S such that
Tk = {t(x),  minu in U(x) E {g(x , u , w ) + Jk+1 (f(x,u,w))} }
We show by induction that Jk (x) is non-decreasing in k
It follows that T0  T1  …..TN-1
Assume that TN-1 is an absorbing set that is, if a state is
in this set, and termination is not selected then the next
state is also in this set.
I
Consider a state x in TN-1
Note that JN-1 (x) = t(x )
minu in U(x) E {g(x, , u , w ) + JN-1 (f(x,u,w)) } = minu in U(x) E {g(x,,u ,w
+ t (f(x,u,w)) }
= minu in U(x) E {g(x,,u ,w ) + t(x)}  t(x)
JN-2 (x) = t(x).
Thus x is in TN-2. Thus TN-1.  TN-2
Similarly TN-1  …..T1  T0
Thus TN-1 = …..T1 = T0
The optimal decision is to stop once the state is in a
certain stopping set, and this set does not depend
on the iteration number.
Modified Asset selling problem
Let it be possible to hold the previous offers
TN-1 is the set of states where the quote is above a
certain value.
Once you enter this set you always remain here
Thus the optimal decision is to accept the
offer if it is above a certain threshold, where
the threshold does not depend on the iteration.
Multiaccess Communication
A bunch of terminals share a wireless medium.
Only one user can successfully transmit a packet at a time.
A terminal attempts a packet with a probability which
is a function of the total queue length in the system.
Multiple attempts cause interference, no attempt
causes poor utilization.
A single attempt clears a packet from the system.
The objective is to choose a probability which
maximizes the number of successful transmissions,
that is reduces the queue length
Let the cost g(x) be an increasing function of the queue length
Disturbances are arrivals
Let every packet be attempted with probability uk in slot k.
Success probability is the probability that only one
packet is attempted which is xk uk(1- uk)x-1. Refer to it
as p(x ,u )
Jk(xk) = gk (xk) + minu in [0, 1] Ew { p(xk , uk ) Jk+1 (xk+ wk - 1)
= + (1 - p(xk , uk )) Jk+1 (xk + wk ) }
= gk (xk) + Jk+1 (xk + wk ) + minu in [0,1] Ew { p(xk , uk )
(Jk+1 (xk+ wk - 1) - Jk+1 (xk + wk ) )}
Jk(x) is an increasing function of x for each k since
gk(x) is an increasing function of x.
Thus Jk (xk+ wk)  Jk (xk + wk - 1)
The minimum is attained if p(xk , uk ) is maximized.
Happens when uk = 1/ xk
Every terminal needs to know the entire queue length
which is not realistic
Imperfect State Information
System has access to imperfect information about the state
x, that is now the observation is zk and not xkwhere
zk is now hk (xk , uk-1 , vk ), where vk is a random
disturbance which may now depend on the entire history
xk+1 = fk (xk , uk , wk )
Choose the controls such that additive expected cost over
N time slots is minimized, that is minimize
E{gN (xN) + k=0N-1 gk (xk , uk , wk ) }
Reformulation as a perfect state
problem
Let Ik be the vector of all previous observations and controls.
Consider Ik as the system state now.
Ik+1 = (Ik , uk , zk+1 )
JN-1(xk) = minu E{gN (fN ( xN-1 , uN-1 , wN-1 )) + gN-1 (xN-1 ,
uN-1 , wN-1 ) | IN-1 , uN-1 }
Jk(Ik) = minu E{gk ( xk , uk , wk ) + Jk+1 (Ik , zk+1 , uk ) | Ik ,
uk }
Sufficient Statistic
The method is complex because of state space explosion.
Can the entire information in Ik be carried in a
function of Ik which has lower dimensionality?
Sufficient statistic
Assume that the observation disturbance depends on the
current state, previous control and disturbance only.
Then P(xk | Ik ) is a sufficient statistic.
Jk(Ik) = minu E{gk ( xk , uk , wk ) + Jk+1 (Ik , zk+1 , uk ) | Ik , uk }
The expectation is a function of P(xk wk zk+1 | Ik uk)
P(xk wk zk+1 | Ik uk) is a product of P(zk+1 | Ik uk xk wk) ,
P(wk | xk uk) and P(xk | Ik)
Thus the cost J is a function of P(xk | Ik) explicitly as
the first probability is P(zk+1 | uk xk wk) and the second
is P(wk | xk uk)
P(xk | Ik) can be computed efficiently from P(xk+1 | Ik+1)
using bayes rule. The system state is now the conditional
probability distribution P(x | I )
Examples: Treasure searching
A site may contain a treasure.
If it contains the treasure, then the search yields the treasure
with probability 
The treasure is worth V units, each search costs C units,
and the search has to terminate in N slots.
The state is the probability that the site contains the treasure
given the previous controls and observations, pk
If we don’t search at a previous slot, we wouldn’t search in
future.
Probability recursion
pk+1 = pk if the site is not searched at time k
= 0 if the site is searched and a treasure is found.
= pk (1-)/ (pk (1-) + 1- pk ) if the site is searched and
a treasure is not found.
Jk(pk) = max [0, -C + Vpk + (1-pk )Jk+1 (pk+1 ) }
JN-1 (p) = 0
Search if and only if Vpk  C
General Form of the Recursion
P(xk+1 | Ik+1) = P(xk+1 | Ik , uk, zk+1 )
= P(xk+1 zk+1 | Ik , uk)/ P(zk+1 | Ik , uk)
= P(xk+1 | Ik , uk) P(zk+1 | Ik , uk, xk+1 )/-
P(xk+1 | Ik , uk) P(zk+1 | Ik , uk, xk+1 ) dxk+1
xk+1 = fk (xk , uk , wk )
P(xk+1 | Ik , uk) = P(wk | Ik , uk)
= - P(xk | Ik ) P(wk | uk, xk ) dxk
P(zk+1 | Ik , uk xk+1 ) can be expressed in terms of P(vk+1 | xk , uk
wk ), P(wk | xk , uk ), P(xk | Ik )
Suboptimal Control
Certainty Equivalence Control
Given the information vector Ik compute the state estimate
xk( Ik)
Choose the controls such that additive expected cost over
N time slots is minimized, that is minimize
gN (xN) + k=0N-1 gk (xk , uk , wk )
Where the disturbances are fixed at their expectations
subject to the initial condition as state xk( Ik)
Deterministic optimizations are easier to solve.
Further Simplification
Choose a heuristic to solve the optimization
approximately.
Find the cost to go function associated with the heuristic
for every control and state, Jk (xk , uk , E(wk ))
Find the control which minimizes gk (xk , uk , E(wk )) +
Jk+1 (xk , uk , E(wk ))
And apply it in the kth stage
Partially stochastic certainty
equivalence control
Applies for imperfect state information
Solve the DP assuming perfect state information
At every stage assume that the state is the
expected value given the observation and the
controls, and choose the controls accordingly.
Applications
Multiaccess communication
Hidden markov model
Open Loop Feedback Control
Similar to certainty equivalence controller,
except that it uses the measurements to modify
the distribution of expectation as well.
OLFC performs at least as well as the optimal
open loop policy, but CEC does not provide
such guarantee.
Limited Lookahead Policy
Find the control which minimizes E[gk (xk , uk , E(wk )) +
Jk+1 (xk , uk , wk ))]
And apply it in the kth stage,
Where Jk+1 (xk , uk , wk ) is an approximation of the cost
to go function.
One stage look ahead policy
Two stage lookahead policy
Approximate Jk+2 (xk , uk , wk )
Compute a two-stage DP with terminal cost Jk+2 (xk , uk , wk )
Performance Bound
Let a function Fk (xk , uk , wk ) be upper bounded
by Jk (xk , uk , wk ), and let
Fk (xk , uk , wk ) = min E[gk (xk , uk , E(wk )) + Jk+1
(xk , uk , wk ))]
Then the cost to go of the one step look-ahead
policy in the kth stage is upper bounded by Fk (xk
, uk , wk )
How to approximate?
Problem approximation
Use the cost to go of a related but simpler problem
Approximate the cost to go function by a
parametrized function, and tune the parameters
Approximation architectures
Approximate the cost to go by that of a suboptimal
strategy which is expected to be reasonably close.
Rollout policy
Problem Approximation
CEC cost
Vehicle routing:
There is a graph with a reward associated with each node.
There are m vehicles which traverse through the graph.
The first vehicle traversing a node collects all its reward.
Each vehicle starts at a given node and returns to another node after
a maximum of a certain number of arcs.
Find a route for each vehicle which maximizes the total reward
Approximate cost to go is the optimal value to go of
the following sub-optimal set of paths.
Fix the order of the vehicles
Obtain the path for each in order, reducing the
rewards of the traversed nodes to 0 at all times.
Rollout policy
Sub-optimal policy to start with
Base policy
One step look-ahead always improves upon the
base policy.
Example: Quiz Problem
A person is given a list of N questions.
Question j will be answered with probability pj
The person will receive a reward vj if he answers the jth
question correctly.
The quiz terminates at the first incorrect answer.
The optimal ordering is to answer in decreasing
order of pj vj /(1 - vj )
Variants where this solution can
be used as a base
A limit on the maximum number of questions
which can be answered.
A time window for each question where each
question can be answered.
Precedence constraints
Infinite Horizon Problem
Problem Description
The objective is to maximize the total cost over an
infinite horizon.
LimN Ek=0N-1 g (xk , uk , wk )
This limit need not exist!
Thus the objective is to minimize a discounted cost
function where LimN Ek=0N-1 k g (xk , uk , wk )
where discount factor  is in (0, 1).
J(x) = LimN Ek=0N-1 k g (xk , uk , wk )
where x0 = x
Classifications
Stochastic shortest path problem
Here the discount factor can be taken as 1
There is a termination state such that the system stays in the
termination state once it reaches there.
The system reaches the termination state with probability 1
The horizon is in effect finite but its length is random.
Discounted problems
The cost per stage is bounded
Here the discount factor is less than 1
Absolute cost per stage is upper bounded
Thus LimN Ek=0N-1 k g (xk , uk , wk ) exists
The cost per stage is un-bounded
The analysis is more complicated
Average Cost Problem
Minimize LimN 1/NEk=0N-1 k g (xk , uk , wk )
Exists under certain special conditions.
Lim  0 (1-  )J(0) is the average cost of the
optimal strategy in many cases
Bellmans Equations
The optimal costs J(x) satisfy Bellman’s equations
J(x) = minu in U(x) E { g(x, u, w ) +  J (f(x,u,w)) }
Given any initial condition, J0(x), the iteration
Jk+1(x) = minu in U(x) E { g(x, u, w ) +  Jk (f(x,u,w)) }
Converges to the optimal discounted cost J(x)
(value iteration)
Optimal cost of any stationary
policy
A policy is said to be stationary if it does not depend on the time
index, that is given the control action in any slot j is same as that
in any other slot k, if the state in both is the same
Optimal discounted cost of a stationary policy u can be
found by solving the following equations:
J,u(x) = E { g(x, u(x), w ) +  J ,u (f(x,u(x),w)) }
The solution can be obtained from the DP iteration, starting from
any initial state
Jk+1(x) = E { g(x, u(x), w ) +  Jk (f(x,u(x),w)) }
A stationary policy is optimal if and only if for every state x
the cost accrued is the minimum attained in the right side of
the Bellmans equation
There always exists an optimal stationary policy for
bounded cost and discount less than 1.
Similar results hold for stochastic shortest path
problems with discount factor 1
Stochastic Shortest Path
Battery management problem
Computational Strategies for
solving Bellmans equations
Value iteration
Infinite number of iterations
Policy Iteration
Finite number of iterartions
Policy Iteration
Start from a stationary policy
Generate a sequence of new policies
Let the policy in the kth iteration be uk
Compute its cost by solving the following linear equations
J(x) = E { g(x, uk(x), w ) +  J(f(x, uk(x), w)) }
The new policy uk+1 can be obtained using the solutions of the
above, J(x), as follows:
uk+1(x) = arg minu in U(x) E { g(x, u(x), w ) +  J (f(x,u(x),w)) }
The iteration stops when the current policy is the
same as the previous policy.
The policy iteration terminates at the optimal
policy in a finite number of iterations, and the
cost of the policies are decreasing.
Continuous time MDP
Time is no longer slotted
State transitions occur at any time.
Markov: The system restarts itself at the instant of every
transition.
Fresh control decisions taken at the instant of
transitions.
Discretize the system by looking at the transition epochs
only (these act like slot boundaries)
Continuous time MDP
formulation of inventory system
Unit Demand arrives as a poisson process ()
Unit Order arrives as a poisson process ()
Transitions are demand epochs, and inventory arrival epochs
Assume that any previous order and demand arrival process is
cancelled at a transition epoch.
State is the inventory level and whether or not an order was
placed at the previous transition
Penalties are charged at the transition epochs:
demands which can not be fulfilled incur penalties
orders are charged at delivery
Amount of inventory x
Indicator of whether or not fresh inventory
was ordered y
J(x, y) =  g1(x) +  g2(y) +   J(x+1) +   J(x+y)
g1(x) = 0 if x is positive
= c otherwise
g2(y) = 0 if y = 0
= p otherwise
Download