Markov Decision Processes Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted cost and average cost criteria. Markov Decision Processes 1 Markov Decision Process Let X = {X0, X1, …} be a system description process on state space E and let D = {D0, D1, …} be a decision process with action space A. The process (X, D) is a Markov decision process if , for j ∈ E and n = 0, 1, …, P { X n +1 = j X n , Dn ,..., X 0 , D0 } = P { X n+1 = j X n , Dn } Furthermore, for each k ∈ A, let fk be a cost vector and Pk be a one-step transition probability matrix. Then the cost fk(i) is incurred whenever Xn = i and Dn = k, and P { X n +1 = j X n = i, Dn = k } = Pk ( i, j ) The problem is to determine how to choose a sequence of actions in order to minimize cost. Markov Decision Processes 2 Policies A policy is a rule that specifies which action to take at each point in time. Let D denote the set of all policies. In general, the decisions specified by a policy may – depend on the current state of the system description process – be randomized (depend on some external random event) – also depend on past states and/or decisions A stationary policy is defined by a (deterministic) action function that assigns an action to each state, independent of previous states, previous actions, and time n. Under a stationary policy, the MDP is a Markov chain. Markov Decision Processes 3 Cost Minimization Criteria Since a MDP goes on indefinitely, it is likely that the total cost will be infinite. In order to meaningfully compare policies, two criteria are commonly used: 1. Expected total discounted cost computes the present worth of future costs using a discount factor α < 1, such that one dollar obtained at time n = 1 has a present value of α at time n = 0. Typically, if r is the rate of return, then α = 1/(1 + r). The expected total discounted cost is 2. ∞ E ∑ n =0 α n f Dn ( X n ) The long run average cost is mlim →∞ 1 m −1 f ( Xn ) ∑ n = 0 Dn m Markov Decision Processes 4 Optimization with Stationary Policies If the state space E is finite, there exists a stationary policy that solves the problem to minimize the discounted cost: ∞ α α α v ( i ) = min vd ( i ) , where vd ( i ) = Ed ∑ n =0 α n f D ( X n ) X 0 = i d ∈D n If every stationary policy results in an irreducible Markov chain, there exists a stationary policy that solves the problem to minimize the average cost: 1 m −1 ϕ * = min ϕ d , where ϕ d = lim ∑ n =0 f D ( X n ) d ∈D m →∞ m Markov Decision Processes n 5 Computing Expected Discounted Costs Let X = {X0, X1, …} be a Markov chain with one-step transition probability matrix P, let f be a cost function that assigns a cost to each state of the M.C., and let α (0 < α < 1) be a discount factor. Then the expected total discounted cost ∞ −1 is g ( i ) = E ∑ n = 0 α n f ( X n ) X 0 = i = ( I − α P ) f ( i ) Why? Starting from state i, the expected discounted cost can be found recursively as g ( i ) = f ( i ) + α ∑ j Pij g ( j ), or g = f + α Pg Note that the expected discounted cost always depends on the initial state, while for the average cost criterion the initial state is unimportant. Markov Decision Processes 6 Solution Procedures for Discounted Costs Let vα be the (vector) optimal value function whose ith component is vα ( i ) = min vdα ( i ) d ∈D { } For each i ∈ E, vα ( i ) = min f k ( i ) + α ∑ j∈E Pk ( i, j ) vα ( j ) k∈ A These equations uniquely determine vα . If we can somehow obtain the values vα that satisfy the above equations, then the optimal policy is the vector a, where { } a ( i ) = arg min f k ( i ) + α ∑ j∈E Pk ( i, j ) vα ( j ) k∈ A “arg min” is “the argument that minimizes” Markov Decision Processes 7 Value Iteration for Discounted Costs Make a guess … keep applying the optimal value equations until the fixed point is reached. Step 1. Choose ε > 0, set n = 0, let v0(i) = 0 for each i in E. Step 2. For each i in E, find vn+1(i) as { } vn +1 ( i ) = min f k ( i ) + α ∑ j∈E Pk ( i, j ) vn ( j ) k∈ A Step 3. Let δ = max {vn +1 ( i ) − vn ( i )} i∈E Step 4. If δ < ε, stop with vα = vn+1. Otherwise, set n = n+1 and return to Step 2. Markov Decision Processes 8 Policy Improvement for Discounted Costs Start myopic, then consider longer-term consequences. Step 1. Set n = 0 and let a0(i) = arg min k∈A f k ( i ) Step 2. Adopt the cost vector and transition matrix: f ( i ) = f an (i ) ( i ) P ( i, j ) = Pan ( i ) ( i, j ) Step 3. Find the value function v = ( I − α P ) { −1 f } Step 4. Re-optimize: an +1 ( i ) = arg min f k ( i ) + α ∑ Pk ( i, j ) v ( j ) j∈E k∈ A Step 5. If an+1(i) = an(i), then stop with vα = v and aα = an(i). Otherwise, set n = n + 1 and return to Step 2. Markov Decision Processes 9 Linear Programming for Discounted Costs Consider the linear program: max ∑ i∈E u ( i ) s.t. u ( i ) ≤ f k ( i ) + α ∑ j∈E Pk ( i, j ) u ( j ) for each i, k The optimal value of u(i) will be vα(i), and the optimal policy is identified by the constraints that hold as equalities in the optimal solution (slack variables equal 0). Note: the decision variables are unrestricted in sign! Markov Decision Processes 10 Long Run Average Cost per Period For a given policy d, its long run average cost could be found from its cost vector fd and one-step transition probability matrix Pd: First, find the limiting probabilities by solving ∞ π j = ∑ π i Pd ( i, j ), j ∈ E; i∈E Then ϕd ∑ = lim m →∞ m −1 n=0 fd ( X n ) ( X n ) m ∞ ∑π j∈E j =1 = ∑ fd ( j ) π j j∈E So, in principle we could simply enumerate all policies and choose the one with the smallest average cost… not practical if A and E are large. Markov Decision Processes 11 Recursive Equation for Average Cost Assume that every stationary policy yields an irreducible Markov chain. There exists a scalar ϕ∗ and a vector h such that for all states i in E, ϕ * + h ( i ) = min f k ( i ) + ∑ j∈E Pk ( i, j ) h ( j ) k∈ A { } The scalar ϕ∗ is the optimal average cost and the optimal policy is found by choosing for each state the action that achieves the minimum on the right-hand-side. The vector h is unique up to an additive constant … as we will see, the difference between h(i) - h(j) represents the increase in total cost from starting out in state i rather than j. Markov Decision Processes 12 Relationships between Discounted Cost and Long Run Average Cost • If a cost of c is incurred each period and α is the discount factor, then the total discounted cost is v = ∑ ∞ cα n = c n =0 1−α • Therefore, a total discounted cost v is equivalent to an average cost of c = (1-α)v per period, so lim (1 − α ) vα ( i ) = ϕ * α →1 • Let vα be the optimal discounted cost vector, ϕ* be the optimal average cost and h be the mystery vector from the previous slide. lim vα ( i ) − vα ( j ) = h ( i ) − h ( j ) α →1 Markov Decision Processes 13 Policy Improvement for Average Costs Designate one state in E to be “state number 1” Step 1. Set n = 0 and let a0(i) = arg min k∈A f k ( i ) Step 2. Adopt the cost vector and transition matrix: f ( i ) = f an (i ) ( i ) P ( i, j ) = Pan ( i ) ( i, j ) Step 3. With h(1) = 0, solve ϕ + h = f + Ph Step 4. Re-optimize: { } an +1 ( i ) = arg min f k ( i ) + α ∑ j∈E Pk ( i, j ) h ( j ) k∈ A Step 5. If an+1(i) = an(i), then stop with ϕ* = ϕ and a*(i) = an(i). Otherwise, set n = n + 1 and return to Step 2. Markov Decision Processes 14 Linear Programming for Average Costs Consider randomized policies: let wi(k) = P{Dn = k | Xn = i}. A stationary policy has wi(k) = 1 for each k=a(i) and 0 otherwise. The decision variables are x(i,k) = wi(k)π(i). The objective is to minimize the expected value of the average cost (expectation taken over the randomized policy): min ϕ = ∑ i∈E ∑ k∈A x ( i, k ) f k ( i ) s.t. ∑ x ( j, k ) = ∑ ∑ ∑ ∑ x ( i, k ) = 1 k∈ A i∈E i∈E k∈ A x ( i, k )Pk ( i, j ) for each j ∈ E k∈A Note that one constraint will be redundant and may be dropped. Markov Decision Processes 15