1 Introduction to Markov Decision Processes (MDP) 1. Decision Making Problem • Multi-stage decision problems with a single decision maker Competitive MDP: more than one decision makers • Open-loop v.s. Closed-loop problems Open-loop: front end, plan-act Closed-loop: observe-act-observe, dependent policy • Short-term v.s. long-term decisions myopic control/greedy strategy MDP balance short-term and long-term 2. What is MDP? • Process observed: perfectly or imperfectly (Bayesian analysis) At decision epoch (typically discrete) over a horizon (finite or infinite) • Actions are taken after observation e.g., order more inventory, continue investment, etc. • Rewards are received Based on state, time of action and action • Probability distribution for the state transition –Probability of having j units at time t+1, given that we had i units at time t and order k units. • The state succinctly summarizes the impact of the past decisions on subsequent decisions 3. Features for MDP: • Key ingredients of sequential decision making model – – – – – A A A A A set set set set set of of of of of decision epochs system states available actions state and action dependent immediate reward or cost state and action dependent transition probabilities • Apart from the mild separability assumptions, the dynamic programming framework is very general. objective function f (x1 , x2 , x3 ) = f1 (x1 ) + f2 (x2 ) + f3 (x3 ) or f (x1 , x2 , x3 ) = f1 (x1 ) + f2 (x1 , x2 ) + f3 (x1 , x2 , x3 ) • No assumptions on cost functions (linear, nonlinear, etc.) • In summary, every separable non-linear programming can be formulated as a Dynamic Programming (DP) • Limitations of Markov Decision Processes: The curse of dimensionality; Exponential growth in the state space. • What questions can we answer? 1 – When does an optimal policy exist? – When does it have a particular form or structure – How can we efficiently compute the optimal policy? Value iteration, policy iteration or Linear programming 4. Markov Decision Processes Application Problem • Inventory Management: (1)determining optimal reorder points and reorder levels; Decision epoch: weekly review State: product inventory level at the time of review Action: amount of stock to order Probability transition: depends on how much ordered and the random demand for that week A decision rule specifies the quantity to be ordered as a function of the stock on hand at the time of review A policy consists of a sequence of such restocking functions (2)determining reorder points for complex product network; • Maintenance and Replacement Problem: (1)Bus engine maintenance and replacement: decision makers periodically inspects the condition of the equipment; based on the age and condition to decide on the extent of the maintenance or replacement; cost are associated with the maintenance and operating the machine in the current status; the objective is to balance these two (maintenance and operating costs) to minimized a measure of long-term operating costs. (2)Highway pavement maintenance: minimize long-run average costs subject to road quality. • Communication Models: wide range of computer, manufacturing and communication systems can be modeled by networks of interrelated queues and servers, control the channels • Behavior ecology: (1)animal behavior– bird nestling problem; state will be the female and the brood health; objective is to maximize a weighted average of the probability of nestling survival and probability of the female’s survival to the next breeding season; the model will find the optimal behavioral strategy (stay at the nest to protect the young; hunt to supplement the food supply by the male; desert the nest) (2)gambling model–find the optimal digit locations to maximize the probability of winning 2 2 Markov Decision Processes (MDP) Model Formulation A decision maker’s goal is to choose a sequence of actions which causes the system to perform optimally with respect to some predetermined criteria. MDP has five elements: decision epochs, states, actions, transition probabilities and rewards. • Decision epochs T • States S • Actions As • Transition probability p • Rewards/costs rt (s, a) 1. Decision Epochs: let T be the decision epochs Discrete: Finite: T = {0, 1, 2, ..., N } Infinite: T = {0, 1, 2, ......} Continuous:T = [0, N ] and T = [0, ∞] Our focus is on discrete-time (primary infinite horizon) 2. States and actions • We observe the system in a state at a decision epoch • Set of all possible states: S • In a state s ∈ S, the decision maker selects an action a ∈ As , where As is the set of feasible actions in state s ∈ S • The set of feasible actions is A = ∪s∈S As • We primarily deal with discrete sets of states and actions (finite or countably infinite) • Actions chosen randomly or deterministically according to a selected ”policy” 3. Rewards and Transition probabilities At time t ∈ T , system is in state s ∈ S, and the decision maker selects an action a ∈ As , then, • Receive a reward rt (s, a) ∈ R, which could be profit or cost; • System moves to state j ∈ S with conditional probability pt (j|s, a) 4. MDP is defined by {T, S, As , Pt (·|s, a), rt (s, a)}; finite horizon vs. infinite horizon. 5. Decision Rules and Policies Decision Rules: function mapping from the state space to action • Deterministic Markovian decision rules: dt (·) : S −→ As for j ∈ S, dt (j) ∈ Aj 3 • History dependent History of the process: ht = {s1 , a1 , s2 , a2 , · · · , st−1 , at−1 , s} and ht ∈ Ht Given the history of the process: mapping: dt : Ht −→ Ast • Randomized decision rules: action is not selected with certainty; select a distribution qdt (·) to select an action • HR–set of history dependent randomized decision rules • HD–set of history dependent deterministic decision rules • MR–set of Markovian randomized decision rules • MD–set of Markovian deterministic decision rules Polices/contingency plan/strategy: • A policy Π is a sequence of decision rules: Π = (d1 , d2 , · · · ) and di ∈ Dik , i ∈ T • It is a stationary policy if dt = d for all t ∈ T Π = (d1 , d1 , · · · ) or Π = d∞ • ΠSD = set of stationary deterministic policies • ΠSR = set of stationary randomized policies • For finite horizon problem, stationary policies are not optimal. For infinite horizon problem, stationary policies are optimal. Machine maintenance example Inventory control example 6. One period MDP problem • Why important? Multi-period decomposed into single period problem • N = 2, T = {1, 2}, S = {1, 2, ..., n} To maximize the sum of immediate reward and expected terminal reward • Value of a policy Π: (decision rule is denoted by d(s), which can be used interchangeably with as ) v(s) = r1 (s, d1 (s)) + n X p1 (j|s, d1 (s)) · v(j) j+1 • what is the best action to select from state S? max{r1 (s, d1 (s)) + a∈As n X p1 (j|s, d1 (s)) · v(j)} j=1 Then d∗1 (s) = arg max{r1 (s, a) + a∈As n X j=1 4 p1 (j|s, d1 (s)) · v(j)} 7. Example of equipment replacement • current machine is one year old • Planning for 3 years • Expense Age Maintenance Salvage 0 $0 - 1 $1 $8 2 $2 $7 3 $3 $6 4 $5 • New machine is $10. • Assume salvage at the end of year 3. 5 3 MDP applications and examples 1. Two state MDP • Assumption: stationary rewards and stationary transition probabilities, i.e., the rewards and transition probabilities won’t change with time; there are two states: s1 and s2 . • MDP formulation: 5 modeling components (decision epochs, states, actions, rewards and transition probabilities) 2. Single Product Inventory Control • Assumptions – – – – – No delivery lead time, instantaneous delivery All orders’demands are filled at the end of the month No backlogging (No negative inventory allowed) stationary demand (No seasonality) Warehouse capacity: M units • question to be answered: For a given stock level at month t, how much do I order? • MDP formulation: 5 modeling components (decision epochs, states, actions, rewards and transition probabilities) • state transition St+1 = [St + at − Dt ]+ • cost of placing orders: o(u) • inventory holding cost: h(u) • stochastic demand (stationary, does not change with time): p(Dt = j) = pj • revenue of selling j units of products: f (j) • expected profit for one period: – first step rt (St , at , St+1 ) = f (St + at − St+1 ) − o(at ) − h(st + at ) – then take the expectation over St+1 X rt (St , at ) = (f (St + at − max(0, St + at − j)) − o(at ) − h(st + at ))pj j = −o(at ) − h(st + at ) + X (f (St + at − max(0, St + at − j)))pj j If st + at > j, then f (St + at − max(0, St + at − j)) = f (j) If st + at ≤ j, then f (St + at − max(0, St + at − j)) = f (St + at ) – Finite horizon: rN = g(s) • if M ≥ j > s + a; 0, ps+a−j , if M ≥ s + a ≥ j > 0; pt (j|s, a) = qs+a , if M ≥ s + a and j = 0. 3. Shortest Route and Critical Path Models; Sequential Allocation Model 4. Optimal Stopping Problem 6 Practice problems: 1. suppose we have a machine that is either running or broken down. If it runs throughout one week, it makes a gross profit of $100. If it fails during the week, gross profit is zero. If it is running at the start of the week ans we perform preventive maintenance, the probability of failure is 0.7. However, maintenance will cost $20. When the machine is broken down at the start of that week, it may either by repaired at a cost of $40, in which case it will fail during the week with a probability of 0.4, or it may be replaced at a cost of $150 by a new machine that is guaranteed to run through its first week of operation. Find the optimal repair, replacement and maintenance policy that maximize total profit over three weeks, assuming a new machine at the start of the first week. Practice problems: 2. At the start of each week, a worker receives a wage offer of w units per week. He may either accept the offer and work at that wage for the entire week or instead seek alternative employment. If he decides to work in current week, then at the start of the next week, with probability p, he will have the same wage offer available, with probability of 1 − p, he will be unemployed and unable to seek employment during that week. If he seeks alternative employment, he receives no income in the current week and obtains a wage offer of w0 for the subsequent week according to a transition probability of pt (w0 |w). Assume his utility when receiving wage w is Φt (w). 7 4 Finite Horizon MDP 1. Introduction • The solution of a finite-horizon MDP depends on the ”optimality equations” • Solution is found by analyzing a sequence of a smaller deductively defined problems • Principle of optimality: ”An optimal policy has the property that whatever the initial state and decision are, the remaining decisions constitute on optimal policy with regard to the state resulting from the first decision” Shortest path problem example: using backwards induction. Optimality Equation: Ut (s) = maxa∈As {rt (s, a)+ P j∈S pt (j|s, a)Ut+1 (j)} for t = 1, 2, ..., N −1 UN (s) = rN (s), ∀s Infinite horizon problem can not work backwards since there is no terminal rewards. 2. Optimality Criteria: How do We select Best policy Π∗ = (d∗1 , d∗2 , ..., d∗N ) • A fixed policy induces a stochastic process of states visited. For a given d(.), we are generating a Markov chain over states for a fixed policy. • Let xt and yt be the random variables describing the state and action selected at time t, then {(xt , yt )}N t=1 is a stochastic process for a given policy: Π = (d1 , d2 , ..., dN ) Sample path: (s, d1 (s)) →)(s0 , d2 (s0 )) → i.e, (X, Y ) →)(X 0 , Y 0 ) → And {r1 (x1 , y1 ), r1 (x1 , y1 ), ..., rN −1 (xN −1 , yN −1 ), rN (xN )} is also a stochastic process • Define Rt = rt (xt , yt ), a given policy Π will provide a probability of receiving a certain reward stream: pΠ (R1 , R2 , ..., RN ) Typically, difficult to determine a measure for maintaining order of stochastic elements, so we use expected value as one measurement. We say that policy Π∗ is preferred to policy Π1 if ∗ 1 E Π [f (R1 , R2 , ..., RN )] ≥ E Π [f (R1 , R2 , ..., RN )] Other types of measurements: mean-variance trade-off, risk-reward measure, time-sensitive measure (time value of money). In summary, more complex measures are available. 8 • Expected total reward criteria: For Π ∈ ΠHD (history deterministic), the value of policy Π is N −1 X VNΠ (s) = EsΠ { rt (xt , dt (ht )) + rN (xN )} t=1 Value of an N-period problem in state s at time 1 under policy Π. Assume: |rt (s, a)| ≤ M for all (s, a) ∈ S × A (bounded rewards), then VNΠ (s) exists. ∗ • We seek an optimal policy Π∗ ∈ ΠHR (more general case), where VNΠ (s) ≥ VNΠ (s), for all Π∗ ∈ ΠHR . The value of the MDP is the value of the policy VN∗ . It may not be achievable. 3. Optimality Equations and the Principle of Optimality • How do we find the optimal policy? 1. By analyzing a sequence of smaller inductively defined problems. 2. Using the principle of optimality: An optimal policy has the property that whatever the initial state and decision are, the remaining decisions constitute on optimal policy with regard to the state resulting from the first decision (any subsequence should be optimal if the big piece is optimal). 3. Take N-period problem and solve a series of N 1-period problem. • Define UtΠ (ht ) = EhΠt { N −1 X rn (xn , yn ) + rN (xN )} n=t given the decision maker has seen history ht at time t, UtΠ (ht ) is the expected remaining reward under policy Π from time t onward, where Π = {d1 , d2 , ..., dt−1 , dt , dt+1 , ..., dN }, and yn = dn (hn ). If we know ht , then we know xt and yt = dt (ht ) and rt (xt , yt ), so UtΠ (ht ) = rt (xt , yt ) + = rt (xt , yt ) + X N −1 X Π Eht { rn (xn , yn ) n=t+1 pt (j|st , dt (ht ))EhΠt+1 { N −1 X + rN (xN )} rn (xn , yn ) + rN (xN )} n=t+1 j∈S where ht+1 = {ht , dt (ht ), j} UtΠ (ht ) = rt (xt , yt ) + X Π pt (j|st , dt (ht ))Ut+1 (ht+1 ) j∈S This is the core of Dynamic programming. This is the recursive way to find the value of MDP. 9 • How to compute? By principle of optimality, decompose to get optimality equations. Ut (ht ) = max {rt (st , a) + a∈As (t) X pt (j|st , a)Ut+1 (ht+1 )} j∈S UN (hN ) = rN (sN ) inductively computed the optimal value. UN (hN ) is given, compute UN −1 (hN −1 ) for all hN −1 , sometimes it is not feasible due to the large number of hN −1 , then UN −1 (hN −1 ) is given, compute UN −2 (hN −2 ) for all hN −2 , this is called backward recursion. 4. Optimality of Deterministic Markov Policies Conditions under which there exists an optimal policy which is deterministic and Markovian Use backward induction to determine the structure of an optimal policy When the immediate rewards and transition probabilities depend on the past only through the current state of the system (as we have assumed), the optimal value function depend on the history only through the current state of the system. 1. Theorem 4.4.1: Existence of a deterministic history dependent policy. 2. Theorem 4.4.2: Existence of a deterministic Markovian policy. 3. If there are K states which L actions per state, there are (LK )(N −1) feasible policies. Each requiring (N − 1)LK multiplications to evaluate, backwards induction requires only (N − 1)LK 2 multiplication. 4. Reward function may be complicated functions to compute. Lots of research on computation reduction. 5. Backward Induction: based on theorem 4.4, there exist Markovian deterministic policy. So instead of finding uht , we only need to find ust 6. Examples problems revisited 10 Practice problem on stock call option: Suppose the current price of some stock is $30 per share, and its daily price increases by $0.10 with probability 0.6, remains the same with probability 0.1 and decrease by $0.10 with probability 0.3. Find the value of a call option to purchase 100 shares of this stock at $31 any time in the next 30 days by finding an optimal policy for exercising this option. Assume a transaction cost of $50. 11 7. Optimality of Monotone Policies (4.7) • Structured policies appeal to decision makers. Typical structure: control (action) limit (state) policy dt (s) = a1 when s < s∗ and dt (s) = a1 when s ≥ s∗ where s∗ is the control limit. • Only need to find s∗ at each t, instead of d∗t (s) for each t and s. Define X state and Y action are partially ordered sets and g(x, y) : X × Y −→ R then g is superadditive if for x+ ≥ x− and y + ≥ y − , then g(x+ , y + ) − g(x+ , y − ) ≥ g(x− , y + ) − g(x− , y − ) Monotone increasing difference: g(30, E) − g(30, w) ≥ g(20, E) − g(20, w) • Lemma 4.7.1, if g is superadditive and for each x ∈ X, maxy∈Y g(x, y) exists, then f (x) = max{y 0 ∈ argmaxy∈Y g(x, y)} is monotone nondecreasing in X. The supperadditivity is sufficient but not necessary condition. • Conditions under which monotone policies are optimal. P Define qt (k|s, a) = ∞ j=k pt (j|s, a), then if qt (k|s, a) is non-decreasing in S for all k ∈ S, a ∈ A, then for the state dependent non-decreasing sequence: ut (j + 1) ≥ ut (j) for all j ∈ S. Then for any decision rule At ∈ DM D ∞ X 0 0 pt (j|s , dt (s ))ut+1 (j) ≥ j=0 ∞ X pt (j|s, dt (s))ut+1 (j) j=0 for s0 ≥ s, s ∈ S. That is, the expected future reward for s0 is as large as that of s. If the optimal value function Ut (.) is nondecreasing and qt (k|s, a) is nondecreasing, then we prefer to be in a ”higher” state. Revisit the stock example. • Monotonic policies: if s0 ≥ s, then d∗ (s0 ) ≥ d∗ (s) Prop 4.7.3 Suppose the max of the optimality equation is attained, 1. rt (s, a) is nondecreasing in S for all a ∈ A, t = 1, 2, ..., N − 1 2. rN (s) is nondecreasing in S for all a ∈ A, t = 1, 2, ..., N − 1 3. qt (k|s, a) (probability of given state s, going to a higher state) is nondecreasing in S for all k ∈ S, a ∈ A, that is, higher state get a higher probability to go to a higher state. Then, Ut∗ (s) is nondecreasing in S for all t = 1, 2, ..., N − 1 12 • How to show monotonic optimal policy exist? Optimal value function structure ⇒optimal policy structure. Use theorem 4.7.4: list of conditions/checklist • Backward Induction Algorithm for MDP with Optimal Monotone Policy 13 5 Infinite Horizon MDP 1. Introduction to infinite horizon problem 1. Why? Horizon length is often unknown or stochastic. 2. Assume time homogenous data: pt (j|s, a) = p(j|s, a) and rt (s, a) = r(s, a) 3. Non-homogenous data for ∞ horizon, usually forecast for T periods, implement for that decision for the current planning horizon and reforecast for T more periods. (Turnpike Theorem) 2. Value of a policy for infinite horizon problem 1. For infinite horizon problem, optimal policy is always stationery. Policy Π∗ = (d, d, d, ...) = d∞ , dt = d is a stationary policy. 2. For a given policy, we receive an infinite stream of rewards r(xt , yt ) with xt being the state and yt being the action taken under the policy d∞ 3. For a given policy d∞ , the process becomes a Markov reward chain/process. Define p(j|i) = p(j|i, d(i)) = pij (since the policy is stationary, we do not need to specify it in the reward function and transition probabilities), when will this process converge? How can we measure the optimality? • Expected total reward: limit may be ∞ or −∞ ; limit does not exist • Expected total discounted reward: limit exist if sups,a |r(s, a)| = M < ∞ • Average reward/cost criterion: exist only if limN →∞ sup = limN →∞ inf 3. Criteria to measure a policy for infinite MDP problem ∗ 1.Expected total reward: v π ≥ v π for each s ∈ S and all π ∈ ΠHR ∗ 2.Expected total discounted reward: vλπ ≥ vλπ for each s ∈ S and all π ∈ ΠHR ∗ 3.Average reward/cost criterion: g π ≥ g π for each s ∈ S and all π ∈ ΠHR 4. Markov Policies 1. We now restrict our attention from ΠHR to ΠM R , eventually to ΠM D 2. Thm 5.5.1 Proof by induction that Markovian policy exist. 5. Vector notations for MDP 14 6 Discounted MDP We study infinite-horizon Markov decision processes with the expected total discounted reward optimality criterion.This approach is widely used and applied, such as, to consider the time value of money, technical changes. Assumptions: • Stationary and bounded reward • Stationary transition probabilities • Discounted future rewards: 0 < λ ≥ 1 • Discrete state space 1. How to evaluate a policy? For nonstationary policy Π1 = {d1 , d2 , ...}, 1 1 P t−1 r(x , y )} we know VλΠ = EsΠ { ∞ t t t=1 λ If we define Π2 = {d2 , d3 , ...} ∈ ΠM D or in general Πn = {dn , dn+1 , ...} ∈ ΠM D 1 VλΠ = expected discounted reward of being in state S and under policy Π1 then X X 1 VλΠ = r(s, d1 (s)) + p(j|s, d1 (s))[λr(j, d2 (j)) + p(k|j, d1 (j))[λ2 ...]] j∈S k∈S or 1 VλΠ = rd1 (s) + λPd1 VλΠ 2 If the policy is not stationary, i.e, di 6= dj , then nearly impossible to compute without special structure. If we have stationary data, why the decisions will be different at same state at a later time? The future will be viewed the same, so we should follow the same decision rule. We only look at stationary policy for infinite horizon problems. Define Π = d∞ = {d, d, d...} ∞ Vλd = rd (s) + λ X ∞ p(j|s, d(s))Vλd (j) j∈S or ∞ Vλd ∞ = rd + λPd Vλd In summary, we want the solution of V = rd + λPd V ⇒ (I − λPd )V = rd 15 If (I − λpd )−1 exists, the system has a unique solution. From linear algebra, if limn→∞ k (λPd )n k1/n < 1, then (I − λpd )−1 exists and (I − λpd ) −1 = ∞ X λt Pdt t=0 If U ≥ 0, then (I − λpd )−1 U ≥ 0, if rewards are nonnegative, then the policy is non-negative and then: ∞ X d∞ −1 vλ = (I − λpd ) rd = λt Pdt rd = rd + λpd rd + (λpd )2 rd + · · · t=0 Now define: Ld V ≡ rd + λpd V linear transformation defined by d, it is a function of V , an operation on V . If find V 0 ∈ V , such that Ld V 0 = V 0 ⇒ rd + λpd V 0 = V 0 ⇒ V 0 = (I − λpd )−1 rd ∞ or V 0 = Vλd . Therefore, in order to find the value of the stationary policy Π = {d∞ }, we need to find the fixed point of the operator Ld . Fixed point of a function f (x) is a y, such that f (y) = y. For finite horizon under stationary assumption: Ut (s) = supa∈As {r(s, a) + λ P j∈S p(j|s, a)Ut+1 (j)} for t = 1, 2, · · · , N − 1 UN (s) = rN (s) What is the optimality equation for infinite horizon problem? For infinite horizon case: V (s) = sup {r(s, a) + λ a∈As X p(j|s, a)V( j)} j∈S for s ∈ S. Starting from now, we focus on π M D , which is deterministic and stationary. 16 Define operator LV ≡ maxd∈D {rd + λPd V }, then for s ∈ S, X LV (s) = max {rd (s) + λ P (j|s, d(s))V (j)} d∈DM D (1) j∈S = max{rd (s) + λ a∈As X P (j|s, d(s))V (j)} (2) j∈S Then LV (s) = max{rd (s) + λ a∈As X P (j|s, d(s))V (j)} = V (s) j∈S Then we want v 0 ∈ V , such that Lv 0 = v 0 In other words, we need to find the fixed point of L. Numerical example 6.1.1. 2. Optimality Equation We will show the following: (a)The optimality equation has a unique solution in V . (b)The value of the discounted MDP satisfies the optimality equation. (c)The optimality equation characterizes stationary optimal policies. (d)Optimal policies exist under reasonable conditions on the states, actions, rewards, and transition probabilities. For infinite horizon optimality, the Bellman equation is: X v(s) = sup {r(s, a) + λp(j|s, a)v(j)} a∈As j∈S Proposition 6.2.1 provide the proof to restrict the policy solution space from Markov Randomized to Markov Deterministic. Remember in Section 5.5, we prove that for each fixed initial state, we may restrict attention to Markov Policies from the History dependent policies. Properties of optimality equations: Theorem 6.2.2 for v ∈ V , if (a)v ≥ Lv, then v ≥ vλ∗ (b)v ≤ Lv, then v ≤ vλ∗ (c)v = Lv, then v = vλ∗ If v ≥ Lv, then Lv ≥ L2 v, then v ≥ Lv ≥ L2 v ≥ · · · . We need to prove: Lk V → Vλ∗ , as k → ∞. In other words, prove for bounded v ∈ V, vλ∗ = limk→∞ Lk v where vλ∗ is the value of the policy under operator L. 17 Given v, then Lv is the optimal value of a one period problem with terminal value v L2 v is the optimal value of a two-period problem with terminal value λ2 v ··· Ln v is the optimal value of a n-period problem with terminal value λn v n → ∞, λn v → 0, it becomes irrelevant. L∞ v ≡ vλ∗ , which is the infinite horizon problem solution. Definition of normed linear space: Define V as the set of bounded real-valued function on S The norm of V : kV k = maxs∈S |V (s)| V is closed under addition and scaler multiplication. then (V, k · k) is a normed linear space. Properties of normed linear space: If v1 , v2 ∈ V then (1)v1 + v2 = v2 + v1 (2)v1 + (v2 + v3 ) = (v1 + v2 ) + v3 (3)∃ a unique vector 0, v + 0 = v (4)For each v ∈ V , ∃ a unique vector −v ∈ V such that v + (−v) = 0 Scalar multiplication: α ∈ <, β ∈ < (1)α(βv) = (αβ)v (2)1 ∗ v = v for all v (3)(α + β)v = αv + βv (4)α(v1 + v2 ) = αv1 + αv2 For normed linear space, triangle inequality holds: kv1 + v2 k ≤ kv1 k + kv2 k Definition of contraction mapping: T : U → U is a contraction mapping if there exist a λ, 0 ≤ λ ≤ 1 such that k T v − T u k≤k v − u k for all u and v in U . Show that if T is a contraction mapping, then kT k v − T k v 0 k ≤ λk kv − v 0 k → 0. Theorem 6.2.3: (Banach fixed point theorem) Let T be a contraction mapping on a Banach space v,then (1)∃ unique v ∗ ∈ v, s.t, T v ∗ = v ∗ (2)limk→∞ kT k v − v ∗ k = 0 for any v ∈ V . We know we can find the optimal policy, but what is the optimal policy. We say d∗ is a conserving decision rule, if 18 rd∗ + λpd∗ vλ∗ = vλ∗ where vλ∗ is the optimal value vector. Component-wise, r(s, d(s)) + λ P j∈S p(j|s, d∗ (s))vλ∗ (j) = vλ∗ (s) In other words, d∗ satisfies the optimality equations. Is d∗ unique? 3. Value iteration Value iteration is the most widely used and best understood algorithm for solving discounted Markov decision problems. (1)other names for value iteration: successive approximation, backward induction, pre-Jacobi iteration. (2)advantage 1: conceptual simplicity. (3)advantage 2: easy to code and implement. (4)advantage 3: similarity to other areas of applied mathematics. Bellman’s Equation for Infinite Horizon MDP: V (s) = max{r(s, a) + λ a∈A X p(j|s, a)V (j)} j∈S Optimality equation: P Lv(s) = maxa∈A {r(s, a) + λ j∈S p(j|s, a)V (j)}, this is the component-based operation. Lv = maxd∈D {rd + λpd v}, this is the vector-based operation. We want to find v ∗ , such that v ∗ = Lv ∗ If v ≥ Lv, then Lv ≥ L2 v, then v ≥ Lv ≥ L2 v ≥ · · · . For two bounded vectors X, Y , if X ≤ Y , then LX ≤ LY or if X ≤ Y , then LX ≤ LY . This means the L operation is monotone. Value iteration/Successive approximation is not just used in MDP, it is generally used in optimization in function space. Find a solution for the system: v(s) = max{r(s, a) + λ a∈As X j∈S or v(s) = Lv(s) 19 p(j|s, a)v(j)} We want to find the fixed point for operator L, vλ∗ is the optimal value, and the optimal stationary, deterministic policy (d∗ )∞ is defined by: X d∗ (s) = argmaxa∈As {r(s, a) + λ p(j|s, a)vλ∗ (j)} j∈S Π∗ = (d∗ , d∗ , · · · ) By fixed point theorem, given v 0 ∈ V , let v n = Ln v 0 , then v n → vλ∗ as n → ∞ How long till it convergence? What is the stopping condition? Value iteration: 1. Select v 0 ∈ V, ε > 0, set n = 0, P 2. Let v n+1 = Lv n ⇔ v n+1 (s) = maxa∈As {r(s, a) + λ j∈S p(j|s, a)v(j)} 3. If kv n+1 − v n k = maxs∈S |v n+1 (s) − v n (s)| < and go to step 2. 4. ε(1−λ) 2λ , dε (s) = argmaxa∈As {r(s, a) + λ go to step 4, otherwise, let n = n + 1 X p(j|s, a)v n+1 (j)} j∈S or Ldε v n+1 = Lv n+1 , then dε (s) is ε−optimal! (theorem 6.3.1) page 62,nice graph on d1 and d2 convergence. This type of analysis is practical for finite state or structured problems. -Value iteration is not the best method for solving MDPs, but useful for proving structured properties (control-limit policies). -Value iteration generates sequence of value vectors in V . 4. Policy Iteration Algorithm Let us explore an algorithm that generates a sequence of stationary policies in ΠM D , instead of V in value iteration algorithm. Proposition 6.4.1, let d1 and d2 be two stationary policies such that Ld2 vλd1 = max d d∈DM D {Ld vλ1 } then vλd1 ≤ vλd2 Hence define a sequence, let d0 ∈ DM D , rd0 + λPd0 V = V which can be done by analytically solving the equation or by applying Ld0 over and over again. Vλd0 = (I − λPd−1 rd0 ) 0 20 Let v 0 = v d0 , and V n+1 = (I − λPd−1 r ), where Ldn+1 V n = Lv n , then we can conclude n+1 dn+1 V n+1 ≥ v n by the proposition above. The concept behind this algorithm is that instead of computing a series of value, we are generating an improving sequence of policies. This is discrete and less steps. Policy iteration: . 1. Set n = 0, select d0 ∈ D = DM D 2. Policy evaluation: V n is the solution of V = rdn + λPdn V or V n = (I − λPdn )−1 rdn 3. Policy improvement: Select dn+1 such that Ldn+1 V n = LV n , V n = Vλ dn or dn+1 (s) ∈ argmaxa∈As {r(s, a) + λ X p(j|s, a)v n (j)} j∈S set dn+1 (s) = dn (s) if possible. 4. If dn+1 = dn , stop, otherwise n = n + 1, go to step 2. -If finite state and finite action, policy iteration will terminate in finite number of steps. (with a small example in page 65) Policy iteration variants: -Policy iteration is computationally expensive. -The exact value is not necessary. We can just compute close approximation. -Updates may only be performed for selected states Sk ∈ S, which is the subspace. -Have a sequence of vectors in value and policies. (V0 , d0 ), (V1 , d1 ), · · · , (Vk , dk ), (Vk+1 , dk+1 ), · · · We can generate this sequence in one of two ways: (1) update Vk by vk+1 (s) = Ld1 vk (s) for s ∈ Sk , for P other states, stay the same. (2) update dk by dk+1 (s) = argmaxa∈As {r(s, a) + λ j∈S p(j|s, a)vk (j)} for s ∈ Sk , for other states, stay the same. For the variants: (a)Let Sk = S, begin with (V0 , −), perform (2) then (1), what is this algorithm? V0 ⇒ apply L, find d0 , such that Ld0 = LV0 ⇒ V1 = Ld0 V0 = LV0 This is value iteration. (b)Begin with (−, d0 ), perform (1) then (2), what is this algorithm? This becomes policy iteration. (c) Perform mk iterations of (1) followed by (2), this is the modified policy iteration, which is so far the best method in solving MDPs. The most important thing is the selection of mk , fixed? increasing? conditional? 21 5. Linear Programming and MDPs How to find max{3, 5, 7, 9} with linear programming? Please refer to section 6.9 in the textbook and in-class lecture notes. 6. Optimality of Structured Policies Please refer to section 6.11 in the textbook and in-class lecture notes. 22