Dynamic Programming Homework: System Control & Optimization

MATH 6367 Homework 1 Name and ID: 1. Consider the system xk+1 = xk + uk + wk , k = 0, 1, 2, 3, with initial state x0 = 5, and the cost function 3 X (x2k + u2k ). k=0 Apply the DP algorithm for the following two cases: (a) The control constraint set Uk (xk ) is {u | 0 ≤ xk + u ≤ 5, u : integer} for all xk and k, and the disturbance wk is equal to zero for all k. (b) The control constraint is as in part (a) and the disturbance wk takes the values −1 and 1 with equal probability 1/2 for all xk and uk , except if xk + uk is equal to 0 or 5, in which case wk = 0 with probability 1. 2. A game of the blackjack variety is played by two players as follows: Both players throw a die. The first player, knowing his opponent’s result, may stop or may throw the die again and add the result to the result of his previous throw. He then may stop or throw again and add the result of the new throw to the sum of his previous throws. He may repeat this process as many times as he wishes. If his sum exceeds seven (i.e., he busts), he loses the game. If he stops before exceeding seven, the second player takes over and throws the die successively until the sum of his throws is four or higher. If the sum of the second player is over seven, he loses the game. Otherwise the player with the larger sum wins, and in case of a tie the second player wins. The problem is to determine a stopping strategy for the first player that maximizes his probability of winning for each possible initial throw of the second player. Formulate the problem in terms of DP and find an optimal stopping strategy for the case where the second player’s initial throw is three. Hint: Take N = 6 and a state space consisting of the following 14 states: x1 : busted x1+i : already stopped at sum i (1 ≤ i ≤ 7), x8+i : current sum is i but the player has not yet stopped (1 ≤ i ≤ 6). The optimal strategy is to throw until the sum is four or higher. Solution: Let the state be the status of the first player. The state space, then, consists of the states:  1  x : busted x1+i : already stopped at sum i (1 ≤ i ≤ 7),   8+i x : current sum is i but the player has not yet stopped (1 ≤ i ≤ 6). • The first player can roll 6 more times at most, after the initial throw. Before he rolls the dice he is in one of the above states. If in state x8 or x1 , he has no choice. If in states x2 to x7 , he has already stopped. If in states x9 to x14 , he can apply a control (i.e., roll or stop), which will maximize Prob(Win | xi ), i = 9, · · · , 14. • For a given throw of the second player we can compute Prob(Win | x1 ), · · · , Prob(Win | x8 ). • Then going backwards in time (from the 6th roll) we calculate the strategy which maximizes: Prob(Win | xi ), i = 14, · · · , 9. Let P ∗ (Win | xi ) = max Prob(Win | xi , u) u Given that the initial throw of the second player is three: Prob(Win | x1 ) = 0 Prob(Win | x5 ) = 1/3 Prob(Win | x2 ) = 1/3 Prob(Win | x6 ) = 1/2 Prob(Win | x3 ) = 1/3 Prob(Win | x7 ) = 2/3 Prob(Win | x4 ) = 1/3 Prob(Win | x8 ) = 5/6 Let xi be the state after the ith roll by player 1. Stage 6 x6 ∈ {x1 , x8 } no controls can be applied Homework 1 Page 2 of 11 Stage 5 x5 ∈ {x1 , x7 , x8 , x14 } control possible only for x14 Prob(Win | x14 , u : stop) = Prob(Win | x7 ) = 2/3 Prob(Win | x14 , u : roll) = Prob(Win | x8 ) = 5/36 6 (Note that µj (xk ) is independent of j.) We have, then: P ∗ (Win | x14 ) = 2/3, µ(x14 ) : stop Stage 4 x4 ∈ {x1 , x6 , x7 , x8 , x13 , x14 } Prob(Win | x13 , u : stop) = Prob(Win | x6 ) = 1/2 Prob(Win | x13 , u : roll) = Prob(Win | x8 ) + P ∗ (Win | x14 ) < 1/2 6 We have, then: P ∗ (Win | x13 ) = 1/2, µ(x13 ) : stop Stage 3 x3 ∈ {x1 , x5 , x6 , x7 , x8 , x12 , x13 , x14 } Prob(Win | x12 , u : stop) = Prob(Win | x5 ) = 1/3 Prob(Win | x12 , u : roll) = Prob(Win | x8 ) + P ∗ (Win | x13 ) + P ∗ (Win | x14 ) = 1/3 6 We have, then: P ∗ (Win | x12 ) = 1/3, µ(x12 ) : stop or roll Stage 2 x2 ∈ {x1 , x4 , x5 , x6 , x7 , x8 , x11 , x12 , x13 , x14 } Prob(Win | x11 , u : stop) = Prob(Win | x4 ) = 1/3 Prob(Win | x11 , u : roll) = Prob(Win | x8 ) + P ∗ (Win | x12 ) + · · · + P ∗ (Win | x14 ) = 7/18 6 We have, then: P ∗ (Win | x11 ) = 7/18, µ(x11 ) : roll Finally: P ∗ (Win | x10 ) = 49/108, µ(x10 ) : roll P ∗ (Win | x9 ) = 343/648, µ(x9 ) : roll 3. Assume that we have a vessel whose maximum weight capacity is z and whose cargo is to consist of different quantities of N different items. Let vi denote the value of the ith type of item, wi the weight of ith type of item, and xi the number of items of type P i that are loaded in the vessel. The PN problem is to find the most valuable cargo, i.e., to maximize N x v subject to the constraints i i i=1 i=1 xi wi ≤ z and xi = 0, 1, 2, · · · . Formulate this problem in terms of DP. Homework 1 Page 3 of 11 4. A farmer annually producing xk units of a certain crop stores (1−uk )xk units of his production, where 0 ≤ uk ≤ 1, and invests the remaining uk xk units, thus increasing the next year’s production to a level xk+1 given by xk+1 = xk + wk uk xk , k = 0, 1, · · · , N − 1. Homework 1 Page 4 of 11 The scalars wk are independent random variables with identical probability distributions that do not depend either on xk or uk . Furthermore, E{wk } = w̄ > 0. The problem is to find the optimal investment policy that maximizes the total expected product stored over N years N −1 n o X xN + (1 − uk )xk E w k k=0,1,··· ,N −1 k=0 Show the optimality of the following policy that consists of constant functions: (a) If w̄ > 1, µ∗0 (x0 ) = · · · = µ∗N −1 (xN −1 ) = 1. (b) If 0 < w̄ < 1/N , µ∗0 (x0 ) = · · · = µ∗N −1 (xN −1 ) = 0. (c) If 1/N ≤ w̄ ≤ 1, µ∗0 (x0 ) = · · · = µ∗N −k̄−1 (xN −k̄−1 ) = 1, µ∗N −k̄ (xN −k̄ ) = · · · = µ∗N −1 (xN −1 ) = 0, where k̄ is such that 1/(k̄ + 1) < w̄ ≤ 1/k̄. Solution: The DP algorithm is: JN (xN ) = xN Jk (xk ) = max 0≤uk ≤1 n o (1 − uk )xk + E Jk+1 ((1 + wk uk )xk ) wk • Case 1: w̄ > 1 Claim: JN −k (xN −k ) = xN −k (1 + w̄)k , µ∗0 (x0 ) = ··· = µ∗N −1 (xN −1 ) k = 1, · · · , N =1 The proof follows by induction. n o (1 − uN −1 )xN −1 + E (1 + wN −1 uN −1 )xN −1 wN −1 0≤uN −1 ≤1 = xN −1 max 2 + (w̄ − 1)uN −1 JN −1 (xN −1 ) = max 0≤uN −1 ≤1 = xN −1 (1 + w̄), where µ∗N −1 (xN −1 ) = 1. Assume that JN −k (xN −k ) = xN −k (1 + w̄)k . Then n o JN −k−1 (xN −k−1 ) = max (1 − uN −k−1 )xN −k−1 + (1 + w̄uN −k−1 )(1 + w̄)k xN −k−1 0≤uN −k−1 ≤1 n o = xN −k−1 max 1 + (1 + w̄)k + (1 + w̄)k w̄ − 1 uN −k−1 0≤uN −k−1 ≤1 = xN −k−1 (1 + w̄)k+1 , where µ∗N −k−1 (xN −k−1 ) = 1. • Case 2: 0 < w̄ < 1/N Claim: JN −k (xN −k ) = (k + 1)xN −k , µ∗0 (x0 ) = ··· = µ∗N −1 (xN −1 ) Homework 1 k = 1, · · · , N =0 Page 5 of 11 The proof follows by induction. JN −1 (xN −1 ) = xN −1 max 0≤uN −1 ≤1 = 2xN −1 , 2 + (w̄ − 1)uN −1 where µ∗N −1 (xN −1 ) = 0. Assume that JN −k (xN −k ) = (k + 1)xN −k . Then n o JN −k−1 (xN −k−1 ) = max (1 − uN −k−1 )xN −k−1 + (k + 1)(1 + w̄uN −k−1 xN −k−1 ) 0≤uN −k−1 ≤1 n o = xN −k−1 max (k + 2) + (k + 1)w̄ − 1 uN −k−1 0≤uN −k−1 ≤1 = (k + 2)xN −k−1 , where µ∗N −k−1 (xN −k−1 ) = 0. • Case 3: 1/N ≤ w̄ ≤ 1 Apply the DP algorithm beginning with stage N . Proceed as in Case 2, setting the control equal to zero until: o n JN −k̄−1 (xN −k̄−1 ) = xN −k̄−1 max (k̄ + 2) + (k̄ + 1)w̄ − 1 uN −k̄−1 0≤uN −k̄−1 ≤1 where N − k̄ − 1 is the first stage where w̄ > 1/(k̄ + 1). Since (k̄ + 1)w̄ − 1 > 0, take: µ∗N −k̄−1 (xN −k̄−1 ) = 1 JN −k̄−1 (xN −k̄−1 ) = (k̄ + 1)(1 + w̄)xN −k̄−1 From this point, proceed as in Case 1. At each iteration the power of (1 + w̄) will be raised and the control will be set to one. 5. An unscrupulous innkeeper charges a different rate for a room as the day progresses, depending on whether he has many or few vacancies. His objective is to maximize his expected total income during the day. Let x be the number of empty rooms at the start of the day, and let y be the number of customers that will ask for a room in the course of the day. We assume (somewhat unrealistically) that the innkeeper knows y with certainty, and upon arrival of a customer, quotes one of m prices ri , i = 1, · · · , m, where 0 < r1 ≤ r2 ≤ · · · ≤ rm . A quote of a rate ri is accepted with probability pi and is rejected with probability 1 − pi , in which case the customer departs, never to return during that day. Formulate this as a problem with y stages and show that the maximal expected income, as a function of x and y, satisfies the recursion h i J(x, y) = max pi (ri + J(x − 1, y − 1)) + (1 − pi )J(x, y − 1) i=1,··· ,m for all x ≥ 1 and y ≥ 1, with initial conditions J(x, 0) = J(0, y) = 0, Homework 1 for all x and y. Page 6 of 11 6. An investor observes at the beginning of each period k the price xk of a stock and decides whether to buy 1 unit, sell 1 unit, or do nothing. There is a transaction cost c for buying or selling. The stock price can take one of n different values v 1 , · · · , v n and the transition probabilities pkij = P {xk+1 = v j | xk = v i } are known. The investor wants to maximize the total worth of his stock at a fixed final period N minus his investment costs from period 0 to period N − 1 ( revenue from a sale is viewed as negative cost). We assume that the function Pk (x) = E{xN | xk = x} − x is monotonically nonincreasing as a function of x; that is, the expected profit from a purchase is a nonincreasing function of the purchase price. Assume that the investor starts with N or more units of stock and an unlimited amount of cash, so that a purchase or sale decision is possible at each period regardless of the past decisions and the current price. For every period k, let xk be the largest value ¯ of x ∈ {v 1 , · · · , v n } such that Pk (x) > c, and let x̄k be the smallest value of x ∈ {v 1 , · · · , v n } such that Pk (x) < −c. Show that it is optimal to buy if xk ≤ xk , sell if xk ≥ x̄k , and do nothing otherwise. ¯ Homework 1 Page 7 of 11 Hint: Formulate the problem as one of maximizing −1 n NX o E (uk Pk (xk ) − c|uk |) , k=0 where uk ∈ {−1, 0, 1}. Solution: The total net expected profit from the (buy/sell) investment decissions after transaction costs are deducted is −1 n NX o E (uk Pk (xk ) − c|uk |) , k=0 where    1 uk = −1   0 if a unit of stock is bought at the kth period, if a unit of stock is sold at the kth period, otherwise. With a policy that maximizes this expression, we simultaneously maximize the expected total worth of the stock held at time N minus the investment costs (including sale revenues). The DP algorithm is given by h i Jk (xk ) = max uk Pk (xk ) − c|uk | + E Jk+1 (xk+1 ) | xk uk ∈{−1,0,1} with JN (xN ) = 0, where Jk+1 (xk+1 ) is the optimal expected profit when the stock price is xk+1 at time k + 1. Since uk does not influence xk+1 and E{Jk+1 (xk+1 ) | xk }, a decision uk ∈ {−1, 0, 1} that maximizes uk Pk (xk ) − c|uk | at time k is optimal. Since Pk (xk ) is monotonically nonincreasing in xk , it follows that it is optimal to set  ,   1 xk ≤ x ¯k uk = −1 xk ≥ x̄k ,   0 otherwise, where xk and x̄k are as in the problem statement. Note that the optimal expected profit Jk (xk ) ¯ is given by −1 n NX o Jk (xk ) = E max (ui Pi (xi ) − c|ui |) . i=k ui ∈{−1,0,1} Homework 1 Page 8 of 11

Dynamic Programming Homework: System Control & Optimization

Related documents

Products

Support

Dynamic Programming Homework: System Control & Optimization

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib