University of Illinois at Urbana-Champaign Fall 2020 ECE 555 – Control of Stochastic Systems: Midterm Exam Instructor: Prof. R. Srikant TAs: Joseph Lubars and Siddhartha Satpathi Due: November 5, 12:30 pm via Gradescope Note: You have to clearly justify your answers to receive credit 1. (30 points) Consider the following finite-horizon LQ stochastic control problem, where both the state and control variables are scalars: the system dynamics are given by xk+1 = axk + buk + wk , where x0 is a constant, w(k) are i.i.d. with mean 0 and variance σ 2 , and the objective is to minimize T −1 X E qx2k + ru2k + mxk uk , k=0 where r > 0, q ≥ 0, qr ≥ m2 /4, and m is the last digit of your UIN. Derive the optimal feedback control policy, including any recursive equations that must be solved to obtain the optimal solution. Solution: We will show that the optimal action uk (xk ) and value function Vk (xk ) take the following form, uk (xk ) = − m + 2absk+1 xk 2 (r + sk+1 b2 ) (1.1) Vk (xk ) = sk x2k + tk (1.2) 2 (m + 2absk+1 ) and tk = sk+1 σ 2 + tk+1 sk = q + a2 sk+1 − 4(r + sk+1 b2 ) m2 m sT = 0, sT −1 = q − , tT −1 = 0, tT = 0 and uT −1 (xT −1 ) = − xT −1 4r 2r (1.3) (1.4) We will show the above result by induction. • Base Case: When k = T − 1, VT −1 (x) = min qx2 + ru2 + mxu u Since r > 0, the above quadratic in variable u has a unique minimum, and solving the above quadratic we arrive at the base case solution in (1.4). Note the form of the base case solution, which we can use as a reference for finding a solution in the inductive case. Alternatively, the base case sT = tT = 0 also works. • Inductive Step: Assume the recursion given in (1.1)-(1.3) is true of any k + 1 < T − 1. We will prove that (1.1)-(1.3) is still true for k. We can start with the dynamic programming equation Vk (x) = min qx2 + ru2 + mxu + EVk+1 (xk+1 ) xk+1 = ax + bu + wk u = min qx2 + ru2 + mxu + sk+1 E(ax + bu + wk )2 + tk+1 u (1.5) Since wk are distributed i.i.d mean 0 and variance σ 2 , E(ax+bu+wk )2 = (ax+bu)2 +σ 2 . Substituting this into equation (1.5), we arrive at the following quadratic minimization problem. Vk (x) = min(q + sk+1 a2 )x2 + (r + sk+1 b2 )u2 + (m + 2absk+1 ) xu + sk+1 σ 2 + tk+1 u m+2absk+1 2 The minimum of the above quadratic is attained at u = − 2(r+s 2 x . Substituting this into the k+1 b ) quadratic we arrive at the solution for the value function in (1.2) at index k. This completes the inductive proof. 2. (40 points) Consider a two-state controlled Markov chains with two possible control actions u1 and u2 in each state. Let the probability transition matrices corresponding to the two actions be 0.5 0.5 0.8 0.2 P (u1 ) = ; P (u2 ) = , 0.4 0.6 0.7 0.3 and the per-stage rewards be r(1, u1 ) = 6 + m, r(1, u2 ) = 4 + m, r(2, u1 ) = −3 + m, r(2, u2 ) = −5 + m, where m is the last digit of your UIN. Consider the problem of maximizing the infinite-horizon discounted-reward with discount factor α ∈ [0, 1), i.e., ! T −1 X k max lim E α r(xk , uk )|x0 = i , T →∞ k=0 where i ∈ {1, 2}. Show that exists a constant ᾱ such that the following statements hold: (a) using action u1 in both states is an optimal policy if α ≤ ᾱ and (b) using action u2 in both states is an optimal policy if α ≥ ᾱ. Calculate ᾱ. Solution: First, a remark on the intuition of the problem. The solution below involves some computational tricks, so here is the general reasoning. Recall that the optimal policy solves X µ∗ (i) = argmax r(i, u) + α Pij (u)V (j). u j With just two states, this is characterized by the value of X A(i, u1 ) := r(i, u1 ) − r(i, u2 ) + α (Pij (u1 ) − Pij (u2 ))V (j). j If A(i, u1 ) ≥ 0, then u1 is optimal, but if A(i, u1 ) < 0, then u2 is optimal. A(i, u1 ) is an example of something called an advantage function in reinforcement learning. Because the problem dictates that our policy is a simple threshold policy, the sign of the advantage function must not depend on state. In checking this condition, we find that the condition satisfied by this problem is actually even stronger: the advantage function does not depend on state at all! The solution: Note that the difference in rewards and difference in transitions does not depend on state: r(1, u1 ) − r(1, u2 ) = r(2, u1 ) − r(2, u2 ) = 2 −0.3 0.3 D := P (u1 ) − P (u2 ) = −0.3 0.3 Action u1 is optimal at state i if the discounted reward for taking action u1 is greater than the discounted reward for taking action u2 , i.e.: r(i, u1 ) − r(i, u2 ) + αDi1 αV (1) + Di2 αV (2)) ≥ 0 2 − 0.3αV (1) + 0.3αV (2) ≥ 0 20 V (1) − V (2) ≤ 3α This expression does not depend on state i, so the optimal policy is either µ∗ = (u1 , u1 ) or µ∗ = (u2 , u2 ). Either way, inspired by the form of our condition on α, we subtract the Bellman equations from each other to get an answer which is conveniently independent of the policy: V (1) − V (2) = 9 + 0.1α(V1 − V2 ) 9 V (1) − V (2) = 1 − 0.1α 2 Thus µ∗ = (u1 , u1 ) if α < ᾱ and µ∗ = (u2 , u2 ) otherwise, where 9 20 = 1 − 0.1ᾱ 3ᾱ 2 20 − 9= 3ᾱ 3 20 ᾱ = 29 3. (30 points) Consider the MDP in Problem 2, but now consider the problem of maximizing the infinitehorizon average-reward, i.e., ! T −1 X 1 max lim E r(xk , uk ) . T →∞ T k=0 (a) (10 points) Find the optimal policy. (b) (20 points) Find the optimal infinite-horizon average reward. Solution: (a) Policy (u2 , u2 ) is optimal for the discounted-cost problem when α > 20/29. Therefore, the same policy is optimal for the average-cost problem. (b) Using the optimal policy (u2 , u2 ), we have our average-cost Bellman equations, where J ∗ is the optimal infinite-horizon average reward: J ∗ + V (1) = 4 + m + 0.8V (1) + 0.2V (2) ∗ J + V (2) = −5 + m + 0.7V (1) + 0.3V (2) (3.6) (3.7) Subtracting (3.7) from (3.6), we get V (1) − V (2) = 9 + 0.1(V (1) − V (2)) V (1) − V (2) = 10 Then, substituting this into (3.6): J ∗ + 0.2(V (1) − V (2)) = 4 + m J∗ = 2 + m Alternatively, you could fix a value for V (1) or V (2), probably 0 for easier calculations. Alternative Solution: (Outline) Through calculation, you can find that under the policy (u2 , u2 ) for the discounted problem, we have: Vu2 (1)(1 − α) = 4 + m − 1.8α 1 − 0.1α Taking limα→1 Vu2 (1)/(1 − α), we get J ∗ = 2 + m. Alternative Solution: (Outline) Under policy (u2 , u2 ), we can calculate the stationary distribution π of the induced Markov chain by finding the probability vector that solves π = πP (u2 ). This gives us π = (7/9, 2/9). Then, the average reward is J ∗ = π1 r(1, u2 ) + π2 r(2, u2 ) = 3 28 + 7m + −10 + 2m =2+m 9