Dynamic Systems and Control - Exercise 1 Submission due date: 4.5.2015 Instructions: Write your answers in LaTeX and submit a PDF (LyX can help you do that). Insert code output as figures into the PDF. Zip all your files together for submission. Questions: 1. Relations between horizon settings: 1.1. Let M be an MDP, π a time-invariant policy that induces an ergodic Markov process X Pπ (s, s0 ) = π(a|s)p(s0 |s, a) ∀s, s0 ∈ S, a∈A and let p̄π (s) be the stationary distribution of Pπ . If Vπ (s0 ) is the value of policy π from initial state s0 , then its value when s0 is distributed according to p̄π is Vπ = Es0 ∼p̄π (·) [Vπ (s0 )]. Show that, if p̄π is the distribution of the initial state s0 , then the value of the policy π is the same, up to some multiplicative constant, for a T -step finite horizon, an undiscounted-, and discountedinfinite horizon. In this sense, what is the "effective finite horizon" of the discounted infinite horizon with discount γ? (That’s the value of T that gives the same multiplicative constant in both settings.) What is the stationary distribution in an episodic-horizon setting? 1.2. Show that the discounted infinite-horizon setting is a special case of the episodic-horizon setting. No need to prove, just show the construction of the latter given the former. Hint: what state do you need to add? What would the new costs be? What would the new transition probabilities be to emulate the discount? 1.3. Let π be a time-invariant policy π that induces an ergodic Markov process, implying that the distribution of st converges to p̄π . Let Vπ be its undiscounted infinite-horizon value, and Vπ,γ (s0 ) its discounted infinite-horizon value with discount 0 < γ < 1, given initial state s0 . Show that for any s0 lim (1 − γ)Vπ,γ (s0 ) = Vπ . γ→1 Hint: let t be such that for t ≥ t the distribution of st is -close to p̄π . What is the contribution of these time steps to the expression on the left-hand side? Be careful about the order of limits. 2. Recall that the Bellman operator for the episodic-horizon setting is (B ∗ [V ])(sf ) = 0 ∗ (B [V ])(s) = min π(·|s) ( X !) π(a|s) c(s, a) + X 0 0 p(s |s, a)V (s ) ∀s ∈ S \ {sf }. s0 a Write down the Bellman operator for the finite-horizon setting, and for the undiscounted- and discountedinfinite horizon settings. In the undiscounted case, answer first using the expected total cost (the sum of the series of expected costs), and then adapt your answer for expected average cost. 1 3. Implement in Matlab or Python a function named valueIteration, in a module with the same name. The function will receive as a first argument an object with the following members: sizeS - integer, size of the set of states; sizeA - integer, size of the set of actions; p - sizeS × sizeA × sizeS array of real numbers (in python: NumPy array), p[s, a, s_] is the probability of reaching state s_ after taking action a in state s; c - sizeS × sizeA × sizeS array or real numbers, c[s, a, s_] is the cost of reaching state s_ after taking action a in state s. The function will also receive as a second argument a real number, specifying the horizon: <1 - (assume >0) discounted infinite horizon with given discount; 1 - undiscounted infinite horizon; >1 - (assume an integer) finite horizon with given number of steps; 0 - episodic horizon. The function will return a sizeS array V of real numbers, such that V[s] is the optimal value of state s. To keep it simple, in the episodic-horizon setting assume that sf is included in p and c. For any action, the state remains sf with probability 1, and the cost is 0. Treat sf as any other state in the algorithm. Convince yourself that the algorithm is still correct. Assume that the value function has converged when none of its components have changed by more than 10−6 in the last iteration. Submit your code in a file named valueIteration.m or valueIteration.py. 4. Maze: implement a model of the following maze. sf Each tile is a state, and the tile marked sf is an absorbing state. The actions Right, Up, Left, Down take you to the appropriate next state, with success probability of 0.8 if there’s no wall, 0 if there is a wall. The action fails and remains in the same state, with probability 0.2 if there’s no wall, 1 if there is a wall. Each step costs 1 if it succeeds, 6 if it fails. Run valueIteration on this model with episodic horizon. Show a heat map of the value function after 1, 2, 4, 8. 16 iteration, and when the algorithm converges. Please use the "hot" color map, with the same range for all 6 figures. Run valueIteration again, but with V initialized to 1 in every state, except in sf where it’s 0. In each iteration, compute the difference between the value of V in the two runs. For the first 10 iterations, plot the maximum (over all states) of this difference. Submit your code. 5. Curiosity killed the cat: implement the following model. 2 -1 0 s1 s2 6 There are 2 states and 2 deterministic actions. In s1 one action remains in s1 and costs nothing, the other moves to s2 and costs -1 (that is, gains reward 1). In s2 either action leads to s1 and costs 6. Run valueIteration on this model with undiscounted- and discounted- infinite horizon, with γ = 0.3, and γ = 0.9. For each of these 3 runs, plot the value of s1 for the first 10 iterations. How many iterations does each horizon setting require until convergence? Roughly how many iterations would each require if we set the converges tolerance to 10−12 ? Explain intuitively why one of the settings is performing so badly. Submit your code. 6. Good things come to those who wait: implement the following model. s0 $$$ $ The agent can move Right or Left, and the actions always succeeds. In the episodic-horizon setting, the two extreme states, marked with dollar signs, are absorbing. In the discounted infinite-horizon setting, moving to the extreme states resets the game to state s0 . Moving to the rightmost state gives reward 1 (costs -1). Moving to the leftmost state gives reward 6 (costs -6). Other actions reward nothing (cost 0). Run valueIteration on this model with episodic horizon. What is the optimal policy? Find theoretically the values of γ for which the same policy is also optimal for the discounted infinitehorizon setting. Run valueIteration with this horizon setting, with γ ranging from 0.5 to 0.99 in strides of 0.01. Plot the value of s0 in each of these 50 runs. Submit your code. 3