Markov Decision Processes Basics Concepts Alan Fern 1 Some AI Planning Problems Fire & Rescue Response Planning Helicopter Control Solitaire Real-Time Strategy Games Legged Robot Control Network Security/Control Some AI Planning Problems Health Care Personalized treatment planning Hospital Logistics/Scheduling Transportation Autonomous Vehicles Supply Chain Logistics Air traffic control Assistive Technologies Dialog Management Automated assistants for elderly/disabled Household robots Sustainability Smart grid Forest fire management ….. 3 Common Elements We have a systems that changes state over time Can (partially) control the system state transitions by taking actions Problem gives an objective that specifies which states (or state sequences) are more/less preferred Problem: At each moment must select an action to optimize the overall (long-term) objective Produce most preferred state sequences 4 Observe-Act Loop of AI Planning Agent Observations State of world/system world/ system Actions action ???? Goal maximize expected reward over lifetime 5 Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Markov Decision Process World State ???? Action from finite set Goal maximize expected reward over lifetime 6 Example MDP State describes all visible info about game ???? Action are the different choices of dice to roll (or to select a category to score) Goal maximize score at end of game 7 Markov Decision Processes An MDP has four components: S, A, R, T: finite state set S finite action set A transition function T(s,a,s’) = Pr(s’ | s,a) Probability of going to state s’ after taking action a in state s bounded, real-valued reward function R(s,a) Immediate reward we get for being in state s and taking action a Roughly speaking the objective is to select actions in order to maximize total reward over time For example in a goal-based domain R(s,a) may equal 1 for goal states and 0 for all others (or -1 reward for non-goal states) 8 State Actions Roll(die1) Roll(die1,die2) . . . Roll(die1,die2,die3) State Reward: only get reward for “category selection” actions. Reward equal to points gained. Roll(die1,die2) … 1,1 1,2 6,6 Probabilistic state transition What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: ???? Should the solution to an MDP be just a sequence of actions such as (a1,a2,a3, ….) ? Consider a single player card game like Blackjack/Solitaire. No! In general an action sequence is not sufficient Actions have stochastic effects, so the state we end up in is uncertain This means that we might end up in states where the remainder of the action sequence doesn’t apply or is a bad choice A solution should tell us what the best action is for any possible situation/state that might arise 11 Policies (“plans” for MDPs) For this class we will assume that we are given a finite planning horizon H I.e. we are told how many actions we will be allowed to take A solution to an MDP is a policy that says what to do at any moment Policies are functions from states and times to actions π:S x T → A, where T is the non-negative integers π(s,t) tells us what action to take at state s when there are t stages-to-go A policy that does not depend on t is called stationary, otherwise it is called non-stationary 12 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy such that ???? We don’t want to output just any policy We want to output a “good” policy One that accumulates lots of reward 13 Value of a Policy How to quantify accumulated reward of policy π? Every policy has a value function that indicates how good it is across states and time 𝑉𝜋ℎ 𝑠 is the h-horizon value of policy 𝜋 in state s Equals the expected total reward of starting in s and executing 𝜋 for h steps h Vh ( s ) E [ R t | , s ] t 0 𝑅 𝑡 is a random variable denoting reward at time t when following 𝜋 starting in s 14 15 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value function” 𝑉𝜋𝐻 is optimal if, for all states s, 𝑉𝜋𝐻 𝑠 is no worse than value of any other policy at s Theorem: For every MDP there exists an optimal policy. (perhaps not unique) This says that there is one policy that is uniformly as good as any other policy across the entire state space. 16 Computational Problems There are two problems that are of typical interest Policy evaluation: Given an MDP and a policy π h Compute finite-horizon value function V (s ) for any h and s Policy optimization: Given an MDP and a horizon H Compute the optimal finite-horizon policy How many finite horizon policies are there? 𝐴 𝐻𝑆 So can’t just enumerate policies for efficient optimization 17 Computational Problems Dynamic programming techniques can be used for both policy evaluation and optimization Polynomial time in # of states and actions http://web.engr.oregonstate.edu/~afern/classes/cs533/ Is polytime in # of states and actions good? Not when these numbers are enormous! As is the case for most realistic applications Consider Klondike Solitaire, Computer Network Control, etc Enters Monte-Carlo Planning 18 Approaches for Large Worlds: Monte-Carlo Planning Often a simulator of a planning domain is available or can be learned from data Fire & Emergency Response Klondike Solitaire 19 Large Worlds: Monte-Carlo Approach Often a simulator of a planning domain is available or can be learned from data Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator World Simulator action Real World State + reward 20 Example Domains with Simulators Traffic simulators Robotics simulators Military campaign simulators Computer network simulators Emergency planning simulators large-scale disaster and municipal Sports domains Board games / Video games Go / RTS In many cases Monte-Carlo techniques yield state-of-the-art performance. 21 MDP: Simulation-Based Representation A simulation-based representation gives: S, A, R, T, I: finite state set S (|S| is generally very large) finite action set A (|A|=m and will assume is of reasonable size) Stochastic, real-valued, bounded reward function R(s,a) = r Stochastically returns a reward r given input s and a Stochastic transition function T(s,a) = s’ (i.e. a simulator) Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP Stochastic initial state function I. Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language! 22 Computational Problems Policy evaluation: Given an MDP simulator, a policy 𝜋, a state s, and horizon h Compute finite-horizon value function 𝑉𝜋ℎ (𝑠) Policy optimization: Given an MDP simulator, a horizon H, and a state s Compute the optimal finite-horizon policy at state s 23 Trajectories We can use the simulator to observe trajectories of any policy π from any state s: Let Traj(s, π , h) be a stochastic function that returns a length h trajectory of π starting at s. Traj(s, π , h) s0 = s For i = 1 to h-1 si = T(si-1, π(si-1)) Return s0, s1, …, sh-1 The total reward of a trajectory is given by h 1 R( s0 ,..., sh 1 ) R( si ) i 0 24 Policy Evaluation Given a policy π and a state s, how can we estimate 𝑉𝜋ℎ (𝑠)? Simply sample w trajectories starting at s and average the total reward received Select a sampling width w and horizon h. The function EstimateV(s, π , h, w) returns estimate of Vπ(s) EstimateV(s, π , h, w) V=0 For i = 1 to w V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory Return V / w How close to true value 𝑉𝜋ℎ 𝑠 will this estimate be? 25 Sampling-Error Bound h Vh ( s ) E [ R t | , s ] E[ R(Traj ( s, , h))] t 0 approximation due to sampling w EstimateV ( s, , h, w) w1 ri , ri R(Traj ( s, , h)) i 1 • Note that the ri are samples of random variable R(Traj(s, π , h)) • We can apply the additive Chernoff bound which bounds the difference between an expectation and an emprical average 26 Aside: Additive Chernoff Bound • Let X be a random variable with maximum absolute value Z. An let xi i=1,…,w be i.i.d. samples of X • The Chernoff bound gives a bound on the probability that the average of the xi are far from E[X] Let {xi | i=1,…, w} be i.i.d. samples of random variable X, then with probability at least 1 we have that, w E[ X ] w1 xi Z i 1 1 w ln 1 2 1 equivalently Pr E[ X ] w xi exp w Z i 1 w 27 Aside: Coin Flip Example • Suppose we have a coin with probability of heads equal to p. • Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1) E[X] = 1*p + 0*(1-p) = p • After flipping a coin w times we can estimate the heads prob. by average of xi. • The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p. w 1 Pr p w xi exp 2 w i 1 28 Sampling Error Bound h 1 V ( s, h) E [ R | , s ] E[ R(Traj ( s, , h))] t t t 0 approximation due to sampling w EstimateV ( s, , h, w) w1 ri , ri R(Traj ( s, , h)) i 1 We get that, V ( s ) EstimateV ( s, , h, w) Vmax h with probability at least 1 w ln 1 1 Can increase w to get arbitrarily good approximation. 29 Two Player MDP (aka Markov Games) • So far we have only discussed single-player MDPs/games • Your labs and competition will be 2-player zero-sum games (zero sum means sum of player rewards is zero) • We assume players take turns (non-simultaneous moves) Player 1 Player 2 Markov Game action action State/ reward State/ reward Simulators for 2-Player Games A simulation-based representation gives: 𝑆, 𝐴1 , 𝐴2 , 𝑅1 , 𝑅2 , 𝑇, 𝐼: finite state set S (assume state encodes whose turn it is) action sets 𝐴1 and 𝐴2 for player 1 (called max) and player 2 (called min) Stochastic, real-valued, bounded reward functions 𝑅1 𝑠, 𝑎 and 𝑅2 (𝑠, 𝑎) Stochastically return rewards 𝑟1 and 𝑟2 for the two players max and min respectively In our zero-sum case we assume 𝑟1 = −𝑟2 So min maximizes its reward by minimizing the reward of max Stochastic transition function T(s,a) = s’ (i.e. a simulator) Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of game Generally s’ will be a turn for the other player These stochastic functions can be implemented in any language! 31 Finite Horizon Value of Game Given two policies 𝜋1 and 𝜋2 one for each player we can define the finite horizon value 𝑉𝜋ℎ1 ,𝜋2 (𝑠) is h-horizon value function with respect to max expected total reward for player 1 after executing 𝝅𝟏 and 𝝅𝟐 starting in s for k steps For zero-sum games the value with respect to player 2 is just − 𝑉𝜋ℎ1 ,𝜋2 (𝑠) Given 𝜋1 and 𝜋2 we can easily use the simulator to estimate 𝑉𝜋ℎ1 ,𝜋2 (𝑠) Just as we did for the single-player MDP setting 32 Summary Markov Decision Processes (MDPs) are common models for sequential planning problems The solution to an MDP is a policy The goodness of a policy is measured by its value function Expected total reward over H steps Monte Carlo Planning (MCP) is used for enormous MDPs for which we have a simulator Evaluating a policy via MCP is very easy 33