mdp-lecture1

Markov Decision Processes Basics Concepts Alan Fern 1 Some AI Planning Problems Fire & Rescue Response Planning Helicopter Control Solitaire Real-Time Strategy Games Legged Robot Control Network Security/Control Some AI Planning Problems  Health Care  Personalized treatment planning  Hospital Logistics/Scheduling  Transportation  Autonomous Vehicles  Supply Chain Logistics  Air traffic control  Assistive Technologies  Dialog Management  Automated assistants for elderly/disabled  Household robots  Sustainability  Smart grid  Forest fire management ….. 3 Common Elements  We have a systems that changes state over time  Can (partially) control the system state transitions by taking actions  Problem gives an objective that specifies which states (or state sequences) are more/less preferred  Problem: At each moment must select an action to optimize the overall (long-term) objective  Produce most preferred state sequences 4 Observe-Act Loop of AI Planning Agent Observations State of world/system world/ system Actions action ???? Goal maximize expected reward over lifetime 5 Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Markov Decision Process World State ???? Action from finite set Goal maximize expected reward over lifetime 6 Example MDP State describes all visible info about game ???? Action are the different choices of dice to roll (or to select a category to score) Goal maximize score at end of game 7 Markov Decision Processes  An MDP has four components: S, A, R, T:  finite state set S  finite action set A  transition function T(s,a,s’) = Pr(s’ | s,a)  Probability of going to state s’ after taking action a in state s  bounded, real-valued reward function R(s,a)    Immediate reward we get for being in state s and taking action a Roughly speaking the objective is to select actions in order to maximize total reward over time For example in a goal-based domain R(s,a) may equal 1 for goal states and 0 for all others (or -1 reward for non-goal states) 8 State Actions Roll(die1) Roll(die1,die2) . . . Roll(die1,die2,die3) State Reward: only get reward for “category selection” actions. Reward equal to points gained. Roll(die1,die2) … 1,1 1,2 6,6 Probabilistic state transition What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: ????  Should the solution to an MDP be just a sequence of actions such as (a1,a2,a3, ….) ?  Consider a single player card game like Blackjack/Solitaire.  No! In general an action sequence is not sufficient  Actions have stochastic effects, so the state we end up in is uncertain  This means that we might end up in states where the remainder of the action sequence doesn’t apply or is a bad choice  A solution should tell us what the best action is for any possible situation/state that might arise 11 Policies (“plans” for MDPs)  For this class we will assume that we are given a finite planning horizon H  I.e. we are told how many actions we will be allowed to take  A solution to an MDP is a policy that says what to do at any moment  Policies are functions from states and times to actions  π:S x T → A, where T is the non-negative integers  π(s,t) tells us what action to take at state s when there are t stages-to-go  A policy that does not depend on t is called stationary, otherwise it is called non-stationary 12 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy such that ????  We don’t want to output just any policy  We want to output a “good” policy  One that accumulates lots of reward 13 Value of a Policy  How to quantify accumulated reward of policy π?  Every policy has a value function that indicates how good it is across states and time  𝑉𝜋ℎ 𝑠 is the h-horizon value of policy 𝜋 in state s  Equals the expected total reward of starting in s and executing 𝜋 for h steps h Vh ( s )  E [  R t |  , s ] t 0  𝑅 𝑡 is a random variable denoting reward at time t when following 𝜋 starting in s 14 15 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value function”  𝑉𝜋𝐻 is optimal if, for all states s, 𝑉𝜋𝐻 𝑠 is no worse than value of any other policy at s Theorem: For every MDP there exists an optimal policy. (perhaps not unique)  This says that there is one policy that is uniformly as good as any other policy across the entire state space. 16 Computational Problems  There are two problems that are of typical interest  Policy evaluation:  Given an MDP and a policy π h  Compute finite-horizon value function V (s ) for any h and s  Policy optimization:  Given an MDP and a horizon H  Compute the optimal finite-horizon policy  How many finite horizon policies are there?  𝐴 𝐻𝑆  So can’t just enumerate policies for efficient optimization 17 Computational Problems  Dynamic programming techniques can be used for both policy evaluation and optimization  Polynomial time in # of states and actions  http://web.engr.oregonstate.edu/~afern/classes/cs533/  Is polytime in # of states and actions good?  Not when these numbers are enormous!  As is the case for most realistic applications  Consider Klondike Solitaire, Computer Network Control, etc Enters Monte-Carlo Planning 18 Approaches for Large Worlds: Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data Fire & Emergency Response Klondike Solitaire 19 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator World Simulator action Real World State + reward 20 Example Domains with Simulators  Traffic simulators  Robotics simulators  Military campaign simulators  Computer network simulators  Emergency planning simulators  large-scale disaster and municipal  Sports domains  Board games / Video games  Go / RTS In many cases Monte-Carlo techniques yield state-of-the-art performance. 21 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S| is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)   Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language! 22 Computational Problems  Policy evaluation:  Given an MDP simulator, a policy 𝜋, a state s, and horizon h  Compute finite-horizon value function 𝑉𝜋ℎ (𝑠)  Policy optimization:  Given an MDP simulator, a horizon H, and a state s  Compute the optimal finite-horizon policy at state s 23 Trajectories  We can use the simulator to observe trajectories of any policy π from any state s:  Let Traj(s, π , h) be a stochastic function that returns a length h trajectory of π starting at s.  Traj(s, π , h)  s0 = s  For i = 1 to h-1  si = T(si-1, π(si-1))  Return s0, s1, …, sh-1  The total reward of a trajectory is given by h 1 R( s0 ,..., sh 1 )   R( si ) i 0 24 Policy Evaluation  Given a policy π and a state s, how can we estimate 𝑉𝜋ℎ (𝑠)?  Simply sample w trajectories starting at s and average the total reward received  Select a sampling width w and horizon h.  The function EstimateV(s, π , h, w) returns estimate of Vπ(s) EstimateV(s, π , h, w) V=0  For i = 1 to w V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory  Return V / w   How close to true value 𝑉𝜋ℎ 𝑠 will this estimate be? 25 Sampling-Error Bound h Vh ( s )  E [  R t |  , s ]  E[ R(Traj ( s,  , h))] t 0 approximation due to sampling w EstimateV ( s,  , h, w)  w1  ri , ri  R(Traj ( s,  , h)) i 1 • Note that the ri are samples of random variable R(Traj(s, π , h)) • We can apply the additive Chernoff bound which bounds the difference between an expectation and an emprical average 26 Aside: Additive Chernoff Bound • Let X be a random variable with maximum absolute value Z. An let xi i=1,…,w be i.i.d. samples of X • The Chernoff bound gives a bound on the probability that the average of the xi are far from E[X] Let {xi | i=1,…, w} be i.i.d. samples of random variable X, then with probability at least 1   we have that, w E[ X ]  w1  xi  Z i 1 1 w ln 1 2        1 equivalently Pr E[ X ]  w  xi     exp     w   Z  i 1     w 27 Aside: Coin Flip Example • Suppose we have a coin with probability of heads equal to p. • Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1) E[X] = 1*p + 0*(1-p) = p • After flipping a coin w times we can estimate the heads prob. by average of xi. • The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p. w   1 Pr p  w  xi     exp   2 w i 1     28 Sampling Error Bound h 1 V ( s, h)  E [   R |  , s ]  E[ R(Traj ( s,  , h))] t t t 0 approximation due to sampling w EstimateV ( s,  , h, w)  w1  ri , ri  R(Traj ( s,  , h)) i 1 We get that, V ( s )  EstimateV ( s,  , h, w)  Vmax h with probability at least 1 w ln 1 1  Can increase w to get arbitrarily good approximation. 29 Two Player MDP (aka Markov Games) • So far we have only discussed single-player MDPs/games • Your labs and competition will be 2-player zero-sum games (zero sum means sum of player rewards is zero) • We assume players take turns (non-simultaneous moves) Player 1 Player 2 Markov Game action action State/ reward State/ reward Simulators for 2-Player Games  A simulation-based representation gives: 𝑆, 𝐴1 , 𝐴2 , 𝑅1 , 𝑅2 , 𝑇, 𝐼:  finite state set S (assume state encodes whose turn it is)  action sets 𝐴1 and 𝐴2 for player 1 (called max) and player 2 (called min)  Stochastic, real-valued, bounded reward functions 𝑅1 𝑠, 𝑎 and 𝑅2 (𝑠, 𝑎)    Stochastically return rewards 𝑟1 and 𝑟2 for the two players max and min respectively In our zero-sum case we assume 𝑟1 = −𝑟2 So min maximizes its reward by minimizing the reward of max  Stochastic transition function T(s,a) = s’ (i.e. a simulator)    Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of game Generally s’ will be a turn for the other player These stochastic functions can be implemented in any language! 31 Finite Horizon Value of Game  Given two policies 𝜋1 and 𝜋2 one for each player we can define the finite horizon value  𝑉𝜋ℎ1 ,𝜋2 (𝑠) is h-horizon value function with respect to max  expected total reward for player 1 after executing 𝝅𝟏 and 𝝅𝟐 starting in s for k steps  For zero-sum games the value with respect to player 2 is just − 𝑉𝜋ℎ1 ,𝜋2 (𝑠)  Given 𝜋1 and 𝜋2 we can easily use the simulator to estimate 𝑉𝜋ℎ1 ,𝜋2 (𝑠)  Just as we did for the single-player MDP setting 32 Summary  Markov Decision Processes (MDPs) are common models for sequential planning problems  The solution to an MDP is a policy  The goodness of a policy is measured by its value function  Expected total reward over H steps  Monte Carlo Planning (MCP) is used for enormous MDPs for which we have a simulator  Evaluating a policy via MCP is very easy 33

mdp-lecture1

Related documents

Products

Support

mdp-lecture1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib