mdp-lecture1

Markov Decision Processes Basics Concepts Alan Fern 1 Some AI Planning Problems Fire & Rescue Response Planning Helicopter Control Solitaire Real-Time Strategy Games Legged Robot Control Network Security/Control Some AI Planning Problems  Health Care  Personalized treatment planning  Hospital Logistics/Scheduling  Transportation  Autonomous Vehicles  Supply Chain Logistics  Air traffic control  Assistive Technologies  Dialog Management  Automated assistants for elderly/disabled  Household robots  Sustainability  Smart grid  Forest fire management ….. 3 Common Elements  We have a systems that changes state over time  Can (partially) control the system state transitions by taking actions  Problem gives an objective that specifies which states (or state sequences) are more/less preferred  Problem: At each moment must select an action to optimize the overall (long-term) objective  Produce most preferred state sequences 4 Observe-Act Loop of AI Planning Agent Observations State of world/system world/ system Actions action ???? Goal maximize expected reward over lifetime 5 Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Markov Decision Process World State ???? Action from finite set Goal maximize expected reward over lifetime 6 Example MDP State describes all visible info about dice ???? Action are the different choices of dice to roll Goal maximize score or score more than opponent 7 Markov Decision Processes  An MDP has four components: S, A, R, T:  finite state set S (|S| = n)  finite action set A (|A| = m)  transition function T(s,a,s’) = Pr(s’ | s,a)  Probability of going to state s’ after taking action a in state s  bounded, real-valued reward function R(s,a)    Immediate reward we get for being in state s and taking action a Roughly speaking the objective is to select actions in order to maximize total reward For example in a goal-based domain R(s,a) may equal 1 for goal states and 0 for all others (or -1 reward for non-goal states) 8 State action = roll the two 5’s … 1,1 1,2 6,6 Probabilistic state transition What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: ????  Should the solution to an MDP be just a sequence of actions such as (a1,a2,a3, ….) ?  Consider a single player card game like Blackjack/Solitaire.  No! In general an action sequence is not sufficient  Actions have stochastic effects, so the state we end up in is uncertain  This means that we might end up in states where the remainder of the action sequence doesn’t apply or is a bad choice  A solution should tell us what the best action is for any possible situation/state that might arise 10 Policies (“plans” for MDPs)  A solution to an MDP is a policy  Two types of policies: nonstationary and stationary  Nonstationary policies are used when we are given a finite planning horizon H  I.e. we are told how many actions we will be allowed to take  Nonstationary policies are functions from states and times to actions  π:S x T → A, where T is the non-negative integers  π(s,t) tells us what action to take at state s when there are t stages-to-go (note that we are using the convention that t represents stages/decisions to go, rather than the time step) 11 Policies (“plans” for MDPs)  What if we want to continue taking actions indefinately?  Use stationary policies  A Stationary policy is a mapping from states to actions  π:S → A  π(s) is action to do at state s (regardless of time)  specifies a continuously reactive controller  Note that both nonstationary and stationary policies assume or have these properties:  full observability of the state  history-independence  deterministic action choice 12 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy such that ????  We don’t want to output just any policy  We want to output a “good” policy  One that accumulates lots of reward 13 Value of a Policy  How good is a policy π?  How do we measure reward “accumulated” by π?  Value function V: S →ℝ associates value with each state (or each state and time for non-stationary π)  Vπ(s) denotes value of policy at state s  Depends on immediate reward, but also what you achieve subsequently by following π  An optimal policy is one that is no worse than any other policy at any state  The goal of MDP planning is to compute an optimal policy 14 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”  This depends on how we define the value of a policy  There are several choices and the solution algorithms depend on the choice  We will consider finite-horizon value 15 Finite-Horizon Value Functions  We first consider maximizing expected total reward over a finite horizon  Assumes the agent has H time steps to live  To act optimally, should the agent use a stationary or non-stationary policy?  I.e. Should the action it takes depend on absolute time?  Put another way:  If you had only one week to live would you act the same way as if you had fifty years to live? 16 Finite Horizon Problems  Value (utility) depends on stage-to-go  hence use a nonstationary policy  k V (s)is k-stage-to-go value function for π  expected total reward for executing π starting in s for k time steps k V ( s)  E [  R |  , s ] k t t 0 k  E [  R( s t ) | a t   ( s t , k  t ), s 0  s ] t 0  Here Rt and st are random variables denoting the reward received and state at time-step t when starting in s  These are random variables since the world is stochastic 17 18 Computational Problems  There are two problems that are of typical interest  Policy evaluation:  Given an MDP and a policy π  Compute finite-horizon value function V k (s) for any k and s   Policy optimization:  Given an MDP and a horizon H  Compute the optimal finite-horizon policy  How many finite horizon policies are there?  |A|Hn  So can’t just enumerate policies for efficient optimization 19 Computational Problems  Dynamic programming techniques can be used for both policy evaluation and optimization  Polynomial time in # of states and actions  http://web.engr.oregonstate.edu/~afern/classes/cs533/  Is polytime in # of states and actions good?  Not when these numbers are enormous!  As is the case for most realistic applications  Consider Klondike Solitaire, Computer Network Control, etc Enters Monte-Carlo Planning 20 Approaches for Large Worlds: Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data Fire & Emergency Response Klondike Solitaire 21 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator World Simulator action Real World State + reward 22 Example Domains with Simulators  Traffic simulators  Robotics simulators  Military campaign simulators  Computer network simulators  Emergency planning simulators  large-scale disaster and municipal  Sports domains  Board games / Video games  Go / RTS In many cases Monte-Carlo techniques yield state-of-the-art performance. Even in domains where model-based planners are applicable. 23 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S|=n and is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)   Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language! 24 Computational Problems  Policy evaluation:  Given an MDP simulator, a policy 𝜋, a state s, and horizon h  Compute finite-horizon value function 𝑉𝜋ℎ (𝑠)  Policy optimization:  Given an MDP simulator, a horizon H, and a state s  Compute the optimal finite-horizon policy at state s 25 Trajectories  We can use the simulator to observe trajectories of any policy π from any state s:  Let Traj(s, π , h) be a stochastic function that returns a length h trajectory of π starting at s.  Traj(s, π , h)  s0 = s  For i = 1 to h-1  si = T(si-1, π(si-1))  Return s0, s1, …, sh-1  The total reward of a trajectory is given by h 1 R( s0 ,..., sh 1 )   R( si ) i 0 26 Policy Evaluation  Given a policy π and a state s, how can we estimate 𝑉𝜋ℎ (𝑠)?  Simply sample w trajectories starting at s and average the total reward received  Select a sampling width w and horizon h.  The function EstimateV(s, π , h, w) returns estimate of Vπ(s) EstimateV(s, π , h, w) V=0  For i = 1 to w V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory  Return V / w   How close to true value 𝑉𝜋ℎ 𝑠 will this estimate be? 27 Sampling-Error Bound h 1 Vh ( s )  E [   t R t |  , s ]  E[ R(Traj ( s,  , h))] t 0 approximation due to sampling w Estim ateV(s,  , h, w)  w1  ri , ri  R(Traj( s,  , h)) i 1 • Note that the ri are samples of random variable R(Traj(s, π , h)) • We can apply the additive Chernoff bound which bounds the difference between an expectation and an emprical average 28 Aside: Additive Chernoff Bound • Let X be a random variable with maximum absolute value Z. An let xi i=1,…,w be i.i.d. samples of X • The Chernoff bound gives a bound on the probability that the average of the xi are far from E[X] Let {xi | i=1,…, w} be i.i.d. samples of random variable X, then with probability at least 1   we have that, w E[ X ]  w1  xi  Z i 1 1 w ln 1 2        1 equivalently Pr E[ X ]  w  xi     exp    w   Z  i 1     w 29 Aside: Coin Flip Example • Suppose we have a coin with probability of heads equal to p. • Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1) E[X] = 1*p + 0*(1-p) = p • After flipping a coin w times we can estimate the heads prob. by average of xi. • The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p. w   1  Pr p  w  xi     exp   2 w i 1     30 Sampling Error Bound h 1 V ( s, h)  E [   t R t |  , s ]  E[ R(Traj( s,  , h))] t 0 approximation due to sampling w Estim ateV(s,  , h, w)  w1  ri , ri  R(Traj( s,  , h)) i 1 We get that, V ( s )  EstimateV ( s,  , h, w)  Vmax h with probability at least 1 w ln 1 1  31 Two Player MDP (aka Markov Games) • So far we have only discussed single-player MDPs/games • Your labs and competition will be 2-player zero-sum games (zero sum means sum of player rewards is zero) • We assume players take turns (non-simultaneous moves) Player 1 Player 2 Markov Game action action State/ reward State/ reward Simulators for 2-Player Games  A simulation-based representation gives: 𝑆, 𝐴1 , 𝐴2 , 𝑅, 𝑇, 𝐼:  finite state set S (assume state encodes whose turn it is)  action sets 𝐴1 and 𝐴2 for player 1 (called max) and player 2 (called min)  Stochastic, real-valued, bounded reward function 𝑅 𝑠, 𝑎   Stochastically returns a reward r for max given input s and action a (it is assumed here that reward for min is -r) So min maximizes its reward by minimizing the reward of max  Stochastic transition function T(s,a) = s’ (i.e. a simulator)    Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of game Generally s’ will be a turn for the other player These stochastic functions can be implemented in any language! 33 Finite Horizon Value of Game  Given two policies 𝜋1 and 𝜋2 one for each player we can define the finite horizon value  𝑉𝜋ℎ1 ,𝜋2 (𝑠) is h-horizon value function with respect to player 1  expected total reward for player 1 after executing 𝝅𝟏 and 𝝅𝟐 starting in s for k steps  For zero-sum games the value with respect to player 2 is just − 𝑉𝜋ℎ1 ,𝜋2 (𝑠)  Given 𝜋1 and 𝜋2 we can easily use the simulator to estimate 𝑉𝜋ℎ1 ,𝜋2 (𝑠)  Just as we did for the single-player MDP setting 34

mdp-lecture1

Related documents

Products

Support

mdp-lecture1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib