mdp-lecture1

advertisement
Markov Decision Processes
Basics Concepts
Alan Fern
1
Some AI Planning Problems
Fire & Rescue
Response Planning
Helicopter Control
Solitaire
Real-Time Strategy Games
Legged Robot Control
Network
Security/Control
Some AI Planning Problems
 Health Care
 Personalized treatment planning
 Hospital Logistics/Scheduling
 Transportation
 Autonomous Vehicles
 Supply Chain Logistics
 Air traffic control
 Assistive Technologies
 Dialog Management
 Automated assistants for elderly/disabled
 Household robots
 Sustainability
 Smart grid
 Forest fire management
…..
3
Common Elements
 We have a systems that changes state over time
 Can (partially) control the system state transitions by
taking actions
 Problem gives an objective that specifies which
states (or state sequences) are more/less preferred
 Problem: At each moment must select an action to
optimize the overall (long-term) objective
 Produce most preferred state sequences
4
Observe-Act Loop of AI Planning Agent
Observations
State of
world/system
world/
system
Actions
action
????
Goal
maximize expected
reward over lifetime
5
Stochastic/Probabilistic Planning:
Markov Decision Process (MDP) Model
Markov Decision
Process
World State
????
Action from
finite set
Goal
maximize expected
reward over lifetime
6
Example MDP
State describes
all visible info
about game
????
Action are the
different choices
of dice to roll
(or to select a
category to score)
Goal
maximize score at
end of game
7
Markov Decision Processes
 An MDP has four components: S, A, R, T:
 finite state set S
 finite action set A
 transition function T(s,a,s’) = Pr(s’ | s,a)

Probability of going to state s’ after taking action a in state s
 bounded, real-valued reward function R(s,a)



Immediate reward we get for being in state s and taking action a
Roughly speaking the objective is to select actions in order to
maximize total reward over time
For example in a goal-based domain R(s,a) may equal 1 for goal
states and 0 for all others (or -1 reward for non-goal states)
8
State
Actions
Roll(die1)
Roll(die1,die2) . . .
Roll(die1,die2,die3)
State
Reward: only get reward
for “category selection”
actions. Reward equal to
points gained.
Roll(die1,die2)
…
1,1
1,2
6,6
Probabilistic
state transition
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: ????
 Should the solution to an MDP be just a sequence of
actions such as (a1,a2,a3, ….) ?
 Consider a single player card game like Blackjack/Solitaire.
 No! In general an action sequence is not sufficient
 Actions have stochastic effects, so the state we end up in is
uncertain
 This means that we might end up in states where the remainder
of the action sequence doesn’t apply or is a bad choice
 A solution should tell us what the best action is for any possible
situation/state that might arise
11
Policies (“plans” for MDPs)
 For this class we will assume that we are given a finite
planning horizon H
 I.e. we are told how many actions we will be allowed to take
 A solution to an MDP is a policy that says what to do at
any moment
 Policies are functions from states and times to actions
 π:S x T → A, where T is the non-negative integers
 π(s,t) tells us what action to take at state s when there are t
stages-to-go
 A policy that does not depend on t is called stationary, otherwise
it is called non-stationary
12
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: a policy such that ????
 We don’t want to output just any policy
 We want to output a “good” policy
 One that accumulates lots of reward
13
Value of a Policy
 How to quantify accumulated reward of policy π?
 Every policy has a value function that indicates how
good it is across states and time
 𝑉𝜋ℎ 𝑠 is the h-horizon value of policy 𝜋 in state s
 Equals the expected total reward of starting in s and
executing 𝜋 for h steps
h
Vh ( s )  E [  R t |  , s ]
t 0
 𝑅 𝑡 is a random variable denoting reward at time t when
following 𝜋 starting in s
14
15
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: a policy that achieves an “optimal value function”
 𝑉𝜋𝐻 is optimal if, for all states s, 𝑉𝜋𝐻 𝑠 is no worse than
value of any other policy at s
Theorem: For every MDP there exists an optimal policy.
(perhaps not unique)
 This says that there is one policy that is uniformly as good
as any other policy across the entire state space.
16
Computational Problems
 There are two problems that are of typical interest
 Policy evaluation:
 Given an MDP and a policy π
h
 Compute finite-horizon value function V (s ) for any h and s
 Policy optimization:
 Given an MDP and a horizon H
 Compute the optimal finite-horizon policy
 How many finite horizon policies are there?
 𝐴 𝐻𝑆
 So can’t just enumerate policies for efficient optimization
17
Computational Problems
 Dynamic programming techniques can be used for both
policy evaluation and optimization
 Polynomial time in # of states and actions
 http://web.engr.oregonstate.edu/~afern/classes/cs533/
 Is polytime in # of states and actions good?
 Not when these numbers are enormous!
 As is the case for most realistic applications
 Consider Klondike Solitaire, Computer Network Control,
etc
Enters Monte-Carlo Planning
18
Approaches for Large Worlds:
Monte-Carlo Planning
 Often a simulator of a planning domain is available
or can be learned from data
Fire & Emergency Response
Klondike Solitaire
19
Large Worlds: Monte-Carlo Approach
 Often a simulator of a planning domain is available
or can be learned from data
 Monte-Carlo Planning: compute a good policy for
an MDP by interacting with an MDP simulator
World
Simulator
action
Real
World
State + reward
20
Example Domains with Simulators
 Traffic simulators
 Robotics simulators
 Military campaign simulators
 Computer network simulators
 Emergency planning simulators
 large-scale disaster and municipal
 Sports domains
 Board games / Video games
 Go / RTS
In many cases Monte-Carlo techniques yield state-of-the-art
performance.
21
MDP: Simulation-Based Representation
 A simulation-based representation gives: S, A, R, T, I:
 finite state set S (|S| is generally very large)
 finite action set A (|A|=m and will assume is of reasonable size)
 Stochastic, real-valued, bounded reward function R(s,a) = r

Stochastically returns a reward r given input s and a
 Stochastic transition function T(s,a) = s’ (i.e. a simulator)


Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP
 Stochastic initial state function I.

Stochastically returns a state according to an initial state distribution
These stochastic functions can be implemented in any language!
22
Computational Problems
 Policy evaluation:
 Given an MDP simulator, a policy 𝜋, a state s, and horizon h
 Compute finite-horizon value function 𝑉𝜋ℎ (𝑠)
 Policy optimization:
 Given an MDP simulator, a horizon H, and a state s
 Compute the optimal finite-horizon policy at state s
23
Trajectories
 We can use the simulator to observe trajectories of
any policy π from any state s:
 Let Traj(s, π , h) be a stochastic function that
returns a length h trajectory of π starting at s.
 Traj(s, π , h)
 s0 = s
 For i = 1 to h-1
 si = T(si-1, π(si-1))
 Return s0, s1, …, sh-1
 The total reward of a trajectory is given by
h 1
R( s0 ,..., sh 1 )   R( si )
i 0
24
Policy Evaluation
 Given a policy π and a state s, how can we estimate 𝑉𝜋ℎ (𝑠)?
 Simply sample w trajectories starting at s and average the
total reward received
 Select a sampling width w and horizon h.
 The function EstimateV(s, π , h, w) returns estimate of Vπ(s)
EstimateV(s, π , h, w)
V=0
 For i = 1 to w
V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory
 Return V / w

 How close to true value 𝑉𝜋ℎ 𝑠 will this estimate be?
25
Sampling-Error Bound
h
Vh ( s )  E [  R t |  , s ]  E[ R(Traj ( s,  , h))]
t 0
approximation due to sampling
w
EstimateV ( s,  , h, w)  w1  ri , ri  R(Traj ( s,  , h))
i 1
• Note that the ri are samples of random variable R(Traj(s, π , h))
• We can apply the additive Chernoff bound which bounds the
difference between an expectation and an emprical average
26
Aside: Additive Chernoff Bound
• Let X be a random variable with maximum absolute value Z.
An let xi i=1,…,w be i.i.d. samples of X
• The Chernoff bound gives a bound on the probability that the
average of the xi are far from E[X]
Let {xi | i=1,…, w} be i.i.d. samples of random variable X,
then with probability at least 1   we have that,
w
E[ X ]  w1  xi  Z
i 1
1
w
ln 1
2







1
equivalently Pr E[ X ]  w  xi     exp     w 
 Z 
i 1




w
27
Aside: Coin Flip Example
• Suppose we have a coin with probability of heads equal to p.
• Let X be a random variable where X=1 if the coin flip
gives heads and zero otherwise. (so Z from bound is 1)
E[X] = 1*p + 0*(1-p) = p
• After flipping a coin w times we can estimate the heads prob.
by average of xi.
• The Chernoff bound tells us that this estimate converges
exponentially fast to the true mean (coin bias) p.
w


1
Pr p  w  xi     exp   2 w
i 1




28
Sampling Error Bound
h 1
V ( s, h)  E [   R |  , s ]  E[ R(Traj ( s,  , h))]
t
t
t 0
approximation due to sampling
w
EstimateV ( s,  , h, w)  w1  ri , ri  R(Traj ( s,  , h))
i 1
We get that,
V ( s )  EstimateV ( s,  , h, w)  Vmax
h
with probability at least
1
w
ln 1
1 
Can increase w to get arbitrarily good approximation.
29
Two Player MDP (aka Markov Games)
• So far we have only discussed single-player MDPs/games
• Your labs and competition will be 2-player zero-sum games
(zero sum means sum of player rewards is zero)
• We assume players take turns (non-simultaneous moves)
Player 1
Player 2
Markov Game
action
action
State/
reward
State/
reward
Simulators for 2-Player Games
 A simulation-based representation gives: 𝑆, 𝐴1 , 𝐴2 , 𝑅1 , 𝑅2 , 𝑇, 𝐼:
 finite state set S (assume state encodes whose turn it is)
 action sets 𝐴1 and 𝐴2 for player 1 (called max) and player 2 (called min)
 Stochastic, real-valued, bounded reward functions 𝑅1 𝑠, 𝑎 and 𝑅2 (𝑠, 𝑎)



Stochastically return rewards 𝑟1 and 𝑟2 for the two players max and
min respectively
In our zero-sum case we assume 𝑟1 = −𝑟2
So min maximizes its reward by minimizing the reward of max
 Stochastic transition function T(s,a) = s’ (i.e. a simulator)



Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr(s’ | s,a) of game
Generally s’ will be a turn for the other player
These stochastic functions can be implemented in any language!
31
Finite Horizon Value of Game
 Given two policies 𝜋1 and 𝜋2 one for each player we
can define the finite horizon value
 𝑉𝜋ℎ1 ,𝜋2 (𝑠) is h-horizon value function with respect to max
 expected total reward for player 1 after executing 𝝅𝟏
and 𝝅𝟐 starting in s for k steps
 For zero-sum games the value with respect to player 2 is just
− 𝑉𝜋ℎ1 ,𝜋2 (𝑠)
 Given 𝜋1 and 𝜋2 we can easily use the simulator to estimate
𝑉𝜋ℎ1 ,𝜋2 (𝑠)
 Just as we did for the single-player MDP setting
32
Summary
 Markov Decision Processes (MDPs) are common
models for sequential planning problems
 The solution to an MDP is a policy
 The goodness of a policy is measured by its value
function
 Expected total reward over H steps
 Monte Carlo Planning (MCP) is used for enormous
MDPs for which we have a simulator
 Evaluating a policy via MCP is very easy
33
Download