Alan Fern
1
Often a simulator of a planning domain is available or can be learned from data
Conservation Planning Fire & Emergency Response
Often a simulator of a planning domain is available or can be learned from data
Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator
World
Simulator action
State + reward
MDP: Simulation-Based Representation
A simulation-based representation gives: S, A, R , T , I :
finite state set S (|S|=n and is generally very large)
finite action set A (|A|=m and will assume is of reasonable size)
Stochastic, real-valued, bounded reward function R(s,a) = r
Stochastically returns a reward r given input s and a
Stochastic transition function T(s,a ) = s’ (i.e. a simulator)
Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr (s’ | s,a) of MDP
Stochastic initial state function I.
Stochastically returns a state according to an initial state distribution
These stochastic functions can be implemented in any language!
4
Outline
You already learned how to evaluate a policy given a simulator
Just run the policy multiple times for a finite horizon and average the rewards
In next two lectures we’ll learn how to use the simulator in order to select good actions in the real world
5
Monte-Carlo Planning Outline
Single State Case (multi-armed bandits)
A basic tool for other algorithms
Monte-Carlo Policy Improvement
Policy rollout
Policy Switching
Today
Monte-Carlo Tree Search
Sparse Sampling
UCT and variants
6
Single State Monte-Carlo Planning
Suppose MDP has a single state and k actions
Can sample rewards of actions using calls to simulator
Sampling action a is like pulling slot machine arm with random payoff function R ( s , a ) a
1 s a
2 a k
…
R(s,a
1
) R(s,a
2
)
…
R(s,a k
)
Multi-Armed Bandit Problem
7
Multi-Armed Bandits
We will use bandit algorithms as components for multi-state Monte-Carlo planning
But they are useful in their own right
Pure bandit problems arise in many applications
Applicable whenever:
We have a set of independent options with unknown utilities
There is a cost for sampling options or a limit on total samples
Want to find the best option or maximize utility of our samples
Multi-Armed Bandits: Examples
Clinical Trials
Arms = possible treatments
Arm Pulls = application of treatment to inidividual
Rewards = outcome of treatment
Objective = determine best treatment quickly
Online Advertising
Arms = different ads/ad-types for a web page
Arm Pulls = displaying an ad upon a page access
Rewards = click through
Objective = find best add quickly (the maximize clicks)
Simple Regret Objective
Different applications suggest different types of bandit objectives.
Today minimizing simple regret will be the objective
Simple Regret Minimization (informal): quickly identify arm with close to optimal expected reward s a
1 a
2 a k
…
R(s,a
1
) R(s,a
2
)
…
R(s,a
Multi-Armed Bandit Problem k
)
10
Simple Regret Objective: Formal Definition
Protocol: at time step n
1.
2.
Pick an “exploration” arm 𝑎 𝑛 reward 𝑟 𝑛
, then pull it and observe
Pick an “exploitation” arm index 𝑗 𝑛 that currently looks best (if algorithm is stopped at time 𝑛 it returns 𝑗 𝑛
)
( 𝑎 𝑛
, 𝑗 𝑛 and 𝑟 𝑛 are random variables) .
Let 𝑹 ∗ be the expected reward of truly best arm
Expected Simple Regret ( 𝑬[𝑺𝑹𝒆𝒈 𝒏 between 𝑅 ∗
]) : difference and expected reward of arm 𝑗 𝑛 selected by our strategy at time n
𝐸[𝑆𝑅𝑒𝑔 𝑛
] = 𝑅 ∗ − 𝐸[𝑅(𝑎 𝑗 𝑛
)]
11
UniformBandit Algorith (or Round Robin)
Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852
UniformBandit Algorithm:
At round n pull arm with index (k mod n) + 1
At round n return arm (if asked) with largest average reward
I.e. 𝑗 𝑛 is the index of arm with best average so far
Theorem : The expected simple regret of
Uniform after n arm pulls is upper bounded by O 𝑒 −𝑐𝑛 for a constant c.
This bound is exponentially decreasing in n!
So even this simple algorithm has a provably small simple regret.
12
Can we do better?
Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI
Conference on Artificial Intelligence.
Algorithm 𝜖 -GreedyBandit : (parameter 0 < 𝜖 < 1 )
At round n, with probability 𝜖 pull arm with best average reward so far, otherwise pull one of the other arms at random.
At round n return arm (if asked) with largest average reward
Theorem : The expected simple regret of 𝜖 -
Greedy for 𝜖 = 0.5
after n arm pulls is upper bounded by O 𝑒 −𝑐𝑛 for a constant c that is larger than the constant for Uniform
(this holds for “large enough” n).
Often is more effective than UniformBandit in practice.
13
Monte-Carlo Planning Outline
Single State Case (multi-armed bandits)
A basic tool for other algorithms
Monte-Carlo Policy Improvement
Policy rollout
Policy Switching
Today
Monte-Carlo Tree Search
Sparse Sampling
UCT and variants
14
Policy Improvement via Monte-Carlo
Now consider a very large multi-state MDP.
Suppose we have a simulator and a non-optimal policy
E.g. policy could be a standard heuristic or based on intuition
Can we somehow compute an improved policy?
World
Simulator
+
Base Policy action
State + reward
15
Policy Improvement Theorem
Definition: The Q-value function 𝑄 𝜋 𝑠, 𝑎, ℎ gives the expected future reward of starting in state s , taking action 𝑎 , and then following policy 𝜋 until the horizon h.
How good is it to execute 𝜋 after taking action 𝑎 in state 𝑠
Define: 𝝅 ′ 𝒔 = argmax 𝒂
𝑸 𝝅 𝒔, 𝒂, 𝒉
Theorem [Howard, 1960]: For any non-optimal policy 𝜋 the policy 𝜋 ′ is strictly better than 𝜋 .
So if we can compute 𝜋 ′ 𝑠 at any state we encounter, then we can execute an improved policy
Can we use bandit algorithms to compute 𝜋 ′ 𝑠 ?
16
Policy Improvement via Bandits s a
1 a
2 a k
SimQ(s, a
1
, π,h) SimQ(s, a
2
, π,h)
…
SimQ(s, a k
, π,h)
Idea: define a stochastic function SimQ(s,a, π,h) that we can implement and whose expected value is Q
π (s,a,h)
Then use Bandit algorithm to select (approximately) the action with best Q-value (i.e. the action 𝜋′(𝑠) )
How to implement SimQ?
17
s
Policy Improvement via Bandits a1 a2 a k
SimQ(s,a, π,h) q = R(s,a) s = T(s,a) simulate a in s for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy
Return q
Trajectory under p
…
…
Sum of rewards = SimQ(s, a
1
, π,h)
Sum of rewards = SimQ(s, a
2
, π,h)
… Sum of rewards = SimQ(s, a k
, π,h)
18
Policy Improvement via Bandits
SimQ(s,a, π,h) q = R(s,a) s = T(s,a) simulate a in s for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy
Return q
Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards
Expected value of SimQ(s,a, π,h) is Q
π (s,a,h)
So averaging across multiple runs of SimQ quickly converges to Q
π (s,a,h)
19
Policy Improvement via Bandits s a
1 a
2 a k
SimQ(s, a
1
, π,h) SimQ(s, a
2
, π,h)
…
SimQ(s, a k
, π,h)
Now apply your favorite bandit algorithm for simple regret
UniformRollout : use UniformBandit
Parameters: number of trials n and horizon/height h
𝝐 -GreedyRollout : use 𝜖 -GreedyBandit
Parameters: number of trials n, 𝜖, and horizon/height h
( 𝜖 = 0.5
often is a good choice)
20
UniformRollout
Each action is tried roughly the same number of times (approximately 𝑤 = 𝑛/𝑘 times) s a
1 a
2 a k
…
SimQ(s,a i
, π,h) trajectories
Each simulates taking action a i then following
π for h-1 steps.
Samples of SimQ(s,a i
, π,h) q
11 q
12
… q
1w q
21 q
22
… q
2w q k1 q k2
… q kw
21
𝝐 − 𝐆𝐫𝐞𝐞𝐝𝐲𝐑𝐨𝐥𝐥𝐨𝐮𝐭
Allocates a non-uniform number of trials across actions (focuses on more promising actions) s a
1 a
2 a k • For 𝜖 = 0.5
we might expect it to be better than
UniformRollout for same value of n.
… q
11 q
12
… q
1u q
21 q
22
… q
2v q k1
22
Real world state/action sequence s a
2
… run policy rollout
Simulated experience a
1 a
2
… a k
… … … … … … … … a k
… a
1 run policy rollout a
2
… a k
… … … … … … … …
How much time does each decision take?
23
s a
1 a
2 a k
…
SimQ(s,a i
, π,h) trajectories
Each simulates taking action a i then following
π for h-1 steps.
• Total of n SimQ calls each using h calls to simulator and policy
• Total of hn calls to the simulator and to the policy
(dominates time to make decision)
24
Practical Issues: Accuracy
Selecting number of trajectories 𝒏
n should be at least as large as the number of available actions
(so each is tried at least once)
In general n needs to be larger as the randomness of the simulator increases (so each action gets tried a sufficient number of times)
Rule-of-Thumb : start with n set so that each action can be tried approximately 5 times and then see impact of decreasing/increasing n
Selecting height/horizon h of trajectories
A common option is to just select h to be the same as the horizon of the problem being solved
Suggestion: setting h = -1 in our framework, which will run all trajectories until the simulator hits a terminal state
Using a smaller value of h can sometimes be effective if enough reward is accumulated to give a good estimate of Q-values
In general, larger values are better, but this increases time.
25
Practical Issues: Speed
There are three ways to speedup decision making time
1.
Use a faster policy
26
Practical Issues: Speed
There are three ways to speedup decision making time
1.
Use a faster policy
2.
Decrease the number of trajectories n
Decreasing Trajectories:
If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often
One way to get away with a smaller n is to use an action filter
Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider
You can use your domain knowledge to filter out obviously bad actions
Rollout decides among the remaining actions returned by f(s)
Since rollout only tries actions in f(s) can use a smaller value of n
27
Practical Issues: Speed
There are three ways to speedup either rollout procedure
1.
Use a faster policy
2.
3.
Decrease the number of trajectories n
Decrease the horizon h
Decrease Horizon h:
If h is too small compared to the “real horizon” of the problem, then the
Q-estimates may not be accurate
Can get away with a smaller h by using a value estimation heuristic
Heuristic function: a heuristic function v(s) returns an estimate of the value of state s
SimQ is adjusted to run policy for h steps ending in state s’ and returns the sum of rewards up until s’ added to the estimate v(s’)
28
Multi-Stage Rollout
A single call to Rollout[ π,h,w](s) yields one iteration of policy improvement starting at policy π
We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout
Rollout[ Rollout[ π,h,w] ,h,w](s) returns the action for state s resulting from two iterations of policy improvement
Can nest this arbitrarily
Gives a way to use more time in order to improve performance
29
s
Each step requires nh simulator calls for Rollout policy a
1 a
2
… a k
Trajectories of
SimQ(s,a i
, Rollout[ π,h,w] ,h)
• Two stage: compute rollout policy of “rollout policy of π”
• Requires (nh) 2 calls to the simulator for 2 stages
• In general exponential in the number of stages
30
Example: Rollout for Solitaire [Yan et al. NIPS’04]
Player Success Rate Time/Game
Human Expert 36.6%
(naïve) Base
Policy
13.05%
1 rollout 31.20%
20 min
0.021 sec
0.67 sec
2 rollout
3 rollout
47.6%
56.83%
7.13 sec
1.5 min
4 rollout
5 rollout
60.51%
70.20%
18 min
1 hour 45 min
Multiple levels of rollout can payoff but is expensive
31
Rollout in 2-Player Games
SimQ simply uses the base policy 𝜋 to select moves for both players until the horizon
Rollout is biased toward playing well against 𝜋 a
1
Is this ok? s a
2 a k
… p1 p2 q
11 q
12
… q
1w q
21 q
22
… q
2w q k1 q k2
… q kw
32
Another Useful Technique:
Policy Switching
Suppose you have a set of base policies { π
1
, π
2
,…, π
M
}
Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select.
Policy switching is a simple way to select which policy to use at a given step via a simulator
33
Another Useful Technique:
Policy Switching s
π
1 π
2
π
M
Sim(s, π
1
,h) Sim(s, π
2
,h)
…
Sim(s, π
M
,h)
The stochastic function Sim(s, π,h) simply samples the h-horizon value of π starting in state s
Implement by simply simulating π starting in s for h steps and returning discounted total reward
Use Bandit algorithm to select best policy and then select action chosen by that policy
34
PolicySwitching
PolicySwitch[{ π
1
, π
2
,…, π
M
},h,n](s)
1.
Define bandit with M arms giving rewards Sim(s, π i
,h)
2.
3.
Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials
Return action π i*
(s) s
π
1 π
2
π
M
…
Sim(s, π i
,h) trajectories
Each simulates following
π i for h steps.
Discounted cumulative rewards v
11 v
12
… v
1w v
21 v
22
… v
2w v
M1 v
M2
… v
Mw
35
Executing Policy Switching in Real World
Real world state/action sequence s 𝜋
2
(s)
… run policy rollout 𝜋
1 𝜋
2 𝜋 k
Simulated experience
…
… … … … … … … … 𝜋 k
(s’)
… 𝜋
1 run policy rollout 𝜋
2
… 𝜋 k
… … … … … … … …
36
Policy Switching: Quality
Let 𝜋 𝑝𝑠 denote the ideal switching policy
Always pick the best policy index at any state
Theorem: For any state s, max 𝑖
𝑉 𝜋 𝑖 𝑠 ≤ 𝑉 𝜋 𝑝𝑠 𝑠 .
The value of the switching policy is at least as good as the best single policy in the set
It will often perform better than any single policy in set.
For non-ideal case, were bandit algorithm only picks approximately the best arm we can add an error term to the bound.
37
Suppose we have a two sets of polices, one for each player.
Max Policies (us) : {𝜋
1
, 𝜋
2
, … , 𝜋
𝑀
}
Min Policies (them) : {𝜙
1
, 𝜙
2
, … , 𝜙
𝑁
}
These policy sets will often be the same, when players have the same actions sets.
Policies encode our knowledge of what the possible effective strategies might be in the game
But we might not know exactly when each strategy will be most effective.
Current
State s
Build Game
Matrix
….
….
….
….
….
….
….
….
….
Game Simulator
Each entry gives estimated value (for max player) of playing a policy pair against one another
Each value estimated by averaging across w simulated games .
Current
State s
Build Game
Matrix
….
….
….
Game Simulator
MaxiMin Policy 𝜋 𝑖
= arg max 𝑖 min 𝑗
𝐶 𝑖,𝑗
(𝑠)
Select action 𝜋 𝑖 𝑠
Can switch between policies based on state of game!
….
….
….
….
….
….
Current
State s
Build Game
Matrix
….
….
Game Simulator
Parameters in Library Implementation:
Policy Sets: {𝜋
1
, 𝜋
2
, … , 𝜋
𝑀
} , {𝜙
1
, 𝜙
2
, … , 𝜙
𝑁
}
Sampling Width w : number of simulations per policy pair
Height/Horizon h : horizon used for simulations
….
….
….
….
….
….
….
Policy Switching: Quality
MaxiMin policy switching will often do better than any single policy in practice
The theoretical guarantees for basic MaxiMin policy switching are quite weak
Tweaks to the algorithm can fix this
For single-agent MDPs, policy switching is guaranteed to improve over the best policy in the set.
42