Two Models of Evaluating Probabilistic Planning

advertisement
Efficient Approaches for
Solving Large-scale MDPs
Slides on LRTDP and UCT are
courtesy Mausam/Kolobov
Ideas for Efficient Algorithms..
• Use heuristic search
(and reachability
information)
– LAO*, RTDP
• Use execution and/or
Simulation
– “Actual Execution”
Reinforcement learning
(Main motivation for RL is to
“learn” the model)
– “Simulation” –simulate the
given model to sample
possible futures
• Policy rollout, hindsight
optimization etc.
• Use “factored”
representations
– Factored representations
for Actions, Reward
Functions, Values and
Policies
– Directly manipulating
factored representations
during the Bellman update
Real Time Dynamic Programming
[Barto et al 95]
• Original Motivation
– agent acting in the real world
• Trial
– simulate greedy policy starting from start state;
– perform Bellman backup on visited states
– stop when you hit the goal
• RTDP: repeat trials forever
– Converges in the limit #trials ! 1
We will do the discussion
in terms of SSP MDPs
--Recall they subsume
infinite horizon MDPs
3
Trial
s0
s2
s1
s5
Sg
s3
s6
s4
s7
s8
4
Trial
s0
h
s5
s2
s1
Sg
h
s6
V
h
s3
s4
s7
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
5
Trial
s0
h
s5
s2
s1
Sg
h
s6
V
h
h
s3
s4
s7
h
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
6
Trial
s0
h
s5
s2
s1
Sg
h
s6
V
V
h
s3
s4
s7
h
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
7
Trial
s0
h
s5
s2
s1
Sg
h
s6
V
V
h
s3
s4
s7
h
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
8
Trial
s0
h
s5
s2
s1
Sg
h
s6
V
V
V
s3
s4
s7
h
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
9
Trial
s0
h
s5
s2
s1
Sg
h
s6
V
V
V
s3
s4
s7
h
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
10
Trial
s0
h
s5
RTDP
repeat
forever
s2
s1
Sg
h
s6
Backup all states
on trajectory
V
V
V
s3
s4
s7
h
h
s8
start at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
11
Real Time Dynamic Programming
[Barto et al 95]
• Original Motivation
– agent acting in the real world
• Trial
– simulate greedy policy starting from start state;
– perform Bellman backup on visited states
– stop when you hit the goal
• RTDP: repeat trials forever
No termination
condition!
– Converges in the limit #trials ! 1
12
RTDP Family of Algorithms
repeat
s à s0
repeat //trials
REVISE s; identify agreedy
FIND: pick s’ s.t. T(s, agreedy, s’) > 0
s à s’
until s 2 G
until termination test
13
• Admissible heuristic & monotonicity
⇒ V(s) · V*(s)
⇒ Q(s,a) · Q*(s,a)
• Label a state s as solved
– if V(s) has converged
best action
s
sg
ResV(s) < ²
) V(s) won’t change!
label s as solved
Labeling (contd)
best action
s
sg
s'
ResV(s) < ²
s' already solved
) V(s) won’t change!
label s as solved
15
Labeling (contd)
best action
s
sg
s'
ResV(s) < ²
s' already solved
) V(s) won’t change!
label s as solved
best action
s
sg
best action s'
ResV(s) < ²
ResV(s’) < ²
V(s), V(s’) won’t change!
label s, s’ as solved
16
Labeled RTDP [Bonet&Geffner 03b]
repeat
s à s0
label all goal states as solved
repeat //trials
REVISE s; identify agreedy
FIND: sample s’ from T(s, agreedy, s’)
s à s’
until s is solved
for all states s in the trial
try to label s as solved
until s0 is solved
17
LRTDP
• terminates in finite time
– due to labeling procedure
• anytime
– focuses attention on more probable states
• fast convergence
– focuses attention on unconverged states
18
Picking a Successor Take 2
• Labeled RTDP/RTDP: sample s’ / T(s, agreedy, s’)
– Adv: more probable states are explored first
– Labeling Adv: no time wasted on converged states
– Disadv: labeling is a hard constraint
– Disadv: sampling ignores “amount” of convergence
• If we knew how much V(s) is expected to change?
– sample s’ / expected change
19
Upper Bounds in SSPs
• RTDP/LAO* maintain lower bounds
– call it Vl
• Additionally associate upper bound with s
– Vu(s) ¸ V*(s)
• Define gap(s) = Vu(s) – Vl(s)
– low gap(s): more converged a state
– high gap(s): more expected change in its value
20
Backups on Bounds
• Recall monotonicity
• Backups on lower bound
– continue to be lower bounds
• Backups on upper bound
– continues to be upper bounds
• Intuitively
– Vl will increase to converge to V*
– Vu will decrease to converge to V*
21
Bounded RTDP [McMahan et al 05]
repeat
s à s0
repeat //trials
identify agreedy based on Vl
FIND: sample s’ / T(s, agreedy, s’).gap(s’)
s à s’
until gap(s) < ²
for all states s in trial in reverse order
REVISE s
until gap(s0) < ²
22
RTDP Trial
Qn+1(s0,a)
agreedy = a2
Min
a1
Jn+1(s0)
s0
a2
Jn
Jn
?
Jn
?
Jn
a3
?
Jn
Jn
Jn
Goal
Greedy “On-Policy” RTDP without
execution
Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you
reach a terminating state. Update the
values along this path. Loop back—until
the values stabilize
Comments
• Properties
– if all states are visited infinitely often then Jn → J*
– Only relevant states will be considered
• A state is relevant if the optimal policy could visit it.
•  Notice emphasis on “optimal policy”—just because a rough neighborhood
surrounds National Mall doesn’t mean that you will need to know what to do in
that neighborhood
• Advantages
– Anytime: more probable states explored quickly
• Disadvantages
– complete convergence is slow!
– no termination condition
Do we care about complete
convergence?
Think Cpt. Sullenberger
9/26
The “Heuristic”
• The value function is
• They approximate it by
What if we pick the s’
corresponding to the
highest P?
Exactly what are they relaxing?
They are assuming that they can make the best
outcome of the action happen..
UCT: A Monte-Carlo Planning Algorithm
• UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the
current best policy and improving it
– Similar principle as RTDP
– But action selection, value updates, and guarantees are different
– Useful when we have
• Enormous reachable state space
• High-entropy T (2|X| outcomes per action, many likely ones)
– Building determinizations can be super-expensive
– Doing Bellman backups can be super-expensive
• Success stories:
–
–
–
–
–
–
Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08)
Klondike Solitaire (wins 40% of games)
General Game Playing Competition
Real-Time Strategy Games
Probabilistic Planning Competition
The list is growing…
31
Background: Multi-Armed Bandit Problem
• Select an arm that probably (w/ high probability) has
approximately the best expected reward
• Use as few simulator calls (or pulls) as possible
s
a1
Just like a an FH MDP
with horizon 1!
ak
a2
…
R(s,a1)
R(s,a2)
…
R(s,ak)
Slide courtesy of A. Fern
32
UCT
Example
Build a state-action tree
At a leaf node perform a random rollout
Current World State
Initially tree is single leaf
1
Rollout
policy
1
1
1
Terminal
(reward = 1)
Slide courtesy of A. Fern
33
UCT Example
Must select each action at a node at least once
Current World State
1
1
Rollout
Policy
1
1
0
Terminal
(reward = 0)
Slide courtesy of A. Fern
34
UCT Example
Must select each action at a node at least once
Current World State
1
0
1
0
1
0
1
0
Slide courtesy of A. Fern
35
UCT Example
When all node actions tried once, select action according to tree policy
Current World State
Tree Policy
1
0
1
0
1
0
1
0
Slide courtesy of A. Fern
36
UCT Example
When all node actions tried once, select action according to tree policy
Current World State
Tree Policy
Rollout
Policy
0
1
0
1
0
1
0
1
0
Slide courtesy of A. Fern
37
UCT Example
When all node actions tried once, select action according to tree policy
Current World State
Tree
Policy
1/2
0
0
1
0
0
1
0
0
1
0
0
What is an appropriate
tree policy?
Rollout policy?
Slide courtesy of A. Fern
38
UCT Details
• Rollout policy:
– Basic UCT uses random
• Tree policy:
– Q(s,a) : average reward received in current trajectories after
taking action a in state s
– n(s,a) : number of times action a taken in s
– n(s) : number of times state s encountered
ln n( s )
 UCT ( s )  arg max a Q( s, a)  c
n( s, a )
Theoretical constant that must
be selected empirically in practice.
Exploration term
Setting it to distance to horizon
Slide courtesy of A. Fern
39
guarantees arriving at the optimal policy eventually, if R
UCT Example
When all node actions tried once, select action according to tree policy
Current World State
a1
Tree
Policy
 UCT ( s )  arg max a Q( s, a)  c
a2
1/2
0
0
1
0
0
1
0
0
1
0
0
ln n( s )
n( s, a )
Slide courtesy of A. Fern
40
When all node actions tried once, select action according to tree policy
Current World State
1/3
Tree
Policy
 UCT ( s )  arg max a Q( s, a)  c
1/2
0
0
1
0
0
1
0
0
1
0
ln n( s )
n( s, a )
0
Slide courtesy of A. Fern
41
UCT Summary & Theoretical Properties
• To select an action at a state s
– Build a tree using N iterations of Monte-Carlo tree search
• Default policy is uniform random up to level L
• Tree policy is based on bandit rule
– Select action that maximizes Q(s,a)
(note that this final action selection does not take the exploration
term into account, just the Q-value estimate)
• The more simulations, the more accurate
– Guaranteed to pick suboptimal actions exponentially rarely after
convergence (under some assumptions)
• Possible improvements
– Initialize the state-action pairs with a heuristic (need to pick a weight)
– Think of a better-than-random rollout policy
Slide courtesy of A. Fern
42
LRTDP or UCT?
AAAI 2012!
43
Online Action Selection
Policy Computation
Exec
Select
e
x
Select
e
x
Select
e
x
Select
e
x
Off-line policy generation
Online action selection
• First compute the whole policy
• Loop
– Get the initial state
– Compute the optimal policy
given the initial state and the
goals
• Then just execute the policy
– Loop
• Do action recommended by the
policy
• Get the next state
– Until reaching goal state
• Pros: Can anticipate all
problems;
• Cons: May take too much time
to start executing
– Compute the best action
for the current state
– execute it
– get the new state
• Pros: Provides fast first
response
• Cons: May paint itself
into a corner..
FF-Replan
•
•
•
Simple replanner
Determinizes the probabilistic problem
– IF an action has multiple effect sets with different probabilities
• Select the most likely on
• Split the action into multiple actions one for each setup
Solves for a plan in the determinized problem
a3
a2
a1
S
a2
a5
a4
a3
a4
G
G
All Outcome Replanning
(FFRA)
ICAPS-07
Probability1
Effect
1
Action
1
Effect
1
Effect
2
Action
2
Effect
2
Action
Probability2
48
1st IPPC & Post-Mortem..
IPPC Competitors
•
•
Most IPPC competitors used
different approaches for offline
policy generation.
One group implemented a simple
online “replanning” approach in
addition to offline policy generation
–
•
•
To everyone’s surprise, the
replanning approach wound up
winning the competition.
Lots of hand-wringing ensued..
–
–
Most-likely vs. All-outcomes
Loop
•
•
•
Determinize the probabilistic problem
•
–
Results and Post-mortem
Get the state S; Call a classical
planner (e.g. FF) with [S,G] as the
problem
Execute the first action of the plan
Umpteen reasons why such an
approach should do quite badly..
•
May be we should require that the
planners really really use probabilities?
May be the domains should somehow
be made “probabilistically
interesting”?
Current understanding:
–
–
No reason to believe that off-line policy
computation must dominate online
action selection
The “replanning” approach is just a
degenerate case of hind-sight
optimization
Reducing calls to FF..
• We can reduce calls to FF by memoizing
successes
– If we were given s0 and sG as the problem, and
solved it using our determinization to get the
plan s0—a0—s1—a1—s2—a2—s3…an—sG
– Then in addition to sending a1 to the simulator,
we can memoize {si—ai} as the partial policy.
• Whenever a new state is given by the simulator, we can
see if it is already in the partial policy
• Additionally, FF-replan can consider every state in the
partial policy table as a goal state (in that if it reaches
them, it knows how to get to goal state..)
Hindsight Optimization for
Anticipatory Planning/Scheduling
• Consider a deterministic planning (scheduling) domain,
where the goals arrive probabilistically
– Using up resources and/or doing greedy actions may preclude
you from exploiting the later opportunities
• How do you select actions to perform?
– Answer: If you have a distribution of the goal arrival, then
• Sample goals upto a certain horizon using this distribution
• Now, we have a deterministic planning problem with known goals
• Solve it; do the first action from it.
– Can improve accuracy with multiple samples
• FF-Hop uses this idea for stochastic planning. In anticipatory
planning, the uncertainty is exogenous (it is the uncertain
arrival of goals). In stochastic planning, the uncertainty is
endogenous (the actions have multiple outcomes)
Probabilistic Planning
(goal-oriented)
Left
Outcomes
are more
likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
Dead End
Goal State
52
Problems of FF-Replan and
better alternative sampling
FF-Replan’s Static Determinizations don’t respect
probabilities.
We need “Probabilistic and Dynamic Determinization”
Sample Future Outcomes and
Determinization in Hindsight
Each Future Sample Becomes a
Known-Future Deterministic Problem
55
Implementation FF-Hindsight
Constructs a set of futures
• Solves the planning problem using the
H-horizon futures using FF
• Sums the rewards of each of the plans
• Chooses action with largest Qhs value
Hindsight Optimization
(Online Computation of VHS )
•
•
Pick action a with highest Q(s,a,H) where •
– Q(s,a,H) = R(s,a) + ST(s,a,s’)V*(s’,H-1) •
Compute V* by sampling
–
H-horizon future FH for M = [S,A,T,R]
• Mapping of state, action and time (h<H)
to a state
– S×A×h→S
•
•
•
•
Value of a policy π for FH
– R(s,FH, π)
V*(s,H) = maxπ EFH [ R(s,FH,π) ]
–
–
•
Common-random number (correlated) vs.
independent futures..
Time-independent vs. Time-dependent
futures
But this is still too hard to compute..
Let’s swap max and expectation
VHS(s,H) = EFH [maxπ R(s,FH,π)]
–
•
VHS underestimates V*
Why?
– Intuitively, because VHS can
assume that it can use
different policies in different
futures; while V* needs to
pick one policy that works
best (in expectation) in all
futures.
But then, VFFRa underestimates
VHS
– Viewed in terms of J*, VHS is
a more informed admissible
heuristic..
maxπ R(s,FH-1,π) is approximated by FF plan
59
Left
Outcomes
are more
likely
Probabilistic Planning
Action
(goal-oriented)
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
Dead End
Goal State
60
Improvement Ideas
• Reuse
– Generated futures that are still relevant
– Scoring for action branches at each step
– If expected outcomes occur, keep the plan
• Future generation
– Not just probabilistic
– Somewhat even distribution of the space
• Adaptation
– Dynamic width and horizon for sampling
– Actively detect and avoid unrecoverable failures
on top of sampling
Left
Outcomes
are more
likely
Hindsight Sample 1
Maximize Goal Achievement
I
A1
Action
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 1
A2: 0
Dead End
Goal State
62
Factored Representations fo MDPs:
Actions
• Actions can be represented directly in terms of
their effects on the individual state variables
(fluents). The CPTs of the BNs can be represented
compactly too!
– Write a Bayes Network relating the value of fluents at the state
before and after the action
• Bayes networks representing fluents at different time points are called
“Dynamic Bayes Networks”
• We look at 2TBN (2-time-slice dynamic bayes nets)
• Go further by using STRIPS assumption
– Fluents not affected by the action are not represented explicitly
in the model
– Called Probabilistic STRIPS Operator (PSO) model
Action CLK
Factored Representations: Reward,
Value and Policy Functions
• Reward functions can be represented in
factored form too. Possible representations
include
– Decision trees (made up of fluents)
– ADDs (Algebraic decision diagrams)
• Value functions are like reward functions
(so they too can be represented similarly)
• Bellman update can then be done directly
using factored representations..
SPUDDs use of ADDs
Direct manipulation of ADDs
in SPUDD
Ideas for Efficient Algorithms..
• Use heuristic search
(and reachability
information)
– LAO*, RTDP
• Use execution and/or
Simulation
– “Actual Execution”
Reinforcement learning
(Main motivation for RL is to
“learn” the model)
– “Simulation” –simulate the
given model to sample
possible futures
• Policy rollout, hindsight
optimization etc.
• Use “factored”
representations
– Factored representations
for Actions, Reward
Functions, Values and
Policies
– Directly manipulating
factored representations
during the Bellman update
A Plan is a Terrible Thing to
Waste
• Suppose we have a plan
– s0—a0—s1—a1—s2—a2—s3…an—sG
– We realized that this tells us not just the estimated value of s0,
but also of s1,s2…sn
– So we don’t need to compute the heuristic for them again
• Is that all?
– If we have states and actions in factored representation, then we
can explain exactly what aspects of si are relevant for the plan’s
success.
– The “explanation” is a proof of correctness of the plan
» Can be based on regression (if the plan is a sequence) or causal proof (if the
plan is a partially ordered one.
• The explanation will typically be just a subset of the literals making up
the state
– That means actually, the plan suffix from si may actually be relevant in many
more states that are consistent with that explanation
Triangle Table Memoization
• Use triangle tables / memoization
C
B
A
A
B
C
If the above problem is solved, then we don’t need to call FF again for the below:
B
A
A
B
Explanation-based Generalization
(of Successes and Failures)
• Suppose we have a plan P that solves a
problem [S, G].
• We can first find out what aspects of S
does this plan actually depend on
– Explain (prove) the correctness of the plan,
and see which parts of S actually contribute to
this proof
– Now you can memoize this plan for just that
subset of S
Relaxations for Stochastic
Planning
• Determinizations can also be used as a basis
for heuristics to initialize the V for value
iteration [mGPT; GOTH etc]
• Heuristics come from relaxation
• We can relax along two separate dimensions:
– Relax –ve interactions
• Consider +ve interactions alone using relaxed planning
graphs
– Relax uncertainty
• Consider determinizations
– Or a combination of both!
Solving Determinizations
• If we relax –ve interactions
– Then compute relaxed plan
• Admissible if optimal relaxed plan is computed
• Inadmissible otherwise
• If we keep –ve interactions
– Then use a deterministic planner (e.g.
FF/LPG)
• Inadmissible unless the underlying planner is
optimal
Negative Interactions
Increasing consideration 
Dimensions of Relaxation
3
4
1
2
Uncertainty
1
Relaxed Plan Heuristic
2
McLUG
3
FF/LPG
4
Limited width
stochastic planning?
Reducing Uncertainty
Bound the number of stochastic
outcomes  Stochastic “width”
Dimensions of Relaxation
-ve interactions
Uncertainty
None
None
Some
Relaxed Plan McLUG
Some
Full
FF/LPG
Limited
width Stoch
Planning
Full
Expressiveness v. Cost
Limited width
stochastic planning
Node Expansions v.
Heuristic Computation Cost
FF
McLUG
Nodes Expanded
FF-Replan
Computation Cost
h=0
FFR
FF
Download