Efficient Approaches for Solving Large-scale MDPs Slides on LRTDP and UCT are courtesy Mausam/Kolobov Ideas for Efficient Algorithms.. • Use heuristic search (and reachability information) – LAO*, RTDP • Use execution and/or Simulation – “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) – “Simulation” –simulate the given model to sample possible futures • Policy rollout, hindsight optimization etc. • Use “factored” representations – Factored representations for Actions, Reward Functions, Values and Policies – Directly manipulating factored representations during the Bellman update Real Time Dynamic Programming [Barto et al 95] • Original Motivation – agent acting in the real world • Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal • RTDP: repeat trials forever – Converges in the limit #trials ! 1 We will do the discussion in terms of SSP MDPs --Recall they subsume infinite horizon MDPs 3 Trial s0 s2 s1 s5 Sg s3 s6 s4 s7 s8 4 Trial s0 h s5 s2 s1 Sg h s6 V h s3 s4 s7 h s8 start at start state repeat perform a Bellman backup simulate greedy action 5 Trial s0 h s5 s2 s1 Sg h s6 V h h s3 s4 s7 h h s8 start at start state repeat perform a Bellman backup simulate greedy action 6 Trial s0 h s5 s2 s1 Sg h s6 V V h s3 s4 s7 h h s8 start at start state repeat perform a Bellman backup simulate greedy action 7 Trial s0 h s5 s2 s1 Sg h s6 V V h s3 s4 s7 h h s8 start at start state repeat perform a Bellman backup simulate greedy action 8 Trial s0 h s5 s2 s1 Sg h s6 V V V s3 s4 s7 h h s8 start at start state repeat perform a Bellman backup simulate greedy action 9 Trial s0 h s5 s2 s1 Sg h s6 V V V s3 s4 s7 h h s8 start at start state repeat perform a Bellman backup simulate greedy action until hit the goal 10 Trial s0 h s5 RTDP repeat forever s2 s1 Sg h s6 Backup all states on trajectory V V V s3 s4 s7 h h s8 start at start state repeat perform a Bellman backup simulate greedy action until hit the goal 11 Real Time Dynamic Programming [Barto et al 95] • Original Motivation – agent acting in the real world • Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal • RTDP: repeat trials forever No termination condition! – Converges in the limit #trials ! 1 12 RTDP Family of Algorithms repeat s à s0 repeat //trials REVISE s; identify agreedy FIND: pick s’ s.t. T(s, agreedy, s’) > 0 s à s’ until s 2 G until termination test 13 • Admissible heuristic & monotonicity ⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a) • Label a state s as solved – if V(s) has converged best action s sg ResV(s) < ² ) V(s) won’t change! label s as solved Labeling (contd) best action s sg s' ResV(s) < ² s' already solved ) V(s) won’t change! label s as solved 15 Labeling (contd) best action s sg s' ResV(s) < ² s' already solved ) V(s) won’t change! label s as solved best action s sg best action s' ResV(s) < ² ResV(s’) < ² V(s), V(s’) won’t change! label s, s’ as solved 16 Labeled RTDP [Bonet&Geffner 03b] repeat s à s0 label all goal states as solved repeat //trials REVISE s; identify agreedy FIND: sample s’ from T(s, agreedy, s’) s à s’ until s is solved for all states s in the trial try to label s as solved until s0 is solved 17 LRTDP • terminates in finite time – due to labeling procedure • anytime – focuses attention on more probable states • fast convergence – focuses attention on unconverged states 18 Picking a Successor Take 2 • Labeled RTDP/RTDP: sample s’ / T(s, agreedy, s’) – Adv: more probable states are explored first – Labeling Adv: no time wasted on converged states – Disadv: labeling is a hard constraint – Disadv: sampling ignores “amount” of convergence • If we knew how much V(s) is expected to change? – sample s’ / expected change 19 Upper Bounds in SSPs • RTDP/LAO* maintain lower bounds – call it Vl • Additionally associate upper bound with s – Vu(s) ¸ V*(s) • Define gap(s) = Vu(s) – Vl(s) – low gap(s): more converged a state – high gap(s): more expected change in its value 20 Backups on Bounds • Recall monotonicity • Backups on lower bound – continue to be lower bounds • Backups on upper bound – continues to be upper bounds • Intuitively – Vl will increase to converge to V* – Vu will decrease to converge to V* 21 Bounded RTDP [McMahan et al 05] repeat s à s0 repeat //trials identify agreedy based on Vl FIND: sample s’ / T(s, agreedy, s’).gap(s’) s à s’ until gap(s) < ² for all states s in trial in reverse order REVISE s until gap(s0) < ² 22 RTDP Trial Qn+1(s0,a) agreedy = a2 Min a1 Jn+1(s0) s0 a2 Jn Jn ? Jn ? Jn a3 ? Jn Jn Jn Goal Greedy “On-Policy” RTDP without execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize Comments • Properties – if all states are visited infinitely often then Jn → J* – Only relevant states will be considered • A state is relevant if the optimal policy could visit it. • Notice emphasis on “optimal policy”—just because a rough neighborhood surrounds National Mall doesn’t mean that you will need to know what to do in that neighborhood • Advantages – Anytime: more probable states explored quickly • Disadvantages – complete convergence is slow! – no termination condition Do we care about complete convergence? Think Cpt. Sullenberger 9/26 The “Heuristic” • The value function is • They approximate it by What if we pick the s’ corresponding to the highest P? Exactly what are they relaxing? They are assuming that they can make the best outcome of the action happen.. UCT: A Monte-Carlo Planning Algorithm • UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the current best policy and improving it – Similar principle as RTDP – But action selection, value updates, and guarantees are different – Useful when we have • Enormous reachable state space • High-entropy T (2|X| outcomes per action, many likely ones) – Building determinizations can be super-expensive – Doing Bellman backups can be super-expensive • Success stories: – – – – – – Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08) Klondike Solitaire (wins 40% of games) General Game Playing Competition Real-Time Strategy Games Probabilistic Planning Competition The list is growing… 31 Background: Multi-Armed Bandit Problem • Select an arm that probably (w/ high probability) has approximately the best expected reward • Use as few simulator calls (or pulls) as possible s a1 Just like a an FH MDP with horizon 1! ak a2 … R(s,a1) R(s,a2) … R(s,ak) Slide courtesy of A. Fern 32 UCT Example Build a state-action tree At a leaf node perform a random rollout Current World State Initially tree is single leaf 1 Rollout policy 1 1 1 Terminal (reward = 1) Slide courtesy of A. Fern 33 UCT Example Must select each action at a node at least once Current World State 1 1 Rollout Policy 1 1 0 Terminal (reward = 0) Slide courtesy of A. Fern 34 UCT Example Must select each action at a node at least once Current World State 1 0 1 0 1 0 1 0 Slide courtesy of A. Fern 35 UCT Example When all node actions tried once, select action according to tree policy Current World State Tree Policy 1 0 1 0 1 0 1 0 Slide courtesy of A. Fern 36 UCT Example When all node actions tried once, select action according to tree policy Current World State Tree Policy Rollout Policy 0 1 0 1 0 1 0 1 0 Slide courtesy of A. Fern 37 UCT Example When all node actions tried once, select action according to tree policy Current World State Tree Policy 1/2 0 0 1 0 0 1 0 0 1 0 0 What is an appropriate tree policy? Rollout policy? Slide courtesy of A. Fern 38 UCT Details • Rollout policy: – Basic UCT uses random • Tree policy: – Q(s,a) : average reward received in current trajectories after taking action a in state s – n(s,a) : number of times action a taken in s – n(s) : number of times state s encountered ln n( s ) UCT ( s ) arg max a Q( s, a) c n( s, a ) Theoretical constant that must be selected empirically in practice. Exploration term Setting it to distance to horizon Slide courtesy of A. Fern 39 guarantees arriving at the optimal policy eventually, if R UCT Example When all node actions tried once, select action according to tree policy Current World State a1 Tree Policy UCT ( s ) arg max a Q( s, a) c a2 1/2 0 0 1 0 0 1 0 0 1 0 0 ln n( s ) n( s, a ) Slide courtesy of A. Fern 40 When all node actions tried once, select action according to tree policy Current World State 1/3 Tree Policy UCT ( s ) arg max a Q( s, a) c 1/2 0 0 1 0 0 1 0 0 1 0 ln n( s ) n( s, a ) 0 Slide courtesy of A. Fern 41 UCT Summary & Theoretical Properties • To select an action at a state s – Build a tree using N iterations of Monte-Carlo tree search • Default policy is uniform random up to level L • Tree policy is based on bandit rule – Select action that maximizes Q(s,a) (note that this final action selection does not take the exploration term into account, just the Q-value estimate) • The more simulations, the more accurate – Guaranteed to pick suboptimal actions exponentially rarely after convergence (under some assumptions) • Possible improvements – Initialize the state-action pairs with a heuristic (need to pick a weight) – Think of a better-than-random rollout policy Slide courtesy of A. Fern 42 LRTDP or UCT? AAAI 2012! 43 Online Action Selection Policy Computation Exec Select e x Select e x Select e x Select e x Off-line policy generation Online action selection • First compute the whole policy • Loop – Get the initial state – Compute the optimal policy given the initial state and the goals • Then just execute the policy – Loop • Do action recommended by the policy • Get the next state – Until reaching goal state • Pros: Can anticipate all problems; • Cons: May take too much time to start executing – Compute the best action for the current state – execute it – get the new state • Pros: Provides fast first response • Cons: May paint itself into a corner.. FF-Replan • • • Simple replanner Determinizes the probabilistic problem – IF an action has multiple effect sets with different probabilities • Select the most likely on • Split the action into multiple actions one for each setup Solves for a plan in the determinized problem a3 a2 a1 S a2 a5 a4 a3 a4 G G All Outcome Replanning (FFRA) ICAPS-07 Probability1 Effect 1 Action 1 Effect 1 Effect 2 Action 2 Effect 2 Action Probability2 48 1st IPPC & Post-Mortem.. IPPC Competitors • • Most IPPC competitors used different approaches for offline policy generation. One group implemented a simple online “replanning” approach in addition to offline policy generation – • • To everyone’s surprise, the replanning approach wound up winning the competition. Lots of hand-wringing ensued.. – – Most-likely vs. All-outcomes Loop • • • Determinize the probabilistic problem • – Results and Post-mortem Get the state S; Call a classical planner (e.g. FF) with [S,G] as the problem Execute the first action of the plan Umpteen reasons why such an approach should do quite badly.. • May be we should require that the planners really really use probabilities? May be the domains should somehow be made “probabilistically interesting”? Current understanding: – – No reason to believe that off-line policy computation must dominate online action selection The “replanning” approach is just a degenerate case of hind-sight optimization Reducing calls to FF.. • We can reduce calls to FF by memoizing successes – If we were given s0 and sG as the problem, and solved it using our determinization to get the plan s0—a0—s1—a1—s2—a2—s3…an—sG – Then in addition to sending a1 to the simulator, we can memoize {si—ai} as the partial policy. • Whenever a new state is given by the simulator, we can see if it is already in the partial policy • Additionally, FF-replan can consider every state in the partial policy table as a goal state (in that if it reaches them, it knows how to get to goal state..) Hindsight Optimization for Anticipatory Planning/Scheduling • Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically – Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities • How do you select actions to perform? – Answer: If you have a distribution of the goal arrival, then • Sample goals upto a certain horizon using this distribution • Now, we have a deterministic planning problem with known goals • Solve it; do the first action from it. – Can improve accuracy with multiple samples • FF-Hop uses this idea for stochastic planning. In anticipatory planning, the uncertainty is exogenous (it is the uncertain arrival of goals). In stochastic planning, the uncertainty is endogenous (the actions have multiple outcomes) Probabilistic Planning (goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I A1 Probabilistic Outcome A2 Time 1 A1 A2 A1 A2 A1 A2 A1 A2 Time 2 Action State Dead End Goal State 52 Problems of FF-Replan and better alternative sampling FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem 55 Implementation FF-Hindsight Constructs a set of futures • Solves the planning problem using the H-horizon futures using FF • Sums the rewards of each of the plans • Chooses action with largest Qhs value Hindsight Optimization (Online Computation of VHS ) • • Pick action a with highest Q(s,a,H) where • – Q(s,a,H) = R(s,a) + ST(s,a,s’)V*(s’,H-1) • Compute V* by sampling – H-horizon future FH for M = [S,A,T,R] • Mapping of state, action and time (h<H) to a state – S×A×h→S • • • • Value of a policy π for FH – R(s,FH, π) V*(s,H) = maxπ EFH [ R(s,FH,π) ] – – • Common-random number (correlated) vs. independent futures.. Time-independent vs. Time-dependent futures But this is still too hard to compute.. Let’s swap max and expectation VHS(s,H) = EFH [maxπ R(s,FH,π)] – • VHS underestimates V* Why? – Intuitively, because VHS can assume that it can use different policies in different futures; while V* needs to pick one policy that works best (in expectation) in all futures. But then, VFFRa underestimates VHS – Viewed in terms of J*, VHS is a more informed admissible heuristic.. maxπ R(s,FH-1,π) is approximated by FF plan 59 Left Outcomes are more likely Probabilistic Planning Action (goal-oriented) Maximize Goal Achievement I A1 Probabilistic Outcome A2 Time 1 A1 A2 A1 A2 A1 A2 A1 A2 Time 2 Action State Dead End Goal State 60 Improvement Ideas • Reuse – Generated futures that are still relevant – Scoring for action branches at each step – If expected outcomes occur, keep the plan • Future generation – Not just probabilistic – Somewhat even distribution of the space • Adaptation – Dynamic width and horizon for sampling – Actively detect and avoid unrecoverable failures on top of sampling Left Outcomes are more likely Hindsight Sample 1 Maximize Goal Achievement I A1 Action Probabilistic Outcome A2 Time 1 A1 A2 A1 A2 A1 A2 A1 A2 Time 2 Action State A1: 1 A2: 0 Dead End Goal State 62 Factored Representations fo MDPs: Actions • Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! – Write a Bayes Network relating the value of fluents at the state before and after the action • Bayes networks representing fluents at different time points are called “Dynamic Bayes Networks” • We look at 2TBN (2-time-slice dynamic bayes nets) • Go further by using STRIPS assumption – Fluents not affected by the action are not represented explicitly in the model – Called Probabilistic STRIPS Operator (PSO) model Action CLK Factored Representations: Reward, Value and Policy Functions • Reward functions can be represented in factored form too. Possible representations include – Decision trees (made up of fluents) – ADDs (Algebraic decision diagrams) • Value functions are like reward functions (so they too can be represented similarly) • Bellman update can then be done directly using factored representations.. SPUDDs use of ADDs Direct manipulation of ADDs in SPUDD Ideas for Efficient Algorithms.. • Use heuristic search (and reachability information) – LAO*, RTDP • Use execution and/or Simulation – “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) – “Simulation” –simulate the given model to sample possible futures • Policy rollout, hindsight optimization etc. • Use “factored” representations – Factored representations for Actions, Reward Functions, Values and Policies – Directly manipulating factored representations during the Bellman update A Plan is a Terrible Thing to Waste • Suppose we have a plan – s0—a0—s1—a1—s2—a2—s3…an—sG – We realized that this tells us not just the estimated value of s0, but also of s1,s2…sn – So we don’t need to compute the heuristic for them again • Is that all? – If we have states and actions in factored representation, then we can explain exactly what aspects of si are relevant for the plan’s success. – The “explanation” is a proof of correctness of the plan » Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one. • The explanation will typically be just a subset of the literals making up the state – That means actually, the plan suffix from si may actually be relevant in many more states that are consistent with that explanation Triangle Table Memoization • Use triangle tables / memoization C B A A B C If the above problem is solved, then we don’t need to call FF again for the below: B A A B Explanation-based Generalization (of Successes and Failures) • Suppose we have a plan P that solves a problem [S, G]. • We can first find out what aspects of S does this plan actually depend on – Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof – Now you can memoize this plan for just that subset of S Relaxations for Stochastic Planning • Determinizations can also be used as a basis for heuristics to initialize the V for value iteration [mGPT; GOTH etc] • Heuristics come from relaxation • We can relax along two separate dimensions: – Relax –ve interactions • Consider +ve interactions alone using relaxed planning graphs – Relax uncertainty • Consider determinizations – Or a combination of both! Solving Determinizations • If we relax –ve interactions – Then compute relaxed plan • Admissible if optimal relaxed plan is computed • Inadmissible otherwise • If we keep –ve interactions – Then use a deterministic planner (e.g. FF/LPG) • Inadmissible unless the underlying planner is optimal Negative Interactions Increasing consideration Dimensions of Relaxation 3 4 1 2 Uncertainty 1 Relaxed Plan Heuristic 2 McLUG 3 FF/LPG 4 Limited width stochastic planning? Reducing Uncertainty Bound the number of stochastic outcomes Stochastic “width” Dimensions of Relaxation -ve interactions Uncertainty None None Some Relaxed Plan McLUG Some Full FF/LPG Limited width Stoch Planning Full Expressiveness v. Cost Limited width stochastic planning Node Expansions v. Heuristic Computation Cost FF McLUG Nodes Expanded FF-Replan Computation Cost h=0 FFR FF