SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013 Outline 1. Review – POMDP – Point-based Value Iteration – Heuristic Search [point-based] Value Iteration 2. SARSOP – Add a new heuristic and pruning method to HSVI – Results (Thanks Ryan) REVIEW POMDP • Solving a POMDP is very similar to an MDP • The similarities: – State transitions are still stochastic – Value function is a function of our current “state” – We still perform Bellman backups to compute V • The differences: – We have a probability distribution of where we are – We can make (stochastic) observations of our current belief Let’s solve a POMDP! measurements state x1 action u3 state x2 measurements 0 .2 0 .8 z1 0 .7 z2 0 .3 u3 x1 u3 0 .3 z1 0 .7 z2 x2 0 .8 0 .2 u1 u2 ο 100 100 payoff u1 u2 100 ο 50 payoff actions u1, u2 Make a decision… NOW π1 π = max π(π, π’) π’ 3 π’1 if π1 ≤ 7 π1 π = 3 π’2 if π1 > 7 HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value. π1 π = πΈπ§ π1 π π§ = π π§π π1 (π|π§π π −70π1 + 30 = max 70π1 − 15 −30π1 + 70 + max 30π1 − 35 1 − π1 1 − π1 1 − π1 1 − π1 HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value. Belief Update Lather, Rinse, Repeat • We just did a full backup for T=1! • Repeat for T=2. The POMDP Hex • POMDPs have both the curse of dimensionality and the curse of history. • Scholars maintain that history is the truly unmanageable part. O (|V| x |A| x |Ω| x |S|2 + |A| x |S| x |V||Ω| ) Belief Update (taking an action) Value Backups (sensor measurements) Point-based Value Iteration • Maintains a fixed set of belief points, B • The value function is computed only over B – B only contains reachable beliefs • The number of constraints (|V|) is fixed at |B| – Value updates are now PTIME • PBVI provides an anytime solution that converges to optimal value function • The error in the value function is bounded What does it Mean? • Start with a small number of reachable beliefs • Point-based backup instead of Bellman backup – Implicitly prunes value simplex • Increase # beliefs until timeout O (|V| x |A| x |Ω| x |S|2 + |B| x |A| x |S| x |Ω|) HSVI Value Function • When search ceases, perform full Bellman backups in reverse order • Just insert the constraint vector for the l.b. • Update u.b. based on expected value • Both bounds are uniformly improvable Properties of HSVI • Upper and lower bounds monotonically converge to optimal value function • Local updates preserve improvability • Maximum regret is ε • Finite search depth • Finite depth → finite Bellman updates (New stuff here!) SARSOP Like HSVI… • Except, only look at reachable OPTIMAL belief points instead of all reachable points Algorithm Outline 1. Sample 2. Backup 3. Prune Algorithm Outline 1. Sample π ′ π ′ = π π, π, π = πΩ π ′ , π, π π π , π, π ′ π(π ) π – Select a,o as in HSVI (upper bound, max gap) • Lower bound increases propagate quickly • Upper bound decreases do not – OR: Predict optimal value at b, if the lower bound improves at the root, then explore around b • Cluster beliefs by initial upper bound and entropy Algorithm Outline 1. Sample π ′ π ′ = π π, π, π = πΩ π ′ , π, π π π , π, π ′ π(π ) π – Select a,o as in HSVI (upper bound, max gap) • Lower bound increases propagate quickly • Upper bound decreases do not – OR: Predict optimal value at b, if the lower bound improves at the root, then explore around b • Cluster beliefs by initial upper bound and entropy “Selective Deep Sampling” Algorithm Outline 1. Sample 2. Backup – See HSVI, standard Bellman backup 3. Prune Algorithm Outline 1. Sample 2. Backup 3. Prune – Any b that will never be explored currently is pruned from the belief tree, not in π΅ ∗ • If it was incorrect to prune and was in π΅∗ it will be added again later – Smaller size of π΅ decreases computational cost of each iteration, this is speed-critical! Rock-Sample • Deterministic motions • Noisy sensor for Rock-goodness • +10 for sampling good • -10 for sampling bad • +10 for exiting • No other cost/reward Rock-Sample • Deterministic motions • Noisy sensor for Rock-goodness • +10 for sampling good • -10 for sampling bad • +10 for exiting • No other cost/reward New method… is worse? • The robot position in RockSample is fully observed • The rock goodness values are also completely independent • HSVI search for value function cuts that mask others is particularly suited to this HomeCare • Smaller π, larger π • Follow target in case they need help • Reward if helped in time – Time is uncertain • Movement has small cost SARSOP Overall • It’s a heuristic method – There will always be bad corner cases – It works well overall most of the time • It has become very popular in the past year for POMDP literature – Because it performs well and is fairly straightforward if you understand HSVI? References Kurniawati, Hanna, David Hsu, and Wee Sun Lee. "SARSOP: Efficient pointbased POMDP planning by approximating optimally reachable belief spaces."Proc. Robotics: Science and Systems. Vol. 62. 2008. SARSOP available for download at http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/ G. Shani, J. Pineau and R. Kaplow. A Survey of Point-based POMDP Solvers. Autonomous Agents and Multi-agent Systems. 2012. T. Smith and R. Simmons. Heuristic Search Value Iteration for POMDPs. Uncertainty in Artificial Intelligence. 2004. J. Pineau, G. Gordon and S. Thrum. Point-based Value Iteration: An anytime algorithm for POMDPs. Int’l Joint Conferences in Artificial Intelligence. 2003. S. Thrun, W. Burgard. and D. Fox. Probabilistic Robotics. MIT Press. 2006.