How Should Intelligence Be Abstracted in AI Research ... AAAI Technical Report FS-13-02 Combining a POMDP Abstraction with Replanning to Solve Complex, Position-Dependent Sensing Tasks Devin K. Grady and Mark Moll and Lydia E. Kavraki Dept. of Computer Science, Rice University Houston, TX 77005, USA {devin.grady, mmoll, kavraki}@rice.edu Abstract a globally optimal policy is computed. To describe the class of robotic tasks we would like to solve, a reward function is constructed such that an optimal policy to increase reward over time will solve the task optimally. The computational burden prevents solving large problem instances in reasonable time or space. Recent belief point-based solvers (e.g., Smith and Simmons 2005; Kurniawati, Hsu, and Lee 2008) have extended the size of solvable POMDP instances such that they are useful for some robotic tasks involving action and sensing noise. The state spaces considered in these solvers tend to be on the order of 10 × 10 grids with some uncertain world features (e.g., Lee and Kim 2013; Hsiao, Kaelbling, and Lozano-Perez 2007); this a small state space. This paper does not present a new POMDP solver, instead it proposes an abstraction of a POMDP, which is combined with online planning to address the simplifications made in the abstraction. We demonstrate problem instances in which the POMDP model of a car-like robot can not be solved with modern techniques, but a solution to the abstraction can be found. Online planning does not incorporate the observations from sensing in a globally optimal way as the POMDP policy does. Combining a POMDP abstraction with online planning yields a best of both worlds solution in our experiments. This paper considers the task of navigating a car-like robot with a partially-known map and noisy sensors. We propose that the natural level of abstraction for this task is a holonomic robot operating in R2 . This abstraction reduces the computational burden by ignoring the complex motion constraints. Intuitively, this abstraction is effective because it attacks the computational complexity by reducing the state space, and the curse of history is eased by using simpler constraints. Although the problems we consider have a similarly sized state space as prior work, we show that added motion constraints can lead to unsolvable POMDPs. Our experimental results indicate that our POMDP abstraction with replanning is an effective technique. Executing a policy constructed in an abstract motion model is not trivial, and is discussed in the Methodology section. Details connecting the theoretical method with our evaluation platform appear in Experimental Setup. Our evaluation shows that, counter-intuitively, incorporating the true constraints takes more time, but may also generate hard to execute policies, reducing reward even when given time to converge to an optimal policy. The Partially-Observable Markov Decision Process (POMDP) is a general framework to determine reward-maximizing action policies under noisy action and sensing conditions. However, determining an optimal policy for POMDPs is often intractable for robotic tasks due to the PSPACE-complete nature of the computation required. Several recent solvers have been introduced that expand the size of problems that can be considered. Although these POMDP solvers can respect complex motion constraints in theory, we show that the computational cost does not provide a benefit in the eventual online execution, compared to our alternative approach that relies on a policy that ignores some of the motion constraints. We advocate using the POMDP framework where it is critical – to find a policy that provides the optimal action given all past noisy sensor observations, while abstracting some of the motion constraints to reduce solution time. However, the actions of an abstract robot are generally not executable under its true motion constraints. The problem is addressed offline with a less-constrained POMDP, and navigation under the full system constraints is handled online with replanning. We empirically demonstrate that the policy generated using this abstracted motion model is faster to compute and achieves similar or higher reward than addressing the motion constraints for a car-like robot as used in our experiments directly in the POMDP. Introduction POMDPs are a structured representation of a planning problem where, at each discrete time point, an agent performs an action and obtains an observation through sensing. The challenge lies in addressing noise in both action and sensing. This creates uncertainty in the state of the robot and in the outcome of actions. The solution of a POMDP is an action policy for an agent that maximizes an arbitrary reward function. The POMDP encapsulates a well-defined noise model for action and sensing. Finding an optimal policy for a POMDP (“solving” the POMDP) can be thought of as a search in a continuous belief space of dimension equal to the number of states in the underlying state space. Solving POMDPs is PSPACE-complete (Papadimitriou and Tsitsiklis 1987) due to the curse of history and the curse of dimensionality. Although solving a POMDP is computationally challenging, POMDPs are useful because all computation is offline, and c 2013, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 28 Problem Definition To represent a car-like model, the state space S is the continuous space of rigid body poses in the plane, denoted as SE(2). Action space A will be discretized for computational efficiency into 5 actions as shown in Figure 2. Sensor observations consist of (x, y, range) tuples, discretized to integral values. T will be a Gaussian distribution centered around the expected outcome from the motion model. Ω will implement a similar Gaussian noise model for sensing. The current orientation, θ, is noise-free. We will define the reward function R to be independent of A and only depend on S. The discount factor, γ, encourages short policies as in all known prior finite-time horizon POMDP planning. As we will show, computing even an approximate policy for P is intractable, despite the discretization of sensing and action. Therefore, this paper proposes an abstracted POMDP model, P 0 = hS 0 , A0 , O, T 0 , Ω, R, γi. Although the motion model has changed (S 0 , A0 , T 0 ), the components of the task description that will not be addressed with replanning are unchanged (O, Ω, R, γ). In P 0 , S 0 = R2 , A0 = {North, South, East, West, Stop}, and T 0 is the same Gaussian, but centered on the state returned by this simpler motion model. For comparison purposes, we solve both P and P 0 using the powerful solver MCVI, and the policies generated will be used in online execution. The task is complete, either in success or failure, when the policy update of the online execution indicates that the robot has entered a terminal state. The task we set out to solve is an escape problem, where a robotic agent must avoid detection. This robot has a radio signal strength sensor that estimates the distance to a radio source emanating from an adversarial detection outpost. The problem is to determine a sensor-based policy that a car-like robot can follow without detection to a goal region. There are some existing, known obstacles in the workspace, as well as one possible unknown obstacle. The range to the unknown obstacle can be sensed, however the sensor is noisy and noise increases with range. Additionally, only noisy position sensing is available, preventing the agent from ever fully localizing. Our problem is similar in construction to RockSample, introduced on 7 × 7 grids (Smith and Simmons 2004), since expanded and modified (Zhang and Chen 2010; Bai et al. 2010). Similar to RockSample, the environment is defined as a grid, and there exists an exit region that is rewarding and terminal. Unlike RockSample, costly and terminal obstacles exist. RockSample, unlike our problem, has perfect localization. In addition, our variables for sensing the environment are not boolean values of rocks with known positions but, instead, are range measurements to a terminal obstacle with unknown position. Overall, these modifications make this new problem much more difficult. The POMDP representing this task is defined as P0 = hS, A0 , O0 , T0 , Ω0 , R, γi, and is the input to our algorithm. S = Continuous State Space A0 = Continuous Action Space O0 = Continuous Observation Space T0 = Transition Probabilities (S × A0 × S) ⇒ R Ω0 = Observation Probabilities (S × A0 × O0 ) ⇒ R R = Reward Function(S × A) ⇒ R γ = Discount Factor (0.95) Related Work The two-phase method we propose may appear similar to hierarchical POMDP methods (Foka and Trahanias 2007; Pineau, Roy, and Thrun 2001; Pineau et al. 2003). However, these methods generate POMDP policies “all the way down” to the full system detail through decomposition of the state, action, and sensor spaces. Our proposal instead uses an abstracted POMDP, and then applies replanning to incorporate the constraints. Point-Based Policy Transformation (PBPT) (Kurniawati and Patrikalakis 2012) and online POMDP methods (Wang and Dearden 2012; Ross et al. 2008) propose a similar idea to this paper, by limiting offline computation and then fixing the policy when necessary to decrease the required offline computation time. However, we are breaking out of the POMDP framework entirely and not changing the input policy. In our work, the input POMDP policy is for an abstracted model, therefore replanning is necessary to execute the policy. PBPT uses a bijection S ⇔ S 0 and A ⇔ A0 . We will be operating in state spaces that differ in dimensionality, and our action spaces have very different constraints, precluding the applicability of PBPT. These changes in state and action spaces are precisely what drives the computational benefits of our proposed framework. There has been significant interest recently in problems where a belief can be approximated by a Gaussian distribution (Bai, Hsu, and Lee 2013; Van den Berg, Patil, and Alterovitz 2012). Although we use Gaussian distributions in our sensor model, the belief state of the world cannot effectively be collapsed into a unimodal distribution due to the uncertainty The output is a sequence of states (execution trace) that follow the propagation of a simulated car-like robot. An execution trace has a reward calculated using R and γ from above. We define success as when a trace ends in a terminal exit region, and failure is all other traces. Overall Approach P0 is useful to describe the underlying continuous problem, however, continuous action spaces have only recently been addressed using techniques from sampling-based planning to find useful action samples (Kurniawati, Bandyopadhyay, and Patrikalakis 2012). We directly define a solvable POMDP abstraction as P = hS, A, O, T, Ω, R, γi where A and O are discretized approximations of A0 and O0 . T and Ω are then functions over these discretized spaces. Modern solvers, such as Monte-Carlo Value Iteration (MCVI) (Lim, Hsu, and Lee 2011), can solve a POMDP in this form. The stealth escape problem will be solved in two stages: 1. Offline POMDP policy generation, and 2. Online replanning agent policy execution. 29 of the location of obstacles. Therefore, these extremely fast and effective techniques are not applicable to our problem. In (Candido and Hutchinson 2011), a method similar to MCVI is proposed for an explicit belief variance minimization and navigation task. Local policies are explicitly userdesigned and given as input. These local policies are designed to decrease uncertainty in the current belief and drive the system to a goal state. This will allow the use of high-level abstract actions similar to our method, however, our technique builds per-action policies online instead of using user defined policies, and we use MCVI, which converges to optimal policies. Recent efforts to improve POMDP solution times by decreasing the number of states (Agha-mohammadi, Chakravorty, and Amato 2011; Kurniawati, Bandyopadhyay, and Patrikalakis 2012; Kaplow, Atrash, and Pineau 2010) and actions (Grady, Moll, and Kavraki 2013) are also relevant to our goal of decreased computation time without sacrificing reward. In general, these methods will produce abstractions of navigation and/or carve out “interesting” areas of the spaces to focus on. Unlike in this paper, the discretizations are altered while the underlying motion constraints and state spaces are unchanged. POMDP+Online is a high-level view, and wraps up a several important details in the functions called Solve(), Plan(), and SampleObs(). Many of these implementation details will be covered in the Methodology and Experimental Setup sections. Critical to our combination of an offline policy and online replanning is having the correct goal state for Plan(), and we will expand that function here. Algorithm 2 Plan() from line 7 of POMDP+Online Require: start is a valid continuous state for planner Require: policyNode is a node in policy Require: P is a POMDP model 1: maxReward ← P .minReward() 2: bestFuture ← NULL 3: for i = 1 . . . N do 4: future, reward ← Simulate(policyNode, P ) 5: if reward ≥ maxReward then 6: bestFuture ← future 7: maxReward ← reward 8: end if 9: end for 10: goal ← GoalSelect(bestFuture) 11: path ← planner.plan(start, goal) 12: return path Methodology The proposed algorithm, Algorithm 1: POMDP+Online, operates in two phases. Line 1 is the entire offline solution phase, taking the input POMDP model and getting an optimal policy. Lines 2–4 set up the initial state of the replanning system. Line 5 initializes the output of the execution trace. Lines 6–12 perform the replanning loop until a terminal observation is sensed, appending to the execution trace after each action. Finally, on line 13, the execution trace is returned as output.. The policy is represented as a deterministic policy graph, where each node is the optimal action and edges are taken dependent on observations. The policy node update on line 11 thus depends on the observation (from line 10) This observation may be critical to the overall reward, as noisy sensing plays a pivotal role in the task. Therefore, the execution of each action is planned independently (line 7). This ensures that policy node updates are done correctly and the agent follows the reward-maximizing policy from line 1. A brief walkthrough of Plan() starts at the initialization on lines 1–2. The loop on lines 3–9 finds the best possible future trajectory over N simulations. At the end of this loop, bestFuture is the set of states along this best possible future trajectory. This is used on line 10 to access bestFuture[1], the next state from the best future. Line 10 selects a goal – when P and planner have different state spaces this is non-trivial and will be discussed later. Finally, on line 11 the planner is actually called to solve the motion planning problem, and the path found is returned on line 12. The call to planner.plan on line 10 of Algorithm 2 must generate collision-free trajectories that respect the system constraints. The online system is also computationally constrained for each cycle. During online execution, the same noise model is applied as in the offline stage. Due to the computation budget and noise, even simple problems and environments fail occasionally. This occurs when Plan() returns an empty path because the planner could not connect the start and goal states given the constraints of the system. Algorithm 1 POMDP+Online Require: P is a POMDP model 1: policy ← Solve(P ) . Offline Planning 2: state ← SampleInitState(P ) . Initialization 3: policyNode ← policy.root(P ) 4: obs ← SampleObs(state, policyNode, P ) 5: trace ← {state} 6: repeat . Online replanning loop 7: path ← Plan(obs.state(), policy, policyNode, P ) 8: state ← state.IntegrateAlong(path) 9: trace.append(state) 10: obs ← SampleObs(state, policyNode, P ) 11: policyNode ← policyNode.child(obs) 12: until TerminalObservation(obs) 13: return trace Experimental Setup In this section, we will first give the details of the offline models, P and P 0 . We will proceed to provide the details of our implementation of online replanning. These details will give the algorithms presented earlier some concrete structure. Both P and P 0 share a common noise model for sensing, shown in Figure 1. Position sensing noise is applied by drawing a Gaussian sample around the true state, with σ = 0.2, and then only the integer part is returned. The range sensor observations follow the same general noise model, except that σ is multiplied by the current true range. Therefore, the range sensor samples from a discretized Gaussian centered around the true range to the unknown obstacle. Because our 30 Figure 1: Position (left) and range (right) sensing noise models are Gaussian distributions truncated to integral values. Figure 2: The action model of P . sensing model does not depend on orientation, P 0 still captures the essential characteristics of our task, but under fewer motion constraints. For offline planning, MCVI is used. MCVI is a limited time horizon offline planner. We set the horizon to 100 actions. For our problems, the policies produced by MCVI always terminated significantly before this cutoff. The action space A is shown in Figure 2. A is a discretizedaction first-order robot with bounded curvature. This robot model is a discretized version of the online system that will be discussed later in this section. A does not allow the system to back up since, in the task we are solving, reverse movement is of little use at the POMDP level. A0 consists of the 4 cardinal directions with unit length, plus staying still. The reward function R is assumed constant over unit squares of R2 , and R is exactly the definition of the environment. Obstacles and exit regions are sets of terminal states. P or P 0 is solved with MCVI, and a policy mapping observations to actions is returned. This reward-maximizing policy will be followed one action at a time with an online system that does respect all the constraints of the system. The online planner uses locally-optimal Dubins (Dubins 1957) and Reeds-Shepp (Reeds and Shepp 1990) paths. These first-order, bounded-curvature paths are time-independent, kinematically valid paths for many car and UAV models. Controllers can easily track these paths under the full system dynamics. The Reeds-Shepp model is almost identical to the Dubins model with the additional ability to go backward. Reeds-Shepp paths clearly are not valid for fixed-wing UAVs, but are more applicable to car-like UGVs. The difficulty of online planning is in satisfying the motion constraints while executing the input policy. Being able to reverse allows the Reeds-Shepp car to easily approach a target orientation and correct for position noise. The effect of this will be seen in the solution percentages. Transition-based RRT (Jaillet, Cortes, and Siméon 2008) was chosen for online replanning because it plans in a space with an arbitrary reward function to produce low-cost paths. TRRT reward is based on the executions of the policy and the POMDP reward function R. This directly couples the online replanning with the offline policy at a low level. Online replanning is implemented using the open-source OMPL package (Şucan, Moll, and Kavraki 2012). Dubins and ReedsShepp models with a minimum turning radius of 0.5 are Figure 3: Environments used in trials. They are: (a) empty, (b) single stealth, (c) double stealth, and (d) slalom. instantiated. N = 1000 runs of the policy are executed from the current state, given all previous sensor measurements. A benefit of our technique is that the same offline policy can be utilized for multiple online systems. In this paper, we will take advantage of this abstraction and compute one policy offline, and then execute it under two different vehicle models, each with its own set of constraints. This represents the potential for massive computational savings, as it indicates that a policy does not need to be re-calculated for small differences in system constraints. The noise model in the online stage is the same as in P and P 0 . As we are not testing the effectiveness of online localization algorithms, we have omitted implementing such algorithms. Omitting localization at the online stage is not a limitation of the method, but will negatively affect success rates. Because it affects all experimental conditions equally, it changes the overall results by a uniform reduction in reward and success rates. As discussed earlier, execution of the policy is done one POMDP action at a time, by computing an intermediate goal that accomplishes each action (Algorithm 2, line 10). Each intermediate goal is a state (x, y, θ), where θ is taken directly from the P policy, and is predicted with a 2-action lookahead when using the P 0 policy. This prediction considers the direction of the action that joins the next and the next-next state, and then assigns that direction as the goal orientation. If there is only 1 state in the future of either the P or P 0 policy, then it is assumed to be a final state and an escape goal is used that does not constrain orientation or position within the exit region. Experimental Evaluation The environments used are depicted in Figure 3. Most of the space has reward of −1 (gray), terminal escape regions have reward +900 (white), and the terminal obstacle regions have a reward of −1000 (black). The large grey circles are the pos- 31 Figure 4: Experimental results are presented in this summary figure. Striped bars used P , solid bars used P 0 . sible locations of the unknown obstacle. There are also large grey circles and small green circles. The small green circle is an initial belief 95% confidence circle. No green circle appears in the slalom environment because this environment was run without position noise. Possible obstacle positions are placed so that, without sensing the unknown obstacle’s location, reaching a state with positive reward is unlikely. Metrics tested are total solution time, policy execution reward, and task success rate. Each metric is computed across 500 trials in each environment and the mean value is shown with a 95% confidence interval in Figure 4. Although solution time is inclusive of both the offline POMDP solution and online replanning, the online time is negligible in all but the simplest tests. To compare between the policies generated from P and P 0 , reward is considered at the online level. This is the reward the robot is actually getting under the true constraints. Each set of bars shows a comparison of the policy generated by solving P and the policy from solving P 0 , when each are executed on the Dubins and Reeds-Shepp models. The first test, Figure 3(a), was an empty environment, only limited by the bounds and kinematics of the system. In this test, we see that the total computation time for P is more than an order of magnitude greater than for P 0 (note the log scale). The success rate and reward shows that the policies for P and P 0 were equally optimal. In the stealth single passage problem in Figure 3(b), neither policy is optimal within time limits, however, the policy for the simpler POMDP abstraction P 0 achieves significantly higher reward for both robot models. This test shows that the policy for the more complex POMDP was not just somewhat worse, but performed no better than chance. We conclude that all of the time was spent in determining a policy for only the navigational aspects of the problem, without being able to build a large enough policy to localize the unknown obstacle. For the stealth double passage in Figure 3(c), making the wrong navigation decision based on the noisy sensing is catastrophic. This makes the problem harder than the single passage problem. If P 0 is used to generate a policy, the Dubins model gets negative reward. The same policy executed under the Reeds-Shepp car model achieves a very high success rate and a positive reward. From this we can conclude that the policy was good, but actually following it was difficult for the Dubins model. When simulations were visualized, it was apparent that turning corners was often problematic for the Dubins model, as it would often compute paths that were believed to be barely collision free, and, due to the position sensing noise, were actually in collision. In the slalom environment in Figure 3(d), the noise model was too challenging for either P or P 0 to produce a policy that had a non-zero success rate. We expect this is due to the large number of turns that are made, a difficulty discussed above. Results from this environment were therefore obtained without position noise. In this case, MCVI can find an optimal policy for both P and P 0 . Note the system does not reduce to an MDP because the unknown obstacle is only partially observable. Due to the increased clutter, the difficulty of achieving specific orientations is highlighted by the reward and success degradation when using the policy from P . When MCVI converges for both models, P converges significantly faster and the reward of both policies are approximately equal. When MCVI times out, it is clear that the policy from P 0 is closer to optimal and yields significantly higher reward. The generality of online replanning is hinted at by using a single offline policy for two online models. Discussion To address sensing-based robotic tasks with a complex reward structure, we propose that the correct level of abstrac- 32 tion is to remove complex motion constraints from a POMDP. Removed constraints can be incorporated online with replanning. Our results show that, for the problem instances we have considered, our proposed algorithm is superior. It is clear from our results that complex constraints may cause an explosion in the computation time such that the policy returned may have extremely poor success rates – 0% in some cases. Delaying the consideration of these constraints is shown to be an effective strategy to generate better policies and improve success rates. For the future, it is clear that a localization algorithm will be required. Implementing this addition may make the results harder to analyze, but will help improve success rates. We are also interested in investigating more complex search-andrescue tasks where the goal may be unknown. IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, 2145– 2150. Kaplow, R.; Atrash, A.; and Pineau, J. 2010. Variable resolution decomposition for robotic navigation under a POMDP framework. In IEEE Intl. Conf. on Robotics and Automation, 369–376. Kurniawati, H., and Patrikalakis, N. M. 2012. PointBased Policy Transformation : Adapting Policy to Changing POMDP Models. In Workshop on the Algorithmic Foundations of Robotics, 1–16. Kurniawati, H.; Bandyopadhyay, T.; and Patrikalakis, N. M. 2012. Global motion planning under uncertain motion, sensing, and environment map. Autonomous Robots. Kurniawati, H.; Hsu, D.; and Lee, W. S. 2008. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems. Lee, T., and Kim, Y. J. 2013. GPU-based Motion Planning under Uncertainties using POMDP. In IEEE Intl. Conf. on Robotics and Automation. Lim, Z.; Hsu, D.; and Lee, W. S. 2011. Monte Carlo Value Iteration with Macro-Actions. In Advances in Neural Information Processing Systems. Papadimitriou, C., and Tsitsiklis, J. 1987. The complexity of Markov decision processes. Mathematics of operations research 12(3):441–450. Pineau, J.; Montemerlo, M.; Pollack, M.; Roy, N.; and Thrun, S. 2003. Towards robotic assistants in nursing homes: Challenges and results. Robotics and Autonomous Systems 42(34):271–281. Pineau, J.; Roy, N.; and Thrun, S. 2001. A hierarchical approach to POMDP planning and execution. In Workshop on Hierarchy and Memory in Reinforcement Learning. Reeds, J. A., and Shepp, L. A. 1990. Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics 145(2):367–393. Ross, S.; Pineau, J.; Paquet, S.; and Chaib-Draa, B. 2008. Online Planning Algorithms for POMDPs. Journal of Artificial Intelligence Research 32(2):663–704. Smith, T., and Simmons, R. 2004. Heuristic search value iteration for POMDPs. In Conf. on Uncertainty in Artificial intelligence, 520—-527. Smith, T., and Simmons, R. 2005. Point-based POMDP algorithms: Improved analysis and implementation. In Uncertainty in Artificial Intelligence. Şucan, I. A.; Moll, M.; and Kavraki, L. E. 2012. The Open Motion Planning Library. IEEE Robotics & Automation Magazine 19(4):72—-82. Wang, M., and Dearden, R. 2012. Improving Point-Based POMDP Policies at Run-Time. In Workshop of the UK Planning And Scheduling SIG. Zhang, Z., and Chen, X. 2010. Accelerating Point-Based POMDP Algorithms via Greedy Strategies. Simulation, Modeling, and Programming for Autonomous Robots 6472:545– 556. Acknowledgements This work was partially supported by NSF 1139011, NSF DUE 0920721, CCF 1018798 and the U. S. Army Research Laboratory and the U. S. Army Research Office under grant number W911NF-09-1-0383. Equipment funded in part by the Data Analysis and Visualization Cyberinfrastructure funded by NSF under grant OCI-0959097. D. Grady was additionally supported by an NSF Graduate Research Fellowship. References Agha-mohammadi, A.-a.; Chakravorty, S.; and Amato, N. M. 2011. FIRM: Feedback controller-based information-state roadmap - A framework for motion planning under uncertainty. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, volume 1, 4284–4291. Bai, H.; Hsu, D.; Lee, W. S.; and Ngo, V. 2010. Monte Carlo value iteration for continuous-state POMDPs. In Workshop on the Algorithmic Foundations of Robotics, 175–191. Bai, H.; Hsu, D.; and Lee, W. S. 2013. Planning How to Learn. In IEEE Intl. Conf. on Robotics and Automation. Van den Berg, J.; Patil, S.; and Alterovitz, R. 2012. Efficient approximate value iteration for continuous Gaussian POMDPs. In AAAI Conf. on Artificial Intelligence, 1832– 1838. Candido, S., and Hutchinson, S. 2011. Minimum uncertainty robot navigation using information-guided POMDP planning. In IEEE Intl. Conf. on Robotics and Automation, 6102–6108. Dubins, L. E. 1957. On curves of minimal length with a constraint on average curvature, and with prescribed initial and terminal positions and tangents. American Journal of Mathematics 79(3):497–516. Foka, A., and Trahanias, P. 2007. Real-time hierarchical POMDPs for autonomous robot navigation. Robotics and Autonomous Systems 55(7):561–571. Grady, D.; Moll, M.; and Kavraki, L. E. 2013. Automated Model Approximation for Robotic Navigation with POMDPs. In IEEE Intl. Conf. on Robotics and Automation. Hsiao, K.; Kaelbling, L. P.; and Lozano-Perez, T. 2007. Grasping POMDPs. In IEEE Intl. Conf. on Robotics and Automation, 4685–4692. Jaillet, L.; Cortes, J.; and Siméon, T. 2008. Transitionbased RRT for path planning in continuous cost spaces. In 33