Automated Model Approximation for Robotic Navigation with POMDPs

advertisement
To appear in Proc. 2013 IEEE Intl. Conf. on Robotics and Automation.
Automated Model Approximation for Robotic Navigation with POMDPs
Devin Grady, Mark Moll, Lydia E. Kavraki
Abstract— Partially-Observable Markov Decision Processes
(POMDPs) are a problem class with significant applicability
to robotics when considering the uncertainty present in the
real world, however, they quickly become intractable for large
state and action spaces. A method to create a less complex
but accurate action model approximation is proposed and
evaluated using a state-of-the-art POMDP solver. We apply
this general and powerful formulation to a robotic navigation
task under state and sensing uncertainty. Results show that
this method can provide a useful action model that yields
a policy with similar overall expected reward compared to
the true action model, often with significant computational
savings. In some cases, our reduced complexity model can solve
problems where the true model is too complex to find a policy
that accomplishes the task. We conclude that this technique
of building problem-dependent approximations can provide
significant computational advantages and can help expand the
complexity of problems that can be considered using current
POMDP techniques.
I. I NTRODUCTION
Motivation: Robot motion planning has advanced considerably, but accounting for uncertainty into planning is still very
challenging [1]. A principled and general formulation that
can handle this uncertainty is that of a Partially-Observable
Markov Decision Process (POMDP). In the general case,
POMDPs are computationally intractable (PSPACE-complete
[2]), and are applied only to small problems [3]. However,
ongoing research has sucessfully applied POMDPs to increasingly complex robotic tasks under uncertainty, such as
grasping [4], navigation [5] and exploration [6], although the
problem sizes remain fairly small.
Even state-of-the-art POMDP solvers (some recent examples being [4], [7]–[11] ) may require hours of computation
to find an optimal policy even for relatively small discrete
problems with tens of states and less than 10 discrete actions
that the robot can take. The policies they produce, however,
are extremely useful in the robotics domain. Specifically, they
encode an optimal solution, and can be computed off-line.
The on-line computation then is a simple sequence of calls
to a look-up table. Therefore, there is significant interest in
improving the maximum problem size that can be considered
with these methods.
Problem Definition: To reduce the computational burden
of POMDPs, we introduce a method called Automated Model
Approximation, or AMA. AMA builds a problem-specific
approximation of the state and/or action spaces, to delay
the curse of dimensionality, while attempting to maintain an
approximation that can build a near-optimal policy.
Dept. of Computer Science, Rice University, Houston, TX 77005, USA
{devin.grady, mmoll, kavraki}@rice.edu
To investigate the merit of this idea, we will focus on the
robotic navigation problem under state and sensing uncertainty.
The initial state of the robot is unknown, but is instead
a belief spread uniformly across a set of states. Sensing
uncertainty prevents the robot from ever collapsing this belief
to a single state except in highly-constrained environments.
The robot navigation task we will focus on is to take a sensor
measurement while in any one of several goal states. Sensing
is defined as its own action and therefore takes time, so an
optimal policy should avoid sensing except when necessary.
Finding the optimal balance between motion and sensing is
very computationally expensive in the presence of uncertainty.
To reduce complexity of the discussion and implementation,
we will assume that only the action model, defined as the
state and action space product, will be modified by AMA. It
is not, however, a restriction of the method, and state space
modification will be investigated in the future. The action
model is defined this way because it represents the selection
of actions allowed in any particular region of the state space.
Therefore, action models will comprise the primary input and
the output of AMA.
AMA requires the definition of the simple action space,
as well as a method to specify important states of where the
true action complexity needs to be considered. The user will
define the true action space, a simplified version of the action
space, and an operator to augment the simple action model
with the true complexity in specific regions of the state space.
These three items implicitly define three action models: the
simplified action space everywhere, the true action space
everywhere, and the output of AMA, which is a mix of
the prior two. For example, a simple action model may be
moving with speed 1 in any direction, and the true action
model also allows moving with speed 2. Then, AMA will
build an action model where speed 2 is allowed in useful
directions in regions of the state space that are probably going
to be entered, but in all other areas only allows a speed of
1. By reducing the number of options globally available, it
decreases the computational burden, and by using all options
where needed, we allow the construction of near optimal
policies for a given problem instance.
The output of AMA is a problem-specific estimation of the
true action model. To evaluate the success of AMA, we give
all three action models (simple, true, AMA) to a POMDP
solver with the same task, and evaluate the utility of the
action models based on the policies that the solver constructs.
Given a POMDP and initial belief as input, the POMDP
solver will compute a globally optimal policy with respect
to total reward that is achieved. This policy maps from
observations to actions, maximizing the expected reward,
and should be valid over all possible observations. The
POMDP solver will reason in the belief space, B, which
has dimension equal to the size of the state space. A point
b ∈ B encodes the probability of being in each state. Thus
for even a 10 by 10 discrete 2D grid of states, the space that
must be reasoned over is of dimension 100. As opposed to
much of the work in solving POMDPs, we do not set out to
address the discrete nature of our robotic navigation problem
definition, and work specifically with discrete state and action
spaces. The discrete POMDP model is formally defined as
(S, A, O, T, Ω, R) where
S = state space
A = action space
O = observation space
T = conditional transition probabilities
Ω = conditional observation probabilities
R = reward function: (A, S) → R
Many POMDP formulations include a discount factor, γ,
that provides the rate at which future rewards are discounted.
AMA does not require using a specific POMDP solver,
although the computational benefits will be best if there is a
way to explicitly disallow an action from a particular state so
it will not be considered. This is not a requirement, however,
because with an exceptionally high cost, most heuristics will
avoid taking these actions.
Related Work: Because of the difficulty of solving
a problem POMDP instance exactly, modern methods are
generally point-based methods that approximate the solution
using a small set of representative belief points [5], [12]–[14].
In this work, we utilize an existing point-based method, MCVI
[7], to evaluate the action model produced using AMA. MCVI
was selected because it is more direct to define different action
models than the other solvers we looked at. AMA is not a
new method to compute solution policies for POMDP models,
but rather a method to construct a fast, medium fidelity model
that is problem-specific and so can avoid high complexity in
unnecessary regions.
AMA builds an approximate action model that is used to
speed up POMDP computation. In doing so, it also builds
an initial guess for a policy. This initial guess is used to
bootstrap solving the AMA generated POMDP problem
instance. Although this seems similar to applying a heuristic
to the underlying POMDP solver, it only initializes the guess
of the lower bound of expected reward better, and does not
affect the sampling strategy of the POMDP solver. Many
heuristics [8], [15]–[18] have been successfully applied to
various POMDP problem instances, however we only utilize
the extremely general sampling heuristic built in to MCVI.
This avoids any interaction between a selected sampling
heuristic and AMA’s action model computation. There is no
reason to think that utilizing one of these sampling heuristics
to improve solution speed would be problematic, but it would
complicate the analysis of results. This is an open avenue for
future research.
Hierarchical POMDP methods [9], [19], [20] are related
to AMA in that they focus on building and solving multiple
POMDP models that in turn solve a global problem definition.
However, these methods are constructing decompositions of
the action and state spaces. AMA, on the other hand, focuses
on pruning and removing parts of the action cross state space
to construct a useful approximation, rather than decomposing
it. Thus, AMA could be incorporated with hierarchal methods.
Because the computational complexity of solving a
POMDP problem instance grows exponentially with the size
of the state space, some recent methods try to group similar
states and actions together to maintain high simulation fidelity
where needed [11], [21]. This work is similar in concept
to our proposed method, but tends to require a strong set
of assumptions and/or user-defined mapping functions to
create these groups. In contrast, AMA requires definition of
action models and an action model refinement function as
opposed to mapping functions or the existence of stabilizing
feedback controllers. Also, AMA does not attempt to provide
an optimal approximation over the whole space, but instead
only focuses on regions found to be important for a specific
problem instance. The ideas of state and observation grouping
are used to extend a POMDP solver to continuous spaces
[11]. This modification to the underlying POMDP solver is
generally orthogonal to the thrust of AMA, where we alter the
cardinality of the spaces considered, depending on the results
of solving a low accuracy model. Integrating these ideas could
provide AMA extra information for model building, based
on the groups constructed in the low accuracy solution.
Probably the closest idea to AMA is found in [22], which
builds a decomposition of the state space with varying
resolution depending on the environment features. However,
AMA uses knowledge from an approximate solution to the
specific problem instance, rather than only the environment
specification. This means that AMA will retain a simple
action model even where the environment is complex if those
regions are not useful in the current problem instance. This is
in contrast to a variable resolution method based on building
quadtrees over the entire space, where resolution depends on
environment complexity only. Additionally, we focus on the
action model, while [22] focused on the state space.
The work on policy transformation under changing models
[23], called Point-Based Policy Transformation (PBPT) is
extremely relevant although orthogonal to our approach.
Specifically, AMA focuses on building an action model with
the minimum complexity required, while PBPT focuses on
making minimal modifications to a policy to fit a new model.
AMA does not use such a sophisticated policy transformation,
instead bootstrapping the new approximate solution with
information about how the original, simple instance was
solved. This simpler method requires fewer assumptions
on the differences between the simple action model, the
approximation of the true action model, and the true action
model. In particular, PBPT requires a bijection between the
(discrete) state, action and observation spaces between the
two models under consideration, and our approach can not
meet this requirement because we are explicitly changing the
cardinality of these spaces between the three action models.
Contributions: We introduce AMA as a general method
to compute an approximate action model that can be used
to find fast approximate solutions to POMDPs. Our highlevel AMA framework is action, observation, and transition
model agnostic, and our specific POMDP implementation
for simulation results is presented in Section III. Given a
robot model and environment that induce a belief space large
enough to be computationally infeasible for existing POMDP
solution methods, AMA probes the specific problem instance
and automatically creates an action model that estimates
where high fidelity is required to produce an optimal policy.
We investigate the performance benefit of AMA and
discuss the relative merits of the action model approximation
constructed by AMA as compared to the true action model.
The use of the action model provided by AMA is shown to
generally, although not always, improve the POMDP solution
time while not dramatically impacting total expected reward.
Simulations are performed over a range of environments to
verify the scope of our results. Finally, we apply our method
to an example system presented in the software library we
used, and our initial results show the same trend of faster
solution without quality decrease.
II. M ETHODOLOGY
AMA Framework: We will denote the true action model
that defines the problem as M . Then let M − be the simplified
version of that model. The function M̂ = Refine(M − , R) is
defined by the user and modifies the action model of M −
by adding new options in the product of state and action
spaces, selected by analyzing the set of states R. That is,
if an action model M − and set of states R are passed into
Refine, it shall return a new action model M̂ with additional
complexity added. The analysis of R determines what actions
are made available in what states. For example, if the simple
model M − only uses the 4 cardinal directions, and R includes
turning a corner, then diagonal actions to short-cut that corner
can be added. In general, if the states are nodes in a graph
connected by action edges, then more complex models can
be built with actions that, when added, will create shortcut
paths in this graph. If all possible refinements are made to the
simple action model M − , the true action model M should
result. Thus M̂ is an action model that has at least as much
complexity as M − , but no more complexity than M , and uses
the information from the solution of the POMDP problem
instance under M − and the function Refine to increase action
model complexity in only the relevant areas.
Algorithm 1 Automated model approximation algorithm
M̂ ← M −
for i = 1 → iterations do
policy ← Solve(M̂ )
states ← Execute(M̂ , policy)
M̂ ← Refine(M̂ , states)
end for
return M̂
In Algorithm 1, Solve should call any POMDP solver, and
return a policy that maximizes expected reward, as described
in Section I. states is the set of ordered lists of true world
states that the robot entered while executing policy. This is
a set of lists because the simulation must account for all
possible true initial conditions with the correct probability
distribution. To accomplish this, 1000 samples are drawn
from the initial belief and executed.
Our implementation of AMA then operates in an iterative
fashion as described in Algorithm 1. We found this to be
an intuitive way to pass information to Refine, although in
general, any function of the current robot model and policy
could work here. Then M̂ is updated with the information
that this sequence of states provides. The exact form of the
Refine function will differ between various POMDP models.
Our implementation can be found in Section III.
III. I MPLEMENTATION
To implement AMA, we chose to work with a state-of-theart POMDP solver known as MCVI [7]. The authors of MCVI
provide an implementation of their method for download,
and new problems are defined by writing C++ code. This
enabled AMA implementation by separating out our AMA
code into the action model implementation, and required less
modification to the underlying code than may be the case
with other POMDP solvers. MCVI is an iterative method, so
it constructs a series of policies. The final policy will have
the highest lower bound and lowest upper bound of expected
reward. MCVI exits when upper bound − lower bound ≤
convergence criterion. This criterion is 1 for all simulations.
In MCVI, the lower bound is found by particle filter
simulation of the model applying a policy. Thus the actual
reward of the policy may be lower than this lower bound,
depending on the exact samples that get drawn in the particle
filter. We will continue to use the term lower bound even
though it has a dependence on randomness and may not be
a strict lower bound for all samplings. (Similar arguments
apply for the upper bound, although there is less variance
here so it has a smaller effect.)
Problem Instance: In this section, we will define the
spaces that the POMDP solver will be working in. The robot
definition that we will concentrate on is a point robot living in
a grid discretization of a 2D world. The robot state space thus
is a set of square grid cells, and the action space is defined
as noise free movement between cells. In this case, a clear
choice for iterative refinement is over which diagonal edges
are available. Our simple action model, M − , can always
move in the four cardinal directions, but not on diagonal
edges. The true action model, M , is a model that can always
move in the four cardinal directions as well as the four
diagonal directions. Therefore, M − should be able to solve
all problems presented but the
√ optimal path length could be
larger by up to a factor of 2, as long as the environment
does not include diagonal zero width corridors. We only
present results in environments that fit this requirement. The
Refine function can add diagonal actions where needed based
on a set of execution traces (Algorithm 1), and is used in
AMA to create M̂ from M − . The full complexity action
model, M , is the baseline for comparison.
A POMDP model must provide costs or rewards for all
actions possible. Moving in a cardinal direction and sensing
√
always has cost 1, while moving diagonally has a cost 2.
Sensing while in a goal region is the success condition for our
task, so in this case the robot gets a large reward. MCVI uses a
discount factor, and we leave it at the default value of γ = .95.
The robot gets a null observation for every action except for
the sense action. In this case, a two dimensional observation
is provided that is sampled from a Gaussian around the
current true state of the robot with variance three. The initial
position of the robot is defined as a uniform sampling of a
problem specified state and all adjacent non-obstacle states.
This uncertainty in initial position drives the challenge of
solving the problem, and the uncertainty in sensing prevents
the system from collapsing belief to a single state. There is
no penalty for attempting motion into an obstacle other than
the lost time (discount of future reward) and the cost above.
Environments: AMA is fairly general in definition, but
to evaluate the performance of the approximate robot models
it constructs, we will define several concrete instances of
problems to solve, as depicted in Figure 1.
(a) A reasonable starting point is an empty environment,
where the goal is simply to navigate from a point in the
north to anywhere in the south. This is a useful example
because it should not require diagonal actions to be
optimal, it should run quickly, and the optimal policy
should follow our intuition. Therefore we present it as
a baseline for comparison. Solving M̂ should be fast
and not take much time beyond the time to solve M − ,
because the policy from M − should have been near
optimal already. So this environment should provide the
best possible advantage to our method.
(b,c) In the U environment, the robot starts inside a U shaped
obstacle. In this case, some diagonal actions are needed
for an optimal policy. We expect to see improvement
with M̂ as compared to M , although not as much as
the empty environment. We also test a smaller, more
constrained version of this environment.
(d) The Diagonal environment has a diagonal line cutting
from the southwest up to just beyond the center. This is
expected to be the hardest environment to solve because
not only are there two very different navigation paths to
follow, but the choice needs to be made soon without
spending too much time sensing.
(e) The Spikes environment has three narrow passages.
This environment should solve quickly, because the
narrow passages actually allow the robot to reduce the
uncertainty in position. In addition, the starting location
is fairly constrained, so the initial belief does not have as
much spread as the other environments. The spikes are
along the X and Y axes, so optimal navigation requires
many diagonal actions.
(f,g) The Maze environments require navigating four turns.
The difference between the two mazes is that one is
always as wide as the initial uncertainty, while the second
has one side as a narrow passage that should reduce the
uncertainty part way through execution.
(h) The final environment is for the underwater robot model
in the MCVI package as found in [7]. In this model, the
robot can only localize in the area on top and bottom
(shaded lightly), while the goal region is to the right.
A major difference in the model is that the true action
space of M is smaller. Only five movement directions
are available: north, northeast, east, southeast, and south.
Therefore, the difference in action model complexity is
significantly less than in our navigation model.
Iterative Refinement: In this section we will describe our
implementation of Refine(M − , R) that is used in Algorithm 1
to construct M̂ . The first step is to find a solution policy with
the simplest possible general action model, M − . The simple
action model, M − , when passed to MCVI, should be able
to solve any problem instance, although it need not be able
to create an optimal policy with respect to the true model
M . Because the policy generated from solving M − is only
used to initialize the next step, we define convergence as
10, rather than the (arbitrary, user defined) true convergence
criterion of 1. Once this policy is computed, it is simulated
1000 times, and each execution trace is recorded in order to
call the Refine function as described in Algorithm 1.
Because the POMDP solution under the action model M −
tends to yield policies that are optimal up to short-cutting
corners, each state in the trace is checked to see if a sequence
of diagonal actions can connect to a future state in the trace.
If so, then M̂ will have that sequence of diagonal actions
added to it. As an implementation detail, the center of the
start distribution always has all 4 diagonal actions added to
ensure that our initial belief is identical between M̂ and M ,
because it is initialized based on adjacency of regions.
Adding actions based on short-cuts tends to add exactly
the edges that might be useful in the next iteration, but
does not add extraneous edges that will increase computation
time. In addition, the transitions are combined to create an
initialization map and a new initial policy. This initial policy
follows whatever transition was most likely from a given
state in the previous iteration. In this way, the work done in
the simpler action model can be applied in the more complex
action model easily, while searching for improvements using
the new diagonal actions available in M̂ . After processing
all execution traces to create M̂ , it is then passed into the
MCVI framework as a new problem instance, and it is run
until convergence or timeout, with identical parameters to M .
Our hypothesis is that a problem-specific action model M̂
will allow us to find a policy that is almost as good as the
policy that the true action model provides, but in significantly
less time. This iterative refinement could be applied multiple
times, but in practice we found that AMA worked well with
just one refinement step. Therefore there is no ambiguity in
the use of M − , M̂ , and M to describe the action models.
IV. R ESULTS
Each environment was run in MCVI using our iterative
refinement model on a single core to avoid timing discrep-
(a) Empty
(b) U
(c) Small U
(d) Diagonal
(e) Spikes
(f) Maze
(g) Constrained Maze
(h) Underwater
Fig. 1. Environments used for evaluation, where the black regions are hard obstacles, stars indicate an element of initial belief, and the light shaded area at
the bottom is the goal region. The final environment is for the underwater robot model of [7], and has the goal region to the right while the shaded regions
at top and bottom correspond to areas that allow localization.
ancies with varying levels of parallelism. All experiments
were repeated 60 times to obtain statistically significant
means to compare. All error bars represent 95% confidence
intervals. The time limit for solving M − and M̂ was 10,000
seconds, and the time limit for M was 20,000 seconds. Time
taken to perform the refinement step is less than one second
and is neglected here for clarity of plots. This time grows
asymptotically linearly in the number of execution traces
considered and the planning horizon, so it is not expected to
ever be a significant influence on the total runtime.
Performance: The solver was able to converge to a
correct solution in our test of the empty environment (Figure
1(a)), as our performance data shows in Figure 2. As expected,
M − followed by M̂ did very well, taking about 2500 seconds
less than the time to solve M . As should be expected,
increasing the size of the action space when the actions
are not needed makes a large difference in computational
power required. We see that the increase in time to solve
both M − and M̂ is smaller than the increase of time used
due to additional actions available in M .
When the U shaped obstacle (Figure 1(b)) was tested, M
and the hybrid model M̂ did not converge within the time
limits, although the simpler M − did. Therefore it is somewhat
uninteresting to compare run-times, as they are simply pegged
at timeout for everything except M − . Thus we present a plot
of the reward in Figure 3 of the best policy that the three cases
were able to find. The results here are somewhat surprising
– the solution using M did not return any policy that got a
positive reward at all, while using the simpler M − could,
Fig. 2. In the empty environment,
our approximate action model is more
than an order of magnitude faster.
The baseline is the true robot model
M , while the comparison is the time
to solve the model AMA built, M̂ ,
stacked on top of the time to solve
the simple action model (M − ) that
is used by AMA as a starting point.
Note that the time for M̂ is so small
as to be invisible on this plot.
Fig. 3. In the U environment the use
of our action model approximation
method yielded an approximate solution, while the full complexity action
model was not able to achieve any
positive reward.
_
Diagonal
Time (s)
6000
_
4000
M−
^
M
M
2000
_
0
(a)
Baseline Comparison
(b)
Fig. 4. The diagonal environment of Figure 1(d) caused our approximate
models to use slightly more runtime, shown in (a). The spikes environment
depicted in Figure 1(e) barely completed on time, and AMA provided a
significant runtime advantage, shown in (b). Here, the time for solving M −
is hard to see because it is so small at the bottom of the comparison column.
_
Underwater
5000
_
4000
_
Time (s)
3000
_
2000
M−
^
M
M
Fig. 5. In the underwater setting,
the convergence time was highly variable, and the means are not statistically significantly different.
1000
_
0
Baseline
Comparison
and then using M̂ was able to improve on the expected
reward. Therefore, we see a significant benefit for AMA. It
is apparent that solving with M̂ is able to provide partial
answers quickly. In this case, using M failed to provide a
reasonable solution at all. Because of the difficulty in solving
this environment, we constructed a smaller version of it with
about half the number of discrete states. However, the results
were qualitatively identical: the full complexity action model
M still did not yield a policy that got positive reward within
20,000 seconds in any of the 60 runs.
The environment with a diagonal line in it (see Figure 1(d))
was found to be able to converge in all cases. In this case, we
see that the time of solving M actually beats the time to solve
using M − and M̂ combined, although not by a large margin,
as shown in Figure 4(a). This fits our expectations, because
M − cannot provide as much information to the AMA, and
there are many diagonal actions needed for an optimal policy.
If we also look at the reward of the three methods in Figure
6(b), we can see again that using M̂ gets a value very close
to when we use M , but clearly was missing out on some
actions that were needed to be optimal. However, the absolute
difference in reward is small.
The fourth environment with three spikes to navigate,
depicted in Figure 1(e), also converged to a solution in
all cases. The total time to solve, shown in Figure 4(b),
is significantly less when using AMA. However, based on the
number of diagonal actions required, it is reasonable to think
that there will be degradation in the reward that M̂ was able
to achieve. The results presented in Figure 6(a) indicate that
this is exactly what happens, a significant decrease in runtime
is paid for by a smaller percentage decrease in reward.
The two maze environments of Figures 1(f),1(g) had
varying results. As before, we will compare the reward
that their partial execution could achieve, noting that in the
constrained version of the problem, solving with the simple
M − was fast, but all other runs in these environments hit
timeouts. We believe that the unconstrained maze is able to
find a good solution with all three models (see Figure 5(c))
because the uncertainty in the start state doesn’t matter as
much. This is because there is enough room to make progress
in an open-loop fashion without even trying to disambiguate
between the possibilities for the true start state. The true state
only significantly affects the policy in the vicinity of the goal.
The true model is favored here because each corner taken
must use diagonal actions to achieve an optimal policy, and
the reward in the unconstrained case favors the true model
that is not bootstrapped with a straight-line policy which
can take extra time to show to be suboptimal. However, the
constrained maze results (see Figure 6) are believed to be
so different because the robot needs to effectively collapse
the uncertainty in its Y position to reliably enter the narrow
passage. The simpler models are able to do this within the
time limits and therefore can get to the goal on average, while
the true model is not able to find a policy that achieves this
effective localization in the time limits provided.
Finally, when we tested AMA using a noisy underwater
action model provided in MCVI as the true M , we found
very high variance in runtime (see Figure 5), although the
overall reward was found to be very consistent across the
three models.
All trials completed and found a good policy, however,
some of them took much longer than others to converge to
that result. This was the case for both the true action model
M and our intermediate complexity action model M̂ . The
simple model, M − , was much more consistently fast. Overall,
the mean time using AMA was found to be less than the mean
time to solve the true action model, although the variance
present in the data prevents a definite result.
V. D ISCUSSION
We presented a method for iterative refinement of action
models in a POMDP setting, called Automated Model
Approximation, for a class of robotic applications. AMA was
applied to a discrete robot model and tested in several environments. In all but one environment, there was a significant
computational benefit to using AMA. Either solution times
were significantly reduced, or solutions were found in cases
that the fully-refined model could not solve at all. AMA was
also applied to an underwater navigation problem as presented
by the authors of MCVI. Our method showed promise, but the
results can not be considered statistically significant due to
high variance in runtime. We believe that through refinement,
solution times on many POMDPs, traditionally an extremely
long computation, can be significantly reduced.
_
500
Diagonal
_
_
Constr. Maze
_
250
_
200
400
Reward
Reward
150
300
100
200
50
100
0
0
(a)
(b)
M
M
−
^
M
(c)
(d)
_
M
M−
^
M
Fig. 6. In the spikes environment (a), the simpler models are able to run much faster, however, they miss some areas where additional complexity in the
model was required to achieve optimality. The diagonal environment (b) produced a policy that achieved slightly lower reward. In the maze environments
(c,d), the constrained version of the problem could not be solved by M , but the approximate models could; when unconstrained, both M and M̂ could be
solved for approximately equal reward, noting that they all timeout so the total runtime for M and M̂ + M − are approximately equal at 20,000 seconds.
In the future, we will investigate the problems of variance
associated with approximate methods such as MCVI, and try
to automate the selection of good run-time parameters. Many
other problems can be framed as a POMDP, and AMA may be
useful in sensing problems such as target classification. State
refinement is also interesting, and could be implemented in
a grid-based state model by splitting grid cells into 4 smaller
cells. Then the initial policy assumes all 4 sub-cells are equal
– the solver will correct for cases where it is not. An extension
of AMA in this manner could help increase the size of the
state spaces that are considered in POMDP problems, in
addition to the complexity of the action model.
ACKNOWLEDGEMENTS
This work was supported in part by the US Army Research
Laboratory and the US Army Research Office under grant number
W911NF-09-1-0383, NSF CCF 1018798, NSF IIS 0713623. D.G. is
also supported by an NSF Graduate Research Fellowship. Equipment
used was funded in part by the Data Analysis and Visualization
Cyberinfrastructure funded by NSF under grant OCI-0959097.
R EFERENCES
[1] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge,
MA: MIT Press, 2005.
[2] C. Papadimitriou and J. Tsitsiklis, “The complexity of Markov decision
processes,” Mathematics of Operations Research, vol. 12, no. 3, pp.
441–450, 1987.
[3] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and
acting in partially observable stochastic domains,” Artificial Intelligence,
vol. 101, no. 1-2, pp. 99–134, May 1998.
[4] K. Hsiao, L. P. Kaelbling, and T. Lozano-Perez, “Grasping POMDPs,”
in Proc. 2007 IEEE Intl. Conf. on Robotics and Automation. Ieee,
Apr. 2007, pp. 4685–4692.
[5] N. Roy and S. Thrun, “Coastal navigation with mobile robots,” in
Advances in Neural Processing Systems, 1999, pp. 1043–1049.
[6] T. Smith and R. Simmons, “Point-based pomdp algorithms: Improved
analysis and implementation.” in UAI. AUAI, 2005, pp. 542–547.
[7] Z. W. W. Lim, D. Hsu, and L. Sun, “Monte carlo value iteration
with macro-actions,” in Advances in Neural Information Processing
Systems 24, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and
K. Weinberger, Eds., 2011, pp. 1287–1295.
[8] J. S. Dibangoye, A.-I. Mouaddib, and B. Chai-draa, “Point-based
incremental pruning heuristic for solving finite-horizon DEC-POMDPs,”
in Intl. Conf. on Autonomous Agents and Multiagent Systems. Budapest,
Hungary: Intl. Foundation for Autonomous Agents and Multiagent
Systems, 2007, pp. 569–576.
[9] A. Foka and P. Trahanias, “Real-time hierarchical POMDPs for
autonomous robot navigation,” Robotics and Autonomous Systems,
vol. 55, no. 7, pp. 561–571, July 2007.
[10] H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based
POMDP planning by approximating optimally reachable belief spaces,”
in Robotics: Science and Systems, 2008.
[11] H. Kurniawati, T. Bandyopadhyay, and N. Patrikalakis, “Global motion
planning under uncertain motion, sensing, and environment map,”
Autonomous Robots, vol. 33, pp. 255–272, 2012.
[12] J. Pineau, G. Gordon, and S. Thrun, “Point-based value iteration:
An anytime algorithm for POMDPs,” Intl. Joint Conf. on Artificial
Intelligence, vol. 18, pp. 1025–32, 2003.
[13] H. Kurniawati, Y. Du, D. Hsu, and W. S. Lee, “Motion planning under
uncertainty for robotic tasks with long time horizons,” The Intl. Journal
of Robotics Research, 2010.
[14] H. Bai, D. Hsu, W. Lee, and V. Ngo, “Monte carlo value iteration
for continuous-state pomdps,” in Algorithmic Foundations of Robotics
IX, ser. Springer Tracts in Advanced Robotics, D. Hsu, V. Isler, J.-C.
Latombe, and M. Lin, Eds. Springer Berlin / Heidelberg, 2011, vol. 68,
pp. 175–191.
[15] Z. Zhang and X. Chen, “Accelerating Point-Based POMDP Algorithms
via Greedy Strategies,” Simulation, Modeling, and Programming for
Autonomous Robots, vol. 6472, pp. 545–556, 2010.
[16] M. Hauskrecht, “Value-Function Approximations for Partially Observable Markov Decision Processes,” Journal of Articial Intelligence
Research, vol. 13, pp. 33–94, June 2000.
[17] O. Madani, S. Hanks, and A. Condon, “On the undecidability of
probabilistic planning and infinite-horizon partially observable markov
decision problems,” in Proc. of the sixteenth national conference on
Artificial intelligence, ser. AAAI ’99. Menlo Park, CA, USA: American
Association for Artificial Intelligence, 1999, pp. 541–548.
[18] W. S. Lovejoy, “A survey of algorithmic methods for partially observed
Markov decision processes,” Annals of Operations Research, vol. 28,
no. 1, pp. 47–65, Dec. 1991.
[19] J. Pineau, N. Roy, and S. Thrun, “A hierarchical approach to POMDP
planning and execution,” Workshop on Hierarchy and Memory in
Reinforcement Learning, 2001.
[20] J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun, “Towards
robotic assistants in nursing homes: Challenges and results,” Robotics
and Autonomous Systems, vol. 42, no. 3-4, pp. 271–281, Mar. 2003.
[21] A.-a. Agha-mohammadi, S. Chakravorty, and N. Amato, “FIRM:
Feedback controller-based information-state roadmap - A framework
for motion planning under uncertainty,” in Intl. Conf. on Intelligent
Robots and Systems. IEEE, Sept. 2011, pp. 4284–4291.
[22] R. Kaplow, A. Atrash, and J. Pineau, “Variable resolution decomposition
for robotic navigation under a POMDP framework,” in IEEE Intl. Conf.
on Robotics and Automation. IEEE, May 2010, pp. 369–376.
[23] H. Kurniawati and N. M. Patrikalakis, “Point-Based Policy Transformation : Adapting Policy to Changing POMDP Models,” in Algorithmic
Foundations of Robotics X, ser. Springer Tracts in Advanced Robotics,
D. Hsu, V. Isler, J.-C. Latombe, and M. Lin, Eds. Springer Berlin /
Heidelberg, 2012, pp. 1–16.
Download