Planning with State Uncertainty via Contingency Planning and Execution Monitoring

Proceedings of the Ninth Symposium on Abstraction, Reformulation and Approximation
Planning with State Uncertainty via
Contingency Planning and Execution Monitoring
Minlue Wang and Richard Dearden
School of Computer Science, University of Birmingham,
Birmingham B15 2TT, UK
{mxw765,rwd}@cs.bham.ac.uk
In the past, quasi-deterministic problems have been
solved by assuming the world is perfectly known in advance (for the MER Mars rovers (Bresina and Morris 2007)
this is done by attaching goals to doing experiments on targets whether or not the target turns out to be interesting) or
will be known at execution time (Brenner and Nebel 2006;
Bresina et al. 2002; Pryor and Collins 1996). However,
these approaches ignore the noise in the observations, assuming that any observation made is correct. This means
that they can perform arbitrarily poorly in the presence of
sensor noise, and they also have no way of choosing between different ways to make observations. An alternative is to consider these problems as partially observable
Markov decision problems (POMDPs) (Cassandra, Kaelbling, and Littman 1994). POMDPs allow an explicit representation of quasi-deterministic problems. However, in
practice they are very hard to solve—state of the art exact solvers can manage hundreds or thousands of states for
general problems (Poupart 2005). As POMDPs can represent a more general class of problem (allowing stochastic actions as well as observations), it should be possible to exploit the division between state-changing actions
(deterministic) and observation-making actions (stochastic)
to solve quasi-deterministic problems more efficiently (although quasi-deterministic POMDPs are easier to solve
than general POMDPs, finding -optimal policies is still in
PSPACE (Besse and Chaib-draa 2009)).
Abstract
This paper proposes a fast alternative to POMDP planning for
domains with deterministic state-changing actions but probabilistic observation-making actions and partial observability
of the initial state. This class of planning problems, which we
call quasi-deterministic problems, includes important realworld domains such as planning for Mars rovers. The approach we take is to reformulate the quasi-deterministic problem into a completely observable problem and build a contingency plan where branches in the plan occur wherever an observational action is used to determine the value of some state
variable. The plan for the completely observable problem is
constructed assuming that state variables can be determined
exactly at execution time using these observational actions.
Since this is often not the case due to imperfect sensing of
the world, we then use execution monitoring to select additional actions at execution time to determine the value of the
state variable sufficiently accurately. We use a value of information calculation to determine which information gathering
actions to perform and when to stop gathering information
and continue execution of the branching plan. We show empirically that while the plans found are not optimal, they can
be generated much faster, and are of better quality than other
approaches.
1
Introduction
Many robotic decision problems are characterised by worlds
that are not perfectly observable but have deterministic or
almost deterministic actions. An example is a Mars rover:
thanks to low-level control and obstacle avoidance, rovers
can be expected to reach their destinations reliably, and can
collect and communicate data, but they do not know in advance which science targets are interesting and hence will
provide valuable data. Similarly, robots performing tasks
such as security or cognitive assistance are generally able to
navigate reliably, but use unreliable vision algorithms to detect the people and objects with which they are supposed
to interact. Following Besse and Chaib-draa (2009), we
will refer to problems with deterministic actions but stochastic observations as quasi-deterministic problems, which differ from Deterministic-POMDPs (DET-POMDPS) (Bonet
2009) by taking into account of uncertainty from observation model.
The major problem with applying POMDP approaches to
realistic planning problems like the Mars rovers is the sheer
size of the problems. Using point-based approximations and
structured representations similar to those used in classical planning (Poupart 2005), problems with tens of millions
of states can be solved approximately, but even that corresponds to a classical planning problem with only 25 binary
variables, which is a quite small problem by the standards
of classical deterministic planning. The alternative we propose in this paper is to construct a series of classical deterministic planning problems from the quasi-deterministic
problem. By solving each of these deterministic problems
we construct a contingent plan—one that contains branches
to be chosen between at run-time. However, the contingent
plan is not directly executable as the conditions that lead to
each branch may not be known. Therefore we use execution
monitoring to determine which branch to take at run time by
c 2011, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
132
s ∈ S, a ∈ A for the reward the agent receives for executing action a in state s.
executing one or more observation-making actions.
To generate the classical planning problem from the original quasi-deterministic problem, we take the approach of
Yoon et al. (2007) and determinise the problem by replacing the probabilistic effects of actions and observations with
their most probable outcome (this is called single outcome
determinisation in Yoon et al.). We then pass this problem
to our classical planner (FF (Hoffmann and Nebel 2001)) to
generate a plan. For each point in this plan where an observation action is executed (also for non-deterministic actions) and therefore a state different from the determinised
one could occur, we use FF to generate another plan which
we insert as a branch. This process continues until the contingent plan is completed.
As the contingent plan is executed, the execution monitoring system maintains a belief distribution over the values
of the state variables in exactly the same way a POMDP
planner would. Whenever a branch point is reached that depends on the value of some variable x (in general, branch
points could be based on functions of many variables; for
ease of exposition we will present the single variable case),
execution monitoring repeatedly uses a value of information
calculation to greedily choose an observation action to try to
compute the value of x. Once no observation action is available that has greater than zero value, execution continues
with the branch with the highest expected value.
In Section 2 we formally present quasi-deterministic planning problems and describe how we generate a classical
planning problem from them. Section 3 then presents the
contingency planning algorithm in detail, while Section 4
presents the execution monitoring approach. We discuss related work in Section 5, present an experimental evaluation
in Section 6, and finish with our conclusions and future directions.
2
Quasi-deterministic planning problems can be defined as
problems in which the actions are of two types: statechanging actions and observation-making actions. We
define state-changing actions as those where ∀s∃s :
P (s, a, s ) = 1, P (s, a, s ) = 0 for s = s and
∀s, a, s ∃o : P (s, a, s , o) = 1, P (s, a, s , o ) = 0 for
o = o. That is, for every state they are performed in there is
exactly one state they transition to, and their observations are
uninformative. In contrast, for observation-making actions
the observation function O is unconstrained and the state
does not change: ∀s, a : P (s, a, s) = 1, P (s, a, s ) = 0 for
s = s1 . This is the point at which the model differs from
the DET-POMDP formulation (Bonet 2009), where the observation function must also be deterministic.
In practice, the problems we are interested in are unlikely
to be specified in a completely flat general POMDP form.
Rather, we expect that just as in classical planning, they
will be specified using state variables, for example in a dynamic Bayesian network as in symbolic Perseus (Poupart
2005), which we will use for comparison purposes. Similarly, we use factored-observable models (Besse and Chaibdraa 2009) to simplify the representation of the observation
space.
In addition to the definition of the POMDP itself, we also
need an initial state and an optimality criterion. For quasideterministic POMDPs where S is represented using state
variables as in classical planning, the state variables can be
divided into completely observable variables So (such as the
location of the rover) and partially observable ones Sp (such
as whether a particular rock is of interest). Since all the statechanging actions are deterministic, the partially observable
variables are exactly those that are not known with certainty
in the initial state. We represent this by an initial belief state
and for all s ∈ Sp write b(s) for the probability that s is true
(for the purposes of exposition, we will assume all variables
are Boolean). For the optimality criterion, so as to be able to
compare performance directly between our approach and a
POMDP solver (symbolic Perseus), we will use total reward.
We illustrate this using the RockSample domain (Smith
and Simmons 2004), in which a robot can visit and collect
samples from a number of rocks at locations in a rectangular
grid. Some of the rocks are “good” indicating that the robot
will get a reward for sampling them. Others are “bad” and
the robot gets a penalty for sampling them. The robot’s statechanging actions are to move north, south, east or west in
the grid, or to sample a rock, and the observational actions
are to check each rock, which returns a noisy observation of
whether the rock is good or not.
Quasi-Deterministic Planning Problems
The POMDP model of planning problems is sufficiently
general to capture all the complexities of the domains we
consider here (although in principle our approach will work
for problems with resources and durative actions as well).
Formally, a POMDP is a tuple S, A, T, Ω, O, R where:
S is the state space of the problem. We assume all states
are discrete.
A is the set of actions available to the agent.
T is the transition function that describes the effects of the
actions. We write P (s, a, s ) where s, s ∈ S, a ∈ A for
the probability that executing action a in state s leaves the
system in state s .
Ω is the set of possible observations that the agent can
make.
O is the observation function that describes what is
observed when an action is performed. We write
P (s, a, s , o) where s, s ∈ S, a ∈ A, o ∈ Ω for the probability that observation o is seen when action a is executed
in state s resulting in state s .
R is the reward function that defines the value to the
agent of particular activities. We write R(s, a) where
Classical Planning Representation
To use a contingency planner to solve this problem, we
translate it into the probabilistic planning domain definition
1
We note that our approach does not require that all statechanging actions have uninformative observations. However, the
efficiency of the approach we describe depends on the number of
observations in the plan as this determines the number of branches.
133
language (PPDDL) (Younes and Littman 2004). PPDDL
is designed for problems that can be represented as completely observable Markov decision problems (known state
but stochastic actions). To use it for a quasi-deterministic
problem we need to represent the effects of the observationmaking actions. Following Wyatt et al. (Wyatt et al. 2010),
we do this by adding a knowledge predicate kval() to indicate the agent’s knowledge about an observation variable.
For instance, in the RockSample problem (Smith and Simmons 2004), we use kval(rover0, rock0, good) to reflect that
rover0 knows rock0 has good scientific value. The knowledge predicate is included in the effects of the observationmaking action and also appears in the goals to ensure that the
agent has to find out the value. Because state-changing actions produce uninformative observations, knowledge predicates will only appear in the effects of observation-making
actions. As an example, the checkRock action from the
RockSample domain might appear as follows:
For example, for a RockSample problem with a single
rock rock0, which is a good rock with probability 0.6, we
determinise the initial state to produce one where the rock
is good, and the goal becomes to sample a rock where
kval(rover0, rock0, good) is true. The action checkRock then
becomes determinised (in PDDL) as:
(:action checkRock
:parameters
(?r -rover ?rock -rocksample
?value -rockvalue)
:preconditions
(not (measured ?r ?rock ?value))
:effect
(and (when (and (rock_value ?rock ?value)
(= ?value good))
(and (measured ?r ?rock good)
(kval ?r ?rock good)))
(when (and (rock_value ?rock ?value)
(= ?value bad))
(and (measured ?r ?rock bad)
(kval ?r ?rock bad)))))
(:action checkRock
:parameters
(?r -rover ?rock -rocksample
?value -rockvalue)
:preconditions
(not (measured ?r ?rock ?value))
:effect
(and (when (and (rock_value ?rock ?value)
(= ?value good))
(probabilistic
0.8 (and (measured ?r ?rock good)
(kval ?r ?rock good)))
0.2 (and (measured ?r ?rock bad)
(kval ?r ?rock bad)))
(when (and (rock_value ?rock ?value)
(= ?value bad))
(probabilistic
0.8 (and (measured ?r ?rock bad)
(kval ?r ?rock bad)))
0.2 (and (measured ?r ?rock good)
(kval ?r ?rock good)))))
3
Since in reality each observation-making action in the
plan could have an outcome other than the one selected in
the determinisation, we then traverse the plan updating the
initial belief state as we go, until an observation-making action is encountered. This then forms a branch point in the
plan. For each possible value of the observed state variable
apart from the one already planned for, we generate a new
initial state using the belief state from the existing plan and
the value of the variable, and then call FF again to generate a new branch which can be attached to the plan at this
point. This process repeats until all observation-making actions have branches for every possible value of the observed
variable. The full algorithm is displayed in Algorithm 1.
Algorithm 1 Generating the contingent plan using FF
plan=FF(initial-state,goal)
while plan contains observation actions without branches
do
Let o be an initial observation making variable v = v1
without a branch in plan
Let s be the belief state after executing all actions preceding o from the initial state
for each value vi , i = 1 of v with non-zero probability
in s do
branch = FF(s ∪ (v = vi ), goal)
Insert branch as a branch at o
end for
end while
Generating Contingency Plans
Given a quasi-deterministic planning problem as described
in the previous section we seek to generate contingent plans
where each branch point in a plan is associated with one possible outcome of an observation action. We use the simple
approach of Warplan-C (Warren 1976) to generate contingent plans. There are two steps in generating contingency
plans for a quasi-deterministic planning problem. First, we
determinise the problem according to single-outcome determinisation (Yoon, Fern, and Givan 2007). For each observation action, only the most likely probabilistic effect is
chosen. Similarly, only the most likely state from the initial belief state is used to define the initial state of the determinised problem. This approach will convert a quasideterministic planning problem into a standard classical deterministic model. We then forward this determinised problem to the classical planner FF (Hoffmann and Nebel 2001).
FF then generates a plan to achieve the goal from the determinised initial state. Since the goal includes knowledge
predicates, the plan by necessity includes some observationmaking actions.
Since the approach in Algorithm 1 enumerates all the possible contingencies that could happen during execution, the
number of branches in the contingent plan is exponential
in the number of observation-making actions in the plan.
This is precisely why it is useful that the state-changing
actions do not generate observations—to keep the number of branches as low as possible. Also, since the determinised problem assumes the observation-making actions
are perfectly accurate, in any branch of the plan at most one
134
Figure 1: An example of the RockSample(4,2) domain and a contingent plan generated for that problem. The rectangles in the
plan are state-changing (mostly moving) actions and the circles are observation-making actions for the specified rock.
a1 , followed by observation action o1 , which measures state
variable c. If c is true, branch T1 will be executed, and if c is
false, branch T2 will be executed. When execution reaches
o1 , execution monitoring calculates the expected utility of
the current best branch T ∗ based on the belief state b(c) over
the value of c after a1 as follows:
observation-making action will appear for each state variable in Sp . Thus in practice, we expect there to be a relatively small number of branches. In the RockSample domain, for example, there is one branch per rock in the problem. This is illustrated in Figure 1. On the left is an example
problem from the RockSample domain with a 4x4 grid and
two rocks, while the right hand side shows the plan generated by the contingency planner. In this version of the domain, the robot gets a large positive reward if it moves to the
exit with a sample of a good rock, it gets zero reward if it
moves to the exit with no sample, and it gets a large negative
reward if it moves to the exit with a sample of a bad rock.
4
Ub (T ∗ ) = max U (Ti , b)
Ti
(1)
where Ub (T ∗ ) represents the value in belief state b of making no observations and simply executing the best branch.
U (Ti , b) is the expected value of executing branch Ti in belief state b.
Next we examine the value of performing an observationmaking action o (not necessarily the same o1 as planned)
that gives information about c. Performing o will change the
belief state depending on the observation that is returned.
Let B be the set of all possible such belief states, one for
each possible observation returned by o, and let P (b ) be
the probability of getting an observation that produces belief
state b ∈ B. Let cost(o) be the cost of performing action o.
The value of the information gained by performing o, is the
value of the best branch to take in each b , weighted by the
probability of b , less the cost of performing o and the value
of the current best branch:
P (b )Ub (T b ) − Ub (T ∗ ) − cost(o) (2)
VG(o) =
Execution Monitoring
The approach we described in Section 3 for generating
branching plans relies on relaxation of the uncertainty in the
initial states and observation actions. The results of this are
plans that account for every possible state the world might be
in but do not account for the observations needed to discover
that state. That is, they are executable only if we assume
complete observability at execution time (or equivalently,
that the observation actions are perfectly reliable as in DETPOMDPs). If, as is the case in the RockSample domain, the
sensing is not perfectly reliable and therefore the state is not
known with certainty, they may perform arbitrarily badly. To
overcome this problem we propose a novel execution monitoring and plan modification approach to increase the quality of the plan that is actually executed. During execution,
we keep track of the agent’s belief state after each selected
action via a straightforward application of Bayes rule, just
as a POMDP planner would. To select actions to perform
when we reach an observation-making action in the plan,
we utilise a value of information calculation (Howard 1966).
Suppose the plan consists of state-changing action sequence
b ∈B
Where T b is the best branch to take given belief state b :
Ub (T b ) = max U (Ti , b )
Ti
(3)
Both Equation 1 and Equation 3 rely on the ability to compute the utility of executing a branch of the plan, U (T, b).
Building the complete contingent plan allows us to estimate
135
outcome of o, weighted by the probability according to our
current belief state of getting that outcome.
The execution monitoring algorithm is given in Algorithm
2. The equations above are used to select an action to perform in each iteration, and the process repeats until no action with a positive value can be found. At that point, execution selects the best branch and continues by executing
it. We might expect that in some circumstances this greedy
approach to observation-making action selection might be
sub-optimal. However, the action selection problem clearly
satisfies the requirements for sub-modularity (Krause and
Guestrin 2007) which guarantees the greedy approach is
close to optimal.
this value when deciding what observation actions to perform. We do this by a straightforward backup of the expected rewards of each plan branch given our current belief
state. The value of U (T, b) (the utility of branch T in belief
state b is computed as follows:
• if T is an empty branch, then U (T, b) is the reward
achieved by that branch of the plan.
• if T consists of a state-changing action a followed by the
rest of the branch T , then U (T, b) = U (T , b) − cost(a),
that is, we subtract the cost of this action from the utility
of the branch.
• if T consists of an observation-making action o on
some variable d (observation-making actions for each
variable will appear at most once), then U (T, b) =
d b(di )U (Ti , b) − cost(o), that is, we weight the value
of each branch at o by our current belief about d.
This ability to estimate the value of each branch is in contrast to the alternative approach of replanning (e.g. see (Goebelbecker, Gretton, and Dearden 2011), which we discuss in
more detail in Section 5) where the utility of the future plan
is impossible to determine since you cannot be sure what
plan will actually be executed until the replanning has occurred. Even in our case, we cannot compute this value
exactly as we don’t know what additional actions execution monitoring will add to the plan. However, since FF
will choose the minimum cost observational action2 we can
be sure that the cost we estimate for the tree by the procedure described above will be an underestimate, thus ensuring
that execution monitoring will never perform fewer observational actions than are needed to determine which branch to
execute.
For the plan in Figure 1, assuming we get rewards of V + ,
0, and V − for sampling a good rock, taking no sample, and
sampling a bad rock respectively, and costs of Co for observation actions, Cs for sampling actions and Cm for moving
actions, when we reach the observation action for rock R1
in a belief state b, the value of the “good” branch is:
Algorithm 2 Execution monitoring at observation-making
action o
Let c be the variable being observed by o
Let A be the set of actions that provide information about
c
repeat
Let V G(a) be the value gain for a ∈ A according to
Equation 2
Let a∗ = arg maxa V G(a)
if V G(a∗ ) > 0 then
execute a∗ and update the belief state b based on the
observation returned
end if
until V G(a∗ ) ≤ 0
Execute the best branch given the new belief state b according to Equation 3
The restriction that execution monitoring can only choose
among the observation-making actions is important (if we
allow state-changing actions to have non-trivial observations, they may have positive value of information). If execution monitoring was allowed to select actions that changed
the state, the rest of the plan might not be executable from
the changed state. This fact limits the applicability of this
approach in general POMDPs.
One thing worth noting is that the value of information
approach has the ability to choose between multiple observation actions by looking at the value gained by every observation action and picking the one that has the highest value.
After that action is executed, we continue to choose and execute the best observation-making action until there is no
action o with VG(o) > 0.
b(R1 = good)V + + b(R1 = bad)V − − 4Cm − Cs
while the value of the “bad” branch is:
b(R2 = g)V + + b(R2 = b)V − − 3Cm − Cs ,
max
−3Cm
−(2Cm + Co )
Here the top line of the equation is the value of taking the
left branch at R2 , the second line is the value of the right
branch, which is simply the cost of moving to the exit without sampling any rocks, and the bottom line is the penalty
for moving to R2 and observing its value, which applies to
both branches. Note that if the “bad” branch is taken at R1 ,
then in neither case is any reward gained from R1 , so the
belief we have in that rock becomes irrelevant to the plan
value. To compute the value gain for an action o, we compare the value of the best branch given our current belief
state with the value of the best branch given each possible
5
Related Work
Many execution monitoring approaches (Fikes, Hart, and
Nilsson 1972; Giacomo, Reiter, and Soutchanski 1998;
Veloso, Pollack, and Cox 1998) have been developed to detect discrepancies between the actual world state and the
agent’s knowledge of world, and incorporate plan modification or replanning to recover from any such unexpected
state. In most cases the discrepancies that these approaches
are trying to detect result from exogenous events or action
failures. In addition, most of these approaches are only focusing on monitoring the execution of straight-line plans. A
2
This is due to the fact that the determinised versions of the
observation-making actions are identical apart from their costs.
136
6
good survey of these approaches can be found in (Pettersson
2005). However, as none of them are addressing the same
problem of partial observability as we investigate, their approaches are not comparable with ours.
Experimental Evaluation
We tested our approach on the classical POMDP problem
RockSample (Smith and Simmons 2004). As described in
Section 2, there are five state-changing actions in the domain: four moving actions and one sampling action. Each
rock has an observation action which is not perfectly reliable. A reward of 20 will be given if the rover samples a
good rock and goes to the exit and a reward of −40 if a
bad rock is sampled. A large penalty is given if there is
no rock at the position of the rover when sampling or if
the rover moves out of grid except to go to the exit. We
write RockSample(n, k) for a n by n grid with k rocks. The
size of the state space is n2 × 2k . To capture the idea that
observation-making actions are usually faster and cheaper
than state-changing actions, we use a version of RockSample with costs on the actions: The movement actions have
cost two, while all others have cost one. An example of
RockSample(4, 2) is given in Figure 1.
For comparison purposes we wanted to use an optimal
plan as computed by a POMDP solver. The only one we
could find that would solve RockSample problems of interesting size was symbolic Perseus (Poupart 2005), which is a
point-based solver that uses a structured representation. This
algorithm is only approximately optimal, and the quality of
the policies found depends strongly on how the points for the
approximation are selected, however, having run the algorithm with a range of different parameters, we are fairly confident that the policies found are very close to optimal. Unfortunately, incompatibilities between the PDDL and symbolic Perseus domain specification languages meant that we
couldn’t use the standard version of the RockSample problem (Smith and Simmons 2004) directly. To overcome this
we added the exit to the grid and made the goal to go to the
exit with a sample from a good rock, and in addition we removed the ability to observe rocks from a distance, leaving
only a single observe action that works when the rover is in
the same square as the rock. When the robot goes to the exit,
the problem resets.
As well as symbolic Perseus, we also compared with the
same contingent plan without the execution monitoring. For
evaluation we need a plan quality measure which can easily
be computed for both the POMDP and classical planners.
To that end, we compare all the planners in terms of their
total reward averaged over 200 runs, each of 200 steps. Figure 2 shows the total reward of these three approaches. Any
problems which are larger than size (8,8) cannot be solved
by symbolic Perseus within reasonable time (one hour) and
memory usage. To aid comparison, we show the total reward
for the contingency planning approaches as a fraction of the
expected reward for the optimal policy computed by symbolic Perseus. The error bars are one standard deviation each
side of the mean. The addition of execution monitoring produced a significant improvement over contingency planning
alone, although neither classical planning approach reached
optimal performance for most problems.
The key argument for this work is that the classical
planning approach has a significant speed advantage over
POMDP planning. Figure 3 shows this on the same set of
problems, with time shown on a log scale. Time for generat-
The closest piece of related work to ours from the execution monitoring literature is (Boutilier 2000), which similarly used classical planning plus execution monitoring to
solve problems that could be represented as POMDPs. In
that work the plans are non-branching and the problem is
to decide when to observe the preconditions of actions and
determine if they are true, as opposed to using execution
monitoring to determine which branch to take. In common
with our approach, they use value of information to measure
whether monitoring is worthwhile, but then formulate the
monitoring decision problem as a set of POMDPs, rather
than using value of information directly to select observational actions.
The other most closely related approach is that of (Fritz
2009). They are interested in monitoring plan optimality
and identifying when a discrepancy in the plan is relevant
to the quality of the plan. To show that the current plan is
still optimal, they need to record the criteria that makes the
current plan optimal (i.e. the conditions that make it better
than the next best plan). This allows them to monitor just
those relevant conditions, and in an approach similar to ours,
directly compare the quality of two candidate plans from the
current state.
An alternative to execution monitoring for solving quasideterministic problems efficiently is described in (Goebelbecker, Gretton, and Dearden 2011). There they use a classical planner and a decision-theoretic (DT) planner to solve
these problems, switching between them as they generate a
plan. The approach is similar to ours in that they use FF to
plan in a determinisation of the original problem augmented
with actions to determine the values of state variables, which
they call assumption actions. However, they build linear
plans with FF and switch to the DT planner to improve the
plan whenever an observation action is executed. The DT
planner looks for a plan either to reach the goal, or to disprove one of the assumptions. If this occurs, replanning is
triggered. The advantage of their approach is that it can find
more general plans using the DT planner, so we might expect it to produce slightly better quality plans overall. However, the DT planner is much more computationally expensive than the simple value of information calculation we use.
Classical planners have also been applied in fully observable MDP domains. By far the most successful of these approaches has been FF-replan (Yoon, Fern, and Givan 2007),
from which we have taken the determinisation ideas discussed above. Because these approaches rely on being able
to determine the state after each action, they cannot easily
be applied in POMDPs. Our approach can be thought of as
FF-replan (although we actually build the entire contingent
plan rather than a single branch) where we use the execution monitoring to determine with sufficient probability the
relevant parts of the state.
137
number of actions, and prefers cheaper actions, before deciding that the rock isn’t worth sampling and moving on to
the next. When few untested rocks remain the algorithm is
willing to put a lot more effort into finding out if they are
good since it may otherwise not find any rock to sample.
This illustrates the value of taking into account the branches
in the remainder of the plan when deciding what observations to make.
1
Expected Total Reward/Optimal
0.9
0.8
0.7
7
0.5
Optimal (symbolic Perseus)
FF plus execution monitoring
FF without execution monitoring
0.4
(4,4)
(5,5)
(6,6)
Rocksample problem
(7,7)
(8,8)
Figure 2: Ratio of plan quality to optimal as computed by
Symbolic Perseus.
4
10
3
10
2
Plan generation time
10
Symbolic Perseus
FF with execution monitoring
FF without execution monitoring
1
10
0
10
−1
10
−2
10
(4,4)
(5,5)
(6,6)
Rocksample problem
(7,7)
Conclusion and Future Work
We have presented an approach to solving quasideterministic POMDPs by converting them into a contingency planning problem and using execution monitoring to
repair the plans at run-time. The monitoring approach differs from most other execution monitoring algorithms in that
we are monitoring beliefs about state variables rather than
whether the plan is still executable. The monitoring approach selects actions to make the belief state more certain,
using a value of information-based heuristic. The approach
is orders of magnitude faster than using a POMDP solver,
and our initial experiments suggest that the plans found are
not too far below optimal, and significantly better than without execution monitoring. We are currently working to apply
the approach in a wider set of domains, as well as to compare
it with other approaches, in particular that of (Goebelbecker,
Gretton, and Dearden 2011).
Other future work we are planning includes adding some
of the features of more traditional execution monitoring,
such as consideration of exogenous events and action failures. We would also like to add more of the richness of recent classical planning domains, including durative actions,
uncertainty about action costs and durations, etc. We believe
that many important real-world domains can be represented
as quasi-deterministic problems, and that cheap and fast solutions such as the one we have described here will be key
to solving them on-board autonomous robots.
0.6
(8,8)
Acknowledgements
Figure 3: Plan Generation Time in seconds as a function of
domain size. Note that the y-axis is log scaled.
This research was partly supported by EU FP7 IST Project
CogX FP7-IST-215181.
References
ing policies in Symbolic Perseus grows exponentially, while
generation time for contingent plans is orders of magnitude
less especially when the domain size is large. For instance,
Symbolic Perseus needs about 50 minutes to compute an optimal policy for RockSample(8,8) while FF only needs 0.09s,
and still is within 80% of the optimal policy.
The RockSample domain we have used only has a single
observation action so the actions selected by the execution
monitoring are rather uninteresting, mostly consisting of repeated tries at the action until the same observation is made
twice in a row. However, we have also tested the approach
on a variant where multiple observation actions are available
with different performance in terms of reliability and cost. In
this case we find that the actions selected by execution monitoring vary over the course of the contingency plan. For example, on early rocks, since there are lots of other untested
rocks available, execution monitoring will only use a small
Besse, C., and Chaib-draa, B. 2009. Quasi-deterministic
partially observable Markov decision processes. In Leung,
C.-S.; Lee, M.; and Chan, J. H., eds., ICONIP (1), volume 5863 of Lecture Notes in Computer Science, 237–246.
Springer.
Bonet, B. 2009. Deterministic POMDPs revisited. In Proceedings of the Twenty-Fifth Conference on Uncertainty in
Artificial Intelligence, UAI ’09, 59–66. Arlington, Virginia,
United States: AUAI Press.
Boutilier, C. 2000. Approximately optimal monitoring of
plan preconditions. In In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence (UAI00, 54–62.
Morgan Kaufmann.
Brenner, M., and Nebel, B. 2006. Continual planning and
acting in dynamic multiagent environments. In Proceedings
138
Wyatt, J. L.; Aydemir, A.; Brenner, M.; Hanheide, M.;
Hawes, N.; Jensfelt, P.; Kristan, M.; Kruijff, G.-J. M.; Lison, P.; Pronobis, A.; Sjöö, K.; Skočaj, D.; Vrečko, A.; Zender, H.; and Zillich, M. 2010. Self-understanding and selfextension: A systems and representational approach. IEEE
Transactions on Autonomous Mental Development 2(4):282
– 303.
Yoon, S. W.; Fern, A.; and Givan, R. 2007. FF-replan: A
baseline for probabilistic planning. In Proceedings of the
Fourteenth International Conference on Automated Planning and Scheduling, 352–.
Younes, H. L. S., and Littman, M. 2004. PPDDL1.0: The
language for the probabilistic part of IPC-4. In Proceedings
of the International Planning Competition.
of the 2006 international symposium on Practical cognitive
agents and robots, PCAR ’06, 15–26. New York, NY, USA:
ACM.
Bresina, J. L., and Morris, P. H. 2007. Mixed-initiative planning in space mission operations. AI Magazine 28(2):75–88.
Bresina, J.; Dearden, R.; Meuleau, N.; Ramkrishnan, S.;
Smith, D.; and Washington, R. 2002. Planning under continuous time and resource uncertainty: A challenge for AI.
In Proc. of UAI-02, 77–84. Morgan Kaufmann.
Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. 1994.
Acting optimally in partially observable stochastic domains.
In AAAI’94: Proceedings of the twelfth national conference
on Artificial intelligence (vol. 2), 1023–1028. Menlo Park,
CA, USA: American Association for Artificial Intelligence.
Fikes, R. E.; Hart, P. E.; and Nilsson, N. J. 1972. Learning
and executing generalized robot plans. Artificial Intelligence
3:251–288.
Fritz, C. 2009. Monitoring the generation and execution of
optimal plans. Ph.D. Dissertation, University of Toronto.
Giacomo, G. D.; Reiter, R.; and Soutchanski, M. 1998. Execution monitoring of high-level robot programs. In KR,
453–465.
Goebelbecker, M.; Gretton, C.; and Dearden, R. 2011. A
switching planner for combined task and observation planning. In Proceedings of the 25th Conference on Artificial
Intelligence (AAAI), to appear.
Hoffmann, J., and Nebel, B. 2001. The FF planning system:
Fast plan generation through heuristic search. Journal of
Artificial Intelligence Research 14:263–302.
Howard, R. 1966. Information value theory. Systems Science and Cybernetics, IEEE Transactions on 2(1):22 –26.
Krause, A., and Guestrin, C. 2007. Near-optimal observation selection using submodular functions. In National
Conference on Artificial Intelligence (AAAI), Nectar track.
Pettersson, O. 2005. Execution monitoring in robotics: A
survey. Robotics and Autonomous Systems 53(2):73 – 88.
Poupart, P. 2005. Exploiting Structure to Efficiently Solve
Large Scale Partially Observable Markov Decision Processes. Ph.D. Dissertation, Department of Computer Science, University of Toronto.
Pryor, L., and Collins, G. 1996. Planning for contingencies:
A decision-based approach. Journal of Artificial Intelligence
Research 4:287–339.
Smith, T., and Simmons, R. 2004. Heuristic search value iteration for POMDPs. In Proceedings of the 20th conference
on Uncertainty in artificial intelligence, UAI ’04, 520–527.
Arlington, Virginia, United States: AUAI Press.
Veloso, M. M.; Pollack, M. E.; and Cox, M. T. 1998.
Rationale-based monitoring for planning in dynamic environments. In Proceedings of the Fourth International Conference on Artificial Intelligence Planning Systems, 171–
179. AAAI Press.
Warren, D. H. D. 1976. Generating conditional plans and
programs. In AISB (ECAI)’76, 344–354.
139