Structured Sohtion Methods for Fahiem Bacchus

From: AAAI-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Structured
Sohtion Methods for
Fahiem Bacchus
Dept. of Computer Science
University of Waterloo
Waterloo, Ontario
Canada, N2L 3Gl
fbacchus@logos.uwaterloo.ca
Craig Boutilier
Adam Grove
Dept. of Computer Science
University of British Columbia
Vancouver, B.C.
Canada, V6T 124
cebly@cs.ubc.ca
NEC Research Institute
4 Independence Way
Princeton NJ 08540, USA
grove@research.nj.nec.com
Abstract
Markov Decision Processes (MDPs), currently a popular
method for modeling and solving decision theoretic planning problems, are limited by the Markovian assumption: rewards and dynamics depend on the current state only, and
not on previous history. Non-Markovian decision processes
(NMDPs) can also be defined, but then the more tractable solution techniques developed for MDP’s cannot be directly applied. In this paper, we show how an NMDP, in which temporal logic is used to specify history dependence, can be automatically converted into an equivalent MDP by adding appropriate temporal variables. The resulting MDP can be represented in a structured fashion and solved using structured
policy construction methods. In many cases, this offers significant computational advantages over previous proposals for
solving NMDPs.
1 Introduction
Markov decision processes (MDPs) have proven to be an effective modeling and computational paradigm for decisiontheoretic planning (DTP). MDPs allow one to deal with planning problems that involve uncertainty, multiple objectives,
and nonterminating (process-oriented) behavior. A fundamental assumption of this model is that the reward function
and system dynamics of the underlying process are Markovian-all the information needed to determine the value of a
particular state or the effects of an action at that state must
be encoded within the state itself. This allows computationally effective dynamic programming techniques to be used to
solve decision problems [Put94].
Nevertheless, the Markovian requirement is often not met
by planning problems that are encoded in the “obvious” way.
For instance, it is often natural to specify desirable behaviors by referring to trajectory properties (properties of the sequence of states passed through, i.e., the system’s history)
in addition to just the current state. This has shown up in
work on planning [HH92, Dru89, Kab90, GK91] (e.g., in
the use of maintenance goals); and in [BBG96] we have argued that many reward functions for process-oriented prob*, The work of Fahiem Bacchus and Craig Boutilier was supported by the Canadian government through their NSERC and IRIS
programs.
r Copyright @ 1997, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
112
AUTOMATED
REASONING
lems are most appropriately viewed in this light (e.g., responses to requests, maintaining desirable systems properties, etc.). For instance, rewarding an agent for achieving a
goal within k steps of a request being issued is a natural, yet
history-dependent, specification of desirable behavior. Similarly, process dynamics (action effects) are sometimes most
naturally expressed in a history dependent fashion.
In [BBG96] we examined Non-Markovian decision processes (NMDPs) and identified two key issues, namely, the
specification of non-Markovianproperties and the solution of
NMDPs.~ A temporal logic called PLTL was used as a mechanism for specifying the non-Markovian aspects of a system,
and we will adopt the same approach here. However, we propose a very different way of solving the NMDPs so specified.
As in the earlier paper, we develop a method for automatically converting an NMDP into an equivalent MDP, solutions
to which can be re-interpreted to yield solutions to the original NMDP. The key difference between the two papers is that
[BBG96] presents a state-based construction, while this paper works with structured representations of (N)MDPs. Each
approach has advantages and disadvantages.
In either case, one can think of each state in the original
NMDP as leading to multiple states in the resulting MDP, the
new states being distinguished by various relevant histories.
The difference, in essence, is how the new MDP is represented. The algorithm in [BBG96] takes an NMDP and a reward function specified using PLTL formulas, and produces
a new MDP whose states are listed explicitly. One advantage
of this approach (which will not be true of the proposal in the
current paper) is that it is possible to produce a minimal expanded MDP (i.e., one with the fewest number of states necessary to capture the required historical distinctions). However, as we now discuss, there are also important disadvantages inherent in state-based constructions.
To understand these disadvantages, note that many DTP
problems are described in terms of a set of variables or propositions (particularly in AI where logical representations are
common). The resultant MDPs have as states all possible assignments of values to these variables. In such cases, a major difficulty with any state-based algorithm for MDPs is that
21n [BBG96] we considered non-Markovian
reward functions
only, although the approach could easily be extended to deal with
history dependence in the system dynamics as well. In this paper
we consider both rewards and dynamics explicitly.
fact that the state space grows exponentially with the number of problem variables. A recent focus in DTP research has
been the development of MDP representation and solution
techniques that do not require an explicit enumeration of the
state space. For instance, the use of STRIPS [BD94, DF95,
KHW94] or Bayes nets [BDG95, BD96] to represent actions
in MDPs, and structured policy construction (SPC) methods
that exploit such representations [DF95, BDG95, BD96] to
avoid explicit state-based computations when solving MDPs,
promise to make MDPs more effective for such DTP problems.
For such problems, the NMDP conversion algorithm proposed in [BBG96] has some obvious drawbacks. First, being state-based, its complexity is exponential in the number
of problem variables, because even if the original NMDP has
a compact representation in terms of variables the final MDP
will not. Second, since the MDP output by that algorithm is
represented as a set of states, any structure present in the original (structured) NMDP will be lost or obscured, preventing
the use of SPC algorithms. Finally, as we will see, the history
that must be encoded itself exhibits significant structure, but
this is not exploited in [BBG96].
Our approach to NMDP conversion in this paper assumes
the existence of a compact variable-based description of the
given NMDP, and works by adding temporal variables to this
description. We thus expand the state space without regard to
minimality, but this expansion is implicit-we never enumerate all states. The the conversion is thus very efficient, and
retains the structured and compact representation we started
with. The output, being a structured MDP, can be used directly by SPC algorithms. This last consideration has an interesting and fortunate consequence. Unlike the state-based
approach, our proposal here does not produce an MDP whose
(implicit) state space is minimal; but to compensate for this,
we can exploit the fact that SPC algorithms have some ability to automatically (and dynamically) detect the relevance of
particular variables at various points in policy construction.
As such, unlike [BBG96], we need not be directly concerned
about only adding the “relevant” history. To a significant extent, relevance can be detected during optimization by appropriate SPC algorithms.
This points to a related advantage of our technique, which
is that relevant history is dynamically determined in SPC algorithms. The algorithm of [BBG96] is based on a static
analysis of the NMDP, before optimal policy construction is
attempted. That algorithm must encode enough history at a
given state s to determine the reward at any future reachable
state t. This ensures that any policy, and not just the optimal
one, will have the same total reward in the constructed MDP
as in the original NMDP. But in fact we are generally only
interested in the optimal policy. As we construct this policy, we may notice that many a priori reachable states t (i.e.,
reachable with non-zero probability through some sequence
of actions) are, in fact, not reachable when the optimal policy
is adopted. Consequently, some of the history we added (in
order to determine the reward at state t) may end up being
“irrelevant”. Our algorithm, in conjunction with SPC algorithms used for optimization, implements a dynamic analysis of the NMDP. Once it is determined that t is not reachable
during policy construction, history relevant only to t (and related states) will be ignored. Though the state space is larger,
only relevant distinctions are, for the most part, considered.
We conjecture that such “dynamic irrelevance” is especially likely to arise in circumstances involving temporally
dependent rewards. For example, a temporally-extended reward function may associate a reward with delivering a item
within 5 time stages of an order being placed. Under certain
conditions (e.g., inclement weather) the risk associated with
any attempt at delivery may be too great for such an undertaking. During policy construction, this will be recognized
and the history relevant to determining whether this goal can
be achieved (i.e., keeping track of when in the past an order
was placed) can be ignored in such states. The “irrelevance”
of the required history cannot be detected a priori.
A final contrast with [BBG96] is the relative simplicity of
our proposal in the current paper, which is based on wellknown properties of temporal logic and the observation that
these can be made to integrate well with SPC. Although the
technical aspects of our contribution are straightforward, it
has considerable potential for improving our ability to solve
history-dependent decision problems.
We begin with a brief description of MDPs, NMDPs and
the temporal logic used in [BBG96] to specify trajectory
properties. We also give an overview of the SPC algorithm
of [BDG95]. Then we present our technique of adding temporal variables to convert an NMDP to an MDP. We close the
paper with some further observations.
a
Afully observable Markov Decision Process [HowGO, Put941
can be characterized by a finite set of states S, a set of actions
A, and a reward function R. The actions are characterized
by probability distributions, and we write Pr(sr , a, ~2) = p
to denote that s2 is reached with probability p when action
a is performed in state ~1. Full observability entails that the
agent always knows what state it is in. A real-valued reward
function R reflects the objectives, tasks and goals to be accomplished by the agent, with R(s) denoting the (immediate) utility of being in state s. Thus, an MDP consists of S,
A, R and the set of transition distributions { Pr(s, a, .) 1 a E
A; s E S}.
A stationary Markovian policy is a mapping r : S + A,
where X(S) denotes the action an agent should perform whenever it is in state s. We adopt expected total discounted reward over an infinite horizon as our optimality criterion: the
current value of future rewards is discounted by some factor
p (0 < ,0 < l), and we maximize the expected accumulated discounted rewards over an infinite time period. The
expected value of a fixed policy r at any state s can be shown
to satisfy [HowGO]:
Vi(s) = R(s) + P>: Pr(s, +),t)
. VT(t)
The value of 7rat any state s can be computed by solving this
system of linear equations. A policy ;ryis optimal if V,(s) 2
V$ (s) for all s E S and policies 7r’. VT is called the value
MODELING
FOR DECISION
PROCESSES
113
function. We refer to [Put941 for an excellent treatment of
MDPs and associated computational methods. See [DW91,
BD94, BDG95] on the use of MDPs for DTP.
2.2
Structured
Representations
For DTP problems, we are often interested in MDPs that can
be represented concisely in terms of a set of features or variables. We adapt the Bayes net/decision tree representation
used by [DK89, BDG95], representing an MDP with a set of
variables P, a reward decision tree TRXR, the set of actions A,
and a set of action decision trees { Tree(a, p) 1a E A; p E P}.
P is a set of n variables that implicitly specifies the state
space S. For simplicity, we assume that all variables take two
values, {T (true), I (false)}; consequently we can identify
each variable p E P as a boolean proposition. The state space
S is then the set of all possible truth assignments to these variablds (i.e., S is the product space of the variable domains).
The number of states is exponential in the number of variables (thus, in our case, 2n).
The rest of the representation relies on the use of decision
trees whose internal nodes are tests on individual variables in
P. The leaves of these trees are labeled with different types
of values, depending on the type of tree. Any of these trees,
say Tree, can be applied to a state s, to yield a value Tree[s].
This value is computed by traversing the tree using the truth
assignments specified by s to determine the direction taken
at internal nodes. The value returned is the label of the leaf
node reached. That is, Tree[s] is the label of the leaf associated with the (unique) branch of the tree whose variable labels are consistent with the values in s.
Tr&?R is a decision tree that specifies the (immediate) reward of each state in S. In particular, the leaves of TreeR are
labeled with real values, and the reward R(s) assigned to any
state s is %%?R[s]. The set of action trees determine transition probabilities. These trees have their leaves labeled with
real numbers in [0, 11. Tree(a, p)[s] specifies the probability
that p will be true in the next state given that action a was
executed in state s. Thus for each variable p’ E P we can determine the probability of it being true in the next state from
Tree( a, p’) [s]. These events are assumed to be independent
so we can obtain the entire distribution over the successor
states, Pr(s, a, .), by simply multiplying these probabilities.3
Although [BDG95] describes their representation in terms
of Bayes nets, it is nevertheless equivalent to ours, including
the independence assumption. One advantage of the Bayes
net representation is that it suggests an easy way of relaxing this assumption. (This possibility was not explored in
[BDG95], but see [Bou97] for details.) Thus, SPC algorithms can be modified to deal with dependence between
present variables, although they become somewhat more
complex. Such modifications are entirely compatible with
our proposals here.
JJ ~r&v)[s]
.p : +p
114
AUTOMATED
J-J (1 - Tr=+w)[sl).
P: S’FP
REASONING
Structured
Policy Construction
The basic idea behind the SPC algorithms described in
[BDG95, BD96] is the use of tree-structured representations
during policy construction. In particular, a policy 7r can be
represented as a decision tree Tree, where the leaves are labeled by actions. That is, Tree, [s] specifies the action to execute in state s. Similarly a value function V can be represented as a tree with real-valued labels.
The SPC algorithm is based on the observation that some
of the traditional algorithms for constructing optimal policies
can be reformulated as tree-manipulation algorithms. In particular, at all points in the algorithm the current value function and current policy are represented as trees. Intuitively,
these algorithms dynamically detect the relevance of particular variables, under specific conditions, to the current value
function or policy. The particular tree-manipulation steps
necessary are not trivial, but the details are not directly relevant here (they are presented in [BDG95]).
We have already alluded to some of SPC’s properties. If
the dynamics and reward function of the MDP are simple, the
optimal policy very often has a simple structure as well (see
[BDG95] for examples). SPC will find this policy in just as
many iterations as modified policy iteration (a popular and
time-efficient algorithm), but often without ever considering
“large” trees-even though it is (implicitly) optimizing over
exponentially many states. Variables that are not relevant, or
are only relevant in some contexts, will not necessarily have
an adverse effect on complexity-in
many cases, they just do
not appear as decision nodes in the trees when they are not
needed. We note that SPC offers another advantage, namely
its amenability to approximation; see [BD96].
2.4
Non Markovian
Decision Processes
Following [BBG96], we use a temporal logic called PLTL
(Past Linear Temporal Logic) to specify the history dependence of rewards and action effects. PLTL is a past version of LTL [EmegO]. We assume an underlying finite set
of propositional constants P, the usual truth functional connectives, and the following temporal operators: S (since),
H (always in the past), 0 (once, or sometime in the past)
and @ (previously).4 The formulas $1 S &, 041, 041 and
O#, are well-formed when 41 and 42 are.5 The semantics
of PLTL is described with respect to models of the form T =
(so,**+, s,), n 2 0, where each si is a state or truth assignment over the variables in P. For any trajectory T =
(so,-*, s,), and any 0 5 i 5 n, let T(i) denote the initial
segment T(i) = (SO,. . . , s;).
A temporal formula is true of T = (so, . . . , sn) if it is
true at the last (or current state) with respect to the history
reflected in the trajectory. We define the truth of formulas inductively as follows:
0 Tbpiffs,
3In particular,iwe have
Pr(s, a, s’) =
2.3
+P,forPEP
4Theseare the backward analogsof the LTL operators until, al-
ways, eventually and next, respectively.
5We use the abbreviation 0” for k iterations of the 0 modality
(e.g., 03$ - 0004), and @Sk to stand for the disjunction of 0’
for 1 < i 5 k, (e.g., OS24 e 04 v OOC$).
e Tk=+iffTkq5
* T k $1 s42 iff there is some i < n s.t. T(i) b 42 and for
all i < j I: n, T(j) /= 41 (thatis, $2 was true sometime
in the past and 41 has been true since then)
e T b q $iffforallO
5 i 5 n,T(i)
at each point in the past)
b 4 (4hasbeentrue
e T b 04 iff for some 0 5 i 5 n, T(i) k 4 (4 was true
at some point in the past)
e Ti=@Ogliffn>OandT(n-1)
previous state)
b+
(4wastrueatthe
If, for example, we wanted to reward behaviors that
achieve a condition G immediately after it is requested by
a command C, we could specify a reward function that rewards states satisfying the PLTL formula G A OC. Similarly, rewarding states satisfying G A OlkC would reward
behaviors that achieve G within k steps its request. [BBG96]
gives further examples of how the logic can be used to specify useful historical dependencies. Using a logic for making such specifications provides all of the usual advantages
gained from logical representations. In particular, we have a
compositional language that can be used to express arbitrarily
complex specifications, and the same time we have a precise
semantics for our specification no matter how complex.
Our proposal for using PLTL formulas within the treestructured representational framework of Section 2.2 is
straightforward. Recall that we represented the reward function and dynamics using decision trees, whose tests were on
the value of individual variables. We simply extend this by
allowing a internal node in the decision tree to also test the
value of an arbitrary PLTL formula. In this way, both rewards and dynamics can depend on history, and we can represent any NMDP, whose history dependence is specified using
PLTL, as a structured NMDP.
decision trees of N. It follows that any state in M must contain enough information to decide if 4 is true. For this reason,
we must at least add temporal variables to the state description that provide suficient history so that this information can
be determined. Let 7’ be the added set of temporal variables.
Each of these variables is associated with a formula of PLTL,
and we will often treat these variables as if they were formulas.
Definition 3.1 A set of temporal variables T is sufficient
for q!~ifs there is some boolean function b@ such that 4 z
b&c p>.
In other words, T is sufficient for $ if the truth of 4 is determined by some subset of the elements of T, together with
knowledge of certain state variables.
Since any formula in PLTL has the form b(tl, . . . , tk)
for some boolean function b and purely temporal formulas
t1, *. * , tk, finding a sufficient set of temporal formulas is
trivial: we can simply strip the main boolean connectives
from a temporal formula 4 until only purely temporal formulas remain, then make each of these a temporal variable.
However this set of temporal variables turns out not to be
suitable, because sufficiency is not the only requirement for
the set of variables T needed to build a suitable MDP. Since
the temporal variables we choose will be incorporated into
the new MDP M, we must be able to determine the dynamics of these new variables. That is, just as we have trees
Tree(a, p) specifying the dynamics of the state variables p E
P we must be able to construct trees Tree(a, t) that tell us how
to update the temporal variables t E T. Furthermore, these
dynamics must be Markovian.
Definition 3.2 A set of temporal variables T is dynamically
closed ifJIfor each t E T, there is some boolean function bt
such that t c @b, (T, P).
In the new MDP M, the state must contain sufficient information to determine rewards and the dynamics; unlike N, it
will not be able to refer to history explicitly. Consider any
temporal formula 4 appearing as a decision node in one of the
Given dynamic closure, the truth of any variable in T at a
given state can be determined from the values of variables
in ‘I’ and P at the previous state. Hence, dynamic closure ensures that we can update the temporal variables while still satisfying the Markov assumption.
Let T be dynamically closed, and consider bt as defined
above. Like any boolean formula, bt can be evaluated using
some decision tree T,; the internal nodes of Tt test the values
of variables in T U P and the leaves are labeled T or 1.7 The
dynamics of any temporal variable t are now easy to define:
we can set Tree(a, t), for all actions a, to be T,‘, where Ti is
the tree exactly like Tt except that where Tt evaluates to T
(resp., I) Ti evaluates to 1 (resp., 0); that is, the truth values
are translated into probabilities.
We note that the dynamics of the temporal variables are
independent of the action executed. Furthermore, the determinism of Z’ree(u, t) ensures that our independence assumptions on action effects (Section 2.2) remain valid.
We have shown that, if we augment the set of variables
of N with a dynamically closed set of temporal variables T
and define the dynamics as just discussed, then the resulting
‘Intuitively, equivalence implies that an optimal policy for the
constructedMDP can be re-interpreted as an optimal policy for the
original NMDP. See [BBG96] for a formal definition.
7In general,there are many treesthat can be usedto evaluate the
formula bt ; an arbitrary tree can be chosen. However, some may be
more compact (hence, lead to greater efficiency) than others.
3
Temporal Variables
Given some structured NMDP N, as described above, we
want to find an equivalent6 structured MDP M, so that the
SPC algorithm can be used to find optimal policies. To do
this, we introduce temporal variables into the domain that refer to relevant properties of the current trajectory. Each temporal variable is used to track the truth of some purely temporal formula of PLTL (i.e., a formula whose main connective
is temporal). Temporal variables are thus boolean.
3.1
Conditions
on Temporal
Variable
Sets
MODELING
FOR DECISION
PROCESSES
115
NMDP is equivalent to the original with the exception of the
added variables, and the dynamics of the temporal variables
are Markovian. However, an even more important feature
of the construction is the following: because of the way we
have defined the dynamics, in any valid trajectory all temporal variables in T will faithfully reflect their intended meaning. More precisely, consider any trajectory through the state
space of the new MDP. Since each t E ‘I’ is one of the
variables defining the state space, then at each point t will
have some value (T or -L). But t is also a logical formula,
and so we can also ask whether (formula) t is true or false
of the trajectory we have followed. Our construction ensures
that (variable) t is true at any state iff the corresponding formula is true of the trajectory followed (assuming all variables
are given correct values at time 0, the initial state). In other
words, a variable’s value coincides with its truth.
Suppose T is both dynamically closed, and sufficient for
every temporal formula 4 that appears as a decision node either in Tr&?R or in any Tree(a,p) in N; for concreteness,
imagine that 4 appears in TiZeR. Sufficiency implies we can
replace the test of 4 by a test on the values of T U P, using
b4 as defined in Definition 3.1. In fact, we can find some decision tree that evaluates b+, and substitute this tree for the
internal nodes in N that test 4. Doing this for all such 4, the
result is a different tree for evaluating rewards (or in the case
of Tree(a, p), dynamics) that is semantically equivalent to the
original, and tests only variables in T U P. Thus, by using the
enlarged set of variables, we have removed all tests that depend explicitly on history: this construction has delivered an
MDP M, equivalent to n/, as required. Of course, the key to
the correctness of this construction is the faithfulness property.
3.2
Construction
of Temporal
Variabk
Sets
It remains to show that a suitable set T exists; in other words,
that for any set of PLTL formulas Q, we can construct a dynamically closed set of temporal variables that is sufficient
for all r$ E @.
To begin, we define a subformula of a PLTL formula 4 to
be any well-formed PLTL formula contained as a substring of
4. For example, if $ = pA 0 (4 V @lp) then the subformulas
of $ are ti itself,p, 0 (qVOlp),
qVOlp,
q, Olp, and lp. A
first proposal for T might be to consider the set of all purely
temporal subformulas of 4 E <p, denoted PTSub( CD). For $J
above these are the subformulas q (q V 01~) and 01~. This
ensures sufficiency simply because any 4 E Q, is a boolean
combination of propositional symbols and its purely temporal subformulas. (In fact, we would not need all subformulas:
for $Jabove, only q (q V 01~) would be needed if sufficiency
were all we cared about.)
The problem with this is that it does not satisfy the dynamic closure property. For instance, the truth of q (q VOlp)
depends not only on what happened in the past, but also on
whether q V 01~ is true now. (Recall that the semantics of
El are, informally, “. . . has been true always in the past, up to
and including
the present.“) In this case, the truth of 01~
in the present must be known. Our representation, however,
requires that a variable’s value can be determined by the pre-
116
AUTOMATED
REASONING
vious values of other variables.8
Instead, we will construct a set of temporal variables that
have 0 as their main connective, because the truth of such
a formula depends only on the past. In fact, as we now
show, we can define T by prepending 0 to each subformula
in PTSub(@) except those that already begin with 0 (these
are retained without change). That is:
T
=
{O$ 1 $ E PTSub(@), $ # et}
U {O$ 1 O$ E PTSub(Q)}
To show sufficiency, we use the following equivalences of
PLTL that allow us to insert preceding 0 operators before any
other modal operator:
1. as/3
3. ea
E pv(aAo(a,S/3)).
E o!VO(OQ).
Consider any 4 E a; 4 is, of course, a boolean combination of primitive propositions in P and purely temporal subformulas of PLTL. If any of the temporal subformulas used
in this expression do not begin with 0, we can replace them
by the equivalent expressions according to the above. This
replacement process may need to be repeated several times,
but eventually it will terminate giving us a boolean combination over P U T that is equivalent to 4, as required. Note that
the same construction allows us to represent any subformula
of 4 as a boolean combination over P U ‘I’.
Dynamic closure is shown very simply. Let Olc, E T. By
definition $ must be a subformula of some 4 E @ and so, as
we have just seen, is equivalent to some boolean combination
b$ over B U T. Thus O$J s ob+ as required for dynamic
closure.
As an illustration, consider 4 = O(p S (q V Or). From
this, we form the temporal variable set
T = (t1 = oe(ps(qvor),
t-J = O(pS(qVOr),
t3 = or)}.
Then 4 can be decomposed as follows:
O(p s (q v Or)
=
i.e.
i-e*
Ps(qv~r)voe(ps(qvor)
p s (q v or) v tl
((a VW
((!I
vt3)
v (PA @(pS
v
(p A t2))
(q v Or))))
v tl
v t1
4 Concluding servations
We have shown how, given a structured NMDP consisting of
reward and action representations involving PLTL formulas,
one can construct a set of temporal variables that is dynamically closed and sufficient for those formulas, and use these
8Again we note that, in principle, dependence on present variables can be used; but we retain the original formulation for reasons
discussed in Section 2.2.
to specify an equivalent MDP in compact form. We conclude
with some observations about our proposal and some directions for future work.
We first note that the new MDP can be constructed efficiently. Let <pbe the collection of PLTL formulas appearing
in the NMDP specification. It is easy to see that the number of temporal variables added is at most equal to the number of purely temporal subformulas contained in a, which
is bounded by the total number of temporal operators in @.
Thus, we are adding only a modest number of new variables.
The time required to do so is O(T]@]), where T is the total size of the reward and action trees and I@] is the sum
of the lengths of formulas in 4e. Of course, one must keep
in mind that the implicit state space grows exponentially in
the number of added variables. But as we discussed in Sections 1 and 2.3, the size of the implicit state space is not necessarily the key factor in the complexity of SPC algorithms.
We do not claim that our construction adds the minimal
number of extra variables to achieve its purpose. In fact, it
is easy to see that minimal state space size is sometimes not
achievable by adding variables at all. The reason is that the
required history can vary from state to state (indeed, this fact
motivates much of [BBG96]). Our approach cannot make
such fine distinctions: every state has the same set of variables “added” to it. On the other hand, the “dynamic irrelevance” feature of SPC may compensate for this. That is,
although the temporal variables can potentially
cause us to
make unnecessary distinctions between histories at certain
states (as compared to [BBG96]), SPC attempts to focus only
on the variables whose values influence the choices made by
the optimal policy, and so will avoid some of these irrelevant
distinctions. Furthermore, SPC can avoid “dynamic irrelevances” that cannot be detected by the approach of [BBG96],
and is much more amenable to approximation in the solution of the resulting MDP. Empirical studies are, of course,
needed to quantify this tradeoff. We suspect that there will be
a range of domains where the SPC approach we are suggesting here will be superior to the state-space based approach of
[BBG96] and vice versa. The interesting question will be to
attempt to characterize domain features that tend to favor one
approach over the other.
Not surprisingly, NMDPs (even in the structured framework we are considering) can be much harder to solve than a
comparably sized MDP. As a simple example, suppose one
gets a reward each time q A Onp is true. Suppose also that
there is an action a that, if taken, is likely to lead to q becoming true at some (unpredictable) time within the next n steps.
The optimal policy could well need to keep track of exactly
when p was true among the previous n steps (because it has
to know whether and when to try to achieve q). There may
be no sub-exponential (in n) decision tree representation of
such a policy. Hence, SPC is not guaranteed to be efficient.
Finally, in this paper, we have omitted any discussion of
the various optimizations that are possible. For instance, one
might gain by carefully choosing the decision tree used to
evaluate b+ and bt (Section 3). These issues deserve careful study. More importantly for future work, however, is the
need for empirical studies to explore the actual performance
that might be seen in practical cases.
eferenees
[BBG96] Fahiem Bacchus, Craig Boutilier, and Adam Grove. Rewarding behaviors. In Proceedings of the Thirteenth National
Conferenceon Artificial Intelligence, pages 1160-l 167, Portland,
OR, 1996.
[BD94] Craig Boutilier and Richard Dearden. Using abstractions
for decision-theoretic planning with time constraints. In Proceedings of the Twelfth National Conference on ArtiJicial Intelligence,
pages 10 16-1022, Seattle, 1994.
[BD96] Craig Boutilier and Richard Dearden. Approximating value
trees in structured dynamic programming. In Proceedings of the
Thirteenth International Conferenceon Machine Learning, pages
54-62, Bar-i, Italy, 1996.
[BDG95]
Craig Boutilier, Richard Dearden, and Moists Goldszmidt. Exploiting structure in policy construction. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1104-l 111, Montreal, 1995.
[Bou97] Craig Boutilier.
Correlated action effects in decisiontheoretic regression. (manuscript), 1997.
[DF95] Thomas G. Dietterich and Nicholas S. Flann. Explanationbased learning and reinforcement learning: A unified approach.
In Proceedings of the Twelfth International Conference on Machine Learning, pages 176-l 84, Lake Tahoe, 1995.
[DK89] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. Computational Intelligence,
5(3):142-150, 1989.
[Dru89] M. Drummond.
Situated control rules. In Proceedings
of the First International Conference on Principles of Knowledge
Representation and Reasoning, pages 103-l 13, Toronto, 1989.
[DW91 ] Thomas Dean and Michael Wellman. Planning and Control. Morgan Kaufmann, San Mateo, 199 1.
[Eme90] E. A. Emerson. Temporal and modal logic. In J. van
Leeuwen, editor, Handbook of Theoretical Computer Science,
Volume B, chapter 16, pages 997-l 072. MIT, 1990.
[GK91] P. Godefroid and F. Kabanza. An efficient reactive planner for synthesizing reactive plans. In Proceedings of the Ninth
National Conference on Art$cial Intelligence, pages 640-645,
1991.
[HH92] Peter Haddawy and Steve Hanks. Representations for
decision-theoretic planning: Utility functions for deadline goals.
In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning, pages 7 I-82,
Cambridge, 1992.
[How601 Ronald A. Howard. Dynamic Programming and Markov
Processes. MIT Press, Cambridge, 1960.
[KabBO] F. Kabanza. Synthesis of reactive plans for multi-path environments. In Proceedings of the Eighth National Conference
on Artificial Intelligence, pages 164-l 69, 1990.
[KHW94]
Nicholas Kushmerick, Steve Hanks, and Daniel Weld.
An algorithm for probabilistic least-commitment planning.
In
Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 1073-I 078, Seattle, 1994.
[Put941 Martin L. Puterman. Markov Decision Processes: Discrete
Stochastic Dynamic Programming. Wiley, New York, 1994.
MODELING
FOR DECISION
PROCESSES
117