Cost Sensitive Reachability Heuristics for Handling State Uncertainty Daniel Bryce & Subbarao Kambhampati Department of Computer Science and Engineering Arizona State University, Brickyard Suite 501 699 South Mill Avenue, Tempe, AZ 85281 {dan.bryce, rao}asu.edu Abstract While POMDPs provide a general platform for non-deterministic conditional planning under a variety of quality metrics they have limited scalability. On the other hand, non-deterministic conditional planners scale very well, but many lack the ability to optimize plan quality metrics. We present a novel generalization of planning graph based heuristics that helps conditional planners both scale and generate high quality plans when using actions with non-uniform costs. We make empirical comparisons with two state of the art planners to show the benefit of our techniques. 1 Introduction When agents have uncertainty about their state, they need to formulate conditional plans, which attempt to resolve state-uncertainty with sensing actions. This problem has received attention in both the uncertainty in AI (UAI) and automated planning communities. From the UAI perspective, finding such conditional plans is a special case of finding policies for Markov Decision Processes (MDPs) in the fully observable case, and Partially Observable MDPs (POMDPs) in the partially observable case. The latter is of more practical use, although much harder computationally [Madani et al., 1999; Littman et al., 1998]. The emphasis in this community has been on finding optimal policies under fairly general conditions. However the scalability of the approaches has been very limited. In the planning community, conditional planning has been modelled as search in the space of uniform probability belief states (i.e. every belief state is a set of equally possible states and the space is finite). Several planners have been developed–eg. MBP [Bertoli et al., 2001], and PKSPlan [Petrick and Bacchus, 2002] – which model conditional plan construction as an and/or search. These approaches are more scalable partly because their complexity is only 2-EXP-complete [Rintanen, 2004], as against POMDPs which are in general undecidable [Madani et al., 1999]. However planning approaches are often insensitive to the cost/quality information. Indeed, in the presence of actions with differing costs, planners such as MBP can generate plans of arbitrarily low quality, attempt to insert sensing actions without taking their cost into consideration. We focus our attention on finding strong plans (i.e. plans the succeed with probability 1) given an uncertain initial state (with uniform probability over possible states). Sensing actions give partial observations, causative actions have deterministic conditional effects, all actions have associated costs, and the model uses a factored representation. In this paper, we describe a way of extending the state of the art non-deterministic conditional planners to make them more sensitive to cost/quality information. Our idea is to adapt the type of cost-sensitive reachability heuristics that have proven to be useful in classical and temporal planning [Do and Kambhampati, 2003]. Straightforward adaptation unfortunately proves to be infeasible. This is because, in the presence of state uncertainty, we will be forced to generate multiple planning graphs (one for each possible state) and reason about reachability across all those graphs [Bryce and Kambhampati, 2004]. This can get prohibitively expensive–especially for forward search where we need to do this analysis at each search node. The main contribution of this paper is a way to solve this dilemma. In particular, we propose a novel way of generating reachability information with respect to belief states without computing multiple graphs. Our approach, called the labelled uncertainty graph (LU G), symbolically represents multiple planning graphs, one for each state in our belief, within a single planning graph. Loosely speaking, this single graph unions the support information present in explicit multiple graphs and pushes the disjunction, describing sets of possible worlds (states in a belief), into “labels” (). The planning graph is built using labels, for sets of worlds, to annotate the vertices (literals and actions). A label on a vertex signifies the states of our belief that can reach the vertex. To take cost into account, we describe a method for prop- agating cost information over the LU G in an extension called the CLU G. The (previously mentioned) labels tell us when graph vertices (e.g. literals) are reachable, but they do not indicate the associated reachability cost. We could track a single cost for the entire set of worlds represented by a label, but this would lose information about differing costs for subsets of the worlds. Tracking a cost for each subset of worlds is also problematic because they are an exponential in the number of worlds. Even tracking the cost of individual worlds can also be costly because their number is exponential in the number of fluents (state variables). Instead we track cost over a fixed partition of world sets. The size of the partition (number of costs tracked) is bounded by the number of planning graph levels. Each disjoint set is the worlds in which a literal or action is newly reached at a level. The CLU G is used as the basis for doing reachability analysis. In particular, we extract relaxed plans from it (as described in [Bryce et al., 2004]), using the cost information to select low cost relaxed plans. Our results show that cost-sensitive heuristics improve plan quality and scalability.1 We proceed by describing our representation and our planner, called P ON D. We then introduce our planning graph generalizations called the LU G and the CLU G, and describe the relaxed plan extraction procedure. We present an empirical study of the techniques within our planner and compare with two state of the art conditional planners MBP [Bertoli et al., 2001] and GPT [Bonet and Geffner, 2000]. We end by providing a comparison to related work, a conclusion, and directions for future work, with emphasis on non-uniform uncertainty. 2 Representation & Search The planning formulation in our planner P ON D uses progression search to find strong plans, under the assumption of partial observability. A strong plan guarantees that after a finite number actions executed from any of the many possible initial states, all resulting states will satisfy the goals. We represent strong plans as directed acyclic graphs (where a node with out degree greater than one is a sensory action). We assume that every plan path is equally likely so our plan quality metric is the mean of the path costs. The cost of a plan path the sum of the costs of its edges (which correspond to outcomes of actions). We will use the following as a motivating, as well as, running example to illustrate our techniques: Example 1. A patient goes to the doctor complaining of feeling unrested (¬r), but he is unsure if he is actually sick (s ∨ ¬s). The doctor has two treatment plans: 1) give the patient drug B to cure the sickness if he is sick, and have 1 A solution for a larger test instance contained nearly 200 belief states among 13 plan paths, of lengths between 18 and 30 actions. B:{10,15} r: rested B: Drug B s: sick C: Drug C R: Rest S: Blood Test ¬r ¬s∧¬r R:{7,7} S:{9,12} s∧¬r ¬s∧r C:{20,10} Figure 1: The example’s AO* graph with two cost models. him rest, R, for a week to become rested, or 2) do a blood test, S, to determine if he is sick; if so he takes drug C with no need to rest, otherwise rests for a week. Both treatments will ensure that he is not sick and rested (¬s ∧ r). The patient may have one of two insurance providers (cost models). We show a transition diagram (Figure 1) with annotations on edges for the two cost models. The optimal plan for the first model is the first plan, at cost 10+7 = 17, compared to the second at cost ((9+7)+(9+20))/2 = 22.5. The optimal plan for the second model is the second plan with cost ((12+7)+(12+10))/2 = 20.5, because the first has cost 15+7 = 22. P ON D searches in the space of belief states, a technique first described by Bonet and Geffner [2000]. The planning problem P is defined as the tuple D, BSI , BSG , where D is a domain, BSI is the initial belief state, and BSG is the goal belief state. The domain D is a tuple F, A, where F is a set of all fluents and A is a set of actions. Belief State Representation: A state S is a complete interpretation over fluents. A belief state BS is a set of states, symbolically represented as a propositional formula over F , and is also referred to as a set of possible worlds. A state S is in the set of states represented by a belief state BS if S is a model of BS (S ∈ M(BS)). In this work we assume the goal belief state is a conjunctive formula to simplify the later presentation. Action Representation: We represent actions as having strictly causative or observational effects, respectively termed as causative or sensory actions. An action a consists of an execution precondition ρe (a), a set of effects Φ(a), and a cost c(a). The execution precondition, ρe (a), is a conjunctive formula that must hold to execute the action. Causative actions have a set of deterministic conditional effects Φ(a) = {ϕ0 (a), ..., ϕm (a)} where each conditional effect ϕj (a) is of the form ρj (a) =⇒ εj (a), and the antecedent and consequent are conjunctions. Sensory actions have a set Φ(a) = {o0 (a), ..., on (a)} of observational effect formulas. Each observational effect formula, oi , defines an outcome of the sensor. The actions in our example are: B : ρe (B) = , Φ(B) = {s =⇒ ¬s}, c(B) = {10, 15} C : ρe (C) = s, Φ(C) = { =⇒ ¬s ∧ r}, c(C) = {20, 10} R : ρe (R) = ¬s, Φ(R) = { =⇒ r}, c(R) = {7, 7} S : ρe (S) = , Φ(S) : {s, ¬s}, c(S) = {9, 12} We list two numbers in the cost of each action because our example uses the first number for cost model one, and the second for cost model two. POND Search: We use top down AO* search [Nilsson, 1980], in the P ON D planner to generate conditional plans. In the search graph, the nodes are belief states and the hyper-edges are action. We need AO* because using a sensing action in essence partitions the current belief state. We use a hyper-edge to represent the collection of outcomes of an action. Sensory actions have several outcomes, all if any of which must be included in a solution. The AO* search consists of two repeated steps, expand the current partial solution, and then revise the current partial solution. Search ends when every leaf node of the current solution is a belief state that satisfies the goal belief and no better solution exists (given our heuristic function). Expansion involves following the current solution to an unexpanded leaf node and generating its children. Revision is essentially a dynamic programming update at each node in the current solution that selects a best hyper-edge (action). The update assigns the action with minimum cost to start the best solution rooted at the given node. The cost of a node is the cost of its best action plus the average cost of its children (the nodes connected through the hyper-edge). When expanding a leaf node, the children of all applied actions are given a heuristic value to indicate their estimated cost. 3 Labelled Uncertainty Graph (LU G) To guide search, we use a relaxation of conditional planning to obtain a lookahead estimate of the conditional plan’s suffix, rooted at each search node. The relaxation measures the cost to support the goal when ignoring mutexes between actions, and ignoring sensory actions. We reason about the cost of sensing in a local manner through the search itself,2 but do not reason about sensing in a lookahead fashion. Our heuristic reasons about the conformant transition cost between two sets of states, a belief state cost measure [Bryce and Kambhampati, 2004]. We review our previous work that uses multiple planning graphs to calculate belief state distances, and then discuss our generalization, called the LU G which performs the same task at a lower cost. Classical planning graph based relaxed plans tend not to capture information needed to make belief state to belief state distance measures because they assume perfect state information. In [Bryce and Kambhampati, 2004] we studied the use of classical planning graphs for belief state distance measures, but found that using multiple planning graphs is more effective for estimating belief state distances. The approach constructs several classical planning graphs, each with respect to a state in our current belief 2 That is, we can reason about the cost of applying a sensing action at the current search node by adding the cost of the action to the average cost of its children (whose costs are determined by the heuristic). state. Then a classical relaxed plan is extracted from each graph. We transform the resulting set of relaxed plans into a unioned relaxed plan, where each layer is the union over the vertices in the same level of the individual relaxed plans. The number of action vertices in the unioned relaxed plan is used as the heuristic estimate. The heuristic measures both the positive interaction and independence in action sequences that are needed to individually transition each state in our belief state to a state in the goal belief state. The obvious downfall of the multiple graph approach is that the number of planning graphs and relaxed plans is exponential in the size of belief states. Among the multiple planning graphs there are quite a bit of repeated structure, and computing a heuristic on each can take a lot of time. With the LU G, our intent is two fold, (i) we would like to obtain the same heuristic as with multiple graphs, but lower the representation and heuristic extraction overhead, and (ii) we also wish to extend the relaxed plan heuristic measure to reflect non uniform action costs. 3.1 LU G & CLU G We present the LU G and its extension to handle costs, the CLU G. The LU G is a single planning graph that uses an annotation on vertices (actions and literals) to reflect assumptions about how a vertex is reached. Specifically we use a label, (k (·)), to denote the models of our current (source) belief BSs that reach the vertex in level k. In the CLU G we additionally use a cost vector (ck (·)) to estimate of the cost of reaching the vertex from different models of the source belief. These annotations help us implicitly represent the vertices common to several of the multiple planning graphs in a single planning graph. Figure 2 illustrates the CLU G built for the initial belief in our example. The initial layer literal labels are used to label the actions and effects they support, which in turn label the literals they support. Label propagation is based on the intuition that (i) actions and effects are applicable in the possible worlds in which their conditions are reachable and (ii) a literal is reachable in all possible worlds where it is affected. Definition 1 (LU G). A LU G is a levelled graph, where a level k contains three layers, the literal Lk , action Ak , and effect Ek layers. The LU G is constructed with respect to the actions in A and a source belief state BSs . Each LU G vertex vk (·) in level k is a pair ·, k (·), where the “·” is an action a, effect ϕj (a), or literal l, and k (·) is its label. Definition 2 (CLU G). A CLU G extends a LU G by associating a triple ·, k (·), ck (·) with each vertex, where ck (·) is a cost vector. Definition 3 (Label). A label k (·) is a propositional formula that describes a set of possible worlds. Every model of a label is also a model of the source belief, implying k (·) |= BSs . For any model Ss ∈ M(BSs ) if Ss ∈ M(k (·)), then the classical relaxed planning graph built from Ss contains “·” as a vertex in level k. s {<{2},0>} :s {<{1},0>} :r {<{1,2},0>} B {<{1,2},0>} C {<{2},0>} R {<{1},0>} World Labels: 1 = : s∧:r, 2 = s∧:r s {<{2},0>} 0 B ϕ (B) {<{2},B>} {<{1,2},0>} :s {<{1},0>, ϕ0(C) C <{2},min(B,C)>} {<{2},C>} {<{2},0>} r ϕ0(R) {<{1,2},C+R>} R {<{1},R>} {<{1},0>, <{2},min(B,C)>} :r {<{1,2},0>} ϕ0(B) {<{2},B>} ϕ0(C) {<{2},C>} ϕ0(R) {<{1},R>, <{2},min(B,C)+R>} s {<{2},0>} :s {<{1},0>, r <{2},min(B,C)>} {<{1,2}, min(C+R, min(B,C)+R)>} :r {<{1,2},0>} Figure 2: A LU G for our example problem. Each literal, action, and effect has its cost vector listed. Definition 4 (Extended Label). An extended label ∗k (f ) for a propositional formula f is defined as the formula that results from substituting the label k (l) of each literal l for the literal in f . An extended label is defined: ∗k (f ∧ f ) = ∗k (f ) ∧ ∗k (f ), ∗k (f ∨ f ) = ∗k (f ) ∨ ∗k (f ), ∗k (¬(f ∧ f )) = ∗k (¬f ∨ ¬f ), ∗k (¬(f ∨ f )) = ∗k (¬f ∧ ¬f ), ∗k () = BSs , ∗k (⊥) =⊥, ∗k (l) = k (l) Labels and Reachability: A literal l is (optimistically) reachable from a set of states, described by BSs , after k steps, if BSs |= k (l). A propositional formula f is reachable from BSs after k steps if BSs |= ∗k (f ). Definition 5 (Cost Vectors). A cost vector ck (·) is a set of pairs f i (·), ci (·), where f i (·) is a propositional formula over F and ci (·) is a rational number. Every ci (·) is an estimate of the cost of reaching the vertex from all models Ss ∈ M(f i (·)). Cost propagation on planning graphs, similar to that used in the Sapa planner [Do and Kambhampati, 2003], computes the estimated cost of reaching literals at time points. Since we track whether a literal is reached in more than one possible world, it is possible that the cost of reaching a literal is different for every subset of these worlds. Instead of tracking costs for an exponential number of subsets, or even each individual world, we partition the models of BSs into fixed sets to track cost over (i.e. the elements of the cost vectors ck (·)). A cost vector ck (·) is a partition of worlds represented by the label k (·) that assigns a cost to each of the disjoint sets. As we will show, the partitions are different for each vertex because we partition with respect to the new worlds that reach the given action, effect, or literal in each level. Our reason for defining the partitions this way is that the size of the partition is bounded by the number of CLU G levels. The LU G and CLU G construction requires first defining our initial literal layer, and then an inductive step to construct a graph level. For each graph layer of the LU G and CLU G, we compute the label of each vertex k (·). In the CLU G, we additionally update the cost vector of the vertex. In the following we combine definitions for the LU G and CLU G layers, but it is easy to see that we obtain the former by omitting cost vectors and obtain the latter by computing them. Initial Literal Layer: The initial layer of the LU G is defined as: L0 = {vk (l)|0 (l) =⊥}, where each label is defined as: 0 (l) = l ∧ BSs , and each cost vector is defined as: c0 (l) = {0 (l), 0} The LU G has an initial layer, L0 , where the label 0 (l) of each literal l represents the states of BSs in which l holds. In the cost vector, we store a cost of zero for the entire group of worlds in which each literal is initially reachable (i.e. 0 (l), 0). We illustrate the case where BSs = BSI from the example. In Figure 2 we graphically represent the LU G and index the models of BSs as worlds {1,2}. We show the cost vector ck (·) for each vertex. Note, we show worlds as indexed models, but implement them using a BDD [Bryant, 1986] representation of propositional formulas. In the figure we do not explicitly show propositional labels of the elements, but do in the text. The labels for the initial literal layer are: 0 (s) = s ∧ ¬r, 0 (¬s) = ¬s ∧ ¬r, 0 (¬r) = ¬r As shown in Figure 2, the literals in the zeroth literal layer have cost zero in their initial worlds. Action Layer: The kth action layer of the LU G is defined as: Ak = {vk (a)|k (a) =⊥} , where each label is defined as: k (a) = ∗k (ρe (a)), each is defined as: ck (a) = cost vector { f i (a), ci (a) |f i (a) =⊥}, each cost vector partition is defined as: f i (a) = k (a) ∧ ¬k −1 (a), k ≤ k, and cost is computed as: ci (a) = each partition i Cover(f (a), ck (l)) l∈ρe (a) Based on the previous literal layer Lk , the action layer Ak contains all non-⊥ labelled causative actions from the ac- tion set A, plus all literal persistence. Persistence for a literal l, denoted by lp , is represented as an action where ρe (lp ) = ε0 (lp ) = l. The label of the action at level k, is equivalent to the extended label of its execution precondition. We partition the cost vector based on worlds that newly support the vertex in each level. If there are new worlds supporting a at level k, we need to add a formula-cost pair to the cost vector with the formula equal to k (a) ∧ ¬k−1 (a). When k = 0 we can say −1 (a) =⊥. We then update the cost for each element of the cost vector. We find ci (a) by summing the costs of the execution precondition literals in the worlds described by f i (a). The cost of each literal is determined by covering the worlds f i (a) with the cost vector of the literal. In general, cost vectors do not have a specific formula-cost pair for a set of worlds we care about, rather the worlds are partitioned over several formula-cost pairs. To get a cost for the set of worlds we care about, we do a cover with the disjoint world sets in the cost vector. We try to find a minimum cost for the cover because planning graphs typically represent an optimistic projection of reachability. Cover(f, c): A Cover of a formula f with a set of formulacost pairs c = {f 1 (·), c1 (·), ..., f n (·), cn (·)}, is equivalent to a weighted set cover problem [Cormen et al., 1990] where the set of models of f must be covered with weighted sets of models defined by the formula-cost pairs ⊆ c covers f with cost in pairs c c. A set of formula-cost i i c (·) when f |= i:f i (·),ci (·)∈c i:f i (·),ci (·)∈c f (·). Finding a minimum cover is an NP-Complete problem, following from set cover. We solve it using a greedy algorithm that at each step chooses the least cost formula-cost pair that covers a new world of our set of worlds. Fortunately in the action and effect layers, the Cover operation is done with (non-overlapping) partitions, meaning there is only one possible cover. This is not the case in the literal layer construction and relaxed plan extraction because the cover is with a set of possibly overlapping sets. We show an example of using Cover after the literal layer definition. The zeroth action layer has the following labels: 0 (B) = 0 (¬rp ) = ¬r, 0 (C) = 0 (sp ) = s ∧ ¬r, 0 (R) = 0 (¬sp ) = ¬s ∧ ¬r The action B is reachable in both worlds at a cost of zero because it has no execution precondition, whereas C has a cost of zero in world two because its execution precondition holds in world two at a cost of zero. Effect Layer: The kth effect layer of the LU G is defined as: Ek = {vk (ϕj (a))|k (ϕj (a)) =⊥}, where each label is defined as: k (ϕj (a)) = ∗k (ρj (a)) ∧ k (a), each cost vector is defined as: ck (ϕj (a)) = {f i (a), ci (a)|f i (a) =⊥}, each cost vector partition is defined as: f i (a) = k (ϕj (a)) ∧ ¬k −1 (ϕj (a)), k ≤ k, i and each partition cost is compute as: c (a) = i i Cover(f (a), ck (l)) c(a) + Cover(f (a), ck (a)) + l∈ρj (a) An effect ϕj (a) is included in Ek , when it is reachable in some world of BSs , i.e. k (ϕj (a)) =⊥, which only happens when both the associated action and the antecedent are reachable in at least one world together. The cost ci (a) of world set f i (a) of an effect at level k is found by adding the execution cost of the associated action, the support cost of the action in the worlds of f i (a), and the support cost of the antecedent in f i (a) (found by summing over the cost of each literal of ϕj (a) in f i (a)). The zeroth effect layer for our example has the labels: 0 (ϕ0 (B)) = 0 (ϕ0 (C)) = 0 (ϕ0 (sp )) = s ∧ ¬r, 0 (ϕ0 (R)) = 0 (ϕ0 (¬sp )) = ¬s ∧ ¬r, 0 (ϕ0 (¬rp )) = ¬r The effect of action B has the cost of B in world two, even though B could be executed in both worlds. This is because the effect is only enabled in world 2 by its antecedent s. Likewise, the effects of C and R have the cost of respectively executing C and R. While not shown, the persistence effects have cost zero in the worlds of the previous level. Literal Layer: The kth literal layer of the LU G is defined as: Lk = {vk (l)|k (l) =⊥}, where the label of each literal is defined as: k (l) = j ϕj (a):l∈εj (a),vk−1 (ϕj (a))∈Ek−1 k−1 (ϕ (a)), each cost vector is defined as: ck (l) = {f i (a), ci (a)|f i (a) =⊥}, each cost vector partition is defined as: f i (a) = k (l) ∧ ¬k −1 (l), k ≤ k, i and each partition cost is computed as: c (a) = i Cover(f (a), ϕj (a):l∈εj (a),vk−1 (ϕj (a))∈Ek−1 ck−1 (ϕj (a))) The literal layer, Lk , contains all literals with non-⊥ labels. The label of a literal, k (l), depends on Ek−1 and is the disjunction of the labels of each effect that causes the literal. The cost ci (a) in a set of worlds f i (a) for a literal at level k is found by covering the worlds f i (a) with the union of all formula-cost pairs of effects that support the literal. The first literal layer for our example has the labels: 1 (s) = s ∧ ¬r, 1 (¬s) = 1 (r) = 1 (¬r) = ¬r In our example, we want to update the formula-cost pairs of ¬s at level one. There are three supporters of ¬s, the persistence of ¬s in world 1, the effect of action B in world 2 and the C action’s effect in world 2. The formula-cost tuples for ¬s at level 1 are for {1} and {2}. We group the worlds this way because ¬s was originally reachable in world 1, but is newly supported by world 2. For the formula-cost pair with world 1 we use the persistence in the cover. For the formula-cost pair with world 2, the supporters are B and C, and we choose one for the cover. In Figure 2 we assign a cost of min(B, C) because we discuss two different cost models. We must also assign a cost to the formula-cost pair for r in worlds 1 and 2. The cover for r in these worlds must use both the effect of C and R because each covers only one world, hence its cost is C + R. min arg min Level off: The graph levels off when Lk = Lk−1 . model of BSs is able to reach a model of BSG and the cost of reaching BSG is minimal. In our example we level off (terminate construction) at level two with the LU G and level three with the CLU G. We show up to level two because level three is identical. We can say that the goal is reachable after one step from the initial belief because BSI = ¬r |= ∗1 (BSG ) = ¬r. 3.2 Relaxed Plans The relaxed plan heuristic we extract from the LU G and the CLU G is similar to the multiple graph relaxed plan heuristic, [Bryce and Kambhampati, 2004]. As previously described, the multiple graph heuristic uses a planning graph for every possible world of our source belief state to extract a relaxed plan to achieve the goal belief state. The LU G and CLU G relaxed plan heuristics are similar by accounting for positive world interaction and independence across source states in achieving the goals. The advantage is that we find the relaxed plans by using only one planning graph to extract a single, albeit more complicated, relaxed plan. In a relaxed plan we find a line of causal support for the goal from every state in BSs . Since many possible worlds use some of the same vertices to support the goal, we label relaxed plan vertices with which worlds use them. There may be several paths used to support a subgoal in the same worlds because not one is used by all worlds. For example, notice that in Figure 2 it takes both C and R to support r in both worlds because each action supports only one world. One challenge in extracting the relaxed plan is in tracking what worlds use which paths to support subgoals. Another challenge is in extracting cost-sensitive relaxed plans, for which the propagated cost vectors help. The multiple graph, LU G, and CLU G relaxed plans are inadmissible (i.e. will not guarantee optimal plans with AO* search). Admissible heuristics are lower bounds that enable search to find optimal solutions, but most in practice are very ineffective. In the next section we demonstrate that although our heuristics are inadmissible they guide our planner toward high quality solutions. We describe relaxed plan construction by first defining relaxed plans for the LU G and CLU G (pointing out differences), then how the last literal layer is built, followed by the inductive step to construct a level. Definition 6 (Relaxed Plans). A relaxed plan extracted from the LU G or CLU G for BSs is defined with respect to the goal belief state BSG . The relaxed plan is a subgraph that has b levels (see below), where each RP level k has three layers, the literal LRP k , action Ak , RP RP and effect Ek layer. Each vertex vk (·) in the relaxed plan is a pair ·, RP k (·). The level b of the LU G is the earliest level where BSs |= ∗b (BSG ), and b = k Sd ∈M(BSG ) l∈Sd Cover(k (l), ck (l)), meaning every Last Relaxed Plan Literal Layer: The final literal layer of the relaxed plan contains all literals that are in LRP b models of the destination belief BSG . The final literal layer is a subset of the vertices in Lb . Each literal l has a label equivalent to its label at level b, i.e. RP b (l) = b (l). Relaxed Plan Effect Layer: The kth effect layer EkRP contains all the effects needed to support the literals in LRP k+1 . j (ϕ (a)) of an effect is the disjunction of all The label RP k worlds where the effect is used to support a literal. The litRP RP (l)∈LRP : when ∀vk+1 erals in LRP k+1 are supported by Ek k+1 RP k+1 (l) |= RP ϕj (a):vk (ϕj (a))∈EkRP , l∈εj (a) j RP k (ϕ (a)) The above formula states that each vertex in the literal layer must have effects chosen for the supporting effect layer such that for all worlds where the literal must be supported, there is an effect that gives support. We construct the effect layer by using a greedy minimum cover operation for each literal to pick the effects that support worlds where the literal needs support. In the LU G, we use a technique that does not rely on cost vectors and at each step chooses the effect that covers the literal in the most new worlds. The intuition is that we will include less effects (and actions) if they support in more worlds. In the CLU G we use a technique that at each step chooses an effect that can contribute support in new worlds at the lowest cost. We insert the chosen effects in the effect layer and label them to indicate the worlds where they were used for support. Relaxed Plan Action Layer: The kth action layer ARP k contains all actions whose effects were used in EkRP . The associated label RP k (a) for each action a is the disjunction of the labels of each of its effects that are elements of EkRP . Relaxed Plan Literal Layer: The kth literal layer LRP k contains all literals that appear in the execution preconditions of actions in ARP k , or the antecedents of effects in EkRP . The associated label RP k (l) for each literal l is the disjunction of the labels of each of each respective action and effect in ARP or EkRP which the literal appears in the k execution precondition or antecedent. We support literals with effects, insert actions, and insert literals until we have supported all literals in LRP 1 . Once we get a relaxed plan, the relaxed plan heuristic is the sum of the selected action costs. In Figure 3 we show three relaxed plans to support BSG CLUG Relaxed Plan for Cost Model 1 :s :s {1,2} {1} 0 s B ϕ (B) {2} {2} {2} R {1,2} CLUG Relaxed Plan for Cost Model 2 :s s C ϕ0(C) {1,2} {2} {2} {2} :s {1} R {1} ϕ0(R) {1} h(: r) = B + R = 17 :s {1,2} 0 r ϕ (R) {1,2} {1,2} h(: r) = C + R = 17 r {1,2} LUG Relaxed Plan for Cost Model 1 & 2 s {2} B {2} ϕ0(B) {2} :s {1,2} :s {1} R {1} ϕ0(R) {1} r {1,2} h(: r) = B + R = 17 : Cost Model 1 27 : Cost Model 2 Figure 3: Illustration of CLU G and LU G relaxed plans for two cost models. from BSI . The first two are for the two cost models we presented using the CLU G; the third is for both cost models when using the LU G. All relaxed plans need to support the goal literals in worlds 1 and 2 (the worlds of BSI ). We find that BSG is reachable at level one, at a cost of min(B, C) + C + R, and at level two at a cost of min(B, C)+min(C + R,min(B, C) + R). In the first cost model, level one costs 37 and level two costs 27 – so we extract starting at level 2; and with the second cost model level one costs 27 and level two costs 27 – so we extract at level 1 as there is no drop in cost at level 2. Using the LU G we choose level 1 because it is the first level the goals are reachable. To extract a relaxed plan in the first cost model from the CLU G we support ¬s in both worlds with a persistence rather than B because the persistence covers both worlds with a propagated cost of 10, opposed to 20 with B. Likewise, r is supported with R at a propagated cost of 10, opposed to 27 for the persistence. Next, we support ¬s at level one in world 2 with A because it is cheaper than C, and in world 1 with the only choice, persistence. The relaxed plan has value 17 because it chose B and R. We leave the second cost model as an exercise. In the LU G relaxed plan we could extract B, R for either cost scenario because R covers r in the most worlds, and B is chosen for supporting ¬s. The LU G relaxed plan extraction is not sensitive to cost, but the relaxed plan value reflects action cost. 4 Empirical Comparisons Our main intent is to evaluate the effectiveness of the LU G and the CLU G in improving the quality of plans generated by P ON D. Additionally, we also compare with two state of the art planners, GPT [Bonet and Geffner, 2000], and MBP [Bertoli et al., 2001]. Even though MBP does not plan with costs, we show the cost of MBP’s plans for each problem’s cost model. GPT uses heuristics based on relaxing the problem to full-observability (whereas our relaxation is to no observability while ignoring action mutexes), and MBP uses a belief state’s size as its heuristic merit. Our test set up involves two domains: MedicalSpecialist and Rovers. Each problem had a time out of 20 minutes and a memory limit of 1GB on a 2.8GHz P4 Linux machine. We provide our planner and domain encodings at http://rakaposhi.eas.asu.edu/belief-search/. POND is implemented in C and uses several existing technologies. It employs AO* search code from Eric Hansen, planning graph construction code from Joerg Hoffmann, and the BDD CUDD package from Fabio Somenzi for representing belief states, actions, and labels. Medical-Specialist: We developed an extension of the medical domain [Weld et al., 1998], where in addition to staining, counting of white blood cells, and medicating, one can go to a specialist for medication and there is no chance of dying – effectively allowing conformant (non-sensing) plans. We assigned costs as follows: c(stain) = 5, c(count white cells) = 10, c(inspect stain) = X, c(analyze white cell count) = X, c(medicate) = 5, and c(specialist medicate) = 10. We generated ten problems, each with the respective number of diseases (1-10), in two sets where X = {15, 25}. Plans in this domain must treat a patient by either performing some combination of staining, counting white cells, and sensing actions to diagnose the exact disease and apply the proper medicate action, or using the specialist medicate action without knowing the exact disease. Plans can use hybrid strategies, using the specialist medicate for some diseases and the diagnosis and medicate for others. The strategy depends on cost and the number of diseases. Our results in the first two columns in Figures 4, 5, and 6 show the average plan path cost, number of plan nodes (belief states) in the solution, and total time for two cost models; the x-axis reflects different problem instances. Extracting relaxed plans from the CLU G instead of the LU G enables P ON D to be more cost-sensitive. The plans returned by the CLU G method tend to have less nodes and a lower average path cost than the LU G. The LU G heuristic does not measure sensing cost, but as sensing cost changes, the search is able to locally gauge the cost of sensing and adapt. Since MBP is insensitive to cost, its plans are proportionately costlier as the sensor cost increases. GPT returns better plans, but tends to take significantly more time as the cost of sensing increases; this may be attributed to how the heuristic is computed by relaxing the problem to full-observability. Our heuristics measure the cost of coachieving the goal from a set of states, whereas GPT takes the average cost for reaching the goal from the states. Rovers: We use an adaptation of the Rovers domain from the Third International Planning Competition [Long and 100 100 LUG CLUG MBP GPT 10 5 1 2 3 4 5 6 7 LUG CLUG MBP GPT 10 5 8 9 10 1 2 3 4 5 6 7 3000 3000 1000 1000 LUG CLUG MBP GPT 100 8 9 10 0 2 4 6 8 LUG CLUG MBP GPT 100 10 12 0 2 4 6 8 10 12 Figure 4: Quality (average plan path cost) for P ON D (LU G and CLU G), M BP , and GP T for Medical-Specialist and Rovers. 50 50 10 10 LUG CLUG MBP GPT 1 1 2 3 4 5 6 7 LUG CLUG MBP GPT 1 8 9 10 1 2 3 4 5 6 7 200 200 50 50 LUG CLUG MBP GPT 5 8 9 10 0 2 4 6 8 LUG CLUG MBP GPT 5 10 12 0 2 4 6 8 10 12 10 12 Figure 5: Number plan nodes for P ON D (LU G and CLU G), M BP , and GP T for Medical-Specialist and Rovers. 1e6 1e6 LUG CLUG MBP GPT 1e3 LUG CLUG MBP GPT 1e3 1e5 1e5 1e4 1e4 LUG CLUG MBP GPT 1e3 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 LUG CLUG MBP GPT 1e3 10 12 0 2 4 6 8 Figure 6: Total Time(ms) for P ON D (LU G and CLU G), M BP , and GP T for Medical-Specialist and Rovers. Fox, 2003] where there are several locations with possible science data (images, rocks, and soil). We added sensory actions to determine availability of scientific data and conditional actions that conformantly collect data. Our action cost model is: c(sense visibility) = X, c(sense rock) = Y, c(sense soil) = Z, c(navigate) = 50, c(calibrate) = 10, c(take image) = 20, c(communicate data) = 40, c(sample soil) = 30, c(sample rock) = 60, and c(drop) = 5. The two versions have costs: (X,Y,Z) = {(35, 55, 45), (100, 120, 110)}. Plans in the rovers domain can involve sensing at locations to identify if data can be collected or simply going to every possible location and trying to collect data. The number of locations varies between four and eight, and the number of possible locations to collect up to three types of data can be between one and four. The last two columns of Figures 4, 5, and 6 show the average path cost, number of nodes in the solution, and total time for the two cost models. We found that the LU G and CLU G relaxed plan extraction guide POND toward similar plans, in terms of cost and number of nodes. The lack of difference between the heuristics may be attributed to the domain structure – good solutions have a lot of positive interaction (i.e. the heuristics extract similar relaxed plans because low cost actions also support subgoals in many possible worlds), opposed to Medical where solutions are fairly independent for different possible worlds. MBP, making no use of action costs, returns plans with considerably (a order of magnitude) higher average path costs and number of solution nodes. GPT fares better than MBP in terms of plan cost, but both are limited in scalability due to weaker heuristics. In summary, the experiments show that the LU G and CLU G heuristics help with scalability and that using the CLU G to extract relaxed plans can help find better solutions. We also found that planners not reasoning about action cost can return arbitrarily poor solutions, and planners whose heuristic relaxes uncertainty do not scale as well. 5 Related Work The idea of cost propagation on planning graphs was first presented by Do and Kambhampati [2003] to cope with metric-temporal planning. The first work on using planning graphs in conditional planning was in the CGP [Smith and Weld, 1998] and SGP [Weld et al., 1998] planners. Recently, planning graph heuristics have proven useful in conformant planning [Bryce and Kambhampati, 2004; Brafman and Hoffmann, 2004] and conditional planning [Cushing and Bryce, 2005; Hoffmann and Brafman, 2005]. They have also proven useful in reachability analysis for MDPs [Boutilier et al., 1998]; our work could be extended for POMDPs. Also related is the work on sensor planning, such as Koenig and Liu [1999]. The authors investigate the frequency of sensing as the plan optimization criterion changes (from minimizing the worst case cost to the expected cost). We investigate the frequency of sensing while minimizing average plan cost under different cost models. The work on optimal limited contingency planning [Meuleau and Smith, 2003] stated that adjusting sensory action cost, as we have, is an alternative to their approach for reducing plan branches. 6 Conclusion & Future Work With our motivation toward conditional planning approaches that can scale like classical planners, but still reason with quality metrics like POMDPs, we have presented a novel planning graph generalization called the LU G and an associated cost propagated version called the CLU G. With the CLU G we extract cost-sensitive relaxed plans that are effective in guiding our planner P ON D toward high-quality conditional plans. We have shown with an empirical comparison that our approach improves the quality of conditional plans over conditional planners that do not account for cost information, and we that can out-scale approaches that consider cost information and uncertainty in a weaker fashion. While our relaxation of conditional planning ignores sensory actions, we have explored techniques to include observations in heuristic estimates. The basic idea is to extract a relaxed plan then add sensory actions that reduce cost by removing mutexes (sensing to place conflicting actions in different branches) or reducing average path cost (ensuring costly actions are not executed in all paths). The major reason we do not report on using sensory relaxed plans here is that scalability of these techniques is somewhat limited, despite their ability to further improving plan quality. We are investigating ways to reduce computation cost. Given our ability to propagate numeric information on the LU G, we are currently adapting these heuristics and our planner to handle non uniform probabilities. The extension involves adding probabilities to labels by using ADDs instead of BDDs, and redefining propagation semantics. The propagation semantics replaces conjunctions with products, and disjunctions with summations. A label represents a probability distribution over possible worlds, the probability of reaching the vertex is a summation over the possible world probabilities, and the expected cost of a vertex is the sum of products between cost vector partitions and the label. Relaxed plans, which previously involved weighted set covers with a single objective (minimizing cost) become multi-objective by trading off cost and probability. In addition to cost propagation we have also extended the LU G within the framework of state agnostic planning graphs [Cushing and Bryce, 2005]. The LU G seeks to avoid redundancy across the multiple planning graphs built for states in the same belief state. We extended this notion to avoid redundancy in planning graphs built for every belief state. We have shown that the state agnostic LU G (SLU G) which is built once per search episode (opposed to a LU G at each node) can reduce heuristic computation cost without sacrificing informedness. Acknowledgements: This research is supported in part by the NSF grant IIS-0308139 and an IBM Faculty Award to Subbarao Kambhampati. References P. Bertoli, A. Cimatti, M. Roveri, and P. Traverso. Planning in nondeterministic domains under partial observability via symbolic model checking. In Proceedings of IJCAI’01, 2001. B. Bonet and H. Geffner. Planning with incomplete information as heuristic search in belief space. In Proceedings of AIPS’02, 2000. C. Boutilier, R. Brafman, and C. Geib. Structured reachability analysis for Markov decision processes. In Proceedings of UAI’98, 1998. R. Brafman and J. Hoffmann. Conformant planning via heuristic forward search: A new approach. In Proceedings of ICAPS’ 04, 2004. Randal E. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, C-35(8):677– 691, August 1986. D. Bryce and S. Kambhampati. Heuristic guidance measures for conformant planning. In Proceedings of ICAPS’04, 2004. D. Bryce, S. Kambhampati, and D. Smith. Planning in belief space with a labelled uncertainty graph. Technical report, AAAI Workshop TR WS-04-08, 2004. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. McGraw-Hill, 1990. W. Cushing and D. Bryce. State agnostic planning graphs. In Proceedings of AAAI’05, 2005. Minh Binh Do and Subbarao Kambhampati. Sapa: A scalable multi-objective heuristic metric temporal planner˙ JAIR, 2003. J. Hoffmann and R. Brafman. Contingent planning via heuristic forward search with implicit belief states. In Proceedings of ICAPS’05, 2005. S. Koenig and Y. Liu. Sensor planning with non-linear utility functions. In Proceedings of ECP’99, 1999. M. Littman, J. Goldsmith, and M. Mundhenk. The computational complexity of probabilistic planning. JAIR, 9:1–36, 1998. D. Long and M. Fox. The 3rd international planning competition: Results and analysis. JAIR, 20:1–59, 2003. O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In Proceeding of AAAI’99, 1999. N. Meuleau and D. Smith. Optimal limited contingency planning. In Proceedings of UAI’03, 2003. N. Nilsson. Principles of Artificial Intelligence. Morgan Kaufmann, 1980. R. Petrick and F. Bacchus. A knowledge-based approach to planning with incomplete information and sensing. In Proceedings of AIPS’02, 2002. J. Rintanen. Complexity of planning with partial observability. In Proceedings of ICAPS’04, 2004. D. Smith and D. Weld. Conformant graphplan. In Proceedings of AAAI’98, 1998. D. Weld, C. Anderson, and D. Smith. Extending graphplan to handle uncertainty and sensing actions. In Proceedings of AAAI’98, 1998.