Planning with Durative Actions in Stochastic Domains Mausam @ .

Journal of Artificial Intelligence Research 31 (2008) 33-82 Submitted 02/07; published 01/08 Planning with Durative Actions in Stochastic Domains Mausam Daniel S. Weld MAUSAM @ CS . WASHINGTON . EDU WELD @ CS . WASHINGTON . EDU Dept of Computer Science and Engineering Box 352350, University of Washington Seattle, WA 98195 USA Abstract Probabilistic planning problems are typically modeled as a Markov Decision Process (MDP). MDPs, while an otherwise expressive model, allow only for sequential, non-durative actions. This poses severe restrictions in modeling and solving a real world planning problem. We extend the MDP model to incorporate — 1) simultaneous action execution, 2) durative actions, and 3) stochastic durations. We develop several algorithms to combat the computational explosion introduced by these features. The key theoretical ideas used in building these algorithms are — modeling a complex problem as an MDP in extended state/action space, pruning of irrelevant actions, sampling of relevant actions, using informed heuristics to guide the search, hybridizing different planners to achieve benefits of both, approximating the problem and replanning. Our empirical evaluation illuminates the different merits in using various algorithms, viz., optimality, empirical closeness to optimality, theoretical error bounds, and speed. 1. Introduction Recent progress achieved by planning researchers has yielded new algorithms which relax, individually, many of the classical assumptions. For example, successful temporal planners like SGPlan, SAPA, etc. (Chen, Wah, & Hsu, 2006; Do & Kambhampati, 2003) are able to model actions that take time, and probabilistic planners like GPT, LAO*, SPUDD, etc. (Bonet & Geffner, 2005; Hansen & Zilberstein, 2001; Hoey, St-Aubin, Hu, & Boutilier, 1999) can deal with actions with probabilistic outcomes, etc. However, in order to apply automated planning to many real-world domains we must eliminate larger groups of the assumptions in concert. For example, NASA researchers note that optimal control for a NASA Mars rover requires reasoning about uncertain, concurrent, durative actions and a mixture of discrete and metric fluents (Bresina, Dearden, Meuleau, Smith, & Washington, 2002). While today’s planners can handle large problems with deterministic concurrent durative actions, and MDPs provide a clear framework for non-concurrent durative actions in the face of uncertainty, few researchers have considered concurrent, uncertain, durative actions — the focus of this paper. As an example consider the NASA Mars rovers, Spirit and Oppurtunity. They have the goal of gathering data from different locations with various instruments (color and infrared cameras, microscopic imager, Mossbauer spectrometers etc.) and transmitting this data back to Earth. Concurrent actions are essential since instruments can be turned on, warmed up and calibrated, while the rover is moving, using other instruments or transmitting data. Similarly, uncertainty must be explicitly confronted as the rover’s movement, arm control and other actions cannot be accurately predicted. Furthermore, all of their actions, e.g., moving between locations and setting up experiments, take time. In fact, these temporal durations are themselves uncertain — the rover might lose its way and c 2008 AI Access Foundation. All rights reserved. M AUSAM & W ELD take a long time to reach another location, etc. To be able to solve the planning problems encountered by a rover, our planning framework needs to explicitly model all these domain constructs — concurrency, actions with uncertain outcomes and uncertain durations. In this paper we present a unified formalism that models all these domain features together. Concurrent Markov Decision Processes (CoMDPs) extend MDPs by allowing multiple actions per decision epoch. We use CoMDPs as the base to model all planning problems involving concurrency. Problems with durative actions, concurrent probabilistic temporal planning (CPTP), are formulated as CoMDPs in an extended state space. The formulation is also able to incorporate the uncertainty in durations in the form of probabilistic distributions. Solving these planning problems poses several computational challenges: concurrency, extended durations, and uncertainty in those durations all lead to explosive growth in the state space, action space and branching factor. We develop two techniques, Pruned RTDP and Sampled RTDP to address the blowup from concurrency. We also develop the “DUR” family of algorithms to handle stochastic durations. These algorithms explore different points in the running time vs. solutionquality tradeoff. The different algorithms propose several speedup mechanisms such as — 1) pruning of provably sub-optimal actions in a Bellman backup, 2) intelligent sampling from the action space, 3) admissible and inadmissible heuristics computed by solving non-concurrent problems, 4) hybridizing two planners to obtain a hybridized planner that finds good quality solution in intermediate running times, 5) approximating stochastic durations by their mean values and replanning, 6) exploiting the structure of multi-modal duration distributions to achieve higher quality approximations. The rest of the paper is organized as follows: In section 2 we discuss the fundamentals of MDPs and the real-time dynamic programming (RTDP) solution method. In Section 3 we describe the model of Concurrent MDPs. Section 4 investigates the theoretical properties of the temporal problems. Section 5 explains our formulation of the CPTP problem for deterministic durations. The algorithms are extended for the case of stochastic durations in Section 6. Each section is supported with an empirical evaluation of the techniques presented in that section. In Section 7 we survey the related work in the area. We conclude with future directions of research in Sections 8 and 9. 2. Background Planning problems under probabilistic uncertainty are often modeled using Markov Decision Processes (MDPs). Different research communities have looked at slightly different formulations of MDPs. These versions typically differ in objective functions (maximizing reward vs. minimizing cost), horizons (finite, infinite, indefinite) and action representations (DBN vs. parametrized action schemata). All these formulations are very similar in nature, and so are the algorithms to solve them. Though, the methods proposed in the paper are applicable to all the variants of these models, for clarity of explanation we assume a particular formulation, known as the stochastic shortest path problem (Bertsekas, 1995). We define a Markov decision process (M) as a tuple S, A, Ap, Pr, C, G, s0 in which • S is a finite set of discrete states. We use factored MDPs, i.e., S is compactly represented in terms of a set of state variables. • A is a finite set of actions. 34 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS State variables : x1 , x2 , x3 , x4 , p12 Action Precondition Effect toggle-x1 ¬p12 x1 ← ¬x1 toggle-x2 p12 x2 ← ¬x2 toggle-x3 true x3 ← ¬x3 no change toggle-x4 true x4 ← ¬x4 no change toggle-p12 true p12 ← ¬p12 Goal : x1 = 1, x2 = 1, x3 = 1, x4 = 1 Probability 1 1 0.9 0.1 0.9 0.1 1 Figure 1: Probabilistic STRIPS definition of a simple MDP with potential parallelism • Ap defines an applicability function. Ap : S → P(A), denotes the set of actions that can be applied in a given state (P represents the power set). • Pr : S × A × S → [0, 1] is the transition function. We write Pr(s |s, a) to denote the probability of arriving at state s after executing action a in state s. • C : S × A × S → + is the cost model. We write C(s, a, s ) to denote the cost incurred when the state s is reached after executing action a in state s. • G ⊆ S is a set of absorbing goal states, i.e., the process ends once one of these states is reached. • s0 is a start state. We assume full observability, i.e., the execution system has complete access to the new state after an action has been performed. We seek to find an optimal, stationary policy — i.e., a function π: S → A that minimizes the expected cost (over an indefinite horizon) incurred to reach a goal state. Note that any cost function, J: S → , mapping states to the expected cost of reaching a goal state defines a policy as follows: πJ (s) = argmin a∈Ap(s) s ∈S Pr(s |s, a) C(s, a, s ) + J(s ) (1) The optimal policy derives from the optimal cost function, J ∗ , which satisfies the following pair of Bellman equations. J ∗ (s) = 0, if s ∈ G else J ∗ (s) = min a∈Ap(s) Pr(s |s, a) C(s, a, s ) + J ∗ (s ) (2) s ∈S For example, Figure 1 defines a simple MDP where four state variables (x1 , . . . , x4 ) need to be set using toggle actions. Some of the actions, e.g., toggle-x3 are probabilistic. Various algorithms have been developed to solve MDPs. Value iteration is a dynamic programming approach in which the optimal cost function (the solution to equations 2) is calculated as the limit of a series of approximations, each considering increasingly long action sequences. If Jn (s) 35 M AUSAM & W ELD is the cost of state s in iteration n, then the cost of state s in the next iteration is calculated with a process called a Bellman backup as follows: Jn+1 (s) = min a∈Ap(s) Pr(s |s, a) C(s, a, s ) + Jn (s ) (3) s ∈S Value iteration terminates when ∀s ∈ S, |Jn (s) − Jn−1 (s)| ≤ , and this termination is guaranteed for > 0. Furthermore, in the limit, the sequence of {Ji } is guaranteed to converge to the optimal cost function, J ∗ , regardless of the initial values as long as a goal can be reached from every reachable state with non-zero probability. Unfortunately, value iteration tends to be quite slow, since it explicitly updates every state, and |S| is exponential in the number of domain features. One optimization restricts search to the part of state space reachable from the initial state s0 . Two algorithms exploiting this reachability analysis are LAO* (Hansen & Zilberstein, 2001) and our focus: RTDP (Barto, Bradtke, & Singh, 1995). RTDP, conceptually, is a lazy version of value iteration in which the states get updated in proportion to the frequency with which they are visited by the repeated executions of the greedy policy. An RTDP trial is a path starting from s0 , following the greedy policy and updating the costs of the states visited using Bellman backups; the trial ends when a goal is reached or the number of updates exceeds a threshold. RTDP repeats these trials until convergence. Note that common states are updated frequently, while RTDP wastes no time on states that are unreachable, given the current policy. RTDP’s strength is its ability to quickly produce a relatively good policy; however, complete convergence (at every relevant state) is slow because less likely (but potentially important) states get updated infrequently. Furthermore, RTDP is not guaranteed to terminate. Labeled RTDP (LRTDP) fixes these problems with a clever labeling scheme that focuses attention on states where the value function has not yet converged (Bonet & Geffner, 2003). Labeled RTDP is guaranteed to terminate, and is guaranteed to converge to the -approximation of the optimal cost function (for states reachable using the optimal policy) if the initial cost function is admissible, all costs (C) positive and a goal reachable from all reachable states with non-zero probability. MDPs are a powerful framework to model stochastic planning domains. However, MDPs make two unrealistic assumptions — 1) all actions need to be executed sequentially, and 2) all actions are instantaneous. Unfortunately, there are many real-world domains where these assumptions are unrealistic. For example, concurrent actions are essential for a Mars rover, since instruments can be turned on, warmed up and calibrated while the rover is moving, and using other instruments for transmitting data. Moreover, the action durations are non-zero and stochastic — the rover might lose its way while navigating and may take a long time to reach its destination; it may make multiple attempts before finding the accurate arm placement. In this paper we successively relax these two assumptions and build models and algorithms that can scale up in spite of the additional complexities imposed by the more general models. 3. Concurrent Markov Decision Processes We define a new model, Concurrent MDP (CoMDP), which allows multiple actions to be executed in parallel. This model is different from semi-MDPs and generalized state semi-MDPs (Younes & Simmons, 2004b) in that it does not incorporate action durations explicitly. CoMDPs focus on adding concurrency in an MDP framework. The input to a CoMDP is slightly different from that of an MDP – S, A, Ap , Pr , C , G, s0 . The new applicability function, probability model and cost 36 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS (Ap , Pr and C respectively) encode the distinction between allowing sequential executions of single actions versus the simultaneous executions of sets of actions. 3.1 The Model The set of states (S), set of actions (A), goals (G) and the start state (s0 ) follow the input of an MDP. The difference lies in the fact that instead of executing only one action at a time, we may execute multiple of them. Let us define an action combination, A, as a set of one or more actions to be executed in parallel. With an action combination as a new unit operator available to the agent, the CoMDP takes the following new inputs • Ap defines the new applicability function. Ap : S → P(P(A)), denotes the set of action combinations that can be applied in a given state. • Pr : S × P(A) × S → [0, 1] is the transition function. We write Pr (s |s, A) to denote the probability of arriving at state s after executing action combination A in state s. • C : S × P(A) × S → + is the cost model. We write C (s, A, s ) to denote the cost incurred when the state s is reached after executing action combination A in state s. In essence, a CoMDP takes an action combination as a unit operator instead of a single action. Our approach is to convert a CoMDP into an equivalent MDP (M ) that can be specified by the tuple S, P(A), Ap , Pr , C , G, s0 and solve it using the known MDP algorithms. 3.2 Case Study: CoMDP over Probabilistic STRIPS In general a CoMDP could require an exponentially larger input than does an MDP, since the transition model, cost model and the applicability function are all defined in terms of action combinations as opposed to actions. A compact input representation for a general CoMDP is an interesting, open research question for the future. In this work, we consider a special class of compact CoMDP – one that is defined naturally via a domain description very similar to the probabilistic STRIPS representation for MDPs (Boutilier, Dean, & Hanks, 1999). Given a domain encoded in probabilistic STRIPS we can compute a safe set of co-executable actions. Under this safe semantics, the probabilistic dynamics gets defined in a consistent way as we describe below. 3.2.1 A PPLICABILITY F UNCTION We first discuss how to compute the sets of actions that can be executed in parallel since some actions may conflict with each other. We adopt the classical planning notion of mutual exclusion (Blum & Furst, 1997) and apply it to the factored action representation of probabilistic STRIPS. Two distinct actions are mutex (may not be executed concurrently) if in any state one of the following occurs: 1. they have inconsistent preconditions 2. an outcome of one action conflicts with an outcome of the other 3. the precondition of one action conflicts with the (possibly probabilistic) effect of the other. 37 M AUSAM & W ELD 4. the effect of one action possibly modifies a feature upon which another action’s transition function is conditioned upon. Additionally, an action is never mutex with itself. In essence, the non-mutex actions do not interact — the effects of executing the sequence a1 ; a2 equals those for a2 ; a1 — and so the semantics for parallel executions is clear. Example: Continuing with Figure 1, toggle-x1 , toggle-x3 and toggle-x4 can execute in parallel but toggle-x1 and toggle-x2 are mutex as they have conflicting preconditions. Similarly, toggle-x1 and toggle-p12 are mutex as the effect of toggle-p12 interferes with the precondition of toggle-x1 . If toggle-x4 ’s outcomes depended on toggle-x1 then they would be mutex too, due to point 4 above. For example, toggle-x4 toggle-x1 will be mutex if the effect of toggle-x4 was as follows: “if togglex1 then the probability of x4 ← ¬x4 is 0.9 else 0.1”. 2 The applicability function is defined as the set of action-combinations, A, such that each action in A is independently applicable in s and all of the actions are pairwise non-mutex with each other. Note that pairwise concurrency is sufficient to ensure problem-free concurrency of all multiple actions in A. Formally Ap can be defined in terms of our original definition Ap as follows: Ap (s) = {A ⊆ A|∀a, a ∈ A, a, a ∈ Ap(s) ∧ ¬mutex(a, a )} (4) 3.2.2 T RANSITION F UNCTION Let A = {a1 , a2 , . . . , ak } be an action combination applicable in s. Since none of the actions are mutex, the transition function may be calculated by choosing any arbitrary order in which to apply them as follows: Pr (s |s, A) = ... Pr(s1 |s, a1 )Pr(s2 |s1 , a2 ) . . . Pr(s |sk−1 , ak ) (5) s1 ,s2 ,...sk ∈S While we define the applicability function and the transition function by allowing only a consistent set of actions to be executable concurrently, there are alternative definitions possible. For instance, one might be willing to allow executing two actions together if the probability that they conflict is very small. A conflict may be defined as two actions asserting contradictory effects or one negating the precondition of the other. In such a case, a new state called failure could be created such that the system transitions to this state in case of a conflict. And the transition may be computed to reflect a low probability transition to this failure state. Although we impose that the model be conflict-free, most of our techniques don’t actually depend on this assumption explicitly and extend to general CoMDPs. 3.2.3 C OST MODEL We make a small change to the probabilistic STRIPS representation. Instead of defining a single cost (C) for each action, we define it additively as a sum of resource and time components as follows: • Let t be the durative cost, i.e., cost due to time taken to complete the action. • Let r be the resource cost, i.e., cost of resources used for the action. 38 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Assuming additivity we can think of cost of an action C(s, a, s ) = t(s, a, s ) + r(s, a, s ), to be sum of its time and resource usage. Hence, the cost model for a combination of actions in terms of these components may be defined as: C (s, {a1 , a2 , ..., ak }, s ) = k r(s, ai , s ) + max {t(s, ai , s )} i=1..k i=1 (6) For example, a Mars rover might incur lower cost when it preheats an instrument while changing locations than if it executes the actions sequentially, because the total time is reduced while the energy consumed does not change. 3.3 Solving a CoMDP with MDP Algorithms We have taken a concurrent MDP that allowed concurrency in actions and formulated it as an equivalent MDP, M , in an extended action space. For the rest of the paper we will use the term CoMDP to also refer to the equivalent MDP M . 3.3.1 B ELLMAN EQUATIONS We extend Equations 2 to a set of equations representing the solution to a CoMDP: J∗ (s) = 0, if s ∈ G else J∗ (s) = min A∈Ap (s) s ∈S Pr (s |s, A) C (s, A, s ) + J∗ (s ) (7) These equations are the same as in a traditional MDP, except that instead of considering single actions for backup in a state, we need to consider all applicable action combinations. Thus, only this small change must be made to traditional algorithms (e.g., value iteration, LAO*, Labeled RTDP). However, since the number of action combinations is worst-case exponential in |A|, efficiently solving a CoMDP requires new techniques. Unfortunately, there is no structure to exploit easily, since an optimal action for a state from a classical MDP solution may not even appear in the optimal action combination for the associated concurrent MDP. Theorem 1 All actions in an optimal combination for a CoMDP (M ) may be individually suboptimal for the MDP M. Proof: In the domain of Figure 1 let us have an additional action toggle-x34 that toggles both x3 and x4 with probability 0.5 and toggles exactly one of x3 and x4 with probability 0.25 each. Let all the actions take one time unit each, and therefore cost of any action combination is one as well. Let the start state be x1 = 1, x2 = 1, x3 = 0, x4 = 0 and p12 = 1. For the MDP M the only optimal action for the start state is toggle-x34 . However, for the CoMDP M the optimal combination is {toggle-x3 , toggle-x4 }. 2 3.4 Pruned Bellman Backups Recall that during a trial, Labeled RTDP performs Bellman backups in order to calculate the costs of applicable actions (or in our case, action combinations) and then chooses the best action (combination); we now describe two pruning techniques that reduce the number of backups to be computed. 39 M AUSAM & W ELD Let Q (s, A) be the expected cost incurred by executing an action combination A in state s and then following the greedy policy, i.e. Qn (s, A) = s ∈S Pr (s |s, A) C (s, A, s ) + Jn−1 (s ) (8) A Bellman update can thus be rewritten as: Jn (s) = min A∈Ap (s) Qn (s, A) (9) 3.4.1 C OMBO -S KIPPING Since the number of applicable action combinations can be exponential, we would like to prune suboptimal combinations. The following theorem imposes a lower bound on Q (s, A) in terms of the costs and the Q -values of single actions. For this theorem the costs of the actions may depend only on the action and not the starting or ending state, i.e., for all states ∀s, s C(s, a, s ) = C(a). Theorem 2 Let A = {a1 , a2 , . . . , ak } be an action combination which is applicable in state s. For a CoMDP over probabilistic STRIPS, if costs are dependent only on actions and Qn values are monotonically non-decreasing then Q (s, A) ≥ max Q (s, {ai }) + C (A) − i=1..k k i=1 C ({ai }) Proof: Qn (s, A) = C (A) + ⇒ s s Pr (s |s, A)Jn−1 (s ) (using Eqn. 8) Pr (s |s, A)Jn−1 (s ) = Qn (s, A) − C (A) Qn (s, {a1 }) = C ({a1 }) + ≤ C ({a1 }) + s s Pr(s |s, a1 )Jn−1 (s ) Pr(s |s, a1 ) C ({a2 }) + = C ({a1 }) + C ({a2 }) + ≤ = k i=1 k i=1 C ({ai }) + (10) s s s Pr(s |s , a2 )Jn−2 (s ) (using Eqns. 8 and 9) Pr (s |s, {a1 , a2 })Jn−2 (s ) Pr (s |s, A)Jn−k (s ) C ({ai }) + [Qn−k+1 (s, A) − C (A)] Replacing n by n + k − 1 40 (repeating for all actions in A) (using Eqn. 10) P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Qn (s, A) ≥ Qn+k−1 (s, {a1 }) + C (A) − ≥ Qn (s, {a1 }) + C (A) − ≥ k k i=1 max Qn (s, {ai }) + C (A) − i=1..k 2 i=1 C ({ai }) C ({ai }) k i=1 (monotonicity of Qn ) C ({ai }) The proof above assumes equation 5 from probabilistic STRIPS. The following corollary can be used to prune suboptimal action combinations: Corollary 3 Let Jn (s) be an upper bound of Jn (s). If Jn (s) < max Qn (s, {ai }) + C (A) − k i=1..k i=1 C ({ai }) then A cannot be optimal for state s in this iteration. Proof: Let A∗n = {a1 , a2 , . . . , ak } be the optimal combination for state s in this iteration n. Then, Jn (s) ≥ Jn (s) Jn (s) = Qn (s, A∗n ) Combining with Theorem 2 Jn (s) ≥ maxi=1..k Qn (s, {ai }) + C (A∗n ) − k i=1 C ({ai }) 2 Corollary 3 justifies a pruning rule, combo-skipping, that preserves optimality in any iteration algorithm that maintains cost function monotonicity. This is powerful because all Bellman-backup based algorithms preserve monotonicity when started with an admissible cost function. To apply combo-skipping, one must compute all the Q (s, {a}) values for single actions a that are applicable in s. To calculate Jn (s) one may use the optimal combination for state s in the previous iteration (Aopt ) and compute Qn (s, Aopt ). This value gives an upper bound on the value Jn (s). Example: Consider Figure 1. Let a single action incur unit cost, and let the cost of an action combination be: C (A) = 0.5 + 0.5|A|. Let state s = (1,1,0,0,1) represent the ordered values x1 = 1, x2 = 1, x3 = 0, x4 = 0, and p12 = 1. Suppose, after the nth iteration, the cost function assigns the values: Jn (s) = 1, Jn (s1 =(1,0,0,0,1)) = 2, Jn (s2 =(1,1,1,0,1)) = 1, Jn (s3 =(1,1,0,1,1)) = 1. Let Aopt for state s be {toggle-x3 , toggle-x4 }. Now, Qn+1 (s, {toggle-x2 }) = C ({toggle-x2 }) + Jn (s1 ) = 3 and Qn+1 (s, Aopt ) = C (Aopt ) + 0.81×0 + 0.09×Jn (s2 ) + 0.09×Jn (s3 ) + 0.01×Jn (s) = 1.69. So now we can apply Corollary 3 to skip combination {toggle-x2 , toggle-x3 } in this iteration, since using toggle-x2 as a1 , we have Jn+1 (s) = Qn+1 (s, Aopt ) = 1.69 ≤ 3 + 1.5 - 2 = 2.5. 2 Experiments show that combo-skipping yields considerable savings. Unfortunately, comboskipping has a weakness — it prunes a combination for only a single iteration. In contrast, our second rule, combo-elimination, prunes irrelevant combinations altogether. 41 M AUSAM & W ELD 3.4.2 C OMBO -E LIMINATION We adapt the action elimination theorem from traditional MDPs (Bertsekas, 1995) to prove a similar theorem for CoMDPs. Theorem 4 Let A be an action combination which is applicable in state s. Let Q∗ (s, A) denote a lower bound of Q∗ (s, A). If Q∗ (s, A) > J∗ (s) then A is never the optimal combination for state s. Proof: Because a CoMDP is an MDP in a new action space, the original proof for MDPs (Bertsekas, 1995) holds after replacing an action by an ‘action combination’. 2 In order to apply the theorem for pruning, one must be able to evaluate the upper and lower bounds. By using an admissible cost function when starting RTDP search (or in value iteration, LAO* etc.), the current cost Jn (s) is guaranteed to be a lower bound of the optimal cost; thus, Qn (s, A) will also be a lower bound of Q∗ (s, A). Thus, it is easy to compute the left hand side of the inequality. To calculate an upper bound of the optimal J∗ (s), one may solve the MDP M, i.e., the traditional MDP that forbids concurrency. This is much faster than solving the CoMDP, and yields an upper bound on cost, because forbidding concurrency restricts the policy to use a strict subset of legal action combinations. Notice that combo-elimination can be used for all general MDPs and is not restricted to only CoMDPs over probabilistic STRIPS. Example: Continuing with the previous example, let A={toggle-x2 } then Qn+1 (s, A) = C (A) + Jn (s1 ) = 3 and J∗ (s) = 2.222 (from solving MDP M). As 3 > 2.222, A can be eliminated for state s in all remaining iterations. 2 Used in this fashion, combo-elimination requires the additional overhead of optimally solving the single-action MDP M. Since algorithms like RTDP exploit state-space reachability to limit computation to relevant states, we do this computation incrementally, as new states are visited by our algorithm. Combo-elimination also requires computation of the current value of Q (s, A) (for the lower bound of Q∗ (s, A)); this differs from combo-skipping which avoids this computation. However, once combo-elimination prunes a combination, it never needs to be reconsidered. Thus, there is a tradeoff: should one perform an expensive computation, hoping for long-term pruning, or try a cheaper pruning rule with fewer benefits? Since Q-value computation is the costly step, we adopt the following heuristic: “First, try combo-skipping; if it fails to prune the combination, attempt combo-elimination; if it succeeds, never consider it again”. We also tried implementing some other heuristics, such as: 1) If some combination is being skipped repeatedly, then try to prune it altogether with combo-elimination. 2) In every state, try combo-elimination with probability p. Neither alternative performed significantly better, so we kept our original (lower overhead) heuristic. Since combo-skipping does not change any step of labeled RTDP and combo-elimination removes provably sub-optimal combinations, pruned labeled RTDP maintains convergence, termination, optimality and efficiency, when used with an admissible heuristic. 3.5 Sampled Bellman Backups Since the fundamental challenge posed by CoMDPs is the explosion of action combinations, sampling is a promising method to reduce the number of Bellman backups required per state. We describe a variant of RTDP, called sampled RTDP, which performs backups on a random set of 42 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS action combinations1 , choosing from a distribution that favors combinations that are likely to be optimal. We generate our distribution by: 1. using combinations that were previously discovered to have low Q -values (recorded by memoizing the best combinations per state, after each iteration) 2. calculating the Q -values of all applicable single actions (using current cost function) and then biasing the sampling of combinations to choose the ones that contain actions with low Q -values. Algorithm 1 Sampled Bellman Backup(state, m) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: Function 2 SampleComb(state, i, l) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: //returns the best combination found list l = ∅ //a list of all applicable actions with their values for all action ∈ A do compute Q (state, {action}) insert a, 1/Q (state, {action}) in l for all i ∈ [1..m] do newcomb = SampleComb(state, i, l); compute Q (state, newcomb) clear memoizedlist[state] compute Qmin as the minimum of all Q values computed in line 7 store all combinations A with Q (state, A) = Qmin in memoizedlist[state] return the first entry in memoizedlist[state] //returns ith combination for the sampled backup if i ≤ size(memoizedlist[state]) then return ith entry in memoizedlist[state] //return the combination memoized in previous iteration newcomb = ∅ repeat randomly sample an action a from l proportional to its value insert a in newcomb remove all actions mutex with a from l if l is empty then done = true else if |newcomb| == 1 then done = false //sample at least 2 actions per combination else |newcomb| done = true with prob. |newcomb|+1 until done return newcomb This approach exposes an exploration / exploitation trade-off. Exploration, here, refers to testing a wide range of action combinations to improve understanding of their relative merit. Exploitation, on the other hand, advocates performing backups on the combinations that have previously been shown to be the best. We manage the tradeoff by carefully maintaining the distribution over combinations. First, we only memoize best combinations per state; these are always backed-up 1. A similar action sampling approach was also used in the context of space shuttle scheduling to reduce the number of actions considered during value function computation (Zhang & Dietterich, 1995). 43 M AUSAM & W ELD in a Bellman update. Other combinations are constructed by an incremental probabilistic process, which builds a combination by first randomly choosing an initial action (weighted by its individual Q -value), then deciding whether to add a non-mutex action or stop growing the combination. There are many implementations possible for this high level idea. We tried several of those and found the results to be very similar in all of them. Algorithm 1 describes the implementation used in our experiments. The algorithm takes a state and a total number of combinations m as an input and returns the best combination obtained so far. It also memoizes all the best combinations for the state in memoizedlist. Function 2 is a helper function that returns the ith combination that is either one of the best combinations memoized in the previous iteration or a new sampled combination. Also notice line 10 in Function 2. It forces the sampled combinations to be at least size 2, since all individual actions have already been backed up (line 3 of Algo 1). 3.5.1 T ERMINATION AND O PTIMALITY Since the system does not consider every possible action combination, sampled RTDP is not guaranteed to choose the best combination to execute at each state. As a result, even when started with an admissible heuristic, the algorithm may assign Jn (s) a cost that is greater than the optimal J∗ (s) — i.e., the Jn (s) values are no longer admissible. If a better combination is chosen in a subsequent iteration, Jn+1 (s) might be set a lower value than Jn (s), thus sampled RTDP is not monotonic. This is unfortunate, since admissibility and monotonicity are important properties required for termination2 and optimality in labeled RTDP; indeed, sampled RTDP loses these important theoretical properties. The good news is that it is extremely useful in practice. In our experiments, sampled RTDP usually terminates quickly, and returns costs that are extremely close to the optimal. 3.5.2 I MPROVING S OLUTION Q UALITY We have investigated several heuristics in order to improve the quality of the solutions found by sampled RTDP. Our heuristics compensate for the errors due to partial search and lack of admissibility. • Heuristic 1: Whenever sampled RTDP asserts convergence of a state, do not immediately label it as converged (which would preclude further exploration (Bonet & Geffner, 2003)); instead first run a complete backup phase, using all the admissible combinations, to rule out any easy-to-detect inconsistencies. • Heuristic 2: Run sampled RTDP to completion, and use the cost function it produces, J s (), as the initial heuristic estimate, J0 (), for a subsequent run of pruned RTDP. Usually, such a heuristic, though inadmissible, is highly informative. Hence, pruned RTDP terminates quite quickly. • Heuristic 3: Run sampled RTDP before pruned RTDP, as in Heuristic 2, except instead of using the J s () cost function directly as an initial estimate, scale linearly downward — i.e., use J0 () := cJ s () for some constant c ∈ (0, 1). While there are no guarantees we hope that this lies on the admissible side of the optimal. In our experience this is often the case for c = 0.9, and the run of pruned RTDP yields the optimal policy very quickly. 2. To ensure termination we implemented the policy: if number of trials exceeds a threshold, force monotonicity on the cost function. This will achieve termination but will reduce quality of solution. 44 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Experiments showed that Heuristic 1 returns a cost function that is close to optimal. Adding Heuristic 2 improves this value moderately, and a combination of Heuristics 1 and 3 returns the optimal solution in our experiments. 3.6 Experiments: Concurrent MDP Concurrent MDP is a fundamental formulation, modeling concurrent actions in a general planning domain. We first compare the various techniques to solve CoMDPs, viz., pruned and sampled RTDP. In following sections we use these techniques to model problems with durative actions. We tested our algorithms on problems in three domains. The first domain was a probabilistic variant of NASA Rover domain from the 2002 AIPS Planning Competition (Long & Fox, 2003), in which there are multiple objects to be photographed and various rocks to be tested with resulting data communicated back to the base station. Cameras need to be focused, and arms need to be positioned before usage. Since the rover has multiple arms and multiple cameras, the domain is highly parallel. The cost function includes both resource and time components, so executing multiple actions in parallel is cheaper than executing them sequentially. We generated problems with 20-30 state variables having up to 81,000 reachable states and the average number of applicable combinations per state, Avg(Ap(s)), which measures the amount of concurrency in a problem, is up to 2735. We also tested on a probabilistic version of a machineshop domain with multiple subtasks (e.g., roll, shape, paint, polish etc.), which need to be performed on different objects using different machines. Machines can perform in parallel, but not all are capable of every task. We tested on problems with 26-28 state variables and around 32,000 reachable states. Avg(Ap(s)) ranged between 170 and 2640 on the various problems. Finally, we tested on an artificial domain similar to the one shown in Figure 1 but much more complex. In this domain, some Boolean variables need to be toggled; however, toggling is probabilistic in nature. Moreover, certain pairs of actions have conflicting preconditions and thus, by varying the number of mutex actions we may control the domain’s degree of parallelism. All the problems in this domain had 19 state variables and about 32,000 reachable states, with Avg(Ap(s)) between 1024 and 12287. We used Labeled RTDP, as implemented in GPT (Bonet & Geffner, 2005), as the base MDP solver. It is implemented in C++. We implemented3 various algorithms, unpruned RTDP (U RTDP), pruned RTDP using only combo skipping (Ps -RTDP), pruned RTDP using both combo skipping and combo elimination (Pse -RTDP), sampled RTDP using Heuristic 1 (S-RTDP) and sampled RTDP using both Heuristics 1 and 3, with value functions scaled with 0.9 (S3 -RTDP). We tested all of these algorithms on a number of problem instantiations from our three domains, generated by varying the number of objects, degrees of parallelism, and distances to goal. The experiments were performed on a 2.8 GHz Pentium processor with a 2 GB RAM. We observe (Figure 2(a,b)) that pruning significantly speeds the algorithm. But the comparison of Pse -RTDP with S-RTDP and S3 -RTDP (Figure 3(a,b)) shows that sampling has a dramatic speedup with respect to the pruned versions. In fact, pure sampling, S-RTDP, converges extremely quickly, and S3 -RTDP is slightly slower. However, S3 -RTDP is still much faster than Pse -RTDP. The comparison of qualities of solutions produced by S-RTDP and S3 -RTDP w.r.t. optimal is shown in Table 1. We observe that solutions produced by S-RTDP are always nearly optimal. Since the 3. The code may be downloaded at http://www.cs.washington.edu/ai/comdp/comdp.tgz 45 M AUSAM & W ELD Comparison of Pruned and Unpruned RTDP for Rover domain Comparison of Pruned and Unpruned RTDP for Factory domain 12000 y=x Ps-RTDP Pse-RTDP 25000 Times for Pruned RTDP (in sec) Times for Pruned RTDP (in sec) 30000 20000 15000 10000 5000 y=x Ps-RTDP Pse-RTDP 10000 8000 6000 4000 2000 0 0 0 5000 10000 15000 20000 25000 30000 0 Times for Unpruned RTDP (in sec) 2000 4000 6000 8000 10000 12000 Times for Unpruned RTDP (in sec) Figure 2: (a,b): Pruned vs. Unpruned RTDP for Rover and MachineShop domains respectively. Pruning non-optimal combinations achieves significant speedups on larger problems. Comparison of Pruned and Sampled RTDP for Rover domain Comparison of Pruned and Sampled RTDP for Factory domain 8000 y=x S-RTDP S3-RTDP 8000 Times for Sampled RTDP (in sec) Times for Sampled RTDP (in sec) 10000 6000 4000 2000 0 y=x S-RTDP S3-RTDP 7000 6000 5000 4000 3000 2000 1000 0 0 2000 4000 6000 8000 1000012000140001600018000 Times for Pruned RTDP (Pse-RTDP) - (in sec) 0 1000 2000 3000 4000 5000 6000 7000 8000 Times for Pruned RTDP (Pse-RTDP) - (in sec) Figure 3: (a,b): Sampled vs Pruned RTDP for Rover and MachineShop domains respectively. Random sampling of action combinations yields dramatic improvements in running times. 46 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Comparison of algorithms with size of problem for Rover domain 30000 S-RTDP S3-RTDP Pse-RTDP U-RTDP 25000 S-RTDP S3-RTDP Pse-RTDP U-RTDP 15000 20000 Times (in sec) Times (in sec) Comparison of different algorithms for Artificial domain 20000 15000 10000 10000 5000 5000 0 0 0 5e+07 1e+08 1.5e+08 Reach(|S|)*Avg(Ap(s)) 2e+08 2.5e+08 0 2000 4000 6000 8000 Avg(Ap(s)) 10000 12000 14000 Figure 4: (a,b): Comparison of different algorithms with size of the problems for Rover and Artificial domains. As the problem size increases, the gap between sampled and pruned approaches widens considerably. Results on varying the Number of samples for Rover Problem#4 300 250 200 150 100 50 0 Running times Values of start state 300 100 200 300 400 500 600 700 Concurrency : Avg(Ap(s))/|A| 800 900 12.8 250 12.79 200 12.78 150 12.77 100 12.76 50 J*(s0) 0 0 12.81 Value of the start state 350 S-RTDP/Pse-RTDP Times for Sampled RTDP (in sec) Speedup : Sampled RTDP/Pruned RTDP Speedup vs. Concurrency for Artificial domain 350 10 20 30 40 50 60 70 80 Number of samples 12.74 90 100 Figure 5: (a): Relative Speed vs. Concurrency for Artificial domain. (b) : Variation of quality of solution and efficiency of algorithm (with 95% confidence intervals) with the number of samples in Sampled RTDP for one particular problem from the Rover domain. As number of samples increase, the quality of solution approaches optimal and time still remains better than Pse -RTDP (which takes 259 sec. for this problem). 47 M AUSAM & W ELD Problem Rover1 Rover2 Rover3 Rover4 Rover5 Rover6 Rover7 Artificial1 Artificial2 Artificial3 MachineShop1 MachineShop2 MachineShop3 MachineShop4 MachineShop5 J(s0 ) (S-RTDP) 10.7538 10.7535 11.0016 12.7490 7.3163 10.5063 12.9343 4.5137 6.3847 6.5583 15.0859 14.1414 16.3771 15.8588 9.0314 J ∗ (s0 ) (Optimal) 10.7535 10.7535 11.0016 12.7461 7.3163 10.5063 12.9246 4.5137 6.3847 6.5583 15.0338 14.0329 16.3412 15.8588 8.9844 Error <0.01% 0 0 0.02% 0 0 0.08% 0 0 0 0.35% 0.77% 0.22% 0 0.56% Table 1: Quality of solutions produced by Sampled RTDP error of S-RTDP is small, scaling it by 0.9 makes it an admissible initial cost function for the pruned RTDP; indeed, in all experiments, S3 -RTDP produced the optimal solution. Figure 4(a,b) demonstrates how running times vary with problem size. We use the product of the number of reachable states and the average number of applicable action combinations per state as an estimate of the size of the problem (the number of reachable states in all artificial domains is the same, hence the x-axis for Figure 4(b) is Avg(Ap(s))). From these figures, we verify that the number of applicable combinations plays a major role in the running times of the concurrent MDP algorithms. In Figure 5(a), we fix all factors and vary the degree of parallelism. We observe that the speedups obtained by S-RTDP increase as concurrency increases. This is a very encouraging result, and we can expect S-RTDP to perform well on large problems inolving high concurrency, even if the other approaches fail. In Figure 5(b), we present another experiment in which we vary the number of action combinations sampled in each backup. While solution quality is inferior when sampling only a few combinations, it quickly approaches the optimal on increasing the number of samples. In all other experiments we sample 40 combinations per state. 4. Challenges for Temporal Planning While the CoMDP model is powerful enough to model concurrency in actions, it still assumes each action to be instantaneous. We now incorporate actual action durations in the modeling the problem. This is essential to increase the scope of current models to real world domains. Before we present our model and the algorithms we discuss several new theoretical challenges imposed by explicit action durations. Note that the results in this section apply to a wide range of planning problems: • regardless of whether durations are uncertain or fixed • regardless of whether effects are stochastic or deterministic. 48 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Actions of uncertain duration are modeled by associating a distribution (possibly conditioned on the outcome of stochastic effects) over execution times. We focus on problems whose objective is to achieve a goal state while minimizing total expected time (make-span), but our results extend to cost functions that combine make-span and resource usage. This raises the question of when a goal counts as achieved. We require that: Assumption 1 All executing actions terminate before the goal is considered achieved. Assumption 2 An action, once started, cannot be terminated prematurely. We start by asking the question “Is there a restricted set of time points such that optimality is preserved even if actions are started only at these points?” Definition 1 Any time point when a new action is allowed to start execution is called a decision epoch. A time point is a pivot if it is either 0 or a time when a new effect might occur (e.g., the end of an action’s execution) or a new precondition may be needed or an existing precondition may no longer be needed. A happening is either 0 or a time when an effect actually occurs or a new precondition is definitely needed or an existing precondition is no longer needed. Intuitively, a happening is a point where a change in the world state or action constraints actually “happens” (e.g., by a new effect or a new precondition). When execution crosses a pivot (a possible happening), information is gained by the agent’s execution system (e.g., did or didn’t the effect occur) which may “change the direction” of future action choices. Clearly, if action durations are deterministic, then the set of pivots is the same as the set of happenings. Example: Consider an action a whose durations follow a uniform integer duration between 1 and 10. If it is started at time 0 then all timepoints 0, 1, 2,. . ., 10 are pivots. If in a certain execution it finishes at time 4 then 4 (and 0) is a happening (for this execution). 2 Definition 2 An action is a PDDL2.1 action (Fox & Long, 2003) if the following hold: • The effects are realized instantaneously either (at start) or (at end), i.e., at the beginning or the at the completion of the action (respectively). • The preconditions may need to hold instaneously before the start (at start), before the end (at end) or over the complete execution of the action (over all). (:durative-action a :duration (= ?duration 4) :condition (and (over all P ) (at end Q)) :effect (at end Goal)) (:durative-action b :duration (= ?duration 2) :effect (and (at start Q) (at end (not P )))) Figure 6: A domain to illustrate that an expressive action model may require arbitrary decision epochs for a solution. In this example, b needs to start at 3 units after a’s execution to reach Goal. 49 M AUSAM & W ELD Theorem 5 For a PDDL2.1 domain restricting decision epochs to pivots causes incompleteness (i.e., a problem may be incorrectly deemed unsolvable). Proof: Consider the deterministic temporal planning domain in Figure 6 that uses PDDL2.1 notation (Fox & Long, 2003). If the initial state is P =true and Q=false, then the only way to reach Goal is to start a at time t (e.g., 0), and b at some timepoint in the open interval (t + 2, t + 4). Clearly, no new information is gained at any of the time points in this interval and none of them is a pivot. Still, they are required for solving the problem. 2 Intuitively, the instantaneous start and end effects of two PDDL2.1 actions may require a certain relative alignment within them to achieve the goal. This alignment may force one action to start somewhere (possibly at a non-pivot point) in the midst of the other’s execution, thus requiring intermediate decision epochs to be considered. Temporal planners may be classified as having one of two architectures: constraint-posting approaches in which the times of action execution are gradually constrained during planning (e.g., Zeno and LPG, see Penberthy and Weld, 1994; Gerevini and Serina, 2002) and extended statespace methods (e.g., TP4 and SAPA, see Haslum and Geffner, 2001; Do and Kambhampati, 2001). Theorem 5 holds for both architectures but has strong computational implications for state-space planners because limiting attention to a subset of decision epochs can speed these planners. The theorem also shows that planners like SAPA and Prottle (Little, Aberdeen, & Thiebaux, 2005) are incomplete. Fortunately, an assumption restricts the set of decision epochs considerably. Definition 3 An action is a TGP-style action4 if all of the following hold: • The effects are realized at some unknown point during action execution, and thus can be used only once the action has completed. • The preconditions must hold at the beginning of an action. • The preconditions (and the features on which its transition function is conditioned) must not be changed during an action’s execution, except by an effect of the action itself. Thus, two TGP-style actions may not execute concurrently if they clobber each other’s preconditions or effects. For the case of TGP-style actions the set of happenings is nothing but the set of time points when some action terminates. TGP pivots are the set of points when an action might terminate. (Of course both these sets additionally include zero). Theorem 6 If all actions are TGP-style, then the set of decision epochs may be restricted to pivots without sacrificing completeness or optimality. Proof Sketch: By contradiction. Suppose that no optimal policy satisfies the theorem; then there must exist a path through the optimal policy in which one must start an action, a, at time t even though there is no action which could have terminated at t. Since the planner hasn’t gained any information at t, a case analysis (which requires actions to be TGP-style) shows that one could have started a earlier in the execution path without increasing the make-span. The detailed proof is discussed in the Appendix. 2 In the case of deterministic durations, the set of happenings is same as the set of pivots; hence the following corollary holds: 4. While the original TGP (Smith & Weld, 1999) considered only deterministic actions of fixed duration, we use the phrase “TGP-style” in a more general way, without these restrictions. 50 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Probabillity: 0.5 a0 s0 a2 G a1 Make−span: 3 Probability 0.5 a0 s0 a1 G b0 Make−span: 9 Time 0 2 4 6 8 Figure 7: Pivot decision epochs are necessary for optimal planning in face of nonmonotonic continuation. In this domain, Goal can be achieved by {a0 , a1 }; a2 or b0 ; a0 has duration 2 or 9; and b0 is mutex with a1 . The optimal policy starts a0 and then, if a0 does not finish at time 2, it starts b0 (otherwise it starts a1 ). Corollary 7 If all actions are TGP-style with deterministic durations, then the set of decision epochs may be restricted to happenings without sacrificing completeness or optimality. When planning with uncertain durations there may be a huge number of pivots; it is useful to further constrain the range of decision epochs. Definition 4 An action has independent duration if there is no correlation between its probabilistic effects and its duration. Definition 5 An action has monotonic continuation if the expected time until action termination is nonincreasing during execution. Actions without probabilistic effects, by nature, have independent duration. Actions with monotonic continuations are common, e.g. those with uniform, exponential, Gaussian, and many other duration distributions. However, actions with bimodal or multi-modal distributions don’t have monotonic continuations. For example consider an action with uniform distribution over [1,3]. If the action doesn’t terminate until 2, then the expected time until completion is calculated as 2, 1.5, and 1 for times 0, 1, and 2 respectively, which is monotonically decreasing. For an example of non-monotonic continuation see Figure 18. Conjecture 8 If all actions are TGP-style, have independent duration and monotonic continuation, then the set of decision epochs may be restricted to happenings without sacrificing completeness or optimality. If an action’s continuation is nonmonotonic then failure to terminate can increase the expected time remaining and cause another sub-plan to be preferred (see Figure 7). Similarly, if an action’s duration isn’t independent then failure to terminate changes the probability of its eventual effects and this may prompt new actions to be started. By exploiting these theorems and conjecture we may significantly speed planning since we are able to limit the number of decision epochs needed for decision-making. We use this theoretical understanding in our models. First, for simplicity, we consider only the case of TGP-style actions with deterministic durations. In Section 6, we relax this restriction by allowing stochastic durations, both unimodal as well as multimodal. 51 M AUSAM & W ELD toggle−p12 p12 (effect) conflict ¬p12 (Precondition) toggle−x1 0 2 4 6 8 10 Figure 8: A sample execution demonstrating conflict due to interfering preconditions and effects. (The actions are shaded to disambiguate them with preconditions and effects) 5. Temporal Planning with Deterministic Durations We use the abbreviation CPTP (short for Concurrent Probabilistic Temporal Planning) to refer to the probabilistic planning problem with durative actions. A CPTP problem has an input model similar to that of CoMDPs except that action costs, C(s, a, s ), are replaced by their deterministic durations, Δ(a), i.e., the input is of the form S, A, Pr, Δ, G, s0 . We study the objective of minimizing the expected time (make-span) of reaching a goal. For the rest of the paper we make the following assumptions: Assumption 3 All action durations are integer-valued. This assumption has a negligible effect on expressiveness because one can convert a problem with rational durations into one that satisfies Assumption 3 by scaling all durations by the g.c.d. of the denominators. In case of irrational durations, one can always find an arbitrarily close approximation to the original problem by approximating the irrational durations by rational numbers. For reasons discussed in the previous section we adopt the TGP temporal action model of Smith and Weld (1999), rather than the more complex PDDL2.1 (Fox & Long, 2003). Specifically: Assumption 4 All actions follow the TGP model. These restrictions are consistent with our previous definition of concurrency. Specifically, the mutex definitions (of CoMDPs over probabilistic STRIPS) hold and are required under these assumptions. As an illustration, consider Figure 8. It describes a situation in which two actions with interfering preconditions and effects can not be executed concurrently. To see why not, suppose initially p12 was false and two actions toggle-x1 and toggle-p12 were started at time 2 and 4, respectively. As ¬p12 is a precondition of toggle-x1 , whose duration is 5, it needs to remain false until time 7. But toggle-p12 may produce its effects anytime between 4 and 9, which may conflict with the preconditions of the other executing action. Hence, we forbid the concurrent execution of toggle-x1 and toggle-p12 to ensure a completely predictable outcome distribution. Because of this definition of concurrency, the dynamics of our model remains consistent with Equation 5. Thus the techniques developed for CoMDPs derived from probabilistic STRIPS actions may be used. 52 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS An Aligned Epoch policy execution (takes 9 units) toggle−x1 t3 f f f f t3 t3 t3 t3 0 5 s 10 time toggle−x1 f f f f t3 t3 t3 t3 t3 s An Interwoven Epoch policy execution (takes 5 units) Figure 9: Comparison of times taken in a sample execution of an interwoven-epoch policy and an alignedepoch policy. In both trajectories the toggle-x3 (t3) action fails four times before succeeding. Because the aligned policy must wait for all actions to complete before starting any more, it takes more time than the interwoven policy, which can start more actions in the middle. 5.1 Formulation as a CoMDP We can model a CPTP problem as a CoMDP, and thus as an MDP, in more than one way. We list the two prominent formulations below. Our first formulation, aligned epoch CoMDP models the problem approximately but solves it quickly. The second formulation, interleaved epochs models the problem exactly but results in a larger state space and hence takes longer to solve using existing techniques. In subsequent subsections we explore ways to speed up policy construction for the interleaved epoch formulation. 5.1.1 A LIGNED E POCH S EARCH S PACE A simple way to formulate CPTP is to model it as a standard CoMDP over probabilistic STRIPS, in which action costs are set to their durations and the cost of a combination is the maximum duration of the constituent actions (as in Equation 6). This formulation introduces a substantial approximation to the CPTP problem. While this is true for deterministic domains too, we illustrate this using our example involving stochastic effects. Figure 9 compares the trajectories in which the toggle-x3 (t3) actions fails for four consecutive times before succeeding. In the figure, “f” and “s” denote failure and success of uncertain actions, respectively. The vertical dashed lines represent the time-points when an action is started. Consider the actual executions of the resulting policies. In the aligned-epoch case (Figure 9 top), once a combination of actions is started at a state, the next decision can be taken only when the effects of all actions have been observed (hence the name aligned-epochs). In contrast, Figure 9 bottom shows that at a decision epoch in the optimal execution for a CPTP problem, many actions may be midway in their execution. We have to explicitly take into account these actions and their remaining execution times when making a subsequent decision. Thus, the actual state space for CPTP decision making is substantially different from that of the simple aligned-epoch model. Note that due to Corollary 7 it is sufficient to consider a new decision epoch only at a happening, i.e., a time-point when one or more actions complete. Thus, using Assumption 3 we infer that these decision epochs will be discrete (integer). Of course, not all optimal policies will have this property. 53 M AUSAM & W ELD State variables : x1 , x2 , x3 , x4 , p12 Action Δ(a) Precondition toggle-x1 5 ¬p12 toggle-x2 5 p12 toggle-x3 1 true Effect x1 ← ¬x1 x2 ← ¬x2 x3 ← ¬x3 no change toggle-x4 1 true x4 ← ¬x4 no change toggle-p12 5 true p12 ← ¬p12 Goal : x1 = 1, x2 = 1, x3 = 1, x4 = 1 Probability 1 1 0.9 0.1 0.9 0.1 1 Figure 10: The domain of Example 1 extended with action durations. But it is easy to see that there exists at least one optimal policy in which each action begins at a happening. Hence our search space reduces considerably. 5.1.2 I NTERWOVEN E POCH S EARCH S PACE We adapt the search space representation of Haslum and Geffner (2001), which is similar to that in other research (Bacchus & Ady, 2001; Do & Kambhampati, 2001). Our original state space S in Section 2 is augmented by including the set of actions currently executing and the times passed since they were started. Formally, let the new interwoven state5 s ∈ S –- be an ordered pair X, Y where: • X∈S • Y = {(a, δ)|a ∈ A, 0 ≤ δ < Δ(a)} Here X represents the values of the state variables (i.e. X is a state in the original state space) and Y denotes the set of ongoing actions “a” and the times passed since their start “δ”. Thus the overall interwoven-epoch search space is S –- = S × a∈A {a} × ZΔ(a) , where ZΔ(a) represents the set {0, 1, . . . , Δ(a) − 1} and denotes the Cartesian product over multiple sets. Also define As to be the set of actions already in execution. In other words, As is a projection of Y ignoring execution times in progress: As = {a|(a, δ) ∈ Y ∧ s = X, Y } Example: Continuing our example with the domain of Figure 10, suppose state s1 has all state variables false, and suppose the action toggle-x1 was started 3 units ago from the current time. Such a state would be represented as X1 , Y1 with X1 =(F, F, F, F, F ) and Y1 ={(toggle-x1 ,3)} (the five state variables are listed in the order: x1 , x2 , x3 , x4 and p12 ). The set As1 would be {toggle-x1 }. To allow the possibility of simply waiting for some action to complete execution, that is, deciding at a decision epoch not to start any additional action, we augment the set A with a no-op action, which is applicable in all states s = X, Y where Y = ∅ (i.e. states in which some action is still being executed). For a state s, the no-op action is mutex with all non-executing actions, i.e., those in A \ As . In other words, at any decision epoch either a no-op will be started or any combination not 5. We use the subscript –- to denote the interwoven state space (S –- ), value function (J –- ), etc.. 54 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS involving no-op. We define no-op to have a variable duration6 equal to the time after which another already executing action completes (δnext (s, A) as defined below). The interwoven applicability set can be defined as: Ap –- (s) = Ap (X) if Y = ∅ else {noop}∪{A|A∪As ∈ Ap (X) and A∩As = ∅} Transition Function: We also need to define the probability transition function, Pr –- , for the interwoven state space. At some decision epoch let the agent be in state s = (X, Y ). Suppose that the agent decides to execute an action combination A. Define Ynew as the set similar to Y but consisting of the actions just starting; formally Ynew = {(a, Δ(a))|a ∈ A}. In this system, the next decision epoch will be the next time that an executing action terminates. Let us call this time δnext (s, A). Notice that δnext (s, A) depends on both executing and newly started actions. Formally, δnext (s, A) = min (a,δ)∈Y ∪Ynew Δ(a) − δ Moreover, multiple actions may complete simultaneously. Define Anext (s, A) ⊆ A ∪ As to be the set of actions that will complete exactly in δnext (s, A) timesteps. The Y -component of the state at the decision epoch after δnext (s, A) time will be Ynext (s, A) = {(a, δ + δnext (s, A))|(a, δ) ∈ Y ∪ Ynew , Δ(a) − δ > δnext (s, A)} Let s=X, Y and let s =X , Y . The transition function for CPTP can now be defined as: Pr –- (s |s, A)= Pr (X |X, Anext (s, A)) if Y = Ynext (s, A) 0 otherwise In other words, executing an action combination A in state s = X, Y takes the agent to a decision epoch δnext (s, A) ahead in time, specifically to the first time when some combination Anext (s, A) completes. This lets us calculate Ynext (s, A): the new set of actions still executing with their times elapsed. Also, because of TGP-style actions, the probability distribution of different state variables is modified independently. Thus the probability transition function due to CoMDP over probabilistic STRIPS can be used to decide the new distribution of state variables, as if the combination Anext (s, A) were taken in state X. Example: Continuing with the previous example, let the agent in state s1 execute the action combination A = {toggle-x4 }. Then δnext (s1 , A) = 1, since toggle-x4 will finish the first. Thus, Anext (s1 , A)= {toggle-x4 }. Ynext (s1 , A) = {(toggle-x1 ,4)}. Hence, the probability distribution of states after executing the combination A in state s1 will be • ((F, F, F, T, F ), Ynext (s1 , A)) probability = 0.9 • ((F, F, F, F, F ), Ynext (s1 , A)) probability = 0.1 6. A precise definition of the model will create multiple no-opt actions with different constant durations t and the no-opt applicable in an interwoven state will be the one with t = δnext (s, A). 55 M AUSAM & W ELD Start and Goal States: In the interwoven space, the start state is s0 , ∅ and the new set of goal states is G –- = {X, ∅|X ∈ G}. By redefining the start and goal states, the applicability function, and the probability transition function, we have finished modeling a CPTP problem as a CoMDP in the interwoven state space. Now we can use the techniques of CoMDPs (and MDPs as well) to solve our problem. In particular, we can use our Bellman equations as described below. Bellman Equations: The set of equations for the solution of a CPTP problem can be written as: J ∗- (s) = 0, if s ∈ G –- else – ⎧ ∗ ⎪ ⎨ ⎫ ⎪ ⎬ (11) J - (s) = min δnext (s, A) + Pr –- (s |s, A)J ∗- (s ) – – ⎪ A∈Ap - (s) ⎪ ⎩ ⎭ s ∈S – – We will use DURsamp to refer to the sampled RTDP algorithm over this search space. The main bottleneck in naively inheriting algorithms like DURsamp is the huge size of the interwoven state space. In the worst case (when all actions can be executed concurrently) the size of the state space is |S| × ( a∈A Δ(a)). We get this bound by observing that for each action a, there are Δ(a) number of possibilities: either a is not executing or it is and has remaining times 1, 2, . . . , Δ(a) − 1. Thus we need to reduce or abstract/aggregate our state space in order to make the problem tractable. We now present several heuristics which can be used to speed the search. 5.2 Heuristics We present both an admissible and an inadmissible heuristics that can be used as the initial cost function for DURsamp algorithm. The first heuristic (maximum concurrency) solves the underlying MDP and is thus quite efficient to compute. The second heuristic (average concurrency) is inadmissible, but tends to be more informed than the maximum concurrency heuristic. 5.2.1 M AXIMUM C ONCURRENCY H EURISTIC We prove that the optimal expected cost in a traditional (serial) MDP divided by the maximum number of actions that can be executed in parallel is a lower bound for the expected make-span of reaching a goal in a CPTP problem. Let J(X) denote the value of a state X ∈ S in a traditional MDP with costs of an action equal to its duration. Let Q(X, A) denote the expected cost to reach the goal if initially all actions in the combination A are executed and the greedy serial policy is followed thereafter. Formally, Q(X, A) = X ∈S Pr (X |X, A)J(X ). Let J –- (s) be the value for equivalent CPTP problem with s as in our interwoven-epoch state space. Let concurrency of a state be the maximum number of actions that could be executed in the state concurrently. We define maximum concurrency of a domain (c) as the maximum number of actions that can be concurrently executed in any world state in the domain. The following theorem can be used to provide an admissible heuristic for CPTP problems. Theorem 9 Let s = X, Y , J ∗- (s) ≥ – J ∗- (s) ≥ – J ∗ (X) for Y = ∅ c Q∗ (X, As ) for Y = ∅ c 56 (12) P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Proof Sketch: Consider any trajectory of make-span L (from a state s = X, ∅ to a goal state) in a CPTP problem using its optimal policy. We can make all concurrent actions sequential by executing them in the chronological order of being started. As all concurrent actions are non-interacting, the outcomes at each stage will have similar probabilities. The maximum make-span of this sequential trajectory will be cL (assuming c actions executing at all points in the semi-MDP trajectory). Hence J(X) using this (possibly non-stationary) policy would be at most cJ ∗- (s). Thus J ∗ (X) ≤ cJ ∗- (s). – – The second inequality can be proven in a similar way. 2 There are cases where these bounds are tight. For example, consider a deterministic planning problem in which the optimal plan is concurrently executing c actions each of unit duration (makespan = 1). In the sequential version, the same actions would be taken sequentially (make-span = c). Following this theorem, the maximum concurrency (MC) heuristic for a state s = X, Y is defined as follows: J ∗ (X) Q∗ (X, As ) else HM C (s) = if Y = ∅ HM C (s) = c c The maximum concurrency c can be calculated by a static analysis of the domain and is a onetime expense. The complete heuristic function can be evaluated by solving the MDP for all states. However, many of these states may never be visited. In our implementation, we do this calculation on demand, as more states are visited, by starting the MDP from the current state. Each RTDP run can be seeded by the previous value function, thus no computation is thrown away and only the relevant part of the state space is explored. We refer to DURsamp initiated with the MC heuristic by DURMC samp . 5.2.2 AVERAGE C ONCURRENCY H EURISTIC Instead of using maximum concurrency c in the above heuristic we use the average concurrency in the domain (ca ) to get the average concurrency (AC) heuristic. We call the resulting algorithm DURAC samp . The AC heuristic is not admissible, but in our experiments it is typically a more informed heuristic. Moreover, in the case where all the actions have the same duration, the AC heuristic equals the MC heuristic. 5.3 Hybridized Algorithm We present an approximate method to solve CPTP problems. While there can be many kinds of possible approximation methods, our technique exploits the intuition that it is best to focus computation on the most probable branches in the current policy’s reachable space. The danger of this approach is the chance that, during execution, the agent might end up in an unlikely branch, which has been poorly explored; indeed it might blunder into a dead-end in such a case. This is undesirable, because such an apparently attractive policy might have a true expected make-span of infinity. Since, we wish to avoid dead-ends, we explore the desirable notion of propriety. Definition 6 Propriety: A policy is proper at a state if it is guaranteed to lead, eventually, to the goal state (i.e., it avoids all dead-ends and cycles) (Barto et al., 1995). We define a planning algorithm proper if it always produces a proper policy (when one exists) for the initial state. We now describe an anytime approximation algorithm, which quickly generates a proper policy and uses any additional available computation time to improve the policy, focusing on the most likely trajectories. 57 M AUSAM & W ELD 5.3.1 H YBRIDIZED P LANNER Our algorithm, DURhyb , is created by hybridizing two other policy creation algorithms. Indeed, our novel notion of hybridization is both general and powerful, applying to many MDP-like problems; however, in this paper we focus on the use of hybridization for CPTP. Hybridization uses an anytime algorithm like RTDP to create a policy for frequently visited states, and uses a faster (and presumably suboptimal) algorithm for the infrequent states. For the case of CPTP, our algorithm hybridizes the RTDP algorithms for interwoven-epoch and aligned-epoch models. With aligned-epochs, RTDP converges relatively quickly, because the state space is smaller, but the resulting policy is suboptimal for the CPTP problem, because the policy waits for all currently executing actions to terminate before starting any new actions. In contrast, RTDP for interwoven-epochs generates the optimal policy, but it takes much longer to converge. Our insight is to run RTDP on the interwoven space long enough to generate a policy which is good on the common states, but stop well before it converges in every state. Then, to ensure that the rarely explored states have a proper policy, we substitute the aligned policy, returning this hybridized policy. Algorithm 3 Hybridized Algorithm DURhyb (r, k, m) 1: for all s ∈ S - do – 2: initialize J –- (s) with an admissible heuristic 3: repeat 4: perform m RTDP trials 5: compute hybridized policy (πhyb ) using interwoven-epoch policy for k-familiar states and aligned- epoch policy otherwise clean πhyb by removing all dead-ends and cycles J π- s0 , ∅ ← evaluation of πhyb from the start state – π J - (s0 ,∅)−J - (s0 ,∅) – – 8: until <r J - (s0 ,∅) – 9: return hybridized policy πhyb 6: 7: Thus the key question is how to decide which states are well explored and which are not. We define the familiarity of a state s to be the number of times it has been visited in previous RTDP trials. Any reachable state whose familiarity is less than a constant, k, has an aligned policy created for it. Furthermore, if a dead-end state is reached using the greedy interwoven policy, then we create an aligned policy for the immediate precursors of that state. If a cycle is detected7 , then we compute an aligned policy for all the states which are part of the cycle. We have not yet said how the hybridized algorithm terminates. Use of RTDP helps us in defining a very simple termination condition with a parameter that can be varied to achieve the desired closeness to optimality as well. The intuition is very simple. Consider first, optimal labeled RTDP. This starts with an admissible heuristic and guarantees that the value of the start state, J –- (s0 , ∅), remains admissible (thus less than or equal to optimal). In contrast, the hybridized policy’s makespan is always longer than or equal to optimal. Thus as time progresses, these values approach the optimal make-span from opposite sides. Whenever the two values are within an optimality ratio (r), we know that the algorithm has found a solution, which is close to the optimal. 7. In our implementation cycles are detected using simulation. 58 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Finally, evaluation of the hybridized policy is done using simulation, which we perform after a fixed number of m RTDP trials. Algorithm 3 summarizes the details of the algorithm. One can see that this combined policy is proper for two reasons: 1) if the policy at a state is from the aligned policy, then it is proper because the RTDP for the aligned-epoch model was run to convergence, and 2) for the rest of the states it has explicitly ensured that there are no cycles or dead-ends. 5.4 Experiments: Planning with Deterministic Durations Continuing from Section 3.6, in this set of experiments we evaluate the various techniques for solving problems involving explicit deterministic durations. We compare the computation time and solution quality of five methods: interwoven Sampled RTDP with no heuristic (DURsamp ), with the AC maximum concurrency (DURMC samp ), and average concurrency (DURsamp ) heuristics, the hybridized algorithm (DURhyb ) and Sampled RTDP on the aligned-epoch model (DURAE ). We test on our Rover, MachineShop and Aritificial domains. We also use our Artificial domain to see if the relative performance of the techniques varies with the amount of concurrency in the domain. 5.4.1 E XPERIMENTAL S ETUP We modify the domains used in Section 3.6 by additionally including action durations. For NASA Rover and MachineShop domains, we generate problems with 17-26 state variables and 12-18 actions, whose duration range between 1 and 20. The problems have between 15,000-700,000 reachable states in the interwoven-epoch state space, S –- . We use Artificial domain for control experiments to study the effect of degree of parallelism. All the problems in this domain have 14 state variables and 17,000-40,000 reachable states and durations of actions between 1 and 3. We use our implementation of Sampled RTDP8 and implement all heuristics: maximum concurrency (HM C ), average concurrency (HAC ), for the initialization of the value function. We calculate these heuristics on demand for the states visited, instead of computing the complete heuristic for the whole state space at once. We also implement the hybridized algorithm in which the initial value function was set to the HM C heuristic. The parameters r, k, and m are kept at 0.05, 100 and 500, respectively. We test each of these algorithms on a number of problem instances from the three domains, which we generate by varying the number of objects, degrees of parallelism, durations of the actions and distances to the goal. 5.4.2 C OMPARISON OF RUNNING T IMES Figures 11(a, b) and 12(a) show the variations in the running times for the algorithms on different problems in Rover, Machineshop and Artificial domains, respectively. The first three bars represent the base Sampled RTDP without any heuristic, with HM C , and with HAC , respectively. The fourth bar represents the hybridized algorithm (using the HM C heuristic) and the fifth bar is computation of the aligned-epoch Sampled RTDP with costs set to the maximum action duration. The white region in the fourth bar represents the time taken for the aligned-epoch RTDP computations in the hybridized algorithm. The error bars represent 95% confidence intervals on the running times. Note that the plots are on a log scale. 8. Note that policies returned by DURsamp are not guaranteed to be optimal. Thus all the implemented algorithms are approximate. We can replace DURsamp by pruned RTDP (DURprun ) if optimality is desired. 59 M AUSAM & W ELD Rover16 Mach11 10^3 10^2 10^1 Mach12 Mach13 Mach14 Mach15 Mach16 0 M AC H AE Rover15 0 M AC H AE Rover14 Time in sec (on log scale) 10^3 10^2 10^1 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 10^0 10^0 0 M AC H AE Time in sec (on log scale) Rover13 0 M AC H AE Rover12 Rover11 0 M AC H AE 10^4 10^4 Figure 11: (a,b): Running times (on a log scale) for the Rover and Machineshop domain, respectively. For each problem the five bars represent the times taken by the algorithms: DURsamp (0), DURMC samp (AE), DURAC (AC), DUR (H), and DUR (AE), respectively. The white bar on DUR hyb AE hyb samp denotes the portion of time taken by aligned-epoch RTDP. Algos DURMC samp DURAC samp DURhyb DURAE Speedup compared with DURsamp Rover Machineshop Artificial Average 3.016764 1.545418 1.071645 1.877942 3.585993 2.173809 1.950643 2.570148 10.53418 2.154863 16.53159 9.74021 135.2841 16.42708 241.8623 131.1911 Table 2: The ratio of the time taken by S –- S-RTDP with no heuristics to that of each algorithm. Our heuristics produce 2-3 times speedups. The hybridized algo produces about a 10x speedup. Aligned epoch search produces 100x speedup, but sacrifices solution quality. We notice that DURAE solves the problems extremely quickly; this is natural since the alignedepoch space is much smaller. Use of both HM C and HAC always speeds search in the S –- model. Comparing the heuristics amongst themselves, we find that average concurrency heuristic mostly performs faster than maximum concurrency — presumably because HAC is a more informed heuristic in practice, although at the cost of being inadmissible. We find a couple of cases in which HAC doesn’t perform better; this could be because it is focusing the search in the incorrect region, given its inadmissible nature. For the Rover domain, the hybridized algorithm performs fastest. In fact, the speedups are dramatic compared to other methods. In other domains, the results are more comparable for small problems. However, for large problems in these two domains, hybridized outperforms the others by a huge margin. In fact for the largest problem in Artificial domain, none of the heuristics are able to converge (within a day) and only DURhyb and DURAE converge to a solution. 60 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS 10^4 Art12 Art13 Art14 Art15 0 M AC H AE Art11 0 M AC H AE 1.6 Art11(68) Art12(77) Art13(81) Art14(107) Art15(224) Art16(383) Art17(1023) Art16 Art17 Ratio of make-span to optimal Time in sec (on log scale) 1.5 10^3 10^2 10^1 1.4 1.3 1.2 1.1 1 0.9 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0 M AC H AE 0.8 10^0 Figure 12: (a,b): Comparison of the different algorithms (running times and solution quality respectively) for the Artificial domain. As degree of parallelism increases the problems become harder; the largest problem is solved only by DURhyb and DURAE . Table 2 shows the speedups obtained by various algorithms compared to the basic DURsamp . In the Rover and Artificial domains the speedups obtained by DURhyb and DURAE are much more prominent than in the Machineshop domain. Averaging over all domains, H produces a 10x speedup and AE produces more than a 100x speedup. 5.4.3 C OMPARISON OF S OLUTION Q UALITY Figures 13(a, b) and 12(b) show the quality of the policies obtained by the same five methods on the same domains. We measure quality by simulating the generated policy across multiple trials, and reporting the average time taken to reach the goal. We plot the ratio of the so-measured expected make-span to the optimal expected make-span9 . Table 3 presents solution qualities for each method, averaged over all problems in a domain. We note that the aligned-epoch policies usually yield significantly longer make-spans (e.g., 25% longer); thus one must make a quality sacrifice for their speedy policy construction. In contrast, the hybridized algorithm extorts only a small sacrifice in quality in exchange for its speed. 5.4.4 VARIATION WITH C ONCURRENCY Figure 12(a) represents our attempt to see if the relative performance of the algorithms changed with increasing concurrency. Along the top of the figure, by the problem names, are numbers in brackets; these list the average number of applicable combinations in each MDP state, Avgs∈S - |Ap(s)|, and – range from 68 to 1023 concurrent actions. Note that for the difficult problems with a lot of parallelism, DURsamp slows dramatically, regardless of heuristic. In contrast, the DURhyb is still able to quickly produce a policy, and at almost no loss in quality (Figure 12(b)). 9. In some large problems the optimal algorithm did not converge. For those, we take as optimal, the best policy found in our runs. 61 M AUSAM & W ELD Rover16 1.7 Mach11 Ratio of make-span to optimal 1.4 1.3 1.2 1.1 1 Mach16 1.2 1.1 1 0 M AC H AE 0 M AC H AE Mach15 1.3 0.8 0 M AC H AE Mach14 1.4 0.8 0 M AC H AE Mach13 1.5 0.9 0 M AC H AE Mach12 1.6 0.9 0 M AC H AE Ratio of make-span to optimal 1.5 0 M AC H AE Rover15 0 M AC H AE Rover14 0 M AC H AE Rover13 0 M AC H AE Rover12 0 M AC H AE Rover11 0 M AC H AE 1.8 1.6 Figure 13: (a,b): Comparison of make-spans of the solution found with the optimal(plotted as 1 on the yaxes) for Rover and Machineshop domains, respectively. All algorithms except DURAE produce solutions quite close to the optimal. Algos DURsamp DURMC samp DURAC samp DURhyb DURAE Rover 1.059625 1.018405 1.017141 1.059349 1.257205 Average Quality Machineshop Artificial 1.065078 1.042561 1.062564 1.013465 1.046391 1.020523 1.075534 1.059201 1.244862 1.254407 Average 1.055704 1.031478 1.028019 1.064691 1.252158 Table 3: Overall solution quality produced by all algorithms. Note that all algorithms except DURAE produce policies whose quality is quite close to optimal. On average DURAE produces make-spans that are about 125% of the optimal. 6. Optimal Planning with Uncertain Durations We now extend the techniques of previous section for the case when action durations are not deterministic. As before, we consider TGP-style actions and a discrete temporal model. We assume independent durations, and monotonic continuations, but Section 6.3 relaxes the latter, extending our algorithms to handle multimodal duration distributions. As before we aim to minimize the expected time required to reach a goal. 6.1 Formulating as a CoMDP We now formulate our planning problem as a CoMDP similar to Section 5.1. While some of the parameters of the CoMDP can be used directly from our work on deterministic durations, we need to recompute the transition function. 62 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS State Space: Both the aligned epoch state space as well as the interwoven epoch space, as defined in Section 5.1 are adequate to model this planning problem. To determine the size of the interwoven space, we replace the duration of an action by its max duration. Let ΔM (a) denote the maximum time within which action a will complete. The overall interwoven-epoch search space is S –- = S × {a} × ZΔM (a) , where ZΔM (a) represents the set {0, 1, . . . , ΔM (a) − 1} and denotes the Cartesian product over multiple sets. Action Space: At any state we may apply a combination of actions with the applicability function reflecting the fact that the combination of actions is safe w.r.t itself (and w.r.t. already executing actions in case of interwoven space) as in the previous sections. While the previous state space and action space work well for our problem, the transition function definition needs to change, since we now need to take into account the uncertainty in durations. Transition Function: Uncertain durations require significant changes to the probability transition function (Pr –- ) for the interwoven space from the definitions of Section 5.1.2. Since our assumptions justify Conjecture 8, we need only consider happenings when choosing decision epochs. a∈A Algorithm 4 ComputeTransitionFunc(s=X, Y ,A) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: Y ← Y ∪ {(a, 0)} ∀a ∈ A mintime ← min(a,δ)∈Y minimum remaining time for a maxtime ← min(a,δ)∈Y maximum remaining time for a for all integer t ∈ [mintime, maxtime] do At ← set of actions that could possibly terminate after t for all non-empty subsets Asubt ⊆ At do pc ← (prob. that exactly Asubt terminates after t (see Equation 13). W ← {(Xt , pw ) | Xt is a world state; pw is the probability that Asubt terminates yielding Xt }. for all (Xt , pw ) ∈ W do Yt ← {(a, δ + t) | (a, δ) ∈ Y, a ∈ / Asubt } insert (Xt , Yt , pw × pc ) in output return output The computation of transition function is described in Algorithm 4. Although the next decision epoch is determined by a happening, we still need to consider all pivots for the next state calculations as all these are potential happenings. mintime is the minimum time when an executing action could terminate, maxtime is the minimum time by which it is guaranteed that at least one action will terminate. For all times between mintime and maxtime we compute the possible combinations that could terminate then and the resulting next interwoven state. The probability, pc , (line 7) may be computed using the following formula: pc = (prob. a terminates at δa + t|a hasn t terminated till δa ) × (a,δa )∈Y,a∈Asubt (prob. b doesn t terminate at δb + t|b hasn t terminated till δb ) (13) (b,δb )∈Y,b∈Asub / t Considering all pivots makes the algorithm computationally intensive because there may be many pivots and many action combinations could end at each one, and with many outcomes each. In our implementation, we cache the transition function so that we do not have to recompute the information for any state. 63 M AUSAM & W ELD Start and Goal States: The start state and goal set that we developed for the deterministic durations work unchanged when the durations are stochastic. So, the start state is s0 , ∅ and the goal set is G –- = {X, ∅|X ∈ G}. Thus we have modeled our problem as a CoMDP in the interwoven state space. We have redefined the start and goal states, and the probability transition function. Now we can use the techniques of CoMDPs to solve our problem. In particular, we can use our Bellman equations as below. Bellman Equations for Interwoven-Epoch Space: Define δel (s, A, s ) as the time elapsed between two interwoven states s and s when combination A is executed in s. The set of equations for the solution of our problem can be written as: J ∗- (s) = 0, if s ∈ G –- else – Pr –- (s |s, A) δel (s, A, s ) + J ∗- (s ) J ∗- (s) = min – – A∈Ap - (s) – s ∈S –- (14) Compare these equations with Equation 11. There is one difference besides the new transition function — the time elapsed is within the summation sign. This is because time elapsed depends also on the next interwoven state. Having modeled this problem as a CoMDP we again use our algorithms of Section 5. We use ΔDUR to denote the family of algorithms for the CPTP problems involving stochastic durations. The main bottleneck in solving these problem, besides the size of the interwoven state space, is the high branching factor. 6.1.1 P OLICY C ONSTRUCTION : RTDP & H YBRIDIZED P LANNING Since we have modeled our problem as a CoMDP in the new interwoven space, we may use pruned RTDP (ΔDURprun ) and sampled RTDP (ΔDURsamp ) for policy construction. Since the cost function in our problem (δel ) depends also on the current and the next state, combo-skipping does not apply for this problem. Thus ΔDURprun refers to RTDP with only combo-elimination. Furthermore, only small adaptations are necessary to incrementally compute the (admissible) maximum concurrency (M C) and (more informed, but inadmissible) average concurrency (AC) heuristics. For example, for the serial MDP (in the RHS of Equation 12) we now need to compute the average duration of an action and use that as the action’s cost. Likewise, we can further speed planning by hybridizing (ΔDURhyb ) RTDP algorithms for interwoven and aligned-epoch CoMDPs to produce a near-optimal policy in significantly less time. The dynamics of aligned epoch space is same as that in Section 5 with one exception. The cost of a combination, in the case of deterministic durations, was simply the max duration of the constituent actions. The novel twist stems from the fact that uncertain durations require computation of the cost of an action combination as the expected time that the last action in the combination will terminate. For example, suppose two actions, both with uniform duration distributions over [1,3], are started concurrently. The probabilities that both actions will have finished by times 1, 2 and 3 (and no earlier) are 1/9, 3/9, and 5/9 respectively. Thus the expected duration of completion of the combination (let us call it ΔAE ) is 1×1/9 + 2×3/9 + 3×5/9 = 2.44. 64 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS 6.2 Expected-Duration Planner When modeled as a CoMDP in the full-blown interwoven space, stochastic durations cause an exlposive growth in the branching factor. In general, if n actions are started each with m possible durations and each having r probabilistic effects, then there are (m − 1)[(r + 1)n − rn − 1] + rn potential successors. This number may be computed as follows: for each duration between 1 and m − 1 any subset of actions could complete and each action could result in r outcomes. Hence, total number of successors per duration is i∈[1..n] n Ci ri = (r + 1)n − rn − 1. Moreover, if none of the actions finish until time m − 1 then at the last step all actions terminate leading into rn outcomes. So, total number of successors is (m − 1)[(r + 1)n − rn − 1] + rn . Thus, the branching factor is multiplicative in the duration uncertainty and exponential in the concurrency. To manage this extravagant computation we must curb the branching factor. One method is to ignore duration distributions. We can assign each action a constant duration equal to the mean of its distribution, then apply a deterministic-duration planner such as DURsamp . However, when executing the deterministic-duration policy in a setting where durations are actually stochastic, an action will likely terminate at a time different than its mean, expected duration. The ΔDURexp planner addresses this problem by augmenting the deterministic-duration policy created to account for these unexpected outcomes. 6.2.1 O NLINE V ERSION The procedure is easiest to understand in its online version (Algorithm 5): wait until the unexpected happens, pause execution, and re-plan. If the original estimate of an action’s duration is implausible, we compute a revised deterministic estimate in terms of Ea (min) — the expected value of a’s duration given that it has not terminated by time min. Thus, Ea (0) will compute the expected duration of a. Algorithm 5 Online ΔDURexp 1: build a deterministic-duration policy from the start state s0 2: repeat 3: execute action combination specified by policy 4: wait for interrupt 5: case: action a terminated as expected {//do nothing} 6: case: action a terminates early 7: extend policy from current state 8: case: action a didn’t terminate as expected 9: extend policy from current state revising a’s duration as follows: 10: δ ← time elapsed since a started executing 11: nextexp ← Ea (0) 12: while nextexp < δ do 13: nextexp ← Ea (nextexp) 14: endwhile 15: a’s revised duration ← nextexp − δ 16: endwait 17: until goal is reached 65 M AUSAM & W ELD Example: Let the duration of an action a follow a uniform distribution between 1 and 15. The expected value that gets assigned in the first run of the algorithm (Ea (0)) is 8. While running the algorithm, suppose the action didn’t terminate by 8 and we reach a state where a has been running for, say, 9 time units. In that case, a revised expected duration for a would be (Ea (8)) = 12. Similarly, if it doesn’t terminate by 12 either then the next expected duration would be 14, and finally 15. In other words for all states where a has been executing for times 0 to 8, it is expected to terminate at 8. For all times between 8 and 12 the expected completion is at 12, for 12 to 14 it is 14 and if it doesn’t terminate at 14 then it is 15. 2 6.2.2 O FFLINE V ERSION This algorithm also has an offline version in which re-planning for all contingencies is done ahead of time and for fairness we used this version in the experiments. Although the offline algorithm plans for all possible action durations, it is still much faster than the other algorithms. The reason is that each of the planning problems solved is now significantly smaller (less branching factor, smaller reachable state space), and all the previous computation can be succinctly stored in the form of the interwoven state, value pairs and thus reused. Algorithm 6 describes this offline planner and the subsequent example illustrates the savings. Algorithm 6 Offline ΔDURexp 1: build a deterministic-duration policy from the start state s0 ; get current J - and π - values – – 2: insert s0 in the queue open 3: repeat 4: state = open.pop() 5: for all currstate s.t. Pr - (currstate|state, π ∗- (state)) > 0 do – – if currstate is not goal and currstate is not in the set visited then visited.insert(currstate) if J –- (currstate) has not converged then if required, change the expected durations of the actions that are currently executing in currstate. 10: solve a deterministic-duration planning problem with the start state currstate 11: insert currstate in the queue open 12: until open is empty 6: 7: 8: 9: Line 9 of Algorithm 6 assigns a new expected duration for all actions that are currently running in the current state and have not completd by the time of their previous termination point. This reassignment follows the similar case in the online version (line 13). Example: Consider a domain with two state-variables, x1 and x2 , with two actions set-x1 and set-x2 . The task is to set both variables (initially they are both false). Assume that set-x2 always succeeds whereas set-x1 succeeds with only 0.5 probability. Moreover, let both actions have a uniform duration distribution of 1, 2, or 3. In such a case a complete interwoven epoch search could touch 36 interwoven states (each state variable could be true or false, each action could be “not running”, “running for 1 unit”, and “running for 2 units”). Instead, if we build a deterministic duration policy then each action’s deterministic duration will be 2, and so the total number of states touched will be from the 16 interwoven states (each action could now only be “not running” or “running for 1 unit”). 66 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Problem A2 B2 G C2 D G Optimal Solution (Trajectory 1, pr =0.5, make−span 9) A2 B2 G C2 Optimal Solution (Trajectory 2, pr =0.5, make−span 5) A2 C2 D G DUR exp Solution (make−span 8) A2 Time 0 B2 4 G 8 12 Figure 14: An example of a domain where the ΔDURexp algorithm does not compute an optimal solution. Now, suppose that the deterministic planner decides to execute both actions in the start state. Having committed to this combination, it is easy to see that certain states will never be reached. For example, the state (¬x1 , ¬x2 ), {(set−x1 , 2)} can never be visited, since once set-x2 completes it is guaranteed that x2 will be set. In fact, in our example, only 3 new states will initiate offline replanning (line 10 in Algo 6), viz., (x1 , ¬x2 ), {(set−x2 , 2)}, (¬x1 , ¬x2 ), {(set−x2 , 2)}, and (¬x1 , x2 ), {(set−x1 , 2)} 2 6.2.3 P ROPERTIES Unfortunately, our ΔDURexp algorithm is not guaranteed to produce an optimal policy. How bad are the policies generated by the expected-duration planner? The experiments show that ΔDURexp typically generates policies which are extremely close to optimal. Even the worst-case pathological domain we are able to construct leads to an expected make-span which is only 50% longer than optimal (in the limit). This example is illustrated below. Example: We consider a domain which has actions A2:n , B2:n , C2:n and D. Each Ai and Bi takes time 2i . Each Ci has a probabilistic duration: with probability 0.5, Ci takes 1 unit of time, and with the remaining probability, it takes 2i+1 + 1 time. Thus, the expected duration of Ci is 2i + 1. D takes 4 units. In sub-problem SPi , the goal may be reached by executing Ai followed by Bi . Alternatively, the goal may be reached by first executing Ci and then recursively solving the sub-problem SPi−1 . In this domain, the ΔDURexp algorithm will always compute Ai ; Bi as the best solution. However, the optimal policy starts both {Ai , Ci }. If Ci terminates at 1, the policy executes the solution for SPi−1 ; otherwise, it waits until Ai terminates and then executes Bi . Figure 14 illustrates the sub-problem SP2 in which the optimal policy has an expected make-span of 7 (vs. ΔDURexp ’s make-span of 8). In general, the expected make-span of the optimal policy on 3 SPn is 13 [2n+2 + 24−n ] + 22−n + 2. Thus, limn→∞ exp opt = 2 .2 6.3 Multi-Modal Duration Distributions The planners of the previous two sections benefited by considering the small set of happenings instead of pivots, an approach licensed by Conjecture 8. Unfortunately, this simplification is not 67 M AUSAM & W ELD warranted in the case of actions with multi-modal duration distributions, which can be common in complex domains where all factors can’t be modeled explicitly. For example, the amount of time for a Mars rover to transmit data might have a bimodal distribution — normally it would take little time, but if a dust storm were in progress (unmodeled) it could take much longer. To handle these cases we model durations with a mixture of Gaussians parameterized by the triple amplitude, mean, variance. 6.3.1 C O MDP F ORMULATION Although we cannot restrict decision epochs to happenings, we need not consider all pivots; they are required only for actions with multi-modal distributions. In fact, it suffices to consider pivots in regions of the distribution where the expected-time-to-completion increases. In all other cases we need consider only happenings. Two changes are required to the transition function of Algorithm 4. In line 3, the maxtime computation now involves time until the next pivot in the increasing remaining time region for all actions with multi-modal distributions (thus forcing us to take a decision at those points, even when no action terminates). Another change (in line 6) allows a non-empty subset Asub t for t = maxtime. That is, next state is computed even without any action termination. By making these changes in the transition function we reformulate our problem as a CoMDP in the interwoven space and thus solve, using our previous methods of pruned/sampled RTDP, hybrid algorithm or expectedduration algorithm. 6.3.2 A RCHETYPAL -D URATION P LANNER We also develop a multi-modal variation of the expected-duration planner, called ΔDURarch . Instead of assigning an action a single deterministic duration equal to the expected value, this planner assigns it a probabilistic duration with various outcomes being the means of the different modes in the distribution and the probabilities being the probability mass in each mode. This enhancement reflects our intuitive understanding for multi-modal distributions and the experiments confirm that ΔDURarch produces solutions having shorter make-spans than those of ΔDURexp . 6.4 Experiments: Planning with Stochastic Durations We now evaluate our techniques for solving planning problems involving stochastic durations. We compare the computation time and solution quality (make-span) of our five planners for domains with and without multi-modal duration distributions. We also re-evaluate the effectiveness of the maximum- (MC) and average-concurrency (AC) heuristics for these domains. 6.4.1 E XPERIMENTAL S ETUP We modify our Rover, MachineShop, and Artificial domains by additionally including uncertainty in action durations. For this set of experiments, our largest problem had 4 million world states of which 65536 were reachable. Our algorithms explored up to 1,000,000 distinct states in the interwoven state space during planning. The domains contained as many as 18 actions, and some actions had as many as 13 possible durations. For more details on the domains please refer to the longer version (Mausam, 2007). 68 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Planning Time (in sec) 6000 5000 4000 Rover Machine-Shop 3000 ΔPruned DURprun ΔSampled DURsamp ΔDURhyb Hybrid 2000 1000 ΔDURexp Exp-Dur 0 21 22 23 24 25 26 27 28 29 30 Problems Figure 15: Planning time comparisons for Rover and MachineShop domains: Variation along algorithms when all initialized by the average concurrency (AC) heuristic; ΔDURexp performs the best. Algos ΔDURsamp ΔDURhyb ΔDURexp Average Quality of Make-Span Rover MachineShop Artificial 1.001 1.000 1.001 1.022 1.011 1.019 1.008 1.015 1.046 Table 4: All three planners produce near-optimal policies as shown by this table of ratios to the optimal make-span.11 6.4.2 C OMPARING RUNNING T IMES We compare all algorithms with and without heuristics and reaffirm that the heuristics significantly speed up the computation on all problems; indeed, some problems are too large to be solved without heuristics. Comparing them amongst themselves we find that AC beats M C — regardless of the planning algorithm; this isn’t surprising since AC sacrifices admissibility. Figure 15 reports the running times of various algorithms (initialized with the AC heuristic) on the Rover and Machine-Shop domains when all durations are unimodal. ΔDURexp out-performs the other planners by substantial margins. As this algorithm is solving a comparatively simpler problem, fewer states are expanded and thus the approximation scales better than others — solving, for example, two Machine-Shop problems, which were too large for most other planners. In most cases hybridization speeds planning by significant amounts, but it performs better than ΔDURexp only for the artificial domain. 6.4.3 C OMPARING S OLUTION Q UALITY We measure quality by simulating the generated policy across multiple trials. We report the ratio of average expected make-span and the optimal expected make-span for domains with all unimodal distributions in Table 4. We find that the make-spans of the inadmissible heuristic AC are at par 11. If the optimal algorithm doesn’t converge, we use the best solution found across all runs as “optimal”. 69 M AUSAM & W ELD 28 26 24 1000 ΔDURprun Pruned ΔDURsamp Sampled 100 J*(s0) Planning time (log scale) 10000 ΔDURhyb Hybrid ΔDURarch Arch-Dur ΔDURexp Exp-Dur 10 22 ΔDURprun DUR-prun ΔDURsamp DUR-samp 20 18 ΔDURhyb DUR-hyb ΔDURarch DUR-arch 16 ΔDURexp DUR-exp 14 31 32 33 34 35 36 Problems 31 32 33 34 35 36 Problems Figure 16: Comparisons in the Machine-Shop domain with multi-modal distributions. (a) Computation Time comparisons: ΔDURexp and ΔDURarch perform much better than other algos. (b) Makespans returned by different algos: Solutions returned by ΔDURsamp are almost optimal. Overall ΔDURarch finds a good balance between running time and solution quality. with those of the admissible heuristic M C. The hybridized planner is approximate with a userdefined bound. In our experiments, we set the bound to 5% and find that the make-spans returned by the algorithm are quite close to the optimal and do not always differ by 5%. ΔDURexp has no quality guarantees, still the solutions returned on the problems we tested upon are nearly as good as other algorithms. Thus, we believe that this approximation will be quite useful in scaling to larger problems without losing solution quality. 6.4.4 M ULTIMODAL D OMAINS We develop multi-modal variants of our domains; e.g., in the Machine-Shop domain, time for fetching paint was bimodal (if in stock, paint can be fetched fast, else it needs to be ordered). There was an alternative but costly paint action that doesn’t require fetching of paint. Solutions produced by ΔDURsamp made use of pivots as decision epochs by starting the costly paint action in case the fetch action didn’t terminate within the first mode of the bimodal distribution (i.e. paint was out of stock). The running time comparisons are shown in Figure 16(a) on a log-scale. We find that ΔDURexp terminates extremely quickly and ΔDURarch is not far behind. However, the make-span comparisons in Figure 16(b) clearly illustrate the approximations made by these methods in order to achieve planning time. ΔDURarch exhibits a good balance of planning time and solution quality. 7. Related Work This paper extends our prior work, originally reported in several conference publications (Mausam & Weld, 2004, 2005, 2006a, 2006b). Temporal planners may be classified as using constraint-posting or extended state-space methods (discussed earlier in Section 4). While the constraint approach is promising, few (if any) probabilistic planners have been implemented using this architecture; one exception is Buridan (Kush70 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS stochastic deterministic concurrent durative non-durative ΔDUR, Tempastic, Concurrent MDP, GSMDP, Prottle, Factorial MDP, FPG, Aberdeen et al. Paragraph Temporal Planning Step-optimal planning (TP4, Sapa, MIPS (GraphPlan, SATPlan) TLPlan, etc.) non-concurrent durative non-durative Time Dependent MDP, MDP IxTeT, CIRCA, (RTDP, LAO*, etc.) Foss & Onder Planning with Classical Planning Numerical Resources (HSP, FF, etc.) (Sapa, Metric-FF, CPT) Figure 17: A table listing various planners that implement different subsets of concurrent, stochastic, durative actions. merick, Hanks, & Weld, 1995), which performed poorly. In contrast, the MDP community has proven the state-space approach successful. Since the powerful deterministic temporal planners, which have won the various planning competitions, also use the state-space approach, we adopt it for our algorithms that combine temporal planning with MDPs. It may be interesting to incorporate constraint-based approaches in a probabilistic paradigm and compare against the techniques of this paper. 7.1 Comparison with Semi-MDPs A Semi-Markov Decision Process is an extension of MDPs that allows durative actions to take variable time. A discrete time semi-MDP can be solved by solving a set of equations that is a direct extension of Equations 2. The techniques for solving discrete time semi-MDPs are natural generalizations of those for MDPs. The main distinction between a semi-MDP and our formulation of concurrent probabilistic temporal planning with stochastic durations concerns the presence of concurrently executing actions in our model. A semi-MDP does not allow for concurrent actions and assumes one executing action at a time. By allowing concurrency in actions and intermediate decision epochs, our algorithms need to deal with large state and action spaces, which is not encountered by semi-MDPs. Furthermore, Younes and Simmons have shown that in the general case, semi-MDPs are incapable of modeling concurrency. A problem with concurrent actions and stochastic continuous durations needs another model known as Generalized Semi-Markov Decision Process (GSMDP) for a precise mathematical formulation (Younes & Simmons, 2004b). 7.2 Concurrency and Stochastic, Durative Actions Tempastic (Younes & Simmons, 2004a) uses a rich formalism (e.g. continuous time, exogenous events, and expressive goal language) to generate concurrent plans with stochastic durative actions. Tempastic uses a completely non-probabilistic planner to generate a plan which is treated as a candidate policy and repaired as failure points are identified. This method does not guarantee completeness or proximity to the optimal. Moreover, no attention was paid towards heuristics or search control making the implementation impractical. GSMDPs (Younes & Simmons, 2004b) extend continuous-time MDPs and semi-Markov MDPs, modeling asynchronous events and processes. Both of Younes and Simmons’s approaches handle 71 M AUSAM & W ELD a strictly more expressive model than ours due to their modeling of continuous time. They solve GSMDPs by approximation with a standard MDP using phase-type distributions. The approach is elegant, but its scalability to realistic problems is yet to be demonstrated. In particular, the approximate, discrete MDP model can require many states yet still behave very differently than the continuous original. Prottle (Little et al., 2005) also solves problems with an action language more expressive than ours: effects can occur in the middle of action execution and dependent durations are supported. Prottle uses an RTDP-type search guided by heuristics computed from a probabilistic planning graph; however, it plans for a finite horizon — and thus for an acyclic state space. It is difficult to compare Prottle with our approach because Prottle optimizes a different objective function (probability of reaching a goal), outputs a finite-length conditional plan as opposed to a cyclic plan or policy, and is not guaranteed to reach the goal. FPG (Aberdeen & Buffet, 2007) learns a separate neural network for each action individually based on the current state. In the execution phase the decision, i.e., whether an action needs to be executed or not, is taken independently of decisions regarding other actions. In this way FPG is able to effectively sidestep the blowup caused by exponential combinations of actions. In practice it is able to very quickly compute high quality solutions. Rohanimanesh and Mahadevan (2001) investigate concurrency in a hierarchical reinforcement learning framework, where abstract actions are represented by Markov options. They propose an algorithm based on value-iteration, but their focus is calculating joint termination conditions and rewards received, rather than speeding policy construction. Hence, they consider all possible Markov option combinations in a backup. Aberdeen et al. (2004) plan with concurrent, durative actions with deterministic durations in a specific military operations domain. They apply various domain-dependent heuristics to speed the search in an extended state space. 7.3 Concurrency and Stochastic, Non-durative Actions Meuleau et al. and Singh & Cohn deal with a special type of MDP (called a factorial MDP) that can be represented as a set of smaller weakly coupled MDPs — the separate MDPs are completely independent except for some common resource constraints, and the reward and cost models are purely additive (Meuleau, Hauskrecht, Kim, Peshkin, Kaelbling, Dean, & Boutilier, 1998; Singh & Cohn, 1998). They describe solutions in which these sub-MDPs are independently solved and the sub-policies are merged to create a global policy. Thus, concurrency of actions of different sub-MDPs is a by-product of their work. Singh & Cohn present an optimal algorithm (similar to combo-elimination used in DURprun ), whereas domain specific heuristics in Meuleau et al. have no such guarantees. All of the work in Factorial MDPs assumes that a weak coupling exists and has been identified, but factoring an MDP is a hard problem in itself. Paragraph (Little & Thiebaux, 2006) formulates the planning with concurrency as a regression search over the probabilistic planning graph. It uses techniques like nogood learning and mutex reasoning to speed policy construction. Guestrin et al. solve the multi-agent MDP problem by using a linear programming (LP) formulation and expressing the value function as a linear combination of basis functions. By assuming that these basis functions depend only on a few agents, they are able to reduce the size of the LP (Guestrin, Koller, & Parr, 2001). 72 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS 7.4 Stochastic, Non-concurrent, Durative Actions Many researchers have studied planning with stochastic, durative actions in absence of concurrency. For example, Foss and Onder (2005) use simple temporal networks to generate plans in which the objective function has no time component. Simple Temporal Networks allow effective temporal constraint reasoning and their methods can generate temporally contingent plans. Boyan and Littman (2000) propose Time-dependent MDPs to model problems with actions that are not concurrent and have time-dependent, stochastic durations; their solution generates piecewise linear value functions. NASA researchers have developed techniques for generating non-concurrent plans with uncertain continuous durations using a greedy algorithm which incrementally adds branches to a straightline plan (Bresina et al., 2002; Dearden, Meuleau, Ramakrishnan, Smith, & Washington, 2003). While they handle continuous variables and uncertain continuous effects, their solution is heuristic and the quality of their policies is unknown. Also, since they consider only limited contingencies, their solutions are not guaranteed to reach the goal. IxTeT is a temporal planner that uses constraint based reasoning within partial order planning (Laborie & Ghallab, 1995). It embeds temporal properties of actions as constraints and does not optimize make-span. CIRCA is an example of a system that plans with uncertain durations where each action is associated with an unweighted set of durations (Musliner, Murphy, & Shin, 1991). 7.5 Deterministic, Concurrent, Durative Actions Planning with deterministic actions is a comparitively simpler problem and much of the work in planning under uncertainty is based on the previous, deterministic planning research. For instance, our interwoven state representation and transition function are extensions of the extended state representations in TP4, SAPA, and TLPlan (Haslum & Geffner, 2001; Do & Kambhampati, 2003; Bacchus & Ady, 2001). Other planners, like MIPS and AltAltp , have also investigated fast generation of parallel plans in deterministic settings (Edelkamp, 2003; Nigenda & Kambhampati, 2003) and Jensen and Veloso (2000) extend it to problems with disjunctive uncertainty. 8. Future Work Having presented a comprehensive set of techniques to handle probabilistic outcomes, concurrent and durative actions in a single formalism, we now direct our attention towards different relaxations and extensions to the proposed model. In particular, we explore other objective functions, infinite horizon problems, continuous-valued duration distributions, temporally expressive action models, degrees of goal satisfaction and interruptibility of actions. 8.1 Extension to Other Cost Functions For the planning problems with durative actions (sections 4 and beyond) we focused on make-span minimization problems. However, our techniques are quite general and are applicable (directly or with minor variations) to a variety of cost metrics. As an illustration, consider the mixed cost optimization problem in which in addition to the duration of each action, we are also given the amount of resource consumed per action, and we wish to minimize the the sum of make-span and total resource usage. Assuming that the resource consumption is unaffected by concurrent 73 M AUSAM & W ELD execution, we can easily compute a new max-concurrency heuristic. The mixed-cost counterpart for Equations 12 is: Jt∗ (X) for Y = ∅ + Jr∗ (X) c Q∗t (X, As ) + Q∗r (X, As ) for Y = ∅ c J ∗- (s) ≥ – J ∗- (s) ≥ – (15) Here, Jt is for the single-action MDP assignng costs to be durations and Jr is for the single action MDP assigning costs to be resource consumptions. A more informed average concurrency heuristic can be similarly computed by replacing maximum concurrency by average concurrency. The hybridized algorithm follows in the same fashion, with the fast algorithm being a CoMDP solved using techniques of Section 3. On the same lines, if the objective function is to minimize make-span given a certain maximum resource usage, then the total amount of resource remaining can be included in the state-space for all the CoMDPs and underlying single-action MDPs etc. and the same techniques may be used. 8.2 Infinite Horizon Problems Until now this paper has defined the techniques for the case of indefinite horizon problems, in which an absorbing state is defined as is reachable. For other problems an alternative formulation is preferred that allows for infinite execution but discounts the future costs by multiplying them by a discount factor in each step. Again, our techniques can be suitably extended for such scenario. For example, Theorem 2 gets modified to the following: Q (s, A) ≥ γ 1−k Q (s, {a1 }) + C (A) − k i=1 γ i−k C ({ai }) Recall that this theorem provides us with the pruning rule, combo-skipping. Thus, we can use Pruned RTDP with the new pruning rule. 8.3 Extensions to Continuous Duration Distributions Until now we have confined ourselves to actions with discrete durations (refer to Assumption 3). We now investigate the effects of dealing directly with continuous uncertainty in the duration distributions. Let fiT (t)dt be the probability of action ai completing between times t + T and t + T + dt, conditioned on action ai not finishing until time T . Similarly, define FiT (t) to be the probability of the action finishing after time t + T . Let us consider the extended state X, {(a1 , T )}, which denotes that action a1 started T units ago in the world state X. Let a2 be an applicable action that is started in this extended state. Define M = min(ΔM (a1 )−T, ΔM (a2 )), where ΔM denotes the maximum possible duration of execution for each action. Intuitively, M is the time by which at least one action will complete. Then Q –- n+1 (X, {(a1 , T )}, a2 ) = M 0 M 0 f1T (t)F20 (t) t + J –- n (X1 , {a2 , t}) dt + F1T (t)f20 (t) t + J –- n (X2 , {a1 , t + T }) dt 74 (16) 0 2 4 6 8 10 10 10 Expected time to reach the goal Expected Remaining Time for action a0 Duration Distribution of a0 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS 8 6 4 2 0 Time 2 4 Time 6 8 10 8 6 4 2 0 2 4 6 8 10 Time Figure 18: If durations are continuous (real-valued) rather than discrete, there may be an infinite number of potentially important decision epochs. In this domain, a crucial decision epoch could be required at any time in (0, 1] — depending on the length of possible alternate plans. Here X1 and X2 are world states obtained by applying the deterministic actions a1 and a2 respectively on X. Recall that J –- n+1 (s) = mina Q –- n+1 (s, a). For a fixed point computation of this form, we desire that Jn+1 and Jn have the same functional form12 . Going by the equation above this seems very difficult to achieve, except perhaps for very specific action distributions in some special planning problems. For example, if all distributions are constant or if there is no concurrency in the domain, then these equations are easily solvable. But for more interesting cases, solving these equations is a challenging open question. Furthermore, dealing with continuous multi-modal distributions worsens the decision epochs explosion. We illustrate this with the help of an example. Example: Consider the domain of Figure 7 except that let action a0 have a bimodal distribution, the two modes being uniform between 0-1 and 9-10 respectively as shown in Figure 18(a). Also let a1 have a very small duration. Figure 18(b) shows the expected remaining termination times if a0 terminates at time 10. Notice that due to bimodality, this expected remaining execution time increases between 0 and 1. The expected time to reach the goal using plan {a0 , a1 }; a2 is shown in the third graph. Now suppose, we have started {a0 , a1 }, and we need to choose the next decision epoch. It is easy to see that the optimal decision epoch could be any point between 0 and 1 and would depend on the alternative routes to the goal. For example, if duration of b0 is 7.75, then the optimal time-point to start the alternative route is 0.5 (right after the expected time to reach the goal using first plan exceeds 7.75). Thus, the choice of decision epochs depends on the expected durations of the alternative routes. But these values are not known in advance, in fact these are the ones being calculated in the planning phase. Therefore, choosing decision epochs ahead of time does not seem possible. This makes the optimal continuous multi-modal distribution planning problem mostly intractable for any reasonable sized problem. 8.4 Generalizing the TGP Action Model The assumption of TGP style actions enables us to compute optimal policies, since we can prune the number of decision epochs. In the case of complex action models like PDDL2.1 (Fox & Long, 2003), all old, deterministic state-space planners are incomplete. For the same reasons, our algorithms are 12. This idea has been exploited in order to plan with continuous resources (Feng, Dearden, Meuleau, & Washington, 2004). 75 M AUSAM & W ELD incomplete for problems in PPDDL2.1 . Recently, Cushing et al. have introduced Tempo, a statespace planner, which uses lifting over time in to achieve completeness (Cushing, Kambhampati, Mausam, & Weld, 2007). In pursuit of finding a complete, state-space, probabilistic planner for complex action models, a natural step is to consider a Tempo-like representation in a probabilistic setting. While working out the details seems relatively straightforward, the important research challenge will be to find the right heuristics to streamline the search so that the algorithm can scale. 8.5 Other Extensions There are several other extensions to the basic framework that we have suggested. Each different construct introduces additional structure and we need to exploit the knowledge in order to design fast algorithms. Many times, the basic algorithms proposed in this paper may be easily adapted to such situations, sometimes they may be not. We list two of the important extensions below. • Notion of Goal Satisfaction: Different problems may require slightly different notions of when a goal is reached. For example, we have assumed thus far that a goal is not “officially achieved” until all executed actions have terminated. Alternatively, one might consider a goal to be achieved if a satisfactory world state is reached, even though some actions may be in the midst of execution. There are intermediate possibilities in which a goal requires some specific actions to necessarily end. Just by changing the definition of the goal set, these problems can be modeled as a CoMDP. The hybridized algorithm and the heuristics can be easily adapted for this case. • Interruptible Actions: We have assumed that, once started, an action cannot be terminated. However, a richer model may allow preemptions, as well as the continuation of an interrupted action. The problems, in which all actions could be interrupted at will, have a significantly different flavor. Interrupting an action is a new kind of decision and requires a full study of when might an action termination be useful. To a large extent, planning with these is similar to finding different concurrent paths to the goal and starting all of them together, since one can always interrupt all the executing paths as soon as the goal is reached. For instance, example in Figure 7 no longer holds since b0 can be started at time 1, and later terminated as needed to shorten the make-span. 8.6 Effect of Large Durations A weakness of all extended-state space approaches, both in deterministic as well as probabilistic settings, is the dependence on absolute durations (or to be more accurate, the greatest common divisor of action durations). For instance, if the domain has an action a with a large duration, say 100 and another concurrently executable action with duration 1, then all world states will be explored with the tuples (a, 1), (a, 2), . . ., (a, 98), (a, 99). In general, many of these states will behave similarly and there will be certain decision boundaries that will be important. “Start b if a has been executing for 50 units and c otherwise” is one example of such a decision boundary. Instead of representing all these flat discrete states individually, planning in an aggregate space in which each state represents several extended states will help alleviate this inefficiency. However, it is not obvious how to achieve such an aggregation automatically, since adapting the well-known methods for aggregation do not hold in our case. For instance, SPUDD (Hoey 76 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS et al., 1999) uses algebraic decision diagrams to represent abstract states that have the same Jvalue. Aggregating the same valued states may not be enough for us, since the expected time of completion depends linearly on the amount of time left for the longest executing action. So, all the states which differ only by the amount of time an action has been executing will not be able to aggregate together. In a similar way, Feng et al. (2004) use piecewise constant and piecewise linear representations to adaptively discretize continuous variables. In our case, we have |A| of such variables. While only a few of them that are executing are active at a given time, modeling a sparse high-dimensional value function is not easy either. Being able to exploit this structure due to action durations is an essential future direction in order to scale the algorithms to complex real world domains. 9. Conclusions Although concurrent and durative actions with stochastic effects characterize many real-world domains, few planners can handle all these challenges in concert. This paper proposes a unified statespace based framework to model and solve such problems. State space formulations are popular both in deterministic temporal planning as well as in probabilistic planning. However, each of these features bring in additional complexities to the formulation and afford new solution techniques. We develop the “DUR” family of algorithms to alleviates these complexities. We evaluate the techniques on the running times and qualities of solutions produced. Moreover, we study the theoretical properties of these domains and also identify key conditions under which fast, optimal algorithms are possible. We make the following contributions: 1. We define Concurrent MDPs (CoMDP) — an extension of the MDP model to formulate a stochastic planning problem with concurrent actions. A CoMDP can be cast back into a new MDP with an extended action space. Because this action space is possibly exponential in the number of actions, solving the new MDP naively may take a huge performance hit. We develop the general notions of pruning and sampling to speed up the algorithms. Pruning refers to pruning of the provably sub-optimal action-combinations for each state, thus performing less computation but still guaranteeing optimal solutions. Sampling-based solutions rely on an intelligent sampling of action-combinations to avoid dealing with their exponential number. This method converges orders of magnitude faster than other methods and produces near-optimal solutions. 2. We formulate the planning with concurrent, durative actions as a CoMDP in two modified state spaces — aligned epoch, and interwoven epoch. While aligned epoch based solutions run very fast, interwoven epoch algorithms yield a much higher quality solutions. We also define two heuristic functions — maximum concurrency (MC), and average concurrency (AC) to guide the search. MC is an admissible heuristic, whereas AC, while inadmissible, is typically more-informed leading to better computational gains. We call our algorithms the “DUR” family of algorithms. The subscripts samp or prun refer to sampling and pruning respectively, optional superscripts AC or MC refer to the heuristic employed, if any and an optional "Δ" before DUR notifies a problem with stochastic durations. For example, Labeled RTDP for a deterministic duration problem employing sampling and started with AC heuristic will be abbreviated as DURAC samp . 77 M AUSAM & W ELD 3. We also develop the general technique of hybridizing two planners. Hybridizing interwovenepoch and aligned-epoch CoMDPs yields a much more efficient algorithm, DURhyb . The algorithm has a parameter, which can be varied to trade-off speed against optimality. In our experiments, DURhyb quickly produces near-optimal solutions. For larger problems, the speedups over other algorithms are quite significant. The hybridized algorithm can also be used in an anytime fashion thus producing good-quality proper policies (policies that are guaranteed to reach the goal) within a desired time. Moreover, the idea of hybridizing two planners is a general notion; recently it has been applied to solving general stochastic planning problems (Mausam, Bertoli, & Weld, 2007). 4. Uncertainty in durations leads to more complexities because in addition to state and action spaces, there is also a blowup in the branching factor and in the number of decision epochs. We bound the space of decision epochs in terms of pivots (times when actions may potentially terminate) and conjecture further restrictions, thus making the problem tractable. We also propose two algorithms, the expected duration planner (ΔDURexp ) and the archetypal duration planner (ΔDURarch ), which successively solve small planning problems each with no or limited duration uncertainty, respectively. ΔDURarch is also able to make use of the additional structure offered by multi-modal duration distributions. These algorithms perform much faster than other techniques. Moreover, ΔDURarch offers a good balance between planning time vs. solution quality tradeoff. 5. Besides our focus on stochastic actions, we expose important theoretical issues related with durative actions which have repercussions to deterministic temporal planners as well. In particular, we prove that all common state-space temporal planners are incomplete in the face of expressive action models, e.g., PDDL2.1 , a result that may have a strong impact on the future temporal planning research (Cushing et al., 2007). Overall, this paper proposes a large set of techniques that are useful in modeling and solving planning problems employing stochastic effects, concurrent executions and durative actions with duration uncertainties. The algorithms range from fast but suboptimal solutions, to relatively slow but optimal. Various algorithms that explore different intermediate points in this spectrum are also presented. We hope that our techniques will be useful in scaling the planning techniques to real world problems in the future. Acknowledgments We thank Blai Bonet for providing the source code of GPT as well as for comments in the course of this work. We are thankful to Sumit Sanghai for his theorem proving skills and advice at various stages of this research. We are grateful to Derek Long and the anonymous reviewers of this paper who gave several thoughtful suggestions for generalizing the theory and improving the clarity of the text. We also thank Subbarao Kambhampati, Daniel Lowd, Parag, David Smith and all others who provided useful comments on drafts on parts of this research. This work was performed at University of Washington between 2003 and 2007 and was supported by generous grants from National Aeronautics and Space Administration (Award NAG 2-1538), National Science Foundation (Award IIS-0307906), and Office of Naval Research (Awards N00014-02-1-0932, N00014-06-1-0147) and the WRF / TJ Cable Professorship. 78 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS References Aberdeen, D., Thiebaux, S., & Zhang, L. (2004). Decision-theoretic military operations planning. In ICAPS’04. Aberdeen, D., & Buffet, O. (2007). gradients. In ICAPS’07. Concurrent probabilistic temporal planning with policy- Bacchus, F., & Ady, M. (2001). Planning with resources and concurrency: A forward chaining approach. In IJCAI’01, pp. 417–424. Barto, A., Bradtke, S., & Singh, S. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72, 81–138. Bertsekas, D. (1995). Dynamic Programming and Optimal Control. Athena Scientific. Blum, A., & Furst, M. (1997). Fast planning through planning graph analysis. Artificial Intelligence, 90(1–2), 281–300. Bonet, B., & Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. In ICAPS’03, pp. 12–21. Bonet, B., & Geffner, H. (2005). mGPT: A probabilistic planner based on heuristic search. JAIR, 24, 933. Boutilier, C., Dean, T., & Hanks, S. (1999). Decision theoretic planning: Structural assumptions and computational leverage. J. Artificial Intelligence Research, 11, 1–94. Boyan, J. A., & Littman, M. L. (2000). Exact solutions to time-dependent MDPs. In NIPS’00, p. 1026. Bresina, J., Dearden, R., Meuleau, N., Smith, D., & Washington, R. (2002). Planning under continuous time and resource uncertainty : A challenge for AI. In UAI’02. Chen, Y., Wah, B. W., & Hsu, C. (2006). Temporal planning using subgoal partitioning and resolution in sgplan. JAIR, 26, 323. Cushing, W., Kambhampati, S., Mausam, & Weld, D. S. (2007). When is temporal planning really temporal?. In IJCAI’07. Dearden, R., Meuleau, N., Ramakrishnan, S., Smith, D. E., & Washington, R. (2003). Incremental Contingency Planning. In ICAPS’03 Workshop on Planning under Uncertainty and Incomplete Information. Do, M. B., & Kambhampati, S. (2001). Sapa: A domain-independent heuristic metric temporal planner. In ECP’01. Do, M. B., & Kambhampati, S. (2003). Sapa: A scalable multi-objective metric temporal planner. JAIR, 20, 155–194. Edelkamp, S. (2003). Taming numbers and duration in the model checking integrated planning system. Journal of Artificial Intelligence Research, 20, 195–238. 79 M AUSAM & W ELD Feng, Z., Dearden, R., Meuleau, N., & Washington, R. (2004). Dynamic programming for structured continuous Markov decision processes. In UAI’04, p. 154. Foss, J., & Onder, N. (2005). Generating temporally contingent plans. In IJCAI’05 Workshop on Planning and Learning in Apriori Unknown or Dynamic Domains. Fox, M., & Long, D. (2003). PDDL2.1: An extension to PDDL for expressing temporal planning domains.. JAIR Special Issue on 3rd International Planning Competition, 20, 61–124. Gerevini, A., & Serina, I. (2002). LPG: A planner based on local search for planning graphs with action graphs. In AIPS’02, p. 281. Guestrin, C., Koller, D., & Parr, R. (2001). Max-norm projections for factored MDPs. In IJCAI’01, pp. 673–682. Hansen, E., & Zilberstein, S. (2001). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129, 35–62. Haslum, P., & Geffner, H. (2001). Heuristic planning with time and resources. In ECP’01. Hoey, J., St-Aubin, R., Hu, A., & Boutilier, C. (1999). SPUDD: Stochastic planning using decision diagrams. In UAI’99, pp. 279–288. Jensen, R. M., & Veloso, M. (2000). OBDD=based universal planning for synchronized agents in non-deterministic domains. Journal of Artificial Intelligence Research, 13, 189. Kushmerick, N., Hanks, S., & Weld, D. (1995). An algorithm for probabilistic planning. Artificial Intelligence, 76(1-2), 239–286. Laborie, P., & Ghallab, M. (1995). Planning with sharable resource constraints. In IJCAI’95, p. 1643. Little, I., Aberdeen, D., & Thiebaux, S. (2005). Prottle: A probabilistic temporal planner. In AAAI’05. Little, I., & Thiebaux, S. (2006). Concurrent probabilistic planning in the graphplan framework. In ICAPS’06. Long, D., & Fox, M. (2003). The 3rd international planning competition: Results and analysis. JAIR, 20, 1–59. Mausam (2007). Stochastic planning with concurrent, durative actions. Ph.d. dissertation, University of Washington. Mausam, Bertoli, P., & Weld, D. (2007). A hybridized planner for stochastic domains. In IJCAI’07. Mausam, & Weld, D. (2004). Solving concurrent Markov decision processes. In AAAI’04. Mausam, & Weld, D. (2005). Concurrent probabilistic temporal planning. In ICAPS’05, pp. 120– 129. 80 P LANNING WITH D URATIVE ACTIONS IN S TOCHASTIC D OMAINS Mausam, & Weld, D. (2006a). Challenges for temporal planning with uncertain durations. In ICAPS’06. Mausam, & Weld, D. (2006b). Probabilistic temporal planning with uncertain durations. In AAAI’06. Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L., Dean, T., & Boutilier, C. (1998). Solving very large weakly coupled Markov Decision Processes. In AAAI’98, pp. 165–172. Musliner, D., Murphy, D., & Shin, K. (1991). World modeling for the dynamic construction of real-time control plans. Artificial Intelligence, 74, 83–127. Nigenda, R. S., & Kambhampati, S. (2003). Altalt-p: Online parallelization of plans with heuristic state search. Journal of Artificial Intelligence Research, 19, 631–657. Penberthy, J., & Weld, D. (1994). Temporal planning with continuous change. In AAAI’94, p. 1010. Rohanimanesh, K., & Mahadevan, S. (2001). Decision-Theoretic planning with concurrent temporally extended actions. In UAI’01, pp. 472–479. Singh, S., & Cohn, D. (1998). How to dynamically merge markov decision processes. In NIPS’98. The MIT Press. Smith, D., & Weld, D. (1999). Temporal graphplan with mutual exclusion reasoning. In IJCAI’99, pp. 326–333 Stockholm, Sweden. San Francisco, CA: Morgan Kaufmann. Vidal, V., & Geffner, H. (2006). Branching and pruning: An optimal temporal pocl planner based on constraint programming. AIJ, 170(3), 298–335. Younes, H. L. S., & Simmons, R. G. (2004a). Policy generation for continuous-time stochastic domains with concurrency. In ICAPS’04, p. 325. Younes, H. L. S., & Simmons, R. G. (2004b). Solving generalized semi-markov decision processes using continuous phase-type distributions. In AAAI’04, p. 742. Zhang, W., & Dietterich, T. G. (1995). A reinforcement learning approach to job-shop scheduling. In IJCAI’95, pp. 1114–1120. Appendix A Proof of Theorem 6 We now prove the statement of Theorem 6, i.e., if all actions are TGP-style then the set of pivots suffices for optimal planning. In the proof make use of the fact that if all actions are TGP-style then a consistent execution of any concurrent plan requires that any two executing actions be non-mutex (refer to Section 5 for an explanation on that). In particular, none of their effects conflict and a precondition of one does not conflict with the effects of the another. We prove our theorem by contradition. Let us assume that for a problem each optimal solution requires at least one action to start at a non-pivot. Let us consider one of those optimal plans, in 81 M AUSAM & W ELD which the first non-pivot point at which an action needs to start at a non-pivot is minimized. Let us name this time point t and let the action that starts at that point be a. We now prove by a case analysis that we may, as well, start a at time t − 1 without changing the nature of the plan. If t − 1 is also a non-pivot then we contradict the hypothesis that t is the minimum first non-pivot point. If t − 1 is pivot then our hypothesis is contradicted because a does not "need to" start at a non-pivot. To prove that a can be left-shifted by 1 unit, we take up one trajectory at a time (recall that actions could have several durations) and consider all actions playing a role at t − 1, t, t + Δ(a) − 1, and t + Δ(a), where Δ(a) refers to the duration of a in this trajectory. Considering these points suffice, since the system state does not change at any other points on the trajectory. We prove that the execution of none of these actions is affected by this left shift. There are the following twelve cases: 1. ∀actions b that start at t − 1: b can’t end at t (t is a non-pivot). Thus a and b execute concurrently after t, implies a and b are non-mutex. Thus a and b may as well start together. 2. ∀actions b that continue execution at t − 1: Use the argument similar to case 1 above. 3. ∀actions b that end at t − 1: Because b is TGP-style, its effects are realized in the open interval ending at t − 1. Therefore, start of a does not conflict with the end of b. 4. ∀actions b that start at t: a and b start together and hence are not dependent on each other for preconditions. Also, they are non-mutex, so their starting times can be shifted in any direction. 5. ∀actions b that continue execution at t: If b was started at t − 1 refer to case 1 above. If not, t and t − 1 are both similar points for b. 6. ∀actions b that end at t: Case not possible due to the assumption that t is a non-pivot. 7. ∀actions b that start at t + Δ(a) − 1: Since a continued execution at this point, a and b are non-mutex. Thus a’s effects do not clobber b’s preconditions. Hence, b can still be executed after realizing a’s effects. 8. ∀actions b that continue execution at t + Δ(a) − 1: a and b are non-mutex, so a may end earlier without any effect of b. 9. ∀actions b that end at t + Δ(a) − 1: a and b were executing concurrently. Thus they are non-mutex. So they may end together. 10. ∀actions b that start at t + Δ(a): b may still start at t + Δ(a), since the state of t + Δ(a) doesn’t change. 11. ∀actions b that continue execution at t + Δ(a): If b was started at t + Δ(a) − 1 refer to case 7 above, else there is no state change at t + Δ(a) to cause any effect on b. 12. ∀actions b that end at t + Δ(a): a and b are non-mutex because they were executing concurrently. Thus, a’s effects don’t clobber b’s preconditions. Hence, a may end earlier. Since a can be left shifted in all the trajectories, therefore the left-shift is legal. Also, if there are multiple actions a that start at t they may each be shifted one by one using the same argument. Hence Proved. 2 82

Planning with Durative Actions in Stochastic Domains Mausam @ .

Related documents

Products

Support

Planning with Durative Actions in Stochastic Domains Mausam @ .

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib