From: AAAI Technical Report SS-94-06. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. When to Plan and When to Act? Richard Goodwin rich@cs.cmu.edu School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213-3890 (412) 268-8102 that it may be possible to execute one part of the plan while planning another. Abstract Faced with a complicated task, some initial planning can significantly increase the likelihood of success and increase efficiency, but planning for too long before starting to act can reduce efficiency. This paper explores the question of when to begin acting for a resource bounded agent. Limitations of an idealized algorithm suggested in the literature are presented and illustrated in the context of a robot courier. A revised, idealized algorithm is given and justified. The revised idealized algorithm is used as a basis for developing a new "step choice" algorithm for making on-the-fly decisions for a simplified version of the robot courier task. A set of experiments are used to illustrate the relative advantage of the new strategy over always act, always compute and anytime algorithm based strategies for deciding when to begin execution. Approaches Introduction Faced with a complicated task, someinitial planning can significantly increase the likelihood of success and increase efficiency. Conversely, sitting on one’s hands, considering all possible consequencesof all possible actions does not accomplish anything. At some point, action is called for. But when is this point reached? The question of when to start executing the current best action is crucial for creating agents that perform tasks efficiently. The more time and resources spent creating and optimizing a plan, the longer the actions of the plan are delayed. The delay is justified if’the improvement in the plan more than offsets the costs of delaying execution. This tradeoff is basic to creating resource bounded rational agents. An agent demonstrates resource boundedrationality if it performs optimally given its computational limits. Deciding when to begin execution is complex. There is no way to know how much a plan will improve with a given amount of planning. The best that can be achieved is to have sometype of statistical modelas is done in the anytime planning framework [Dean and Boddy, 1988]. In addition, the decision should take into account the fact 128 The classical AI planning approach to deciding when to begin execution is to run the planner until it produces a plan that satisfies its goals and then to execute the plan. This approach recognizes that finding the optimal plan maytake too long and that using the first plan that satisfied all the constraints mayproduce the best results [Simon and Kadane, 1974]. It also relies on the heuristic that in searching for a plan, a planner would tend to produce simple plans first and simple plans tend to be more efficient and robust than unnecessarily complex plans. In contrast to the classical AI approach, the reactive approach is to pre-compile all plans and to always perform the action suggested by the rules applicable to the current situation. The answer to the question of when to begin acting is always "now". This approach is applicable whenit is possible to pre-compile plans and in dynamic environments where quick responses are required. A third approach is to explicitly reason about when to start execution while planning is taking place. Using information about the task, the current state of the plan and expected performance of the planner, the decision to execute the current best action is made on-the-fly. This allows execution to begin before a complete plan is created and facilitates overlapping of planning and execution. This approach also allows execution to be delayed in favour of optimizing the plan, if the expected improvement so warrants. Idealized Algorithm Making on-the-fly decisions about when to begin execution can be modeled as a control problem where a meta-level controller allocates resources and time to the planner and decides when to send actions to the execution controller for execution. The control problem is to select actions and computations that maximize the expected utility. An idealized algorithm that has been suggested by a number of researchers, selects an item from the set {~, C1,..., Ck} which has highest expected utility[Russell and Wefald, 1991]. In this list, ~ is the E(dist(n, planTimen) ) = distinitiat - nA(1 -BptanTime" ) (1) E( travelTime(dlst.,b, planTime.,b ) (3.17dista,b -- 1.50planTimea,b/Vpp)/v planTimea,b < 1.33rppdista,b 1.17dista,b/V planTime,,b > 1.33rppdista,b current best action and the Cis represent possible computations. The utility of each computation must take into account the duration of the computation by which any act it recommended would be delayed. Of course, this ideal algorithm is intractable and in any real implementation some approximations have to be made. Revised Idealized Algorithm The idealized algorithms ignores the fact that most agents can act and compute concurrently. A better control question to ask is whether the agent should compute or compute and act. The corresponding control problem is to select from the list {(a, 61),..., (a, Ck), (¢, C:),..., (¢, Ck)} where dered pair represents an action to perform and a computation to do. ¢ represents the null action corresponding to computing only. Considering pairs of actions and computations allows the controller to accept the current action and focus computation on planning future actions. This can be advantageous even if more computation is expected to improve the current action. For instance, I may be able to improve the path I’m taking to get to my next meeting that saves me more time than the path planning takes. However, I might be better off if I go with the current plan and think about what I’m going to say when I get to the meeting instead. It is not the case that an agent should always consider both acting and computing. The original idealized algorithm is applicable for agents that play chess, In such a context it is not a good idea to sacrifice expected improvements in the current move in order to compute during moving. The utility of computing after a move and before the opponent has responded is relatively low. Considering both acting and computing only complicates the control mechanism and can reduce the overall performance of the agent. One property of chess that contributes to the low utility of computing while acting is that each action is irrevocable. There is in effect an infinite cost for undoing a move. Using the revised idealized algorithm, I have been investigating the question of when an agent should begin to act. In the rest of this paper, I’ll discuss the results in the context of a simple robot courier domain. I will first present a situation where there are improvements to be had by judiciously overlapping planning and execution. I then present a more formal analysis of a particular problem and use the analysis as a basis for generating some preliminary empirical results. Robot Courier In his thesis, Boddyintroduces’ a robot courier task where a robot is required to visit a set of locations [Boddy, 1991]. The robot’s environment is represented as a grid where each grid cell is either free or blocked. The robot is located in one cell and can attempt to move to one of the four adjacent cells. If the neighbouringCell is free, the move succeeds. Otherwise it fails and the robot remains in the same cell. Both successful and unsuccessful moves require time. The robot is provided with a map that can be used for doing route planning. The robot’s objective is to visit the given locations, in any order, in the least amountof time possible. The planning problem consists of two parts; ordering the visits and planning paths between locations. The purpose of planning is to reduce the amount of time needed to complete the set of visits. The initial ordering of locations could be used as the tour plan. Tour planning make the tour more efficient. Similarly, the agent has an execution controller that can dead-reckon between locations and can eventually make its way to any reachable location. The controller can also make use of partial plans to improveits efficiency. For Boddy’s robot, planning is done using a pair of anytime planning algorithms. Path planning is done by a heuristic search that yields increasingly complete paths between locations. Tour improvement is done using twoopt [Lin and Kernighan, 1973]. The expected improvement for each algorithm is a function of the amount of time spent planning. The performance of the two-opt algorithm can be characterized by a curve that exponentially approaches an asymptote (Equation 1). The path planning has a constant expected rate of improvement until a cutoff point after which it becomes zero (Equation 2). Let’s examinethe first examplepresented in the thesis where the robot must go from its current location, 10, and visit two locations, (l:, 12), in sequence. The distances from l0 to l: and from l: to 12 are 100 units. The problem is to create a deliberation schedule for allocating path planning time to each of the two paths. In creating deliberation schedules, the assumption is made that the planner can only plan a path between locations before the robot begins to move between them. This assumption avoids the problem of how to do communications and co-ordination between the planner and the execution controller when both are working on getting the robot to the next location. It also reduces the complexity of the deliberation scheduler. The deliberation plan given in the thesis entails the robot initially doing path planning for the first leg of the trip and then overlapping path planning for the second part of the trip with the execution of the first leg of the trip. The plan is parameterized by the time spent path planning each leg of the trip: planTime12, and planTime2,3. The plan is optimal when the time to follow the first path equals the expected time to plan the best path for the second part of the trip. The expected path length as a function of the distance between points 129 rpp distl,2 dist2,3 v =1 = 100 - 100 -1 2Iv = 2 P(occupied ) planTimea,b planTime*~,b - 0.25 = = 1.17dista,b/V Figure Let Time to perform one planning step. Distance between first two locations. Distance between second two locations. Speed of the robot. The time cost for taking a step that fails. The probability that a given location is occupied. Time spent planning the trip from a to b. Expected time to plan the optimal trip from a to b. 1: Robot Courier E(TravelTime(disQ,2,planTimel,2)) Then E(Time(planl)) planTime~, 3 = 133.0 = = E( Time(plan2) = Parameters P(occupied)(E(Time(planl)) 3planTimel,2 + planTime~, + E( TravelT ime( dist 2,3, planTime2,3 122.6 + 133.0 + 117.5 373.1(} + 2/v) + (3) (4) (5) (1 - P( occupied) )(planTime’l, 2 + planTime~,3 + E( TravelT ime( dist ~,3, planT ime2,3 ) = = 0.25 * (373.16 + 2) + 0.75 ¯ (120.553 + 133 + 117.5) 372.08 and the amount of time spent planning are given in equation 2. Figure 1 gives the set of parameter values used in the thesis. The total expected time for visiting both locations, given an optimal amount of initial planning, is given in equation 41. Suppose we maintain the assumption that the path planner cannot be working on the path that the execution controller is currently executing. However, we remove the implicit assumption that the path planner can only be used to plan paths between the end-points of a path. Making use of the fact that an anytime algorithm can be interrupted at anytime and the results either used or thrown away, we can construct other possible deliberations schedules. Consider a plan where the robot attempts to take one step towards the first location and uses a deliberation schedule that assumes the robot started that one step closer. If the step succeeds, then the robot ends up one step closer to the location and the size of the path planning problem has been reduced. Of course, the step may fail. In which case, the planner could be interrupted and restarted using the original deliberation schedule. The cost of failure is the time it takes to discover that the step failed. Equation 5 gives the expected time for the modified plan. For the parameters used, the expected, time for this modified plan to visit both sites is about one time unit less than the original plan. Whydoes the modified plan do better? One contributing factor is that the environment is relatively benign. Even if the robot tries a faulty step, the penalty for recovery is relatively low, 2 unit step times. Another important factor is that the rate of execution is the same order of magnitude as the rate at which the planner can aThesenumbersare slightly different from those in the thesis whichweretruncated. 130 improvethe plan. If the robot was infinitely fast, then always acting would be the right thing to do. On the other hand, if the rate of computationwas infinitely fast, doing full planning would always be the best thing to do. In between these two extremes, overlapping planning and execution offers potential efficiency improvements. However, blind overlapping of computation and execution, even in a benign environment, is not always the best thing to do. Suppose the processor on the robot was slightly faster. There is a critical point beyond which it makessense to wait for the planner to finish rather than attempt the first step. In the exampleabove, if the time for one planning step, Tpp, falls below 0.315 time units, then the original plan has the lower expected time to completion. Similarly, if the penalty for a bad step rises to 6.34/v, the original plan is better. Finally, note that for this example, the modified deliberation plan is not optimal. In most cases, there are two directions on the grid that reduce the distance to the goal. Wecould modify the the plan again so that if it fails on the first step, it tries the other direction that leads towards the goal. This would improve the expected times slightly. In the limit, a system where the planner and the execution controller were more tightly coupled would likely produce the best result for this example. See [Dean et al., 1991] for an example. Tour Planning Howshould an agent decide when to begin acting? For a more formal analysis, let’s consider only the problem of ordering locations to visit for the robot courier. To simplify the analysis, consider the case where the distance between locations is proportional to the actual time to travel between them. Let’s also assumethat utility is linear with respect to the time needed to complete all the E(Improvement(t, n)) = I(t, Compute only rate Compute and Act rate 1 ft+z~t = Rexe + -~ Jt i(t, Ree + At At=oo =# 0 ~ ](t,n) = -t - ln((n n - 1)dr n - 1) + + at, n- 1)) + 1)ABe -m - 2Reze (n - 1)AB )/B (9) , = (n + 1) i_(t, n) - 2Rexe (n+ 1)ABe -m - 2Reze = = 2Reze n + l ~ 2Rexe n (10) n Computing with an anytime algorithm can be treated as a continuous process. However, moving between locations is more discrete. Each move will be treated as an atomic action that, once started, must be completed. The rate of tour improvement for computing alone is given in equation 7. Whenthe agent chooses to act and compute, the agent is committing to the choice for the duration of the action. The appropriate rate to use when evaluating this option is not the instantaneous rate but the average rate over the duration of the action. Equation 8 gives the average expected rate of accomplishment for acting and computing. In this equation, At is the duration of the action. Equating 7 and 8 and solving for At gives an expression for the time duration of a move the agent would be willing to accept as a function of time spent computing. Unfortunately, the resulting expression has no closed form solution 2. If instead, the integral in equation 8 is replaced by a linear approximation, an expression for At can be found (Equation 9). The linear approximation over estimates the rate of computational improvement for the move and compute option and biases the agent towards moving. Equation 8 ¯ only approximates the rate of optimization after taking a step. It under estimates the effect of reducing the size of the optimization problem. Since two-opt is an 2) O(n algorithm per exchange, reducing the problem size by 1 has more than a linear affect. This tends to cancel out the first approximation. Examiningthe form of equation 9, it is clear that At is infinite if the argumentto the in 0 function reaches zero. This corresponds to an expected rate of tour improvement about twice the rate of path execution (Equation 10). At this point, the marginal rate of only computingis zero and the robot should execute any size step remaining in the plan. For this reason, the anytime strategy should wait only until the rate of improvementfalls below twice the rate of executions. In fact, the wait should only be long enough for At to be larger than the time for the next step. In the range where At is defined, increasing the rate of execution (Rexe) increased the size of an acceptable step. Similarly, increasing the rate of computation (B) visits. In this situation, the control choice for the robot is whether to execute the next step in its current plan while continuing to optimize the’ rest of the plan or to just do the optimization. The optimization that is available is to run the two-opt algorithm on the unexecuted portion of the plan. The expected improvementin the tour as a function of computation time for the two-opt algorithm is described by equation 6. In the equation, t is the computation time, n is the number of locations and A and B are parameters that can be determined empirically. One approach to making the meta level control algorithm efficient is to use a simple greedy algorithm as suggested by Russell and Wefald [Russell and Wefald, 1991]. This algorithm selects the option that gives the highest expected rate of task accomplishment. For the robot courier, there are two ways of making progress on the task: optimizing the tour and moving. Anytime (6) (7) (8) -Bt) nA(1-e -B~ I(t,n) = nABe = = Approach Given the characterization of the two-opt tour improvement algorithm given in equation 6, any anytime deliberation schedule can be created. At first, it might seam obvious that the robot should compute until the expected rate of tour improvement falls below the rate at which the robot can execute the plan. This is not optimal though. Such a deliberation schedule is based on the initial idealized algorithms and ignores the fact that the robot can both execute a step while optimizing the rest of the plan. Basing a deliberation schedule on the revised idealized algorithm, the robot should wait at most until the rate of improvement falls below twice the rate of execution before it begins to act. This allows the robot to better overlap planning and execution. A mathematical justification will be given in the next section. Step Choice Algorithm For on-the-fly decision making, an action should be taken whenever the expected rate of task accomplishment for both performing the action and optimizing the rest of the plan exceeds the expected rate of task accomplishment by only optimizing. 131 2The result involves solving the omegafunction ¢-- ---o Always Compute x- -- ~ Always Act ~-- --- -~ Anytime1 6 5 4 i ~. *.+Anytime Anytime 2 Expected (X-axis) --Step Choice d./’’*/ ~ 2 1 I /.~’"" ~’"" ~ .... a.._...... ---qr~.~2-1 -2 Rate of Execution Figure 2: Performance relative (Unit/msec) Log Scale to Anytime 2 Control. decreases the size of the acceptable action. In the limit, the acceptable step correctly favours either always computing or always acting. Does it make sense to perform an action even when the optimizer is currently improving the plan more than twice as fast as the robot can execute it? It does if the action is small enough. The amount of potential optimization lost by taking an action is limited by the size of the action. At best, the next step in the tour could be reduced to zero time (or cost). Performing small actions does not present a large opportunity cost in terms of potential optimizations lost and allows the optimizer to focus its efforts on a smaller problem. Taking actions that fall below the acceptable action size also allows the agent to be opportunistic. If at any point in the tour improvement process, the current next move is smaller that the acceptable action size, then the agent can begin moving immediately. In this way, the agent takes advantage of the optimizer’s performance on the given problem rather than only on its expected performance. Less Time is better. Always Compute : Complete the entire two-opt algorithm and then begin execution. Always Act : Always perform the next action and run the two-opt algorithm in parallel. Anytime1 : Run the two-opt algorithm and start execution when the expected rate of tour improvement falls below the rate at which the robot can execute the plan. Anytime2 : Same as anytimel except start execution when the expected rate of tour improvement falls below twice the rate at which the robot can execute the plan. Anytime Expected : Same as the anytimel strategy, except start execution when the acceptable step size equals the expected distance between cities. Step Choice : Run the two-opt algorithm. Whenever the next moveis smaller than the acceptable step size, take it. Implementation Figure 3: Tour improvementMeta-LevelControl Algorithms. In order to empirically evaluate the utility of the control strategy given above, I have implemented a version of the simplified robot courier simulator and have performed an initial set of experiments. The six control strategies given in figure 3 were each run on a sample of 100 randomly generated problems each with 200 locations to be visited. The experiments were performed on a DECstation 3100 using the Mach2.6 operating system. 2) Each exchange in the two-opt algorithm is O(n where n is the number of locations considered. When 132 optimization and execution are overlapped, the size of the optimization problem shrinks as execution proceeds. Each optimization step can’t be treated as a constant time operation as was done in [Boddy, 1991]. For these experiments, I measured the actual CPUtime used. The graph in figure 2 shows the total time of each strategy relative to the anytime2 control strategy. Points below the x-axis indicate better performance. The graph covers a range of robot speeds where sometimes always acting dominates always computing and sometimes the converse it true. The performance of the always act and the always compute strategies at either extreme are as expected. The anytime strategies and the step choice algorithm perform well for the entire range of speeds. At relatively fast robot speeds, these strategies correctly emulate the always act strategy. At relatively slow speeds, they correctly emulate the always compute strategy. Between these extremes, the anytime2 strategy outperforms the anytimel strategy by starting execution earlier and better overlapping execution and optimization. For some ranges of execution speed, the always act strategy does better than either of these anytime strategies. These strategies don’t take full advantage of the potentiM for overlapping planning with execution. The expected anytime strategy does better because it starts execution sooner. The step choice algorithm produces the best results over the full range of execution speeds. It outperforms the anytime strategies by opportunistically taking advantage of small initial steps in the tour plan to begin execution earlier. It is interesting to note that even at relatively fast robot speeds, the behaviour of the always act and the step choice strategy are not identical. The step choice algorithm will occasionally pause if the next action to be taken is large. This allows the optimizer an opportunity to improve the next step before it is taken. The affect of this is to reduce the average length of the tour slightly while maintaining the same performance. None of the algorithms made full use of the computational resources available. For the methods that overlap computation and acting, the computation completed before the robot reached the end of the path. The computational resource was then unused for the rest of the duration of run. This extra computation resource could have been used to apply more complex optimization routines to the remaining tour to further improve performance. Future Work The greedy meta-level control algorithm works for the simplified robot courier because the cost of undoing an action is the same as the cost of performing the action. In fact, because of the triangle inequality, it is rare that the action has to be fully undone. The shortest route proceeds directly to the next location. At the other end of the spectrum are domains such as chess playing where it is impossible to undo a move. In between, there are manydomains with a distribution of recovery costs. Consider as an example leaving for the office in the morning. Let’s assume that it is your first day at work, so there is no pre:compiled optimized plan yet. The first step in your plan is to walk out the door and then to the car. Suppose while you are walking, your plan optimizer discovers that it would be more optimal to bring your brief case. The cost of recovery is relatively cheap. Just walk back in the house and get it. On the other hand, if you forgot your keys, you may nowbe 133 locked out of the house. In this sort of environment, a greedy control algorithm may not be the best choice to use when deciding when to begin acting. Related Work In resent work, TomDean, Leslie Pack Kaelbling and others have been modeling actions and domains using Markovmodels [Dean el al., 1991]. Plans for these representations are policies that mapstates to actions. Planning consists of modifying the envelope of the current policy and optimizing the current policy. Deliberation scheduling is done while acting in an attempt to provide the highest utility for the agent. Workhas also been done on deliberation scheduling when the agent has a fixed amount of time before it must begin acting. Such work has not directly addressed the question of when the agent should begin acting. It is assumedthat the agent is given a arbitrary deadline for starting to execute. Techniques developed in this paper could be adapted to decide whento start execution, eliminating the need for an arbitrary start deadline. Other related work includes Russell and Wefald’s work on Decision Theoretic A* [Russell and Wefald, 1991]. This algorithm makes use of the initial idealized algorithm and is applicable to domains such as chess. Acknowledgements This research was supported in part by NASA under contract NAGW-1175. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of NASAor the U.S. government. References [Boddy, 1991] Mark Boddy. Solving time-dependent problems: A decision-theoretic approach to planning in dynamic environments. Technical Report CS-91-06, Department of Computer Science, Brown University, May 1991. [Dean and Boddy, 1988] Thomas Dean and Mark Boddy. An analysis of time dependent planning. In Proceedings AAAI-88, pages 49-54. AAAI, 1988. [Dean et al., 1991] Thomas Dean, Leslie Pack Kaelbling, Jack Kirman, and Ann Nicholson. Planning with deadlines in stochastic domains. In Proceedings, Ninth National Conference on Artificial Intelligence. AAAI, July 1991. [Lin and Kernighan, 1973] S. Lin and B.W. Kernighan. Aneffective heuristic for the travelling salesman problem. Operations Research, 21:498-516, 1973. [Russell and Wefald, 1991] Stuart Russell and Eric Wefald. Do the Right Thing. MITPress, 1991. [Simon and Kadane, 1974] Herbert Simon and Joseph Kadane. Optimal problem-solving search: All-ornothing solutions. Technical Report CMU-CS-74-41, Carnegie Mellon University, Pittsburgh PA. USA, December 1974.