Integrating Planning and Execution in Stochastic ... Richard Dearden and Craig Boutilier Department

From: AAAI Technical Report SS-94-06. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Integrating Planning and Execution in Stochastic Domains RichardDeardenand Craig Boutilier Department of Computer Science, Universityof British Columbia Vancouver, BC, CANADA,V6T 1Z4 {dearden, cebly}@cs.ubc.ca Abstract a certain action is deemed best (for a givenstate) it should be executed and its outcomeobserved. Subsequentsearch Weinvestigate planning in time-critical domains for the best next action can proceedfromthe actual outcome, representedas MarkovDecisionProcesses. To reignoringother unrealizedoutcomesof that action. ducethe computational cost of the algorithmweexIn general, a fixed-depthsearch will tend to be greedy, ecuteactionsas weconstructthe plan, andsacrifice choosingactions that provide immediaterewardat the exoptimalityby searchingto a fixed depthandusinga penseof long-termgain. Toalleviate this problemweassume heuristicfunctionto estimatethe valueof states. Ala heuristicfunctionthat estimatesthe valueof eachstate, acthoughthis paperconcentrateson the searchprocecountingfor future states that mightbe reachedin additionto dure, wealso discusswaysof constructingheuristic that state’s immediatereward.This prevents(to someextent) functionsthat are suitablefor this approach. the problemof globallysuboptimal choicesdueto finite horizoneffects. Knowledge of certain propertiesof the heuristic function allowthe searchtree to be pruned.Wedescribe one 1 Introduction methodof constructingheuristic functions that allowsthis Anoptimal solution to a decision-theoretic planningprob- informationto be easily determined.This constructionalso lem requires the formulationof a sequenceof actions that producesdefault actionsfor eachstate, in essence,generating maximizes the expectedvalueof the sequenceof worldstates a reactive policy. Oursearch procedurecan be viewedas throughwhichthe planningagentprogressesby executingthat usingdeliberationto refine the reactivestrategy. plan. Deanet al. (1993b; 1993a)havesuggestedthat many In the next section wedescribe MDPs and a samplerepresuch problemscan be represented as Markovdecision prosentationfor this decisionmodel.Section3 describesthe basic ceases (MDPs).This allows the use of dynamicprogramming algorithm,includingthe searchalgorithm,the interleavingof techniquessuch as valueorpolicy iteration (Howard 1971) search and execution, as well as possible pruningmethods. computeoptimalpolicies or coursesof action. Indeed,such Section4 discussessomewaysof constructingthe heuristic policies solve the moregeneral problemof determiningthe evaluationfunctions. Section 5 examinesthe computational best actionfor everystate. Unfortunately, this optimalityand cost of the algorithm,anddescribessomepreliminaryexpergenerality comesat great computational expense. imentalresults. Deanet al. (1993b; 1993a) have proposed a planning methodthat relaxes these requirements.Anenvelopeor sub- 2 The Decision Model set of states that mightbe relevant to the planningproblem at hand(e.g., givenparticularinitial andgoalstates) is con- Let S be a finite set of world states. In manydomainsthe strutted, andan optimalpolicy is computed for this restricted states will be the models(or worlds)associated with some of spacein an anytimefashion. Clearly,optimalityis sacrificed logical language,so ISl will be exponentialin the number since importantstates mightlie outsidethe envelope,as is atomsgeneratingthis language.Let ,4 be a finite set of acgenerality, for the policy makesno mentionof these ignored tions availableto an agent. Anaction takes the agentfromone states. In (Deanet al. 1993b)it is suggestedthat domain- worldto another,but the result of an actionis known onlywith specific heuristics will aid in initial envelopeselectionand someprobability. Anaction maythen be viewedas a mapenvelopealteration. ping fromS into probability distributions over S. Wewrite Weproposean alternative methodfor dealing with Markov Pr(sx,a, s2) to denotethe probabilitythat s2 is reachedgiven decision modelsin a real-time environment.Wesuggestthat that action a is performedin state sl (embodying the usual MDPs be explicitly viewedas search problems. Real-time Markovassumption).Weassumethat an agent, onceit has constraintscanbe incorporated by restricting the searchhoriperformed an action, can observethe resulting state; hencethe zon. This is the basic idea behind,for example,Korf’srealprocessis completelyobservable.Uncertaintyin this model results only fromthe outcomesof actions being probabilistime heuristic search algorithm.In stochastic domainsthere is anotherimportantreasonfor interleavingexecutioninto the tic, not fromuncertainty about the state of the world. We planningprocess,namely,to restrict the searchspaceto the assumea real-valued rewardfunction R, with R(s) denoting actual outcomes of probabilistic actions. In particular, once the (immediate) utility of beingin state s. Forour purposes 55 Action Move Discrirainant Office -.Office Move Rain,-~Umb BuyCoffee -,Office GetUmbrella Office Office Conditions Reward I-IasUserCoffee,--,Wet 1.0 I-IasUserCoffee, Wet 0.8 --,HasUserCoffee, --Wet 0.2 --,HasUserCoffee, Wet 0.0 Prob. 0.9 0.1 Office 0.9 0.1 Wet 0.9 0.1 HRC 0.8 0.2 1.0 Umbrella 0.9 0.1 HUC,-~HRC 0.8 -~HRC 0.1 0.1 -~HRC 0.8 0.2 1.0 Effect -~Office Figure 2: An example of a reward function for the coffee delivering robot domain. is performedin a state s satisfying Di, then a randomeffect from ELi is applied to s. For example, in Figure 1, if the DelCoffee Office,HRC agent carries out the GetUmbrellaaction in a state whereOffice is true, then with probability 0.9 Umbrellawill be true, and every other proposition will remain unchanged,and with ~Office,HRC probability 0.1 there will be no changeof state. For convenience, we mayalso write actions as sets of action aspects as ~HRC illustrated for the Moveaction in Figure 1 (Boutilier and Dearden 1994). The action has two descriptions which represent two independentsets of discrir0inants, the cross productof the Hgure 1: An example domain presented as STRIPS-style aspects is used to determine the actual effects. For example, action descriptions. Note that HUCand HRCare HasUserif Rain and Office are true, and a Moveaction is performed Coffee and HasRobotCoffeerespectively. then with probability 0.81 -~Office, Wet will result, and so on. This representation of domainsin terms of propositions MDP consists of,.q, .A, R and the set of transition distributions also provides a natural wayof expressing rewards. Figure 2 showsa representation of rewards for this domain. Only the {Pr(., a, .) : a E A}. propositionsHasUserCoffeeand Wet affect the reward for any Acontrol policy Ir is a function ~- : ,.q ~ A. If this policy is adopted, lr(s) is the action an agent will performwhenever given state. it finds itself in state s. Given an MDP,an agent ought to This frameworkis flexible enoughto allow a wide variety adopt an optimal policy that maximizesthe expected rewards of different rewardfunctions. Oneimportantsituation is that accumulatedas it performsthe specified actions. Weconcenin whichthere is someset S~c_ S of goal states, and the agent trate here on discounted infinite horizon problems:the value tries to reach a goal state in as few movesas possible) Since of a reward is discounted by some factor/3(0 </3 < 1) we are interleaving plan construction and plan execution, the each step in the future; and we want to maximizethe expected time required to plan is significant whenmeasuringsuccess; accumulateddiscounted rewards over an infinite time period. but as a first approximation we can represent this type of situIntuitively, a DTPproblem can be viewed as finding a good ation with the following rewardfunction (Deanet al. 1993b): (or optimal) policy. R(s) = 0 ifs E Sg and R(s) = -1 otherwise. Theexpected value of a fixed policy ~r at any given state s is specified by 3 The Algorithm V,(s) = n(s) + ~Pr(s, ~(s),t). t66 Since the factors V.(s) are mutually dependent, the value of ~r at any initial state s can be computed by solving this system of linear equations. Apolicy Ir is optimal if V, (s) >_ V,~, (s) for all s E ,q andpolicies 7r’. Althoughwe represent actions as sets of stochastic transitions from state to state, we expect that domainsand actions will usually be specified in a moretraditional form for planning purposes. Figure 1 showsa stochastic variation of STRIPSrules (Kushmerick, Hanks and Weld 1993) for a domain in which the robot must deliver coffee to the user. An effect E is a set of literals. If weapply E to somestate s, the resulting state satisfies all the literals in E and agrees with s for all other literals. Theprobabilistic effect of an action is a finite set El, ...E, of effects, with associated probabilities Pl,..., Pn where ~ Pi : 1. Since actions mayhave different results in different contexts, we associate with each action a finite set D1,..., D. of mutually exclusive and exhaustive sentences called discriminants, with probabilistic effects ELI, ..., EL,~. If the action 56 Our algorithm for integrating planning and execution proceeds by searching for a best action, executing that action, observingthe result of this execution, and iterating. Theunderlying search algorithmconstructs a partial decision tree to determinethe best action for the current state (the root of this tree). Weassumethe existence of a heuristic function that estimates the value of each state (such heuristics are described in Section 4). The search tree maybe prunedif certain properfies of the heuristic function are known.This search can be terminated whenthe tree has been expandedto somespecified depth, whenreal-time pressures are brought to bear, or when the best action is known(e.g., due to complete pruning, or becausethe best action has beencachedfor this state). Oncethe search algorithmselects a best action for the current state, the action is executed and the resulting state is observed. By observing the new state, we establish which of the possible action outcomesactually occurred. Without this information, the search for the best next action wouldbe 1If a "final" state stops the process, wemayuse self-absorbing states. Initial First First State Action State Second Second State Utility of 2ndact. Actionandvalue Action State Value given1st state of first state ¯ x p---0.9, V=2 A~¯ y p--¯.l, v=3 } U=2.1 ~ Act.AV(t)=2.39 Cp=O t~~¯ p---0.7,V--O ¯ /O~¯ ~i p=O.3, V=l} u--o.3 pffiO.7, V=O / p---0.3, V=4} U=I.2 p=0.9, V=O ~ Act.A Bestaction andvalue \, U=2.23 V(u)=l.58 \ p=O.1,v=2} u--t2 Act.BV(s)=2.64 p--0.9, V=2 p=0.1, V=0 } U=l.8 ~ Act.A V(v)=2.62 ~v (p-=-o.5 a 1,=o.6, v=] ~ p--0.4, V=2} U=1.4 x~ p=O.l, V=4 p=0.9, V=0} U=0.4 ~ Act.B V(w)=3.25/ / U=2.94 p--o.5, v=3 p=o.5,v=2 } u=2.5 R(t) = R(u)= 0.5, R(v) = R(w)= =0, R(y)= Figure3: Anexample of a two-levelsearchfor the best action fromstate s. forced to accountfor every possible outcomeof the previous action. Byinterleaving executionand observationwith search,weneedonly searchfromthe actualresulting state. In skeletal form,the algorithmis as follows.Wedenoteby s the currentstate, andbyA*(t) 2the best actionfor state t. 1. If state s has not beenpreviouslyvisited, build a partial decision tree of all possible actions andtheir outcomes beginningat state s, usingsomecriteria to decidewhen to stop expanding the leavesof the tree. Usingthe partial tree andthe heuristic function,calculatethe best action A*(s)E .,4 to performin state s. (This value may cachedin cases is revisited.) 2. ExecuteA*(s). 3. Observethe actual outcome of A*(s). Updates to be this observedstate (the state is known with certainty, given the assumptionof completeobservability). 4. Repeat. Thepoint at whichthe algorithmstops dependson the characteristics of the domain. Forexample, if there are goalstates, and the agent’s task is to reach one, planningmaycontinue until a goalstate is reached.In process-oriented domains,the algorithmcontinuesindefinitely. In our experiments,wehave typicallyrun the algorithmuntil a goalstate is reached(if one exists) or for a constantnumber of steps. 3.1 Action Selection Herewediscussstep oneof the high-levelalgorithmgivenin Section3. Toselect the bestactionfor a givenstate, the agent needs to estimate the value of performingeach action. In orderto dothis, it buildsa partial decisiontree of actionsand 2Initially A*(t) mightbe undefined for all t. However, if the heuristicfunctionprovidesdefaultreactions(seeSection4), it usefulto thinkof theseas the bestactionsdetermined bya depthO search. 57 resulting states, andusesthe tree to approximate the expected utility of eachaction. Thissearchtechniqueis related to the .-minimaxalgorithmof Ballard (1983). As weshall see Section3.2, there are similarities in the waywecan prunethe searchtree as well. Figure3 showsa partial tree of actions twolevels deep. Fromthe initial state s, if weperformaction A, wereach state t with probability 0.8, and state u with probability0.2. Theagentexpandsthese states with a second action and reaches the set of secondstates. To determine the actionto performin a givenstate, the agentestimatesthe expectedutility of eachaction. If s andt are states,/~ is the factor bywhichthe rewardfor future states is discounted,and V(t) is the heuristicfunctionat state t, the estimatedexpected utility of actionAiis: U(A,I ) - A,, t)v(t) tE8 V(s) Is):if Aj s is a leaf node V(s) = R(s) + max{U(Aj E .,4 } otherwise Figure3 illustrates the processwith a discountingfactor of 0.9. Theutility of performingaction Aif the worldwerein state t is the weighted sumof the valuesof beingin states a: andy, whichis 2.1. Sincethe utility of action B is 0.3, we select actionAas the best (givenour currentinformation)for state t, andmakeV(t) R(t) + fl U ( AIt) = 2.39. Th e utility of actionAin state s is Pr(s, A, t ) V( t ) +Pr(s, A, u ) V giving U(AIs) = 2.23. This is lower than U(BIs), so we select B as the best action for state s, recordthe fact that A*(8) is B, and execute B. Byobserving the world, the agentnowknowswhetherstate v or wis the newstate, andcan build on its previoustree, expanding the appropriatebranchto twolevels anddeterminingthe best action for the newstate. Noticethat if (say) v results fromactionB, the tree rooted state wcan safely be ignored--the unrealizedpossibility can haveno further impacton updatedexpectedutility (unless is revisitedvia somepath). States /~ MAXstep (~ Val.=7 b t~ t’~ ~ Val.=7 Actions AVERAGEstep Val.=5 [~] States p_.~tT-- ~., ~stimated Val.=3 k~J / \ ,,~." r~_.0.5 Estimated V~’--4 6 N~stimated Ual.=2 (b) (a) Figure4: Twokindsof pruningwhere~(s) < 10 andis accurateto 4-1. In (a), utility pruning,the trees at UandVneednot searebed,whilein Co), expectationpruning,the trees belowT andUare ignored,althoughthe states themselvesare evaluated. action are combined.Twosorts of cuts can be madein the search tree. If weknowboundson the maximum and/or the minimum values of the heuristic function, utility cuts (much like c~ and fl cuts in minimaxsearch) can be used. If the heuristic function is reasonable, the maximum and minimum valuesfor anystate can be bounded easily usingknowledge of the underlyingdecision process. In particular, with maximum and minimum immediaterewards of R+ and R-, the maximum and minimum expectedvalues for any state are bounded + and ~ ¯ R-, respectively. If we have bounds by ~ ¯ R i --p t --/., on the error associatedwiththe heuristic function,expectation cuts maybe applied. Theseare illustrated with examples. Utility Pruning Wecan prune the search at an AVERAGE step if weknowthat no matter what the value of the remainingoutcomesof this action, wecan never exceed the utility of someother action at the precedingMAX step. For example, considerthe searchtree in Figure4(a). Weassumethat the maximum value the heuristic function can take is 10. Whenevaluating action b, since weknowthat the value of the subtree rootedat T is 5, and the best that the subtrees belowUand V could be is 0.1 x 10 + 0.2 x 10 = 3, the total cannotbe larger than 3.5 + 1 + 2 = 6.5 so neither the tree belowUnor that belowVis worthexpanding.This type of pruning requires that weknowin advancethe maximum value of the heuristic function. Theminimum value can be used in a morerestricted fashion. 3.2 Techniquesfor Limitingthe Search Asit stands, the searchalgorithmperformsin a very similar ExpectationPruningFor this type of pruning, weneed to knowthe maximum error associated with the heuristic wayto minimax search. Determiningthe value of a state is function(see (Boutilier andDe,arden1994)for a way analogous to the MAX step in minimax,while calculating estimatingthis value).If weare at a maximizing step and, the value of an action can be thought of as an AVERAGE even taking into account the error in the heuristic funcstep, whichreplaces the MINstep (see also (Ballard 1983)). tion, the action weare investigating cannotbe as good Whenthe search tree is constructed, wecan use techniques as someother action, then wedo not needto expandthis similar to those of Alpha-Betasearch to prunethe tree and action further. For example,considerFigure4(b), where reducethe numberof states that mustbe expanded.Thereare weassumethat 1;(S) is within +1of its true (optimal) two applicable pruningtechniques. To makeour description value. Wehave determinedthat U(alS) = 7, therefore clearer, wewill treat a singleply of searchas consistingof two any potentially better action musthavea valuegreater steps, MAX in whichall the possibleactions froma state are than 6. Since p(S, a, T)V(T)+ p(S, a, U)V(U)< 4, compared,and AVERAGE, wherethe outcomesof a particular If the agentfinds itself in a state visited earlier, it may use the previouslycalculated andcachedbest action A*(s). This allowsit to avoidrecalculatingvisited states, andwill considerablyspeedplanningif the sameor related problems mustbe solvedmultipletimes, or if actions naturallylead to "cycles" of states. Eventually,A*(s) couldcontain a policy for every reachable state in S, removingthe need for any computation. In Figure3, the tree is expandedto depth two. Thedepth can obviouslyvary dependingon the available time for computation. Thedeeperthe tree is expanded,the moreaccurate the estimatesof the utilities of eachaction tend to be, and hence the moreconfidenceweshould have that the action selected approachesoptimality. If there are rn actions, andthe number of states that could result fromexecutingan actionis onaverageb, then a tree of depthone will require O(mb)steps, twolevels will require O(m262),and so on. The potentially improvedperformance of a deepersearchhas to be weighedagainstthe time required to perform the search (Russell and Wefald1991). Rather than expandto a constantdepth, the agentcouldinstead keep expanding the tree until the probabilityof reachingthe state being considereddrops belowa certain threshold. This approach mayworkwell in domainswherethere are extreme probabilitiesor utilities. Pruningof the searchtree mayalso exploit this information. 58 evenif b is as goodas possible(giventhese estimates), it cannotachievethis threshold,so there is no needto search further belowT and U. 7~inducesa partition of the state spaceinto sets of states, or clusters whichagree on the truth values of propositions in 7~. Furthermore,the actions fromthe original ’concrete’ state spaceapply directly to these clusters. This is due to Although wehavenot yet empiricallyinvestigatedthe effects the fact that eachactioneither mapsall the states in a cluster of pruningonthe size of the searchtree, utility pruningcanbe to the samenewcluster, or changesthe state, but leaves the expectedto produceconsiderablesavings.It will beespecially cluster unchanged.Thesetwofacts allowus to performpolicy valuable whennodesare orderedfor expansionaccordingto iteration onthe abstractstate space.Thealgorithmis: their probabilityof actuallyoccurringgivena specificaction. 1. Constructthe set of relevantpropositions7~. Theactions Expectationpruningrequires a modificationof the search are left unchanged, but effects on propositionsnot in algorithmto checkall outcomesof an action to see if the are ignored. weightedaverageof their estimatedvaluesis sufficient to justify continuednodeexpansion.This meansthat the heuristic 2. Use~ to partition the state spaceinto clusters. value of sibling nodes mustbe checkedbefore expandinga 3. Usethe policyiteration algorithmto generatean abstract givennode.Takinginto accountthe cost of this, andthe diffipolicy for the abstract state space. For details of this culty of producingtight boundson V, this typeof pruningmay algorithmsee (Howard1971;Deanet al. 1993b). not be cost-effective in somedomains.However,the method of generatingheuristic functions in (Boutilier andDearden Byaltering the numberof reward-changingpropositions 1994)(see the next section for a brief discussion)produces in ~, wecan vary its size, and hencethe granularity and just such bounds.Expectationpruningis closely related to accuracyof the abstract policy. This allowsus to investigate whatKoff(1990)calls alpha-pruning.Thedifferenceis that the tradeoff betweentime spent buildingthe abstract policy whileKorfreliesona propertyof the heuristicthat it is always andits degreeof optimality. Thepolicy iteration algorithm increasing,werely on an estimateof the actual error in the also computes the valueof eachcluster in the abstract space. heuristic. This value can be usedas a heuristic estimate of the value of the cluster’s constituent states. Oneadvantageof this approachis that it allowsus to accuratelydeterminebounds 4 GeneratingHeuristic Functions on the differencebetweenthe heuristic value for anystate, Wehaveassumedthe existenceof a heuristic function above. andits valueaccordingto an optimalpolicy -- see (Boutilier Wenowbriefly describe somepossible methodsfor generating and Dearden1994)for details. As shownabove,this fact is these heuristics. Theproblem is to build a heuristic function very usefulfor pruningthe searchtree. Asecondadvantage of whichestimatesthe valueof eachstate as accuratelyas pos- this methodfor generatingheuristic valuesis that it provides sible with a minimum of computation.In somecases such a defaultreactionsfor eachstate. heuristic mayalready be available. Here wewill sketch an approach for domainswith certain characteristics, andsuggest 4.2 Other Approaches ideas for other domains. Thealgorithmdescribedabovefor buildingthe heuristic function is certainly not appropriatein all domains.Certaindo4.1 Abstractionby IgnoringPropositions mainsare morenaturally representedby other means(naviIn certain domains,actions mightbe representedas STRIPS- gation is oneexample).In someeases abstractionsof actions like rules as in Figure10 andthe rewardfunctionspecifiedin and states mayalready be available (Tenenberg 1991). termsof certainpropositions.If this is the casewecanbuildan For robot navigationtasks, an obviousmethodfor clusterabstractrepresentationof the state spaceby constructinga set ing states is basedon geographicfeatures. Nearbylocations 7~ of relevantpropositions,andusingit to constructabstract can be clusteredtogetherinto states that representregionsof stales eachcorresponding to all the states whichagreeon the the map,but providingactionsthat operateonthese regionsis values of the propositionsin 7~. A completedescription of morecomplex.Oneapproachis to assumesomeprobability our approach,alongwiththeoretical andexperimentalresults, distribution over locationsin eachregion,andbuild abstract can be found in (Boutilier and Dearden1994). However actions as weightedaveragesover all locationsin the region will broadlydescribethe techniquehere. of the corresponding concreteaction. Thedifficulty with this Toconstruct 7~, wefirst construct a set of immediately approachis that it is computationally expensive,requiringthat relevant propositions2~7~.Theseare propositionsthat have everyaction in everystate by accountedfor whenconstructing significant effect on the rewardfunction. For example,in the abstractactions. Figure 2, both HasUserCoffeeand Wet have an effect on If abstract actions (possibly macro-operators(Fikes and the rewardfunction; but to producea small abstract state Nilsson1971))are alreadyavailable, weneedto find clusters space, ZT~mightinclude only HasUserCoffee, since this is to whichthe actions apply. In manycases this maybe easy as the propositionwhichhas the greatest effect on the reward the abstract actionsmaytreat manystates in exactlythe same function. way,hencegeneratinga clustering scheme.In other domains, 7~ will includeall the propositionsin ZT~,but also any a similar weightedaverageapproachmaybe needed. propositionsthat appearin the discriminantof an action which allowsus to changethe truth valueof somepropositionin 7~. 5 SomePreliminaryResults Formally,7~ is the smallestset suchthat: 1) 2~7~C 7~; and 2) if P E 7~ occursin an effect list of someaction, then all Weare currently exploring,both theoretically andexperipropositionsin the corresponding discriminantare in ~. mentally,the tradeoffsinvolvedin the interleavingof planning 59 Timeper action 2 I 0 I Search 2 Depth 3 0.015 I Search Depth 2 3 0.081 0.662 4 .... ~ Accumulated rewardfor 20 actions 7.5 12.9 15 I OpL Pol. 6.25 4 ~n 5.111 12.9 (a) (b) Abstract Policy No.of Errors Total Error. Max.Error AverageError Average non-zero 2OO 753.773 5.4076 2.956 1 Step 2 Step 3 Step 4 Step 5 Step Search Search Search Search Search 144 212.528 3.9607 0.830 3.769 1.476 136 136 161.068 161.068 2.7372 2.7372 0.629 0.629 21 11.160 2.1045 0.044 21 8.834 2.1045 0.034 1.184 0.531 0.421 1.184 Error (c) Figure 5: Timing(a) and value (b) of policy for a 256 state, six action domainfor various depths of search. Table (c) shows comparisonof the policies induced by search to various depths with the optimal policy. and execution in this framework. Wecan measure the complexity of the algorithm as presented. Let m= [A[ be the number of actions. Wewill assume that when constructing the search tree for a state, we explore to depth d, and that the branching factor for each action (the maximum numberof 3outcomesfor the action in any given state) is at most b. Thecost of calculating the best action for a single state is d. The cost per state is slightly less than this since we can mab reuse our calculations, but the overall complexityis O(md). The actual size of the state space has no effect on the algorithm; rather it is the numberof states visited in the execution of the plan that affects the cost. This is clearly domaindependent, but in most domainsshould be considerably lower than the total numberof states. Mostimportantly, the complexity of the algorithm is constant and execution time (per action) can be boundedfor a fixed branching and search depth. By interleaving execution with search, the search space can be drastically reduced. Whenplanning for a sequence of n actions the execution algorithm is linear in n (with respect to the factor mdbd);a straightforward search without execution for the same numberof actions is O(b"). Someexperimental tests provide a preliminary indication that this frameworkmaybe quite valuable. To generate the results discussed in this section, we used a domainbased on the one described in Figures 1 and 2 but with another item aWeignore preconditionsfor actions here, assumingthat an action can be "attempted"in any circumstance.However,preconditions mayplaya usefulrole bycapturinguser-supplied heuristicsthat filter out actionsin situations in whichthey oughtnot (rather than cannot) be attempted.This will effectively reduce the branching factor of the searchu’e~. 60 (snack) that the robot mustdeliver, and a robot that only carries one thing at a time. Weconstructed the heuristic function using the procedure described in Section 4, with HasUserCoffee as the only immediatelyrelevant proposition, ignoring the proposition HasUserSnack;thus, T¢ = {HasUserCoffee, Office, HasRobotCoffee, HasRobotSnack}. The domaincontains 256 states and six actions. All timing results were produced on a Sun SPARCstation1. Computing an optimal policy by policy iteration required 130.86 seconds, while computinga sixteen state abstract policy (again using policy iteration) for the heuristic function required 0.22 seconds. Figure 5(a) shows the time required for searching as a function of search depth, while (b) shows the average accumulated reward after performing twenty actions from the state (-~HasRobotCoffee, ~HasUserCoffee, -~HasRobotSnack, -~HasUserSnack, Rain, ~Umbrella, Ofrice, ~We0.4 As the graphs show, a look-ahead of four was necessary to producean optimal policy from the abstract policy for this particular starting point, but even this muchlook-aheadonly required 5.111 seconds to plan for each action, for a total (13 step) planning time of 66.44 seconds, about half that computingthe optimal policy using policy iteration. While policy iteration is clearly superiorif plans are required for all possible starting states, by sacrificing this generality to solve a specific problem the search methodprovides considerable computationalsaving. Weshould point out that no pruning has 4Thisis the state that requiresthe longestsequence of actionsto reacha state withmaximal utility; i.e., the state requiringthe"longest optimalplan." beenperformedin this example.Furthermore,as mentioned above,policy iteration will require moretime as the state space grows,whereasour search methodwill not. Figure5(c) comimres the values of the policies that each depth of searching producedwith the value of the optimal policy overthe entire state space.Sincethe rangeof expected valuesof states in this domainis 0 to 10, the avergaeerrors produced by the algorithmare quite small, evenwithrelatively little search. Thetable suggeststhat searchingto depthn is at least as goodas searchingto depth n - 1 (althoughnot alwaysbetter), andin this domain,is considerablybetter than followingthe abstract policy. In noneof the domainswehave tested has searchingdeeperproduceda worsepolicy, although this maynot be the casein general(see (Pearl1984)for a proof of this for minimax search). 6 Conclusions Wehave proposeda frameworkfor planning in stochastic domains.Further experimental workneeds to be done to demonstrate the utility of this model.In particular, werequire an experimentalcomparisonof the search-executionmethod andstraightforwardsearch, as well as further comparison to exactmethodslike policyiteration andheuristic methodslike the envelopeapproachof Deanet al. (1993b). Weintend use the framework to explorea number of tradeoffs (e.g., as in (Russell and Wefald1991)). In particular, wewill look at the advantagesof a deepersearch tree, andbalance this withthe cost of buildingsucha tree, andat the tradeoff betweencomputationtime and improvedresults whenbuilding the heuristic function. To illustrate these ideas weobserve that if the depthof the searchtree is O, this corresponds to a reactivesystemwherethe best actionfor eachstate is obtained fromthe abstractpolicy. If eachcluster for the abstractpolicy contains a single state, wehaveoptimalpolicy planning. Theusefulnessof these tradeoffs will vary whenplanningin different domains. Some of the characteristics of domains that will affect our choicesare: ¯ T/me:for time-critical domains it maybe better to limit timespentdeliberating(perhapsadoptinga reactivestrategy basedon the heuristic function). A moredetailed heuristic functionand a smallersearch tree maybe appropriate. ¯ Continuity:if actionshavesimilareffects in largeclasses of states andmostof the goalstates are fairly similar, we can use a less detailedheuristic function(moreabstract policy). ¯ Fan-out:if there are relatively fewactions, andeach action has a smallnumberof outcomes,wecan afford to increasethe depthof the searchtree. ¯ Plausiblegoals:if goalstates are hardto reach, a deeper search tree anda moredetailed heuristic function may be necessary. ¯ Extremeprobabilities:with extremeprobabilities it may be worth only expandingthe tree for the mostprobable outcomesof each action. This seemsto bear some relationship to the envelopereconstructionphaseof the recurrentdeliberationmodelof Deanet al. (1993a). 61 In the future wehopeto continueour experimentalinvestigation of the algorithmto look at the efficacy of pruning the search tree, as well as experimenting with variable-depth search, andthe possibilityof improving the heuristic function by recording newlycomputedvalues of states, rather than best actions. This last idea will allowus to investigate the tradeoff betweenspeed (whenthe agent is in a previously visited state) and accuracywhenselecting actions. Wealso hopeto investigate the performance of this approachin other types of domains,includinghigh-level robot navigation,and schedulingproblems,andto further investigate the theoretical propertiesof the algorithm,especially throughanalysis of the value of deeper searchingin producingbetter plans (Pearl 1984). Ourmodelcan be extendedby relaxing some the assumptionsincorporatedinto the decision-model.Semimarkovprocesses as well as partially observableprocesses will require interesting modificationsof our model.Finally, wemustinvestigate the degreeto whichthe restricted envelope approachmaybe meshedwith our model. Acknowledgments Discussionswith Mois6sGoldszmidthaveconsiderably influencedour viewand use of abstraction for MDPs. References Ballard, B. W. 1983. The *-minimax search procedure for trees containing chance nodes. Artificial Intelligence, 21:327-350. Boutilier, C. and Dearden,R. 1994. Using abstractions for decisiontheoretic planning with time constraints. (submitted). Dean, T., Kaelbling, L. P., Kirman, J., and Nicholson, A. 1993a. Deliberation scheduling for time-critical decision making. In Proceedingsof the Ninth Conferenceon Uncertainty in Artificial Intelligence, pages 309-316, Washington,D.C. Dean, T., Kaelbling, L. P., Kirman, J., and Nicholson, A. 1993b. Planning with deadlines in stochastic domains.In Proceedings of the Eleventh National Conferenceon Artificial Intelligence, pages 574-579, Washington, D.C. Fikes, R. E. and Nilsson, N. J. 1971. Strips: A newapproachto the application of theorem proving to problemsolving. Artificial Intelligence, 2:189-208. Howard, R. A. 1971. DynamicProbabilistic Systems. Wiley, New York. Korf, R. E. 1990. Real-timeheuristic search. Artiftciallntelligence, 42:189-211. Kushmerick, N., Hanks,S., andWeld,D. 1993.Analgorithmfor probabilisticplanning.Technical Report93-06-04, University of Washington, Seattle. Pearl, J. 1984.Heuristics:IntelligentSearchStrategiesfor Computer ProblemSolving. Addison-Wesley, Reading,Massachusetts. Russell,S. J. andWefald,E. 1991.Dothe RightThing:Studiesin LimitedRationality.M1T Press, Cambridge. Tenenberg, J. D.1991.Abstractionin planning.In Allen,J. F., Kautz,H.A., Pelavin,R. N., andTenenberg, J. D., editors, ReasoningaboutPlans, pages213-280.Morgan-Kaufmann, SanMateo.

Integrating Planning and Execution in Stochastic ... Richard Dearden and Craig Boutilier Department

Related documents

Products

Support

Integrating Planning and Execution in Stochastic ... Richard Dearden and Craig Boutilier Department

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib