Two Techniques for Tractable Decision-Theoretic Planning* t

From: AAAI Technical Report SS-94-06. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Two Techniques for Tractable Decision-Theoretic Planning* t GaryH. Ogasawara Stuart J. Russell Computer Science Division Universityof California Berkeley CA 94720 garyo@cs.berkeley.edu russeU@cs.berkeley.edu Abstract next step of the default plan is executed.Becausea utility measureof the time cost of replanningis included, timely responsescan be erisured. Twotechniquesto provide inferential tractability for decision-theoreticplanningare discussed. Knowledge compilationis a speeduptechniquethat pre-computesintermediatequantifies of a domain modelto yield a moreefficiently executablemodel. Thebasic options for compilingdecision-theoretic modelsare defined, leading to the mostcompiled modelin the form of parameterized, conditionaction decision rules. Wedescribe a general compilation algorithmthat can be applied to dynamic influencediagramsfor sequential decision-making. Decision-theoretic raetareasoning is a control technique that regulates the online replanning. This algorithmfocuses on the mostpromisingareas to replan andconsidersthe time cost of replanning. 2 1 Introduction Applyingdecision-theoreticreasoningwith action sequences rather than single actionscanbe usedto find an optimalplan. However, the standardmethodof computing the expectedutility of eachaction sequenceandthen choosingthe sequenceof maximum utility as the optimalplan is generallyintractable. This paper describes two techniquesthat wehavebeen investigatingto ensureinferential tractability. Thefirst technique, knowledge compilation,is doneat the pre-processing stage to reducethe complexityof the model;the secondtechnique, decision-theoreticmetareasoning, exerts control upon the planningandexecutionduringrun-time. Knowledge compilationpre-computesintermediate quantities of a domainmodelto yield a moreefficiently executable model(e.g., [2, 10, 7]). Wedescribea knowledge compilation algorithmthat generatesa compiledmodelthat is parameterized by prior probabilities of the uncompiled model.Because it is parameterized,the samecompiledmodelcan be used repeatedlyin sequential decision-making. Decision-theoreticmetareasoningis used to control the planningprocess. At each decision point, a default plan is available. Metareasoningcomputesan estimate on the expected value of replanning. If the value of replanningis positive, the replanningalgorithmis invoked;otherwise,the *ThisresearchwassupportedbyNSFgrantsIRI-8903146, INT9207213,andCDA-8722788, and the Lockheed AI Center. lAlsoaffiliatedwiththe Lockheed AICenter,PaloAlto,CA. 197 KnowledgeCompilation Thetermswill be clarified below,but in brief, the knowledge compilationproblemis shownin Table1. Given:an uncompi!ed modelwith the followingproperties: ¯ It is representedby a dynamicinfluence diagramthat allowstemporalprojection fromtime 0 to time k. ¯ Priorprobabilitiesof time 0 are parameterized for generality. ¯ Evidencevariablesare specified. ¯ The utility node of the dynamicinfluence diagramis specified. Determine: a decisionpolicy with the followingproperties: ¯ Completeness: Thereis a a set of 1F-THEN, "conditionaction"rules wherea rule appliesto everypossibleworld state. ¯ Behavioralequivalence:The recommended action is the sameas the best action that wouldbe recommended by the uncompiledmodel. ¯ Parameterizedconditions: The conditions are a conjunctionof evidencevaluesand/orprobability-inequality conditionson the prior probabilities of the uncompiled model. Table h The knowledgecompilation problem 2.1 Input: UncompiledModel Dynamicinfluence diagramsWeadopt the influence diagramas the basic representationaltool. Aninfluencediagram[6] (e.g., Figure1) is a graphicalknowledge representation that accountsfor the uncertaintyof states andprovides inference algorithmsto update beliefs given newevidence. Aninfluence diagramrepresents actions as decision nodes, outcome states (includinggoals)as probabilisticstate nodes, and preferences using value nodes. Dependingon the node types whichthey connect,the arcs in the influence diagram representprobabilistic, informational,or valuedependence. CurrentState X.t B ~ NextState /x.,i: [ Utility at t+l F| U.t+1 \1-° Figure1: Asampleinfluence diagram. Informationaboutthe currentstate. Effectsof actions. Utilities of projected state. Bestaction choicefromcurrentstate. Utilities of (currentstate,action)pairs. F: Goal expected utility principle. MEU: Maximum A: B: C: D ~Be~t Action d:t In our applications,several influencediagramsrepresenting a single time slice havebeenchainedtogetherto forma dynamicinfluence diagram(DID). TheDIDforms a rollinghorizonplanningwindow that modifiesitself as time passes. Figure1 is a DIDof 2 time slices wherethe suffix of 0 or 1 specifies the node’stime slice. Assuming the Markov property of successivestates, as time passes, the window is expanded by addinga newtime slice and the oldest time slice is integrated out to preserve the sameamountof lookahead.In the DIDswehaveused, the earliest time slice has beenused to represent the current time tO, and a constant k amountof temporalprojection has been implemented by havinga fixed numberk + 1 of time slices. Oneach decision cycle, the observationsare enteredas evidenceinto the evidencenodes of the timeslice tO.Theoverallutility of the currentplan(i.e., the current settings of the decision nodes)is then measured by examiningthe value node U.k. For this reason, though other nodesand links are eliminatedduringthe compilation, the current time’sevidencenodesandthe top-level valuenode cannot be removed. Prior probabilitiesPrior probabilitiesare probabilitiesthat are not conditionedon other variables. For example,from Figure 1, Loeation.Ois represented by the prior probability p(Location.O), but MapError.1is a conditioned variable since it is representedby the probability function p(MapError.1[MapError.O).Prior probabilities are parameterized.For example,if Location.O has twopossiblevalues, home,--,home,the probabilityis set to p( Location.O home) = Pl and p(Location.O = -~home) = 1 - where pt is a user-controlledparameter.Witha set of suchparameters for the initial state of the model,the compiled modelcan be reused for different settings. For onedomain,onemight set Pt = 0.9 while for anothermissionthe setting wouldbe Pl = 0.99. Figure 2: Knowledge forms for multiple executionarchitectures. form of the rules specifying the optimal decision d* E D whenthere are evidencevariables evidj (1 < j < n) with values valuejt (1 < l < v) for evidence variable evidj: ruleset *-- rulel, . . . , rulev, rulei *-- IF conds THENd* conds ~-- A condj l<_j<n cond~ *-- (evidj = valuej~) l (prel func(P)) I (prel ,’el *-- 51= IP is the set of parameters p representingprior probabilityvalues. func(P) is an algebraic function on P. Anexamplerule wouldlook like IF (MNM= F) A (Pt > 0.9pro) THENWSEARCH Thefollowingsections will showhowsuch rules are generated from the uncompiledmodel. 2.3 Defining KnowledgeCompilation Knowledge compilationcan be definedusing the multiple execution architecturesframework introducedby Russell [14]. Briefly, the idea is that an agentcan be definedas a function f : P* ~ Dthat receives percepts P as input fromthe environmentand outputs a decision d* E D. FromFigure 2 it can be seen that there are multipleinferencepaths fromP to d* usingdifferent knowledge types (A, B, C, D, E, F, and Evidencevariables andthe Utility node Oneach decision MEU, the maximum expectedutility principle). Thefour difcycle, newevidencefrom sensors is entered into the DID ferent paths fromP to d* describe different implementations by setting the correspondingnodesto a specific value. For example,in a mine-mapping mission, on each cycle, the DID of f, the multiple executionarchitectures. Eachexecution architectureoperateson a different set of knowledge types to receivesinformationon whethera newminewasdetected, true make its decision choice: or false. Basedon this evidence,the DIDis used to compute ¯ Decision-Theoretic EA(path A B C MEU) the action with the highestexpectedutility. Theutility is measured at a specific valuenode(e.g., U.1in Figure1). ¯ Goal-BasedEA(path A B F) ¯ Action-Utility EA(path A E MEU) 2.2 Output:Decision Policy ¯ Condition-ActionEA(path A D) Theantecedentconditionsof the generatedrules are algebraic expressions,parameterizedby the variablesrepresentThevarious compilationroutes can be identified by enuing the prior probabilities. Shownbelowis the general meratingall possiblecombinations of arcs in Figure2 that are 198 HOMOGENEOUS HETEROGENEOUS A+A~A A+B--.B B+F~D C + MEU-, F A+D--.D B+C..-, E E + MEU-., D A+E-’-,E "f ibm 2: Listing of all ~mpilation routes. The left column shows the 4 types of homogeneouscompilation. The middle columnshows heterogeneous compilations using projection. The right column shows heterogeneous compilations using utility computation. UNGOMPILED In the compilationC + MEU ---, F, the value function of type C, vc, is converted into a value function of type F, VF by using the Maximum Expected Utility principle. vc(U.tllX.tl ) + MEU---, vr(U.tllX.tl This is done by assigning ur to the outcomestate configuration(s) of X.tl that maximizevc. u± is assigned to all other state configurations. The Goal-Basedexecution rarchitecture expects only output states with utilities u or u±. Thoseoutput states with utility uT are "goals" in that a decision sequenceleading to a goal is optimal. 2.4 Implementing Compilation With the decision-theoretic representation of dynamicinfluence diagrams, there are two classes of operations to implement compilation: 1. Integrating out a chance variable. A+ A--, A,A+ B ~ B,A+ D~ D, A+ E--* E,B+C--* E,B+ F"-~ D 2.5 A Compilation Algorithm This section describes a general knowledgecompilation algorithm for the problemas stated in Table 1. The compilationalgorithm is in three steps: 1. Representthe prior probabilities as parameters. By using parametersfor the probabilities (e.g, for a twovalued variable, p and 1 - p can be used), the compiled network can be re-used manytimes by instantiating the prior probability variables to different values for different domains/missions. 2. Propagatethe variables through influence diagramreductions. By applying Bayes Rule and marginalization, nodes of the influence diagramcan be removedby integrating their influences into the rest of the network. ¯ Removeunneeded barren nodes. Barren nodes are nodes that do not have any children. Barren nodes can be removedfrom an influence diagramwithout affecting the distributions of any of the other nodes, and therefore no changesto existing probability and utility functions need to be made. The barren node removal can be done recursively so that if a node’s only children are barren, it, too, can be removed. The evidence nodes, the nodes of time 0 which are parameterized, and the value node of interest cannot be removedbecause these nodes are used to enter observations and to examinethe utility of decisions. ¯ Removenodes of each time slice in order. The extraneous nodes of each time slice are removed, starting with TimeSlice 0 and progressing to TimeSlice k. The node removals are done by applying Bayes Rule and probability variable marginalization. For example from Figure 1, the compilation A + B ~ B is doneby averagingout a probabilistic variable, in this ease, Location.O: 3. Generateparameterized,if-then, condition-action rules. IF condition, THENaction rules can be generated from the Type D knowledgeof the Condition-Action EA. This procedureis in twosteps: B÷0 _./I’=-- MEU"~F I=-I COMPILED Figure3: Compilations to convertexecutionarchitectures. adjacent, head to tail. Table2 showsthe eight possible compilation routes wheretwo pieces of sometype of knowledgecan be converted into a single piece of knowledge.For example, B + F --, D denotes the combination of type B knowledge and type F knowledgeto yield type D knowledge. The eight types of knowledgecompilation can be used to convert an execution architecture from one type to another (Condition-Action, Goal-Based,Action-Utility, or DecisionTheoretic). The most "uncompiled" EA is the DecisionTheoretic EAwhich can be converted via compilation transformations to any of the other three EAs. The ConditionAction EAis the "most compiled" EAsince no compilation Wansformation can convert it to another EA(Figure 3). p( Loc.11D.0)- ~ p( Loc.11D.0, Loc.O)p( Loc.O) (1) Loe.O The result is a reduced modelwith the probability distribution of Location.Oincorporated into the newdistriImtion p(Loeation. 1 ID.0). 2. Pre-compnting expectedutilities to fix policy. C+MEU --~ F, E+MEU -* D 199 ¯ Convertutilities to fixed decision policy. Onceall extraneous chance variables are removed, then the utilities (which are nowTypeE knowledge) can be convertedto a fixed decision policy using the E + MEU~ D compilation. ¯ Convertpolicy to if-then rules. The fixed policy can be converted to IF-THEN rules wherethe conditionof the IF statement is one of the possible evidence/perceptvectors and a function of E knowledgegives utilities of state-action pairs as shownin Table 3. In this case, the state corresponds to the evidence vector MNewMine.O,and the possible decisions of D.0 are NOOP,WS, BS, GO. The utilities are represented as functionsfi (pt, pro) of the prior probabilities. MNewMine.O F T NOOP Yl WS BS GO Y2 Y3 Y6 Y7 Is Table 3: TypeE knowledge:fi(pt, Pro) The decision policy is to select maz(fl,fz, f3, f4) if MNewMine = F and maz(fs,fs, fT,is) if MNewMine = T. Finally, to create the Condition-Action EA, the Type E utilities are convertedto best decision choices directly using the E + MEU~ D compilation. For example, IF(MN M = F) A (Pt > 0.9pro) IF(MNM = F) A (Pt < 0.9pro) THEN WSEARCH THEN BSEARCH D. Figure 4: DIDundergoing knowledgecompilation. the prior probability variables, and the consequent is a decision recommendation. 2.6 Example Figure 4 displays four views of a DIDof two time slices in different stages of the .knowledgecompilation process. MNewMine.O is an evidence node that receives sensor information at every time slice. The value node of interest is U.I. Loeation.O and M apError.O are prior probabilities; they are modeledas binary-valued nodes and parameterized with pt and Pro: p(Loe.O = home) = Pt p(Loc.O = ~home) = 1 p(MapErr.O = T) = Pm p(MapErr.O = F) = 1 In (A), the barren nodes, U.O, RecoverVeh.O, MNewMine.1,are identified and removed directly without any changesto the probability or utility tables of other nodes. In (B), the nodes of time 0 to be removedare highlighted. Location.Ocan be simply removedvia the reduction of Equation 1. The parameter Pt is carried through the multiplications and summation. The removal of MapError.Ois more complicated, requiring the application of Bayes Rule to reverse the are between MapError.O and MNewMine.O. In (C), the nodes of time 1 are removedin a similar manner to the previous step. Figure 4D. showsthe final graphical form after the B + C ---, E compilationis applied to convert the Decision-Theoretic EAinto an Action-Utility EA. Type 200 2.7 Discussion Previous workIn a decision-theoretic framework, Heckermanet al. [3] comparetwo strategies: "Compute"and "Compile". Whereas"Compute"uses all of the evidence to make a decision, "Compile"pre-computesand stores the effect of somesubset of the evidence variables. The "Compile"strategy can be implementedas a "situation-action tree" where a path fromthe root to a leaf is the sequenceof evidence examined, branchingon different evidence values, and terminating at a leaf nodethat indicates the best decision choice. Lehner and Sadigh[8] follow up on the situation-action tree method, providing procedures for generating and analyzing them. The generated W-THEN rules of the compilation algorithm describedin this paper are similar to the situation-action tree in that both the condition-actionrules and the situation-action tree provide a decision recommendationbased on current evidence. The major difference is that the IF-THENrules are parameterizedby prior probabilities of the model,not only the evidence. In addition, rather than having only Booleanconditions on evidence, the IF-THEN rules also admit probabilityinequality conditions on the prior probabilities of the model. Thesegeneralizations significantly broadenthe reapplicability of the compiled model. Another difference is that the situation-action tree heuristically orders the sequential examination of evidence, potentially identifying the best decision before examining all evidence. Hopefully, the conditionaction rules can also be processed to yield an efficient implementation(e.g., a situation-action tree or Rete network). D’ambrosio[1] and Shachter et al. [16] have done work on a goal-directed inference system that performs only those calculations required by the query, rather than being datadirected and updating the value of all possible queries upon receipt of newevidence. Intermediate results of computation are cachedfor possible future use. Approximatecompilation It is desirable to maintain behavioral equivalence as knowledgecompilation is done. Two systems that are behaviorally equivalent will each select the same decisions given the same percept history. However, knowledge compilationwill still be useful if the systemcan executemuchfaster and result in only a small reductionin ~.-I decision quality. Depending on the utility modelof the domain, an approximate,compiledmodelmaybe preferable. For example,if there is a high degreeof time pressure, the compiledmodelmaybe able to compensate in speedof execution for anydecision errors due to approximate compilation. Theobviousquestionis whetheran optimaltradcoff point betweenthe "accuracy"of the compilationand the time benefit of using the compilationcan be determined.For example,we C can makeapproximations so that all probabilities c-close to t/ 0 or 1 are set to 0 or 1, respectively,andsave the compiled Figure5: ADecision-TheoreticEAinfluence diagramwith a probabilities in a compressed form. 3-step planningwindow. Controlling compilation Because the knowledgecompilation proceduremaintainsbehavioralequivalencebetweenthe in by initializing the prior probabilitiesof uncompiled andcompiledmodels,there is no cost with respect are "programmed" the correspondingdecision nodesto near 1 for the nominal to loss of modelaccuracy.However, two types of costs that do applyare (1) the computational resourcesrequiredto carry plan’sactions. Asthe possiblecomputational actions, replanningdecisions out the compilationand (2) the cost of maintainingmultiple affecting different portions of the utility modelin the influence compiledmodels(i.e., compiledand uncompiledversions of diagramare considered.For example,in figure 5, the direct the samedomain). Thefirst cost of compilationcan be ignoredif the knowl- influences on the utility node U.3 are the DataAccrued.3, edgecompilationis doneoff-line before the modelsare used DataRecovered.3, and FuelGauge.3nodes. The plan refor decision-making. If compilationis doneon-line, it can be pair focuseson the attribute nodesthat will yield the maximum net utility gain, that is, the nodeswith the maximum valueof put undermetalevelcontrol in muchthe samewaythat the replanning. marginalutility of replanningis analyzedin the secondpart of Eachnodeis the root of a subtreeof other nodesthat inthis paper. Theefficacyof suchmetalevelcontrol is directly fluenceit. In general, the nodesof an influencediagramcan related to the speedof the metalevelanalysis andthe availability of knowledge about the timc-paramcterized effects of forman directedacyclicgraphrather thanjust a subtree.There networkinto a tree on-linecompilation,specifically, probabilitydistributionson are various methodsto convert an DAG network (e.g., cutset conditioning [13]) the relevant variablesusedfor decision-making. Supposewe determine that replanning the FuelGauge.3 There is a cost of maintainingmultiple compiledmodels ’ subtree in the influencediagramhas the highest estimated whennewevidence(in the formof assertions that a variable valueof replanning.Recursively,the valueof replanningthe has beenobservedto take on a specific value) occursfor node that has been "compiledout". If the original uncom- subtrees of FuelGauge.3are computed,and continuing in piled networkis savedas well as different levels of compiled this mannerthe influence diagramis examineduntil (1) networks,thenthe cost is only the memory cost of maintaining decisionnodeis reachedandmodifiedto the action valuethat maximizes utility or (2) all examined subtreeshavea negative multiplemodels. estimatedvalue of replanning. 3 Decision-theoretic Metareasoning 3.2 Valueof replanning Toshowhowthe value of replanninga subtree is determined, 3.1 Overview Theobjective of planningin a decision-theoreticframework let Xbe the subtree being consideredwhere"X"is the name that wecan separate the is to select the action sequencethat has the highestexpected of the subtree’s root. Assuming time-dependent utility, TC(X), from the "intrinsic" utility, utility, andthento executethe first actionof that sequence. EUI(X), the estimated value of information (V/) gained Tolend tractability to the computation,wedon’t compute the from replanning subtree X is utility of everyaction sequenceat everydecisionpoint, but rather compare the effect of continuingwith the current plan VI(replan(X)) = l~U1(replan(X)) - 7~C(replan(X)) versus doingsomesort of replanning.Replanning is invoked if andonlyif the estimatedexpectedutility of replanningis wherereplan(X)denotes the action of replanningsubtree X greater than the estimatedexpectedutility of executingthe and". indicatesan estimatedquantity. plan. Thevalue of replanningmaybe positive, for example, Subtreeslike Xare evaluatedandthe corresponding parent if there is newevidencesince the plan wasconstructedor if nodesare insertedinto a list of nodesorderedbythe valueof the plan can be easily improved.Oneach decision cycle, a replanningthe subtreerooted at that node. Thevalue of rebest action mustbe selected, andit is either to executethe planningis a typeof"valueof information"[5, 4, 15, 9] used in decision-theoreticmetareasoning.If the nodeis controlnextstep of the default planor to doreplanning. Theplanningwindowis a dynamicinfluence diagraminilable, its valueis changed,otherwise,the nodeis recursively tialized with a nominalplanthat specifies a default decision expandedto further narrowthe area subject to replanning. for each D.i in the planningwindow.Theexpectedutility if Usingthe orderedlist allows"best-first" search wherethe the nextplan step wereexecutedis the currentutility of the best subgraphto examinecan jumparoundin the influence top-levelutility node.This is becausethe nominalplan steps diagram. 201 Whenthe time cost, 7¢C, is greater than the expected benefit,/ifU1, the value of replanningis negative, and it shouldnot be done, By varying the TC function, we can get different degrees of the time-criticality of the mission. Currently, we define TC to be proportional to the numberof cpu seconds used by the replanner: TC= r. ~ wherer is a time-criticality parameter and ~ is the estimated numberof cpu seconds required to replan the subtree. Toestimate s, wemaintainstatistics over manyruns and use Bayesianprediction to incorporate prior knowledgeand incrementally update the distribution of s. Note that if TCis 0, there is no reason not to continue to computewithout acting. Utility of replanning a subtree E, Ui(replan(X)), of Equation2, is the estimated expected utility of the state resulting from replanning node X independent of time. It is computedby taking the utility difference betweenthe subtree after replanning, X’, and the current subtree, X. To compute the utility for X’, we must knowthe probability that different X’s will occur as a result of replanning. For a decision node, becauseit is completelycontrollable, the probability is 1 that any particular value can be achievedand simplythe decision value that maximizes utility is selected. But for chance nodes, we need to estimate the probability and update it based on replanning experiences. As an example, suppose X is the subtree rooted at FuelGauge.3.The utility of the current probability distribution of FueIGauge.3node can be computed directly. But to computethe utility of the same node FueiGauge.3’after replanningdecisions that might changeits probability distribution, we must estimate howlikely it is that the replanning will alter actions in the plan to achievea newdistribution for FueiGauge.3’. In the example,if the initial probability distribution for FuelGauge.3has a high probability for the value empty, the utility of that subtree will be relatively low accordingto our utility model (an emptyfuel tank is bad). But if we know that there is a high probability that replanning (perhaps by omitting actions to lower fuel consumption)will change the probability distribution to moreheavily weight half or full, the utility of the FuelGauge.3’subtree will be higher. Thus ~U1(replan(X)) would be positive. TomakeFf, Ut (replan(X))explicit, let ¢ be the probability distribution of the root of the subtree X (i.e., ¢ = p(X)), let ~b’ be the probability distribution of X after replanning (i.e., ~ = p(X’)). Then, expected util ity of replan(X) is the averageutility over all possiblevalues of if’ (i.e., all probability distributions of the node): EUt(replan(X)) = f p(¢’)u(replan(X))d¢’ (2) wherethe utility of replanninga nodeis the difference between the newstate resulting fromreplanningand the old state: u(replan(X)) EU~(replan(X)) = u(¢’)- u(¢) = / p(~’)(u(~’) (3) - (4) To find where to replan in the influence diagram, we start at the top-level U node and work backwards through the arcs to the base-level actions. For example in figure 5, U.3 is the child of its three parents, 2O2 DataAccrued.3, DataRecovered:3, FuelGauge.3. Informally, the replanner wants to set a parent node’s value so that its children have a value with maximum expected utility. For example, the replanner wants to set FuelGauge.3to the value such that U.3is at its optimal value, 1. Oncethe optimal distribution for FuelGauge3is determined, recursively, the sameprocedure is invokedso that the propagationof evidence continues to the parents of FuelGauge.3. Let Y be the child node of node X. With the assumption that the networkis a tree, each node has at most one child. The utility of a particular distribution, u(¢) for X, is the expectedutility of the probability distribution of X’s child, Cy = p(Y), given ¢ and the rest of the belief network ~. Letting ~ be implicit yields Equation6: u(~b,~) (5) /p (¢yl~,~)u(~y)d~by (6) Since ~ is the current probability distribution for X, it can be read directly from the current network with no need for computation.As the base case of the recursive inference, the top-level node U’s utilities are defined as u(U = 1) = 1 and u(U = 0) = 0. The other utilities of ¢ are determined from previous inferences as the belief net is examinedrecursively. Specifying u(¢’) is more difficult because every possible probability distribution of X must be considered. Here is where we estimate EUI (replan(X)) of Equation 4, by ranging only over the possible discrete state values of X: = y p(X = = - u( 7) i wheresubstituting into Equation6: u(X = xi) = Ep(Y = yjiX J = xi)u(Y = yj) (8) The probability that the probability distribution X = xi will result from replanning, p(X = xi), must also be provided. This probability can be viewed as a measure of the controllability of the node or whether Howard’s"wizard" is successful [6]. For decision nodeswhichare perfectly controllable, the probability for the value x* that maximizesutility can be set to 1: p(X = x*) = 1. For other nodes, we must estimate p(X = xi). In our implementation, we start with a uniform prior distribution and use Bayesian updating based on replanning episodes. Wecan showthat if the algorithm considers all subtrees with no approximations,the use of the paths from the utility node to the decision node by the aboveprocedure leads to the determinationof the optimal policy [11]. 3.3 Example Using the influence diagram of Figure 5, we can compute t?I(replan(FuelGauge.3)). Looking up the ~ for the FuelGauge.3subtree, suppose we find 0.15 and use r = 1, so TC= 0.15. Computingthe/~U, we start with the given utility u(U.3 = 1) = 1 and u(U.3 = 0) = 0. Therefore, u( FuelGauge.3 = u(U.3 = 1)p(U.3 ll FuelG.3,DataAcc.3,DataRec.3 ) + u(U.3 = 0)p(U.3 OIFuelG.3, Da taAcc.3, Data~ec.3) 1 * p(U.3 = llFueiG.3, DataAcc.3, DataRec.3) 0 ¯ v(u.3 = OIFuelG.3, DataAcc.3,DataRec.3) = p(U.3 = llFueiG.3,DataAcc.3, DataRec.3) = 0.29 Setting FueiGauge.3 to each of its states and propagating the change to its child U.3, we get new values for p(U.3 llFuelGauge.Y, DataAccrued, DataRecovered) and the utility of specific fuel gaugevalues: u(FueIGauge.3 = zero) u(FuelGauge.3 = low) u(FueIGauge.3 - high) = = = 0.01 0.75 0.78 Assuming that wehavenot replanned this subtreebefore, the probabilities areat their defaultuniform values: p(FuelGauge.3= zero) = 1/3 p(FuelGauge.3= low) = 1/3 p(FuelGauge.3 = high) = 1/3. Putting everything together, + 1(0.75-0.29)+ = 0.22 l?l(replan(FueiGauge.3)) internal state nodes.Asfor future work,interleavingcompilation withexecution is a crucialability if the domain model changessignificantlywithtime. Forthe replanning work,we are investigating moreflexible methods to do temporalprojection (e.g., variable amount of lookahead, and non-Markov domains) and exploring the advantages and disadvantages of different execution architectures for different domains (see also [12]). References [1] B. D’Ambrosio.Incremental probabilistic inference. In Proc. of the 9th Conf. on Uncertainty in AI, 1993. [2] R.E. Fikes, P. E. Hart, and N. J. Nilsson. Learningand executing generalized robot plans. Artificial Intelligence, 4, 1972. [3] D. E. Heckerman,J. S. Breese, and E. J. Horvitz. The compilation of decision models.In Proc.of the 5 th Conf. on Uncertainty inA/, 1989. [4] E. J. Horvitz. Reasoningunder varying and uncertain resource constraints. In Proc. of the 7th Natl. Conf. on AL1988. [5] R. A. Howard. Information value theory, IEEE Trans. on Systems, Man,and Cybernetics, 2, 1965. ~U (replan( FueIGauge.3) = 1(0.01-0.29) the domain are exploited. For example, by knowing which variables are evidence variables, we can compile out all other 1(0.78-0.29) = 0.22 - 0.15 = 0.07 [6] R.A. Howard.Influence diagrams. In R. A. Howardand J. E. Matheson,editors, Readingson the PrincOTlesandApplications of DecisionAnalysis, vol. 2, Strategic De~isionsGroup, Menlo Park, CA, 1984. [7]J. Theestimatedvalue of replanningthe FuelGauge.3 subtree is 0.07. This value is compared to replanningother subtrees (viz., DataReeovered.3 and DataAccrued.3) and since it is positive, the best subtree is replanned. If the root of the subtree is a decision node, it is set to its optimal value. Otherwise,recursively, the lower levels of the best subtree are examineduntil the controllable decision nodes to replan are found. Then, the ~ and p(X = xi) numbersare updated using the newdatafromthe replanning episode. E. Laird, P. S. Rosenbloom, and A. NeweU.Chunking in soar. MachineLearning, 1(1), 1986. [8] P. E. Lehner and A. Sadigh. Twoprocedures for compiling influence diagrams.In Proc. of the 9th Conf. on Uncertaintyin A/, 1993. [9] J.E. Matheson.Using influence diagrams to value information and control. In R. M. Oliver and J. Q. Smith, eds., Influence Diagrams, Belief Nets and Decision Analysis. John Wiley & Sons, Chichester, UK,1990. [10] T. M. Mitchell, R. M. Keller, and S. T. Kedar-Cabelli. Explanation-based generalization: A unifying view. Machine Learning, 1:47-80, 1986. 3.4 Discussion ComplexityThetime complexityof the 1~/calculation is [11] G. H. Ogasawara. RALPH-MEA:A Real-Time DecisionTheoretic Agent Architecture. PhDthesis, UCBerkeley, CS lowsince onlyonelevel of evidencepropagation in the netDivision, 1993. Report UCB/CSD-93-777. workfromthe nodeof interestto its childrenis everdone.The u(~), ~, andp(X -- x~) valuesare obtainedusingconstant [12] G. H. Ogasawaraand S. J. Russell. Planning using multiple timetable lookups.Thecalculationof the u(X= xi) terms execution architectures. In Proc. of the 13th Intl. Joint Conf. on AL 1993. involvesthe single-levelpropagation. If thereare at mostn states eachfor Xandits child Y, the total timefor a single [13] J. Pearl. Probabilistic Reasoningin Intelligent Systems: Netl?I calculationis O(n2). At eachtimestep, eachparentof works of Plausible Inference. MorganKaufmann,San Mateo, CA, 1988. the child mustbe evaluatedto find the maximum 1?I, there2) where fore the total timefor onetimestepis O(kn k is the [141 S. J. Russell. Executionarchitectures and compilation. In Proc. of the 11th Intl. Joint Conf.on AI, 1989. indegreeof the node. Thespacerequirements are constantin the number of nodes. [151 S. J. Russell and E. H. Wefald.Principles of metareasoning.In For the Gaussians ~ and p(X = xi), the sufficient statistics are the meanand variance which are stored with each node. 4 Conclusion Knowledge compilationandmetalevelcontrol of replanning werethe twotechniquesdiscussedto aid in achievingrealtimeperformance withoutsacrificingtoo much decisionquality. Forbothtechniques,assumptions aboutthe structureof Proc. of the 1st Intl. Conf. on Principles of KnowledgeRepresentation and Reasoning, 1989. [16] R. D. Shachter, B. D’Ambrosio,and B. A. Del Favero. Symbolic probabilistic inference in belief networks.In Proc. of the 8th Natl. Conf. on AL 1990. 203

Two Techniques for Tractable Decision-Theoretic Planning* t

Related documents

Products

Support

Two Techniques for Tractable Decision-Theoretic Planning* t

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib