Two Techniques for Tractable Decision-Theoretic Planning* t

From: AAAI Technical Report SS-94-06. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Two Techniques for Tractable Decision-Theoretic
Planning*
t
GaryH. Ogasawara
Stuart J. Russell
Computer
Science Division
Universityof California
Berkeley CA 94720
garyo@cs.berkeley.edu russeU@cs.berkeley.edu
Abstract
next step of the default plan is executed.Becausea utility
measureof the time cost of replanningis included, timely
responsescan be erisured.
Twotechniquesto provide inferential tractability for decision-theoreticplanningare discussed.
Knowledge
compilationis a speeduptechniquethat
pre-computesintermediatequantifies of a domain
modelto yield a moreefficiently executablemodel.
Thebasic options for compilingdecision-theoretic
modelsare defined, leading to the mostcompiled
modelin the form of parameterized, conditionaction decision rules. Wedescribe a general compilation algorithmthat can be applied to dynamic
influencediagramsfor sequential decision-making.
Decision-theoretic
raetareasoning
is a control technique that regulates the online replanning. This
algorithmfocuses on the mostpromisingareas to
replan andconsidersthe time cost of replanning.
2
1 Introduction
Applyingdecision-theoreticreasoningwith action sequences
rather than single actionscanbe usedto find an optimalplan.
However,
the standardmethodof computing
the expectedutility of eachaction sequenceandthen choosingthe sequenceof
maximum
utility as the optimalplan is generallyintractable.
This paper describes two techniquesthat wehavebeen investigatingto ensureinferential tractability. Thefirst technique, knowledge
compilation,is doneat the pre-processing
stage to reducethe complexityof the model;the secondtechnique, decision-theoreticmetareasoning,
exerts control upon
the planningandexecutionduringrun-time.
Knowledge
compilationpre-computesintermediate quantities of a domainmodelto yield a moreefficiently executable
model(e.g., [2, 10, 7]). Wedescribea knowledge
compilation
algorithmthat generatesa compiledmodelthat is parameterized by prior probabilities of the uncompiled
model.Because
it is parameterized,the samecompiledmodelcan be used
repeatedlyin sequential decision-making.
Decision-theoreticmetareasoningis used to control the
planningprocess. At each decision point, a default plan is
available. Metareasoningcomputesan estimate on the expected value of replanning. If the value of replanningis
positive, the replanningalgorithmis invoked;otherwise,the
*ThisresearchwassupportedbyNSFgrantsIRI-8903146,
INT9207213,andCDA-8722788,
and the Lockheed
AI Center.
lAlsoaffiliatedwiththe Lockheed
AICenter,PaloAlto,CA.
197
KnowledgeCompilation
Thetermswill be clarified below,but in brief, the knowledge
compilationproblemis shownin Table1.
Given:an uncompi!ed
modelwith the followingproperties:
¯ It is representedby a dynamicinfluence diagramthat
allowstemporalprojection fromtime 0 to time k.
¯ Priorprobabilitiesof time 0 are parameterized
for generality.
¯ Evidencevariablesare specified.
¯ The utility node of the dynamicinfluence diagramis
specified.
Determine:
a decisionpolicy with the followingproperties:
¯ Completeness:
Thereis a a set of 1F-THEN,
"conditionaction"rules wherea rule appliesto everypossibleworld
state.
¯ Behavioralequivalence:The recommended
action is the
sameas the best action that wouldbe recommended
by
the uncompiledmodel.
¯ Parameterizedconditions: The conditions are a conjunctionof evidencevaluesand/orprobability-inequality
conditionson the prior probabilities of the uncompiled
model.
Table h The knowledgecompilation problem
2.1 Input: UncompiledModel
Dynamicinfluence diagramsWeadopt the influence diagramas the basic representationaltool. Aninfluencediagram[6] (e.g., Figure1) is a graphicalknowledge
representation that accountsfor the uncertaintyof states andprovides
inference algorithmsto update beliefs given newevidence.
Aninfluence diagramrepresents actions as decision nodes,
outcome
states (includinggoals)as probabilisticstate nodes,
and preferences using value nodes. Dependingon the node
types whichthey connect,the arcs in the influence diagram
representprobabilistic, informational,or valuedependence.
CurrentState
X.t
B
~
NextState
/x.,i:
[ Utility at t+l
F| U.t+1
\1-°
Figure1: Asampleinfluence diagram.
Informationaboutthe currentstate.
Effectsof actions.
Utilities of projected
state.
Bestaction choicefromcurrentstate.
Utilities of (currentstate,action)pairs.
F: Goal
expected
utility principle.
MEU: Maximum
A:
B:
C:
D
~Be~t Action
d:t
In our applications,several influencediagramsrepresenting a single time slice havebeenchainedtogetherto forma
dynamicinfluence diagram(DID). TheDIDforms a rollinghorizonplanningwindow
that modifiesitself as time passes.
Figure1 is a DIDof 2 time slices wherethe suffix of 0 or 1
specifies the node’stime slice. Assuming
the Markov
property
of successivestates, as time passes, the window
is expanded
by addinga newtime slice and the oldest time slice is integrated out to preserve the sameamountof lookahead.In
the DIDswehaveused, the earliest time slice has beenused
to represent the current time tO, and a constant k amountof
temporalprojection has been implemented
by havinga fixed
numberk + 1 of time slices. Oneach decision cycle, the
observationsare enteredas evidenceinto the evidencenodes
of the timeslice tO.Theoverallutility of the currentplan(i.e.,
the current settings of the decision nodes)is then measured
by examiningthe value node U.k. For this reason, though
other nodesand links are eliminatedduringthe compilation,
the current time’sevidencenodesandthe top-level valuenode
cannot be removed.
Prior probabilitiesPrior probabilitiesare probabilitiesthat
are not conditionedon other variables. For example,from
Figure 1, Loeation.Ois represented by the prior probability p(Location.O), but MapError.1is a conditioned
variable since it is representedby the probability function
p(MapError.1[MapError.O).Prior probabilities are parameterized.For example,if Location.O
has twopossiblevalues, home,--,home,the probabilityis set to p( Location.O
home) = Pl and p(Location.O = -~home) = 1 - where
pt is a user-controlledparameter.Witha set of suchparameters for the initial state of the model,the compiled
modelcan
be reused for different settings. For onedomain,onemight
set Pt = 0.9 while for anothermissionthe setting wouldbe
Pl = 0.99.
Figure 2: Knowledge
forms for multiple executionarchitectures.
form of the rules specifying the optimal decision d* E D
whenthere are evidencevariables evidj (1 < j < n) with
values valuejt (1 < l < v) for evidence variable evidj:
ruleset *-- rulel, . . . , rulev,
rulei *-- IF conds THENd*
conds ~-- A condj
l<_j<n
cond~ *-- (evidj = valuej~) l (prel func(P)) I (prel
,’el *-- 51= IP is the set of parameters
p representingprior probabilityvalues.
func(P)
is an algebraic
function
on P.
Anexamplerule wouldlook like
IF (MNM= F) A (Pt > 0.9pro)
THENWSEARCH
Thefollowingsections will showhowsuch rules are generated
from the uncompiledmodel.
2.3 Defining KnowledgeCompilation
Knowledge
compilationcan be definedusing the multiple execution architecturesframework
introducedby Russell [14].
Briefly, the idea is that an agentcan be definedas a function
f : P* ~ Dthat receives percepts P as input fromthe environmentand outputs a decision d* E D. FromFigure 2
it can be seen that there are multipleinferencepaths fromP
to d* usingdifferent knowledge
types (A, B, C, D, E, F, and
Evidencevariables andthe Utility node Oneach decision
MEU,
the maximum
expectedutility principle). Thefour difcycle, newevidencefrom sensors is entered into the DID ferent paths fromP to d* describe different implementations
by setting the correspondingnodesto a specific value. For
example,in a mine-mapping
mission, on each cycle, the DID of f, the multiple executionarchitectures. Eachexecution
architectureoperateson a different set of knowledge
types to
receivesinformationon whethera newminewasdetected, true
make
its
decision
choice:
or false. Basedon this evidence,the DIDis used to compute
¯ Decision-Theoretic EA(path A B C MEU)
the action with the highestexpectedutility. Theutility is
measured
at a specific valuenode(e.g., U.1in Figure1).
¯ Goal-BasedEA(path A B F)
¯ Action-Utility EA(path A E MEU)
2.2 Output:Decision Policy
¯ Condition-ActionEA(path A D)
Theantecedentconditionsof the generatedrules are algebraic expressions,parameterizedby the variablesrepresentThevarious compilationroutes can be identified by enuing the prior probabilities. Shownbelowis the general
meratingall possiblecombinations
of arcs in Figure2 that are
198
HOMOGENEOUS
HETEROGENEOUS
A+A~A
A+B--.B
B+F~D
C + MEU-, F
A+D--.D
B+C..-,
E E + MEU-., D
A+E-’-,E
"f ibm 2: Listing of all ~mpilation routes. The left column
shows the 4 types of homogeneouscompilation. The middle
columnshows heterogeneous compilations using projection.
The right column shows heterogeneous compilations using
utility computation.
UNGOMPILED
In the compilationC + MEU
---, F, the value function of
type C, vc, is converted into a value function of type F,
VF by using the Maximum
Expected Utility principle.
vc(U.tllX.tl
) + MEU---, vr(U.tllX.tl
This is done by assigning ur to the outcomestate configuration(s) of X.tl that maximizevc. u± is assigned to
all other state configurations. The Goal-Basedexecution
rarchitecture expects only output states with utilities u
or u±. Thoseoutput states with utility uT are "goals" in
that a decision sequenceleading to a goal is optimal.
2.4 Implementing Compilation
With the decision-theoretic representation of dynamicinfluence diagrams, there are two classes of operations to implement compilation:
1. Integrating out a chance variable.
A+ A--,
A,A+ B ~ B,A+ D~ D,
A+ E--* E,B+C--* E,B+ F"-~ D
2.5 A Compilation Algorithm
This section describes a general knowledgecompilation algorithm for the problemas stated in Table 1.
The compilationalgorithm is in three steps:
1. Representthe prior probabilities as parameters.
By using parametersfor the probabilities (e.g, for a twovalued variable, p and 1 - p can be used), the compiled
network can be re-used manytimes by instantiating the
prior probability variables to different values for different
domains/missions.
2. Propagatethe variables through influence diagramreductions.
By applying Bayes Rule and marginalization, nodes of
the influence diagramcan be removedby integrating their
influences into the rest of the network.
¯ Removeunneeded barren nodes.
Barren nodes are nodes that do not have any children. Barren nodes can be removedfrom an influence diagramwithout affecting the distributions of
any of the other nodes, and therefore no changesto
existing probability and utility functions need to be
made. The barren node removal can be done recursively so that if a node’s only children are barren,
it, too, can be removed. The evidence nodes, the
nodes of time 0 which are parameterized, and the
value node of interest cannot be removedbecause
these nodes are used to enter observations and to
examinethe utility of decisions.
¯ Removenodes of each time slice in order.
The extraneous nodes of each time slice are removed, starting with TimeSlice 0 and progressing to TimeSlice k. The node removals are done
by applying Bayes Rule and probability variable
marginalization.
For example from Figure 1, the compilation A + B ~ B
is doneby averagingout a probabilistic variable, in this
ease, Location.O:
3. Generateparameterized,if-then, condition-action rules.
IF condition, THENaction rules can be generated from
the Type D knowledgeof the Condition-Action EA. This
procedureis in twosteps:
B÷0
_./I’=--
MEU"~F
I=-I
COMPILED
Figure3: Compilations
to
convertexecutionarchitectures.
adjacent, head to tail. Table2 showsthe eight possible compilation routes wheretwo pieces of sometype of knowledgecan
be converted into a single piece of knowledge.For example,
B + F --, D denotes the combination of type B knowledge
and type F knowledgeto yield type D knowledge.
The eight types of knowledgecompilation can be used to
convert an execution architecture from one type to another
(Condition-Action, Goal-Based,Action-Utility, or DecisionTheoretic). The most "uncompiled" EA is the DecisionTheoretic EAwhich can be converted via compilation transformations to any of the other three EAs. The ConditionAction EAis the "most compiled" EAsince no compilation
Wansformation
can convert it to another EA(Figure 3).
p( Loc.11D.0)- ~ p( Loc.11D.0, Loc.O)p( Loc.O) (1)
Loe.O
The result is a reduced modelwith the probability distribution of Location.Oincorporated into the newdistriImtion p(Loeation. 1 ID.0).
2. Pre-compnting
expectedutilities to fix policy.
C+MEU --~ F, E+MEU -* D
199
¯ Convertutilities to fixed decision policy.
Onceall extraneous chance variables are removed,
then the utilities (which are nowTypeE knowledge)
can be convertedto a fixed decision policy using the
E + MEU~ D compilation.
¯ Convertpolicy to if-then rules.
The fixed policy can be converted to IF-THEN
rules
wherethe conditionof the IF statement is one of the
possible evidence/perceptvectors and a function of
E knowledgegives utilities of state-action pairs as shownin
Table 3. In this case, the state corresponds to the evidence
vector MNewMine.O,and the possible decisions of D.0
are NOOP,WS, BS, GO. The utilities
are represented as
functionsfi (pt, pro) of the prior probabilities.
MNewMine.O
F
T
NOOP
Yl
WS
BS
GO
Y2 Y3
Y6 Y7 Is
Table 3: TypeE knowledge:fi(pt, Pro)
The decision policy is to select maz(fl,fz, f3, f4)
if MNewMine = F and maz(fs,fs,
fT,is)
if
MNewMine = T.
Finally, to create the Condition-Action EA, the Type E
utilities are convertedto best decision choices directly using
the E + MEU~ D compilation. For example,
IF(MN M = F) A (Pt > 0.9pro)
IF(MNM = F) A (Pt < 0.9pro)
THEN WSEARCH
THEN BSEARCH
D.
Figure 4: DIDundergoing knowledgecompilation.
the prior probability variables, and the consequent
is a decision recommendation.
2.6 Example
Figure 4 displays four views of a DIDof two time slices
in different stages of the .knowledgecompilation process.
MNewMine.O
is an evidence node that receives sensor information at every time slice. The value node of interest is
U.I. Loeation.O and M apError.O are prior probabilities;
they are modeledas binary-valued nodes and parameterized
with pt and Pro:
p(Loe.O = home) = Pt p(Loc.O = ~home) = 1
p(MapErr.O = T) = Pm p(MapErr.O = F) = 1
In (A), the barren nodes, U.O, RecoverVeh.O,
MNewMine.1,are identified and removed directly without any changesto the probability or utility tables of other
nodes. In (B), the nodes of time 0 to be removedare highlighted. Location.Ocan be simply removedvia the reduction
of Equation 1. The parameter Pt is carried through the multiplications and summation. The removal of MapError.Ois
more complicated, requiring the application of Bayes Rule to
reverse the are between MapError.O and MNewMine.O.
In (C), the nodes of time 1 are removedin a similar manner
to the previous step. Figure 4D. showsthe final graphical
form after the B + C ---, E compilationis applied to convert
the Decision-Theoretic EAinto an Action-Utility EA. Type
200
2.7 Discussion
Previous workIn a decision-theoretic framework, Heckermanet al. [3] comparetwo strategies: "Compute"and "Compile". Whereas"Compute"uses all of the evidence to make
a decision, "Compile"pre-computesand stores the effect of
somesubset of the evidence variables. The "Compile"strategy can be implementedas a "situation-action tree" where a
path fromthe root to a leaf is the sequenceof evidence examined, branchingon different evidence values, and terminating
at a leaf nodethat indicates the best decision choice. Lehner
and Sadigh[8] follow up on the situation-action tree method,
providing procedures for generating and analyzing them.
The generated W-THEN
rules of the compilation algorithm
describedin this paper are similar to the situation-action tree
in that both the condition-actionrules and the situation-action
tree provide a decision recommendationbased on current evidence. The major difference is that the IF-THENrules are
parameterizedby prior probabilities of the model,not only the
evidence. In addition, rather than having only Booleanconditions on evidence, the IF-THEN
rules also admit probabilityinequality conditions on the prior probabilities of the model.
Thesegeneralizations significantly broadenthe reapplicability of the compiled model. Another difference is that the
situation-action tree heuristically orders the sequential examination of evidence, potentially identifying the best decision
before examining all evidence. Hopefully, the conditionaction rules can also be processed to yield an efficient implementation(e.g., a situation-action tree or Rete network).
D’ambrosio[1] and Shachter et al. [16] have done work
on a goal-directed inference system that performs only those
calculations required by the query, rather than being datadirected and updating the value of all possible queries upon
receipt of newevidence. Intermediate results of computation
are cachedfor possible future use.
Approximatecompilation It is desirable to maintain behavioral equivalence as knowledgecompilation is done. Two
systems that are behaviorally equivalent will each select the
same decisions given the same percept history. However,
knowledge
compilationwill still be useful if the systemcan
executemuchfaster and result in only a small reductionin
~.-I
decision quality. Depending
on the utility modelof the domain, an approximate,compiledmodelmaybe preferable.
For example,if there is a high degreeof time pressure, the
compiledmodelmaybe able to compensate
in speedof execution for anydecision errors due to approximate
compilation.
Theobviousquestionis whetheran optimaltradcoff point betweenthe "accuracy"of the compilationand the time benefit
of using the compilationcan be determined.For example,we
C
can makeapproximations
so that all probabilities c-close to
t/
0 or 1 are set to 0 or 1, respectively,andsave the compiled Figure5: ADecision-TheoreticEAinfluence diagramwith a
probabilities in a compressed
form.
3-step planningwindow.
Controlling compilation Because the knowledgecompilation proceduremaintainsbehavioralequivalencebetweenthe
in by initializing the prior probabilitiesof
uncompiled
andcompiledmodels,there is no cost with respect are "programmed"
the correspondingdecision nodesto near 1 for the nominal
to loss of modelaccuracy.However,
two types of costs that
do applyare (1) the computational
resourcesrequiredto carry plan’sactions.
Asthe possiblecomputational
actions, replanningdecisions
out the compilationand (2) the cost of maintainingmultiple
affecting
different
portions
of
the
utility modelin the influence
compiledmodels(i.e., compiledand uncompiledversions of
diagramare considered.For example,in figure 5, the direct
the samedomain).
Thefirst cost of compilationcan be ignoredif the knowl- influences on the utility node U.3 are the DataAccrued.3,
edgecompilationis doneoff-line before the modelsare used DataRecovered.3, and FuelGauge.3nodes. The plan refor decision-making.
If compilationis doneon-line, it can be pair focuseson the attribute nodesthat will yield the maximum
net utility gain, that is, the nodeswith the maximum
valueof
put undermetalevelcontrol in muchthe samewaythat the
replanning.
marginalutility of replanningis analyzedin the secondpart of
Eachnodeis the root of a subtreeof other nodesthat inthis paper. Theefficacyof suchmetalevelcontrol is directly
fluenceit. In general, the nodesof an influencediagramcan
related to the speedof the metalevelanalysis andthe availability of knowledge
about the timc-paramcterized
effects of forman directedacyclicgraphrather thanjust a subtree.There
networkinto a tree
on-linecompilation,specifically, probabilitydistributionson are various methodsto convert an DAG
network
(e.g.,
cutset
conditioning
[13])
the relevant variablesusedfor decision-making.
Supposewe determine that replanning the FuelGauge.3
There is a cost of maintainingmultiple compiledmodels
’ subtree in the influencediagramhas the highest estimated
whennewevidence(in the formof assertions that a variable valueof replanning.Recursively,the valueof replanningthe
has beenobservedto take on a specific value) occursfor
node that has been "compiledout". If the original uncom- subtrees of FuelGauge.3are computed,and continuing in
piled networkis savedas well as different levels of compiled this mannerthe influence diagramis examineduntil (1)
networks,thenthe cost is only the memory
cost of maintaining decisionnodeis reachedandmodifiedto the action valuethat
maximizes
utility or (2) all examined
subtreeshavea negative
multiplemodels.
estimatedvalue of replanning.
3 Decision-theoretic
Metareasoning
3.2 Valueof replanning
Toshowhowthe value of replanninga subtree is determined,
3.1 Overview
Theobjective of planningin a decision-theoreticframework let Xbe the subtree being consideredwhere"X"is the name
that wecan separate the
is to select the action sequencethat has the highestexpected of the subtree’s root. Assuming
time-dependent
utility,
TC(X),
from
the "intrinsic" utility,
utility, andthento executethe first actionof that sequence.
EUI(X),
the
estimated
value
of
information
(V/) gained
Tolend tractability to the computation,wedon’t compute
the
from
replanning
subtree
X
is
utility of everyaction sequenceat everydecisionpoint, but
rather compare
the effect of continuingwith the current plan
VI(replan(X)) = l~U1(replan(X)) - 7~C(replan(X))
versus doingsomesort of replanning.Replanning
is invoked
if andonlyif the estimatedexpectedutility of replanningis
wherereplan(X)denotes the action of replanningsubtree X
greater than the estimatedexpectedutility of executingthe
and". indicatesan estimatedquantity.
plan. Thevalue of replanningmaybe positive, for example,
Subtreeslike Xare evaluatedandthe corresponding
parent
if there is newevidencesince the plan wasconstructedor if
nodesare insertedinto a list of nodesorderedbythe valueof
the plan can be easily improved.Oneach decision cycle, a
replanningthe subtreerooted at that node. Thevalue of rebest action mustbe selected, andit is either to executethe planningis a typeof"valueof information"[5, 4, 15, 9] used
in decision-theoreticmetareasoning.If the nodeis controlnextstep of the default planor to doreplanning.
Theplanningwindowis a dynamicinfluence diagraminilable, its valueis changed,otherwise,the nodeis recursively
tialized with a nominalplanthat specifies a default decision expandedto further narrowthe area subject to replanning.
for each D.i in the planningwindow.Theexpectedutility if
Usingthe orderedlist allows"best-first" search wherethe
the nextplan step wereexecutedis the currentutility of the best subgraphto examinecan jumparoundin the influence
top-levelutility node.This is becausethe nominalplan steps diagram.
201
Whenthe time cost, 7¢C, is greater than the expected benefit,/ifU1, the value of replanningis negative, and it shouldnot
be done, By varying the TC function, we can get different
degrees of the time-criticality of the mission. Currently, we
define TC to be proportional to the numberof cpu seconds
used by the replanner: TC= r. ~ wherer is a time-criticality
parameter and ~ is the estimated numberof cpu seconds required to replan the subtree. Toestimate s, wemaintainstatistics over manyruns and use Bayesianprediction to incorporate
prior knowledgeand incrementally update the distribution of
s. Note that if TCis 0, there is no reason not to continue to
computewithout acting.
Utility of replanning a subtree E, Ui(replan(X)), of
Equation2, is the estimated expected utility of the state resulting from replanning node X independent of time. It is
computedby taking the utility difference betweenthe subtree
after replanning, X’, and the current subtree, X. To compute
the utility for X’, we must knowthe probability that different X’s will occur as a result of replanning. For a decision
node, becauseit is completelycontrollable, the probability is
1 that any particular value can be achievedand simplythe decision value that maximizes
utility is selected. But for chance
nodes, we need to estimate the probability and update it based
on replanning experiences.
As an example, suppose X is the subtree rooted at
FuelGauge.3.The utility of the current probability distribution of FueIGauge.3node can be computed directly. But
to computethe utility of the same node FueiGauge.3’after
replanningdecisions that might changeits probability distribution, we must estimate howlikely it is that the replanning
will alter actions in the plan to achievea newdistribution for
FueiGauge.3’.
In the example,if the initial probability distribution for
FuelGauge.3has a high probability for the value empty, the
utility of that subtree will be relatively low accordingto our
utility model (an emptyfuel tank is bad). But if we know
that there is a high probability that replanning (perhaps by
omitting actions to lower fuel consumption)will change the
probability distribution to moreheavily weight half or full,
the utility of the FuelGauge.3’subtree will be higher. Thus
~U1(replan(X)) would be positive.
TomakeFf, Ut (replan(X))explicit, let ¢ be the probability
distribution of the root of the subtree X (i.e., ¢ = p(X)),
let ~b’ be the probability distribution of X after replanning
(i.e., ~ = p(X’)). Then, expected util ity of replan(X)
is the averageutility over all possiblevalues of if’ (i.e., all
probability distributions of the node):
EUt(replan(X))
= f p(¢’)u(replan(X))d¢’
(2)
wherethe utility of replanninga nodeis the difference between
the newstate resulting fromreplanningand the old state:
u(replan(X))
EU~(replan(X))
= u(¢’)-
u(¢)
= / p(~’)(u(~’)
(3)
- (4)
To find where to replan in the influence diagram,
we start at the top-level U node and work backwards
through the arcs to the base-level actions. For example in figure 5, U.3 is the child of its three parents,
2O2
DataAccrued.3, DataRecovered:3, FuelGauge.3. Informally, the replanner wants to set a parent node’s value so
that its children have a value with maximum
expected utility.
For example, the replanner wants to set FuelGauge.3to the
value such that U.3is at its optimal value, 1. Oncethe optimal
distribution for FuelGauge3is determined, recursively, the
sameprocedure is invokedso that the propagationof evidence
continues to the parents of FuelGauge.3.
Let Y be the child node of node X. With the assumption
that the networkis a tree, each node has at most one child.
The utility of a particular distribution, u(¢) for X, is the
expectedutility of the probability distribution of X’s child,
Cy = p(Y), given ¢ and the rest of the belief network ~.
Letting ~ be implicit yields Equation6:
u(~b,~)
(5)
/p (¢yl~,~)u(~y)d~by
(6)
Since ~ is the current probability distribution for X, it can
be read directly from the current network with no need for
computation.As the base case of the recursive inference, the
top-level node U’s utilities are defined as u(U = 1) = 1 and
u(U = 0) = 0. The other utilities of ¢ are determined from
previous inferences as the belief net is examinedrecursively.
Specifying u(¢’) is more difficult because every possible
probability distribution of X must be considered. Here is
where we estimate EUI (replan(X)) of Equation 4, by ranging only over the possible discrete state values of X:
= y p(X =
=
- u( 7)
i
wheresubstituting into Equation6:
u(X = xi)
= Ep(Y = yjiX
J
= xi)u(Y
= yj)
(8)
The probability that the probability distribution X = xi
will result from replanning, p(X = xi), must also be provided. This probability can be viewed as a measure of the
controllability of the node or whether Howard’s"wizard" is
successful [6]. For decision nodeswhichare perfectly controllable, the probability for the value x* that maximizesutility
can be set to 1: p(X = x*) = 1. For other nodes, we must
estimate p(X = xi). In our implementation, we start with a
uniform prior distribution and use Bayesian updating based
on replanning episodes.
Wecan showthat if the algorithm considers all subtrees
with no approximations,the use of the paths from the utility
node to the decision node by the aboveprocedure leads to the
determinationof the optimal policy [11].
3.3 Example
Using the influence diagram of Figure 5, we can compute t?I(replan(FuelGauge.3)). Looking up the ~ for the
FuelGauge.3subtree, suppose we find 0.15 and use r = 1,
so TC= 0.15. Computingthe/~U, we start with the given
utility u(U.3 = 1) = 1 and u(U.3 = 0) = 0. Therefore,
u( FuelGauge.3
= u(U.3 = 1)p(U.3 ll FuelG.3,DataAcc.3,DataRec.3 ) +
u(U.3 = 0)p(U.3 OIFuelG.3, Da
taAcc.3, Data~ec.3)
1 * p(U.3 = llFueiG.3,
DataAcc.3, DataRec.3)
0 ¯ v(u.3 = OIFuelG.3,
DataAcc.3,DataRec.3)
= p(U.3 = llFueiG.3,DataAcc.3,
DataRec.3)
= 0.29
Setting FueiGauge.3 to each of its states and propagating
the change to its child U.3, we get new values for p(U.3
llFuelGauge.Y,
DataAccrued,
DataRecovered)
and the
utility of specific fuel gaugevalues:
u(FueIGauge.3 = zero)
u(FuelGauge.3 = low)
u(FueIGauge.3 - high)
=
=
=
0.01
0.75
0.78
Assuming
that wehavenot replanned
this subtreebefore,
the probabilities
areat their defaultuniform
values:
p(FuelGauge.3= zero) = 1/3
p(FuelGauge.3= low) = 1/3
p(FuelGauge.3
= high)
=
1/3.
Putting everything together,
+ 1(0.75-0.29)+
= 0.22
l?l(replan(FueiGauge.3))
internal state nodes.Asfor future work,interleavingcompilation withexecution
is a crucialability if the domain
model
changessignificantlywithtime. Forthe replanning
work,we
are investigating moreflexible methods
to do temporalprojection (e.g., variable amount of lookahead, and non-Markov
domains) and exploring the advantages and disadvantages of
different execution architectures for different domains (see
also [12]).
References
[1] B. D’Ambrosio.Incremental probabilistic inference. In Proc.
of the 9th Conf. on Uncertainty in AI, 1993.
[2] R.E. Fikes, P. E. Hart, and N. J. Nilsson. Learningand executing generalized robot plans. Artificial Intelligence, 4, 1972.
[3] D. E. Heckerman,J. S. Breese, and E. J. Horvitz. The compilation of decision models.In Proc.of the 5 th Conf. on Uncertainty
inA/, 1989.
[4] E. J. Horvitz. Reasoningunder varying and uncertain resource
constraints. In Proc. of the 7th Natl. Conf. on AL1988.
[5] R. A. Howard. Information value theory, IEEE Trans. on
Systems, Man,and Cybernetics, 2, 1965.
~U (replan( FueIGauge.3)
= 1(0.01-0.29)
the domain are exploited. For example, by knowing which
variables are evidence variables, we can compile out all other
1(0.78-0.29)
= 0.22 - 0.15 = 0.07
[6] R.A. Howard.Influence diagrams. In R. A. Howardand J. E.
Matheson,editors, Readingson the PrincOTlesandApplications
of DecisionAnalysis, vol. 2, Strategic De~isionsGroup, Menlo
Park, CA, 1984.
[7]J.
Theestimatedvalue of replanningthe FuelGauge.3
subtree is 0.07. This value is compared
to replanningother
subtrees (viz.,
DataReeovered.3 and DataAccrued.3) and
since it is positive, the best subtree is replanned. If the root
of the subtree is a decision node, it is set to its optimal value.
Otherwise,recursively, the lower levels of the best subtree are
examineduntil the controllable decision nodes to replan are
found. Then, the ~ and p(X = xi) numbersare updated using
the newdatafromthe replanning
episode.
E. Laird, P. S. Rosenbloom, and A. NeweU.Chunking in
soar. MachineLearning, 1(1), 1986.
[8]
P. E. Lehner and A. Sadigh. Twoprocedures for compiling
influence diagrams.In Proc. of the 9th Conf. on Uncertaintyin
A/, 1993.
[9] J.E. Matheson.Using influence diagrams to value information
and control. In R. M. Oliver and J. Q. Smith, eds., Influence
Diagrams, Belief Nets and Decision Analysis. John Wiley &
Sons, Chichester, UK,1990.
[10] T. M. Mitchell, R. M. Keller, and S. T. Kedar-Cabelli.
Explanation-based generalization: A unifying view. Machine
Learning, 1:47-80, 1986.
3.4 Discussion
ComplexityThetime complexityof the 1~/calculation is
[11] G. H. Ogasawara. RALPH-MEA:A Real-Time DecisionTheoretic Agent Architecture. PhDthesis, UCBerkeley, CS
lowsince onlyonelevel of evidencepropagation
in the netDivision, 1993. Report UCB/CSD-93-777.
workfromthe nodeof interestto its childrenis everdone.The
u(~), ~, andp(X -- x~) valuesare obtainedusingconstant [12] G. H. Ogasawaraand S. J. Russell. Planning using multiple
timetable lookups.Thecalculationof the u(X= xi) terms
execution architectures. In Proc. of the 13th Intl. Joint Conf.
on AL 1993.
involvesthe single-levelpropagation.
If thereare at mostn
states eachfor Xandits child Y, the total timefor a single [13] J. Pearl. Probabilistic Reasoningin Intelligent Systems: Netl?I calculationis O(n2). At eachtimestep, eachparentof
works of Plausible Inference. MorganKaufmann,San Mateo,
CA, 1988.
the child mustbe evaluatedto find the maximum
1?I, there2) where
fore the total timefor onetimestepis O(kn
k is the [141 S. J. Russell. Executionarchitectures and compilation. In Proc.
of the 11th Intl. Joint Conf.on AI, 1989.
indegreeof the node.
Thespacerequirements
are constantin the number
of nodes. [151 S. J. Russell and E. H. Wefald.Principles of metareasoning.In
For the Gaussians ~ and p(X = xi), the sufficient statistics
are the meanand variance which are stored with each node.
4
Conclusion
Knowledge
compilationandmetalevelcontrol of replanning
werethe twotechniquesdiscussedto aid in achievingrealtimeperformance
withoutsacrificingtoo much
decisionquality. Forbothtechniques,assumptions
aboutthe structureof
Proc. of the 1st Intl. Conf. on Principles of KnowledgeRepresentation and Reasoning, 1989.
[16] R. D. Shachter, B. D’Ambrosio,and B. A. Del Favero. Symbolic probabilistic inference in belief networks.In Proc. of the
8th Natl. Conf. on AL 1990.
203