BUCKET ELIMINATION: A UNIFYING FRAMEWORK FOR PROBABILISTIC INFERENCE Abstract.

advertisement
BUCKET ELIMINATION: A UNIFYING FRAMEWORK FOR
PROBABILISTIC INFERENCE
R. DECHTER
Department of Information and Computer Science
University of California, Irvine
dechter@ics.uci.edu
Abstract.
Probabilistic inference algorithms for belief updating, nding the most
probable explanation, the maximum a posteriori hypothesis, and the maximum expected utility are reformulated within the bucket elimination framework. This emphasizes the principles common to many of the algorithms
appearing in the probabilistic inference literature and claries the relationship of such algorithms to nonserial dynamic programming algorithms. A
general method for combining conditioning and bucket elimination is also
presented. For all the algorithms, bounds on complexity are given as a
function of the problem's structure.
1. Overview
Bucket elimination is a unifying algorithmic framework that generalizes dynamic programming to accommodate algorithms for many complex problemsolving and reasoning activities, including directional resolution for propositional satisability (Davis and Putnam, 1960), adaptive consistency for
constraint satisfaction (Dechter and Pearl, 1987), Fourier and Gaussian
elimination for linear equalities and inequalities, and dynamic programming for combinatorial optimization (Bertele and Brioschi, 1972). Here, after presenting the framework, we demonstrate that a number of algorithms
for probabilistic inference can also be expressed as bucket-elimination algorithms.
The main virtues of the bucket-elimination framework are simplicity
and generality. By simplicity, we mean that a complete specication of
2
R. DECHTER
bucket-elimination algorithms is feasible without introducing extensive terminology (e.g., graph concepts such as triangulation and arc-reversal), thus
making the algorithms accessible to researchers in diverse areas. More important, the uniformity of the algorithms facilitates understanding, which
encourages cross-fertilization and technology transfer between disciplines.
Indeed, all bucket-elimination algorithms are similar enough for any improvement to a single algorithm to be applicable to all others expressed in
this framework. For example, expressing probabilistic inference algorithms
as bucket-elimination methods claries the former's relationship to dynamic
programming and to constraint satisfaction such that the knowledge accumulated in those areas may be utilized in the probabilistic framework.
The generality of bucket elimination can be illustrated with an algorithm in the area of deterministic reasoning. Consider the following algorithm for deciding satisability. Given a set of clauses (a clause is a disjunction of propositional variables or their negations) and an ordering of
the propositional variables, d = Q1 ; :::; Qn, algorithm directional resolution
(DR) (Dechter and Rish, 1994), is the core of the well-known Davis-Putnam
algorithm for satisability (Davis and Putnam, 1960). The algorithm is described using buckets partitioning the given set of clauses such that all the
clauses containing Qi that do not contain any symbol higher in the ordering
are placed in the bucket of Qi , denoted bucketi .
The algorithm (see Figure 1) processes the buckets in the reverse order
of d. When processing bucketi , it resolves over Qi all possible pairs of clauses
in the bucket and inserts the resolvents into the appropriate lower buckets.
It was shown that if the empty clause is not generated in this process then
the theory is satisable and a satisfying truth assignment can be generated
in time linear in the size of the resulting theory. The complexity of the
algorithm is exponentially bounded (time and space) in a graph parameter
called induced width (also called tree-width) of the interaction graph of the
theory, where a node is associated with a proposition and an arc connects
any two nodes appearing in the same clause (Dechter and Rish, 1994).
The belief-network algorithms we present in this paper have much in
common with the resolution procedure above. They all possess the property of compiling a theory into one from which answers can be extracted
easily and their complexity is dependent on the same induced width graph
parameter. The algorithms are variations on known algorithms and, for the
most part, are not new, in the sense that the basic ideas have existed for
some time (Cannings et al., 1978; Pearl, 1988; Lauritzen and Spiegelhalter,
1988; Tatman and Shachter, 1990; Jensen et al., 1990; R.D. Shachter and
Favro, 1990; Bacchus and van Run, 1995; Shachter, 1986; Shachter, 1988;
Shimony and Charniack, 1991; Shenoy, 1992). What we are presenting here
is a syntactic and uniform exposition emphasizing these algorithms' form
BUCKET ELIMINATION
3
Algorithm directional resolution
Input: A set of clauses ', an ordering d = Q1; :::; Qn.
Output: A decision of whether ' is satisable. If it is, an equivalent
output theory; else, an empty output theory.
1. Initialize: Generate an ordered partition of the clauses,
bucket1; :::; bucketn, where bucketi contains all the clauses whose highest literal is Qi .
2. For p = n to 1, do
if bucketp contains a unit clause, perform only unit resolution. Put
each resolvent in the appropriate bucket.
else, resolve each pair f( _ Qp); ( _ :Qp)g bucketp. If = _ is empty, ' isS not satisable, else, add to the appropriate bucket.
3. Return: i bucketi .
Figure 1. Algorithm directional resolution
as a straightforward elimination algorithm. The main virtue of this presentation, beyond uniformity, is that it allows ideas and techniques to ow
across the boundaries between areas of research. In particular, having noted
that elimination algorithms and clustering algorithms are very similar in
the context of constraint processing (Dechter and Pearl, 1989), we nd
that this similarity carries over to all other tasks. We also show that the
idea of conditioning, which is as universal as that of elimination, can be
incorporated and exploited naturally and uniformly within the elimination
framework.
Conditioning is a generic name for algorithms that search the space
of partial value assignments, or partial conditionings. Conditioning means
splitting a problem into subproblems based on a certain condition. Algorithms such as backtracking and branch and bound may be viewed as
conditioning algorithms. The complexity of conditioning algorithms is exponential in the conditioning set, however, their space complexity is only
linear. Our resulting hybrid of conditioning with elimination which trade
o time for space (see also (Dechter, 1996b; R. D. Shachter and Solovitz,
1991)), are applicable to all algorithms expressed within this framework.
The work we present here also ts into the framework developed by
Arnborg and Proskourowski (Arnborg, 1985; Arnborg and Proskourowski,
1989). They present table-based reductions for various NP-hard graph problems such as the independent-set problem, network reliability, vertex cover,
graph k-colorability, and Hamilton circuits. Here and elsewhere (Dechter
and van Beek, 1995; Dechter, 1997) we extend the approach to a dierent
set of problems.
4
R. DECHTER
Following preliminaries (section 2), we present the bucket-elimination
algorithm for belief updating and analyze its performance (section 3). Then,
we extend the algorithm to the tasks of nding the most probable explanation (section 4), and extend it to the tasks of nding the maximum a posteriori hypothesis (section 5) and for nding the maximum expected utility
(section 6). Section 7 relates the algorithms to Pearl's poly-tree algorithms
and to join-tree clustering. We then describe schemes for combining the
conditioning method with elimination (section 8). Conclusions are given in
section 9.
2. Preliminaries
Belief networks provide a formalism for reasoning about partial beliefs under conditions of uncertainty. It is dened by a directed acyclic graph over
nodes representing random variables that takes value from given domains.
The arcs signify the existance of direct causal inuences between the linked
variables, and the strength of these inuences are quantied by conditional
probabilities. A belief network relies on the notion of a directed graph.
A directed graph is a pair, G = fV; E g, where V = fX1; :::; Xng is a set
of elements and E = f(Xi; Xj )jXi; Xj 2 V; i 6= j g is the set of edges. If
(Xi; Xj ) 2 E , we say that Xi points to Xj . For each variable Xi , the set
of parent nodes of Xi , denoted pa(Xi), comprises the variables pointing to
Xi in G, while the set of child nodes of Xi , denoted ch(Xi), comprises the
variables that Xi points to. Whenever no confusion can arise, we abbreviate
pa(Xi) by pai and ch(Xi ) by chi . The family of Xi , Fi , includes Xi and
its child variables. A directed graph is acyclic if it has no directed cycles.
In an undirected graph, the directions of the arcs are ignored: (Xi ; Xj ) and
(Xj ; Xi) are identical.
Let X = fX1; :::; Xng be a set of random variables over multivalued
domains, D1; :::; Dn. A belief network is a pair (G; P ) where G is a directed
acyclic graph and P = fPi g, where Pi denotes probabilistic relationships
between Xi and its parents, namely conditional probability matrices Pi =
fP (Xijpai)g. The belief network represents a probability distribution over
X having the product form
P (x1; ::::; xn) = ni=1 P (xi jxpa )
i
where an assignment (X1 = x1 ; :::; Xn = xn ) is abbreviated to x = (x1; :::; xn)
and where xS denotes the projection of a tuple x over a subset of variables
S . An evidence set e is an instantiated subset of variables. A = a denotes a
partial assignment to a subset of variables A from their respective domains.
We use upper case letter for variables and nodes in a graph and lower case
letters for values in variable's domains.
5
BUCKET ELIMINATION
A
A
B
C
B
C
F
F
D
D
G
(a)
G
(b)
Figure 2. belief network P (g; f; d; c; b; a)= P (gjf )P (f jc;b)P (djb; a)P (bja)P (cja)
Example 2.1 Consider the belief network dened by
P (g; f; d; c; b; a) = P (gjf )P (f jc; b)P (djb; a)P (bja)P (cja):
Its acyclic graph is given in Figure 2a. In this case, pa(F ) = fB; C g.
The following queries are dened over belief networks: 1. belief updating
namely, given a set observations, computing the posterior probability of
each proposition, 2. Finding the most probable explanation (mpe), or, given
some observed variables, nding a maximum probability assignment to the
rest of the variables 3. Finding the maximum aposteriori hypothesis (map),
or, given some evidence, nding an assignment to a subset of hypothesis
variables that maximizes their probability, and nally, 4. given also a utility function, nding an assignment to a subset of decision variables that
maximizes the expected utility of the problem (meu).
It is known that these tasks are NP-hard. Nevertheless, they all permit
a polynomial propagation algorithm for singly-connected networks (Pearl,
1988). The two main approaches to extending this propagation algorithm
to multiply-connected networks are the cycle-cutset approach, also called
conditioning, and tree-clustering (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Shachter, 1986). These methods work well for sparse networks
with small cycle-cutsets or small clusters. In subsequent sections bucketelimination algorithms for each of the above tasks will be presented and
relationship with existing methods will be discussed.
We conclude this section with some notational conventions. Let u be a
partial tuple, S a subset of variables, and Xp a variable not in S . We use
(uS ; xp) to denote the tuple uS appended by a value xp of Xp.
Denition 2.2 (elimination functions) Given a function h dened over
subset of variables S , where X 2 S , the functions (minX h), (maxX h),
6
R. DECHTER
P
(meanX h), and ( X h) are dened over U = S , fX g as follows. For every U = u, (minX h)(u) = minx h(u; x), (maxX h)(u) = maxx h(u; x),
P
P
P
( X h)(u) = x h(u; x), and (meanX h)(u) = x h(jXu;xj ) , where jX j is
the cardinality of X 's domain. Given a set of functions h1 ;P:::; hj dened
over the subsets S1 ; :::; Sj, the product function (j hj ) and J hj are dened
P over U P= [j Sj . For every U = u, (j hj )(u) = j hj (uS ), and
( j hj )(u) = j hj (uS ).
j
j
3. An Elimination Algorithm for Belief Assessment
Belief updating is the primary inference task over belief networks. The task
is to maintain the probability of singleton propositions once new evidence
arrives. Following Pearl's propagation algorithm for singly-connected networks (Pearl, 1988), researchers have investigated various approaches to
belief updating. We will now present a step by step derivation of a general
variable-elimination algorithm for belief updating. This process is typical
for any derivation of elimination algorithms.
Let X1 = x1 be an atomic proposition. The problem is to assess and
update the belief in x1 given some evidence e. Namely, we wish to compute P (X1 = x1 je) = P (X1 = x1; e), where is a normalization constant. We will develop the algorithm using example 2.1 (Figure 2). Assume we have the evidence g = 1. Consider the variables in the order
d1 = A; C; B; F; D; G. By denition we need to compute
P (a; g = 1) =
X
c;b;f;d;g=1
P (g jf )P (f jb; c)P (dja; b)P (cja)P (bja)P (a)
We can now apply some simple symbolic manipulation, migrating each
conditional probability table to the left of summation variables which it
does not reference, we get
= P (a)
X P (cja) X P (bja) X P (f jb; c) X P (djb; a) X P (gjf )
c
b
f
g=1
d
(1)
Carrying the computation from right to left (from G to A), we rst compute the rightmost summation
which generates a function over f , G (f )
P
dened by: G (f ) = g=1 P (g jf ) and place it as far to the left as possible,
yielding
P (a)
X P (cja) X P (bja) X P (f jb; c)
c
b
f
G(f )
X P (djb; a)
d
(2)
7
BUCKET ELIMINATION
bucketG =
bucketD =
bucketF =
bucketB =
bucketC =
bucketA =
P (g jf ); g = 1
P (djb; a)
P (f jb; c)
P (bja)
P (cja)
P (a)
Figure 3. Initial partitioning into buckets using d1 = A;C; B; F; D; G
Summing next
over d (generating a function denoted D (a; b), dened by
D (a; b) = Pd P (dja; b)), we get
P (a)
X P (cja) X P (bja)
c
b
D (a; b)
X P (f jb; c)
f
P
G(f )
(3)
Next, summing over f ( generating F (b; c) = f P (f jb; c)G(f )), we get,
X
X
P (a) P (cja) P (bja)D(a; b)F (b; c)
(4)
c
b
Summing over b (generating B (a; c)), we get
P (a)
X P (cja)
c
B (a; c)
(5)
Finally, summing over c (generating C (a)), we get
P (a)C (a)
(6)
The answer to the query P (ajg = 1) can be computed by evaluating the
last product and then normalizing.
The bucket-elimination algorithm mimics the above algebraic manipulation using a simple organizational devise we call buckets, as follows. First,
the conditional probability tables (CPTs, for short) are partitioned into
buckets, relative to the order used d1 = A; C; B; F; D; G, as follows (going
from last variable to rst varaible): in the bucket of G we place all functions
mentioning G. From the remaining CPTs we place all those mentioning D
in the bucket of D, and so on. The partitioning rule can be alternatively
stated as follows. In the bucket of variable Xi we put all functions that
mention Xi but do not mention any variable having a higher index. The
resulting initial partitioning for our example is given in Figure 3. Note that
observed variables are also placed in their corresponding bucket.
This initialization step corresponds to deriving the expression in Eq.
(1). Now we process the buckets from top to bottom, implementing the
8
R. DECHTER
sum B
Bucket G
P(f | g) g = 1
Bucket D
P( d | b, a )
Bucket F
P( f | b, c)
Bucket B
P(b | a)
Bucket C
P(c | a )
Bucket A
P(a )
G
(f)
(b,a)
D
( b, c)
F
( a, c )
B
(a)
C
Figure 4. Bucket elimination along ordering d1 = A; C; B; F; D;G.
right to left computation of Eq. (1). BucketG is processed rst. Processing
a bucket amounts to eliminating the variable in the bucket from subsequent
computation. To eliminate G, we sum over all values of g . Since, in this case
we have an observed
P value g = 1 the summation is over a singleton value.
Namely, G (f ) = g=1 P (g jf ), is computed and placed in bucketF (this
corresponds to deriving Eq. (2) from Eq. (1)). New functions are placed in
lower buckets using the same placement rule.
BucketD is processed next. We sum-out D getting D (b; a) = Pd P (djb; a),
that is computed and placed in bucketB , (which corresponds to deriving Eq.
(3) from Eq. (2)). The next variable is F . BucketF contains two functions
P (f jb; c) and
G (f ), and thus, following Eq. (4) we generate the function
P
F (b; c) = f P (f jb; c) G(f ) which is placed in bucketB (this corresponds
to deriving Eq. (4)
P from Eq. (3)). In processing the next bucketB , the function B (a; c) = b (P (bja) D (b; a) F (b; c)) is computed and placed in
bucketC (deriving
Eq. (5) from Eq. (4)). In processing the next bucketC ,
C (a) = Pc2C P (cja) B (a; c) is computed (which corresponds to deriving
Eq. (6) from Eq. (5)). Finally, the belief in a can be computed in bucketA ,
P (ajg = 1) = P (a) C (a). Figure 4 summarizes the ow of computation of the bucket elimination algorithm for our example. Note that since
throughout this process we recorded two-dimensional functions at the most,
the complexity the algorithm using ordering d1 is (roughly) time and space
quadratic in the domain sizes.
What will occur if we use a dierent variable ordering? For example,
lets apply the algorithm using d2 = A; F; D; C; B; G. Applying algebraic
manipulation from right to left along d2 yields the following sequence of
derivations:
P (a; g = 1) = P (a) Pf Pd Pc P (cja) Pb P (bja) P (dja; b)P (f jb; c) Pg=1 P (g jf )=
9
BUCKET ELIMINATION
G
bucket
bucket
bucket
bucket
bucket
G
B
C
D
F
= P(g|f), g = 1
B
= P(f | b,c), P(d | a,b), P(b|a)
=
λ C (a, f, d)
=
λ (a, f)
D
λ F (a)
bucket A = P(a)
C
λ B (f, c, a, d)
= P(c | a)
D
λ (f)
G
F
A
(b)
(a)
Figure 5. The buckets output when processing along d2 = A; F;D; C; B; G
PP
P
P (a) P
f G (f ) Pd Pc P (cja) b P (bja) P (dja; b)P (f jb; c)=
P
P (a) Pf G(f )P d c P (cja)B (a; d; c; f ) =
P (a) Pf g(f ) d C (a; d; f ) =
P (a) f G(f )D (a; f ) =
P (a)F (a)
The bucket elimination process for ordering d2 is summarized in Figure
5a. Each bucket contains the initial CPTs denoted by P 's, and the functions
generated throughout the process, denoted by s.
We summarize with a general derivation of the bucket elimination algorithm, called elim-bel. Consider an ordering of the variables X = (X1; :::; Xn).
Using the notation xi = (x1; :::; xi) and xji = (xi ; xi+1; :::; xj ), where Fi is
the family of variable Xi, we want to compute:
X P (x ; e) = X X P (x ; ejx ) =
P (x1 ; e) =
n
i
i
pa
i
x ,1) xn
x=xn2
(2n
Seperating Xn from the rest of the variables we get:
X X
X 2X ,F P (xi ; ejxpa ) P (xn ; ejxpa )X 2ch P (xi ; ejxpa ) =
x=x(2n,1)
where
i
n
i
X
x x ,1)
=(2n
n(xU ) =
n
n
xn
n
i
i
X 2X ,F P (xi ; ejxpa ) n (xU )
n
i
X P (x ; ejx
i
xn
n
n
pan )Xi 2chn P (xi ; ejxpai )
(7)
10
R. DECHTER
Algorithm elim-bel
Input: A belief network BN = fP1; :::; Png; an ordering of the variables, d = X1; :::; Xn; evidence e.
Output: The belief in X1 = x1.
1. Initialize: Generate an ordered partition of the conditional proba-
bility matrices, bucket1 ; :::; bucketn, where bucketi contains all matrices
whose highest variable is Xi . Put each observed variable in its bucket.
Let S1; :::; Sj be the subset of variables in the processed bucket on which
matrices (new or old) are dened.
2. Backward: For p n downto 1, do
for all the matrices 1 ; 2; :::; j in bucketp , do
(bucket with observed variable) if Xp = xp appears in bucketp, assign
Xp = xp to each i and then put each resulting function in appropriate
bucket.
else, Up Sji=1 Si , fXpg. Generate p = PX ji=1i and add p to
the largest-index variable in Up .
3. Return: Bel(x1 ) = P (x1 ) i i (x1)(where the i are in bucket1, is a normalizing constant).
p
Figure 6. Algorithm elim-bel
Where Un denoted the variables appearing with Xn in a probability component, excluding Xn . The process continues recursively with Xn,1 .
Thus, the computation performed in the bucket of Xn is captured by
Eq. (7). Given ordering X1 ; :::; Xn, where the queried variable appears rst,
the CPT s are partitioned using the rule described earlier. To process each
bucket, all the bucket's functions, denoted 1; :::; j and dened over subsets
S1; :::; Sj are multiplied, and then the bucket's variable is eliminated
by
P
j
summation. The computed function is p : Up ! R, p = X i=1 i ,
where Up = [i Si , Xp . This function is placed in the bucket of its largestindex variable in Up . The procedure continues recursively with the bucket
of the next variable going from last variable to rst variable. Once all the
buckets are processed, the answer is available in the rst bucket. Algorithm
elim-bel is described in Figure 6.
p
Theorem 3.1 Algorithm elim-bel compute the posterior belief P (x1je) for
any given ordering of the variables. 2
Both the peeling algorithm for genetic trees (Cannings et al., 1978), and
Zhang and Poole's recent algorithm (Zhang and Poole, 1996) are variations
of elim-bel.
11
BUCKET ELIMINATION
G
G
G
D
B
B
F
C
C
B
C
A
(a)
D
D
F
F
A
A
(b)
(c)
Figure 7. Two ordering of the moral graph of our example problem
3.1. COMPLEXITY
We see that although elim-bel can be applied using any ordering, its complexity varies considerably. Using ordering d1 we recorded functions on pairs
of variables only, while using d2 we had to record functions on four variables
(see BucketC in Figure 5a). The arity of the function recorded in a bucket
equals the number of variables appearing in that processed bucket, excluding the bucket's variable. Since recording a function of arity r is time and
space exponential in r we conclude that the complexity of the algorithm is
exponential in the size of the largest bucket which depends on the order of
processing.
Fortunately, for any variable ordering bucket sizes can be easily read in
advance from an ordered associated with the elimination process. Consider
the moral graph of a given belief network. This graph has a node for each
propositional variable, and any two variables appearing in the same CPT
are connected in the graph. The moral graph of the network in Figure 2a is
given in Figure 2b. Let us take this moral graph and impose an ordering on
its nodes. Figures 7a and 7b depict the ordered moral graph using the two
orderings d1 = A; C; B; F; D; G and d2 = A; F; D; C; B; G. The ordering is
pictured from bottom up.
The width of each variable in the ordered graph is the number of its
earlier neighbors in the ordering. Thus the width of G in the ordered graph
along d1 is 1 and the width of F is 2. Notice now that using ordering d1 ,
the number of variables in the initial buckets of G and F , are also 1, and
2 respectively. Indeed, in the initial partitioning the number of variables
mentioned in a bucket (excluding the bucket's variable) is always identical
to the width of that node in the corresponding ordered moral graph.
During processing we wish to maintain the correspondance that any
12
R. DECHTER
two nodes in the graph are connected i there is function (new or old)
deened on both. Since, during processing, a function is recorded on all
the variables apearing in a bucket, we should connect the corresponding
nodes in the graph, namely we should connect all the earlier neighbors of a
processed variable. If we perform this graph operation recursively from last
node to rst, (for each node connecting its earliest neighbors) we get the
the induced graph. The width of each node in this induced graph is identical
to the bucket's sizes generated during the elimination process (see Figure
5b).
Example 3.2 The induced moral graph of Figure 2b, relative to ordering
d1 = A; C; B; F; D; G is depicted in Figure 7a. In this case the ordered graph
and its induced ordered graph are identical since all earlier neighbors of
each node are already connected. The maximum induced width is 2. Indeed,
in this case, the maximum arity of functions recorded by the elimination
algorithms is 2. For d2 = A; F; D; C; B; G the induced graph is depicted in
Figure 7c. The width of C is initially 1 (see Figure 7b) while its induced
width is 3. The maximum induced width over all variables for d2 is 4, and
so is the recorded function's dimensionality.
A formal denition of all the above graph concepts is given next.
Denition 3.3 An ordered graph is a pair (G; d) where G is an undirected
graph and d = X1; :::; Xn is an ordering of the nodes. The width of a node
in an ordered graph is the number of the node's neighbors that precede it in
the ordering. The width of an ordering d, denoted w(d), is the maximum
width over all nodes. The induced width of an ordered graph, w(d), is the
width of the induced ordered graph obtained as follows: nodes are processed
from last to rst; when node X is processed, all its preceding neighbors are
connected. The induced width of a graph, w, is the minimal induced width
over all its orderings. The tree-width of a graph is the minimal induced
width plus one (Arnborg, 1985).
The established connection between buckets' sizes and induced width
motivates nding an ordering with a smallest induced width. While it is
known that nding an ordering with the smallest induced width is hard
(Arnborg, 1985), usefull greedy heuristics as well as approximation algorithms are available (Dechter, 1992; Becker and Geiger, 1996).
In summary, the complexity of algorithm elim-bel is dominated by the
time and space needed to process a bucket. Recording a function on all the
bucket's variables is time and space exponential in the number of variables
mentioned in the bucket. As we have seen the induced width bounds the
arity of the functions recorded; variables appearing in a bucket coincide
with the earlier neighbors of the corresponding node in the ordered induced
moral graph. In conclusion,
BUCKET ELIMINATION
13
Theorem 3.4 Given an ordering d the complexity of elim-bel is (time and
space) exponential in the induced width w(d) of the network's ordered moral
graph. 2
3.2. HANDLING OBSERVATIONS
Evidence should be handled in a special way during the processing of buckets. Continuing with our example using elimination on order d1 , suppose we
wish to compute the belief in A = a having observed b = 1. This observation is relevant only when processing bucketB . When the algorithm arrives
at that bucket, the bucket contains the three functions P (bja), D (b; a),
and F (b; c), as well as the observation b = 1 (see Figure 4).
The processing rule dictates computing B (a; c) = P (b = 1ja)D(b =
1; a)F (b = 1; c). Namely, we will generate and record a two-dimensioned
function. It would be more eective however, to apply the assignment b = 1
to each function in a bucket separately and then put the resulting functions into lower buckets. In other words, we can generate P (b = 1ja)
and D (b = 1; a), each of which will be placed in the bucket of A, and
F (b = 1; c), which will be placed in the bucket of C . By so doing, we avoid
increasing the dimensionality of the recorded functions. Processing buckets containing observations in this manner automatically exploits the cutset
conditioning eect (Pearl, 1988). Therefore, the algorithm has a special rule
for processing buckets with observations: the observed value is assigned to
each function in the bucket, and each resulting function is moved individually to a lower bucket.
Note that, if the bucket of B had been at the top of our ordering, as
in d2, the virtue of conditioning on B could have been exploited earlier.
When processing bucketB it contains P (bja); P (djb; a); P (f jc; b); and b = 1
(see Figure 5a). The special rule for processing buckets holding observations
will place P (b = 1ja) in bucketA , P (djb = 1; a) in bucketD , and P (f jc; b = 1)
in bucketF . In subsequent processing, only one-dimensional functions will
be recorded.
We see that the presence of observations reduces complexity. Since the
buckets of observed variables are processed in linear time, and the recorded
functions do not create functions on new subsets of variables, the corresponding new arcs should not be added when computing the induced graph.
Namely, earlier neighbors of observed variables should not be connected.
To capture this renement we use the notion of adjusted induced graph
which is dened recursively as follows. Given an ordering and given a set of
observed nodes, the adjusted induced graph is generated by processing from
top to bottom, connecting the earlier neighbors of unobserved nodes only.
The adjusted induced width is the width of the adjusted induced graph.
14
R. DECHTER
Theorem 3.5 Given a belief network having n variables, algorithm elim-
bel when using ordering d and evidence e, is (time and space) exponential
in the adjusted induced width w (d; e) of the network's ordered moral graph.
2
3.3. FOCUSING ON RELEVANT SUBNETWORKS
We will now present an improvement to elim-bel whose essence is restricting
the computation to relevant portions of the belief network. Such restrictions
are already available in the literature in the context of existing algorithms
(Geiger et al., 1990; Shachter, 1990).
Since summation over all values of a probability function is 1, the
recorded functions of some buckets will degenerate to the constant 1. If
we could recognize such cases in advance, we could avoid needless computation by skipping some buckets. If we use a topological ordering of the
belief network's acyclic graph (where parents precede their child nodes),
and assuming that the queried variable starts the ordering1 , we can recognize skipable buckets dynamically, during the elimination process.
Proposition 3.6 Given a belief network and a topological ordering X1; :::; Xn,
algorithm elim-bel can skip a bucket if at the time of processing, the bucket
contains no evidence variable, no query variable and no newly computed
function. 2
Proof: If topological ordering is used, each bucket (that does not contain
the queried variable) contains initially at most one function describing its
probability conditioned on all its parents. Clearly if there is no evidence,
summation will yield the constant 1. 2
Example 3.7 Consider again the belief network whose acyclic graph is
given in Figure 2a and the ordering d1 = A; C; B; F; D; G, and assume
we want to update the belief in variable A given evidence on F . Clearly
the buckets of G and D can be skipped and processing should start with
bucketF . Once the bucket of F is processed, all the rest of the buckets are
not skipable.
Alternatively, the relevant portion of the network can be precomputed
by using a recursive marking procedure applied to the ordered moral graph.
(see also (Zhang and Poole, 1996)).
Denition 3.8 Given an acyclic graph and a topological ordering that starts
with the queried variable, and given evidence e, the marking process works
as follows. An evidence node is marked, a neighbor of the query variable is
marked, and then any earlier neighbor of a marked node is marked.
1
otherwise, the queried variable can be moved to the top of the ordering
15
BUCKET ELIMINATION
Algorithm elim-bel
:::
2. Backward: For p n downto 1, do
for all the matrices 1 ; 2; :::; j in bucketp , do
(bucket with observed variable) if Xp = xp appears in bucketp, then
substitute Xp = xp in each matrix i and put each in appropriate bucket.
else,Sif bucketp is NOT skipable
, then
P
j
j
Up
i=1 Si , fXpg p = X i=1 i. Add p to the largest-index
variable in Up .
p
:::
Figure 8. Improved algorithm elim-bel
The marked belief subnetwork, obtained by deleting all unmarked nodes,
can be processed now by elim-bel to answer the belief-updating query. It is
easy to see that
Theorem 3.9 The complexity of algorithm elim-bel given evidence e is exponential in the adjusted induced width of the marked ordered moral subgraph.
Proof: Deleting the unmarked nodes from the belief network results in a
belief subnetwork whose distribution is identical to the marginal distribution over the marked variables. 2.
4. An Elimination Algorithm for mpe
In this section we focus on the task of nding the most probable explanation. This task appears in applications such as diagnosis and abduction.
For example, it can suggest the disease from which a patient suers given
data on clinical ndings. Researchers have investigated various approaches
to nding the mpe in a belief network. (See, e.g., (Pearl, 1988; Cooper, 1984;
Peng and Reggia, 1986; Peng and Reggia, 1989)). Recent proposals include
best rst-search algorithms (Shimony and Charniack, 1991) and algorithms
based on linear programming (Santos, 1991).
The problem is to nd x0 such that P (x0 ) = maxx i P (xi ; ejxpa ) where
x = (x1; :::; xn) and e is a set of observations. Namely, computing for a given
ordering X1; :::; Xn,
n
M = max
(8)
x i=1 P (xi ; ejxpa )
x P (x) = max
x , max
i
n
n
1
n
i
This can be accomplished as before by performing the mximization operation along the ordering from right to left, while migrating to the left, at
16
R. DECHTER
each step, all components that do not mention the maximizing variable.
We get,
M = xmax
xn ; e) = xmax
max i P (xi ; ejxpa ) =
=
x P (
, x
n
(n
i
n
1)
max
xn P (xn ; ejxpan )Xi 2chn P (xi ; ejxpai ) =
x=
xn,1 Xi 2X ,Fn P (xi ; ejxpai ) max
where
max
x=
xn,1 Xi 2X ,Fn P (xi ; ejxpai ) hn (xUn )
hn (xU ) = max
x P (xn ; ejxpa )X 2ch P (xi ; ejxpa )
n
n
n
i
n
i
Where Un are the variables appearing in components dened over Xn .
Clearly, the algebraic manipulation of the above expressions is the same
as the algebraic manipulation for belief assessment where summation is replaced by maximization. Consequently, the bucket-elimination procedure
elim-mpe is identical to elim-bel except for this change. Given ordering
X1; :::; Xn, the conditional probability tables are partitioned as before. To
process each bucket, we multiply all the bucket's matrices, which in this case
are denoted h1 ; :::; hj and dened over subsets S1 ; :::; Sj, and then eliminate
the bucket's variable by maximization. The computed function in this case
is hp : Up ! R, hp = maxX ji=1 hi , where Up = [i Si , Xp. The function
obtained by processing a bucket is placed in the bucket of its largest-index
variable in Up . In addition, a function xop (u) = argmaxX hp (u), which relates an optimizing value of Xp with each tuple of Up , is recorded and placed
in the bucket of Xp .
The procedure continues recursively, processing the bucket of the next
variable while going from last variable to rst variable. Once all buckets
are processed, the mpe value can be extracted in the rst bucket. When
this backwards phase terminates the algorithm initiates a forwards phase to
compute an mpe tuple.
Forward phase: Once all the variables are processed, an mpe tuple is
computed by assigning values along the ordering from X1 to Xn , consulting the information recorded in each bucket. Specically, once the partial
assignment x = (x1; :::; xi,1) is selected, the value of Xi appended to this
tuple is xoi (x), where xo is the function recorded in the backward phase.
The algorithm is presented in Figure 9. Observed variables are handled as
in elim-bel.
Example 4.1 Consider again the belief network of Figure 2. Given the ordering d = A; C; B; F; D; G and the evidence g = 1, process variables from
last to the rst after partitioning the conditional probability matrices into
buckets, such that bucketG = fP (g jf ); g = 1g, bucketD = fP (djb; a)g,
p
p
17
BUCKET ELIMINATION
Algorithm elim-mpe
Input: A belief network BN = fP1; :::; Png; an ordering of the variables, d; observations e.
Output: The most probable assignment.
1. Initialize: Generate an ordered partition of the conditional probabil-
ity matrices, bucket1, : : :, bucketn , where bucketi contains all matrices
whose highest variable is Xi . Put each observed variable in its bucket.
Let S1; :::; Sj be the subset of variables in the processed bucket on which
matrices (new or old) are dened.
2. Backward: For p n downto 1, do
for all the matrices h1 ; h2; :::; hj in bucketp , do
(bucket with observed variable) if bucketp contains Xp = xp, assign
Xp = xp to each
hi and put each in appropriate bucket.
S
j
j
else, Up
i
=1 Si , fXpg. Generate functions hp = maxX i=1 hi
and xop = argmaxX hp . Add hp to bucket of largest-index variable in
Up .
3. Forward: Assign values in the ordering d using the recorded functions xo in each bucket.
p
p
Figure 9. Algorithm elim-mpe
bucketF = fP (f jb; c)g, bucketB = fP (bja)g, bucketC = fP (cja)g, and
bucketA = fP (a)g. To process G, assign g = 1, get hG(f ) = P (g =
1jf ), and place the result in bucketF . The function Go (f ) = argmaxhG (f )
is placed in bucketG as well. Process bucketD by computing hD (b; a) =
maxd P (djb; a) and putting the result in bucketB . Record also Do (b; a) =
argmaxDP (djb; a). The bucket of F , to be processed next, now contains two
matrices: P (f jb; c) and hG (f ). Compute hF (b; c) = maxf p(f jb; c) hG (f ),
and place the resulting function in bucketB . To eliminate B , we record
the function hB (a; c) = maxb P (bja) hD (b; a) hF (b; c) and place it in
bucketC . To eliminate C , we compute hC (a) = maxc P (cja) hB (a; c) and
place it in bucketA . Finally, the mpe value is given in bucketA , M =
maxa P (a) hC (a), is determined, along with the mpe tuple, by going forward
through the buckets.
The backward process can be viewed as a compilation (or learning)
phase, in which we compile information regarding the most probable extension of partial tuples to variables higher in the ordering (see also section
7.2).
Similarly to the case of belief updating, the complexity of elim-mpe is
bounded exponentially in the dimension of the recorded matrices, and those
are bounded by the induced width w(d; e) of the ordered moral graph. In
18
R. DECHTER
summary,
Theorem 4.2 Algorithm elim-mpe is complete for the mpe task. Its complexity (time and space) is O(n exp(w(d; e))), where n is the number of
variables and w(d; e) is the e-adjusted induced width of the ordered moral
graph. 2
5. An Elimination Algorithm for MAP
We next present an elimination algorithm for the map task. By its denition, the task is a mixture of the previous two, and thus in the algorithm
some of the variables are eliminated by summation, others by maximization.
Given a belief network, a subset of hypothesized variables A = fA1 ; :::; Akg,
and some evidence e, the problem is to nd an assignment to the hypothesized variables that maximizes their probability given
P the evidence. Formally, we wish to compute maxa P (x; e) = maxa x ni=1 P (xi ; ejxpa )
where x = (a1; :::; ak; xk+1 ; :::; xn). In the algebraic manipulation of this
expression, we push the maximization to the left of the summation. This
means that in the elimination algorithm, the maximized variables should
initiate the ordering (and therefore will be processed last). Algorithm elimmap in Figure 10 considers only orderings in which the hypothesized variables start the ordering. The algorithm has a backward phase and a forward
phase, but the forward phase is relative to the hypothesized variables only.
Maximization and summation may be somewhat interleaved to allow more
eective orderings; however, we do not incorporate this option here. Note
that the relevant graph for this task can be restricted by marking in a
very similar manner to belief updating case. In this case the initial marking includes all the hypothesized variables, while otherwise, the marking
procedure is applied recursively to the summation variables only.
Theorem 5.1 Algorithm elim-map is complete for the map task. Its complexity is O(n exp(w(d; e)), where n is the number of variables in the
relevant marked graph and w (d; e) is the e-adjusted induced width of its
marked moral graph. 2
k
k
n
k +1
i
6. An Elimination Algorithm for MEU
The last and somewhat more complicated task we address is that of nding the maximum expected utility. Given a belief network, evidence e, a
real-valued utility function u(x) additively decomposable relative to functions
over Q = fQ1 ; :::; Qjg, Qi X , such that u(x) =
PQ 2Qf1f;j:::;(xQfj ),dened
and a subset of decision variables D = fD1; :::Dkg that
are assumed to be root nodes, the meu task is to nd a set of decisions
j
j
19
BUCKET ELIMINATION
Algorithm elim-map
Input: A belief network BN = fP1; :::; Png; a subset of variables
A = fA1 ; :::; Akg; an ordering of the variables, d, in which the A's are
rst in the ordering; observations e.
Output: A most probable assignment A = a.
1. Initialize: Generate an ordered partition of the conditional probabil-
ity matrices, bucket1, : : :, bucketn , where bucketi contains all matrices
whose highest variable is Xi .
2. Backwards For p n downto 1, do
for all the matrices 1 ; 2; :::; j in bucketp , do
(bucket with observed variable) if bucketp contains the observation
Xp = xp, assign
Xp = xp to each i and put each in appropriate bucket.
else, Up Sji=1 Si P
,fXpg. If Xp not in A and if bucketp contains new
functions, then p = X ji=1 i ; else, Xp 2 A, and p = maxX ji=1 i
and a0 = argmaxX p. Add p to the bucket of the largest-index variable in Up .
3. Forward: Assign values, in the ordering d = A1 ; :::; Ak, using the
information recorded in each bucket.
p
p
p
Figure 10. Algorithm elim-map
do = (do 1; :::; dok ) that maximizes the expected utility. We assume that the
variables not appearing in D are indexed Xk+1 ; :::; Xn. Formally, we want
to compute
E = dmax
;:::;d
1
and
X
k
xk+1 ;:::xn
ni=1 P (xi ; ejxpa ; d1; :::; dk)u(x);
i
d0 = argmaxD E
As in the previous tasks, we will begin by identifying the computation
associated with Xn from which we will extract the computation in each
bucket. We denote an assignment to the decision variables by d = (d1; :::; dk)
and xjk = (xk ; :::; xj ). Algebraic manipulation yields
X X n P (x ; ejx ; d) X f (x )
E = max
i
pa
j Q
i=1
d
i
,1 xn
xnk+1
Qj 2Q
j
We can now separate the components in the utility functions into those
mentioning Xn , denoted by the index set tn , and those not mentioning Xn ,
labeled with indexes ln = f1; :::; ng, tn . Accordingly we get
X X n P (x ; ejx ; d) ( X f (x ) + X f (x ))
E = max
i
pa
j Q
j Q
i=1
d
x(kn+1,1) xn
i
j 2ln
j
j 2tn
j
20
R. DECHTER
E = max
[
d
+
X X n
x , xn
(kn+11)
i=1 P (xi ; ejxpai ; d)
X X n
x , xn
(n 1)
k+1
i=1 P (xi ; ejxpai ; d)
X f (x
j Qj )
j 2ln
X f (x
j Qj )]
j 2tn
By migrating to the left of Xn all of the elements that are not a function
of Xn , we get
max
[
d
X
x ,
n 1
k +1
X
+
x ,
n 1
k +1
X
X
fj (xQj ))
Xi 2X ,Fn P (xi ; ejxpai ; d) (
xn
j 2ln
Xi 2X ,Fn P (xi ; ejxpai
Xi 2Fn P (xi ; ejxpai ; d)
X
; d) xn
Xi 2Fn P (xi ; ejxpai
X
; d) f (x
j 2tn
(9)
j Q )]
j
We denote by Un the subset of variables that appear with Xn in a probabilistic component, excluding Xn itself, and by Wn the union of variables
that appear in probabilistic and utility components with Xn , excluding Xn
itself. We dene n over Un as (x is a tuple over Un [ Xn )
n(xU jd) =
n
We dene n over Wn as
n (xW jd) =
n
X
xn
Xi 2Fi P (xi ; ejxpai ; d)
X
xn
X f (x
Xi 2Fn P (xi ; ejxpai ; d)
j 2tn
j Qj ))
(10)
(11)
After substituting Eqs. (10) and (11) into Eq. (9), we get
E = max
d
X
,1
xnk+1
X f (x
n (xWn jd)
j Qj )+ (x jd) ]
n Un
j 2ln
Xi 2X ,Fn P (xi ; ejxpai ; d)n(xUn jd)[
(12)
The functions n and n compute the eect of eliminating Xn . The result
(Eq. (12)) is an expression, which does not include Xn , where the product has one more matrix n and the utility components have one more
element n = . Applying such algebraic manipulation to the rest of the
variables in order yields the elimination algorithm elim-meu in Figure 11.
We assume here that decision variables are processed last by elim-meu.
Each bucket contains utility components i and probability components,
i . When there is no evidence, n is a constant and we can incorporate
the marking modication we presented for elim-bel. Otherwise, during processing, the algorithm generates the i of a bucket by multiplying all its
n
n
21
BUCKET ELIMINATION
Algorithm elim-meu
Input: A belief network BN = fP1; :::; Png; a subset of variables
D1; :::; Dk that are decision
P variables which are root nodes; a utility
function over X , u(x) = j fj (xQ ); an ordering of the variables, o, in
which the D's appear rst; observations e.
Output: An assignment d1; :::; dk that maximizes the expected utility.
1. Initialize: Partition components into buckets, where bucketi contains all matrices whose highest variable is Xi . Call probability matrices
1; :::; j and utility matrices 1 ; :::; l. Let S1 ; :::; Sj be the probability
variable subsets and Q1; :::; Ql be the utility variable subsets.
2. Backward: For p n downto k + 1, do
for all matrices 1; :::; j; 1 ; :::; l in bucketp, do
(bucket with observed variable) if bucketp contains the observation
Xp = xp, then assign Xp = xp to each i ; i and put each resulting
matrix in appropriate
bucket.
S
j
else, Up
Up [ (Sli=1 QPi , fXp )g. If
i=1 Si , fXp g and Wp
bucket contains an observation or new 's, then p = X i i and
p = 1 PX ji=1 i Plj=1 j ; else, p = PX ji=1i Plj =1 j . Add p
and p to the bucket of the largest-index variable in Wp and Up, respectively.
3. Forward: Assign values in the ordering o = D1 ; :::; Dk using the
information recorded in each bucket of the decision variable.
j
p
p
p
p
Figure 11. Algorithm elim-meu
probability components and summing over Xi . The of bucket Xi is computed as the average utility of the bucket; if the bucket is marked, the
average utility of the bucket is normalized by its . The resulting and are placed into the appropriate buckets.
The maximization over the decision variables can now be accomplished
using maximization as the elimination operator. We do not include this step
explicitly, since, given our simplifying assumption that all decisions are root
nodes, this step is straightforward. Clearly, maximization and summation
can be interleaved to some degree, thus allowing more ecient orderings.
As before, the algorithm's performance can be bounded as a function
of the structure of its augmented graph. The augmented graph is the moral
graph augmented with arcs connecting any two variables appearing in the
same utility component fi , for some i.
Theorem 6.1 Algorithm elim-meu computes the meu of a belief network
augmented with utility components (i.e., an inuence diagram) in O(n 22
R. DECHTER
Z1
U1
Z2
Z3
U2
U3
Z
Z
3
2
Z1
Y1
U1
X
1
U2
Y1
U3
X
(a)
1
(b)
Figure 12. (a) A poly-tree and (b) a legal processing ordering
exp(w(d; e)), where w (d; e) is the induced width along d of the augmented
moral graph. 2
Tatman and Schachter (Tatman and Shachter, 1990) have published
an algorithm that is a variation of elim-meu, and Kjaerul's algorithm
(Kjaerul, 1993) can be viewed as a variation of elim-meu tailored to
dynamic probabilistic networks.
7. Relation of Bucket Elimination to Other Methods
7.1. POLY-TREE ALGORITHM
When the belief network is a poly-tree, both belief assessment, the mpe
task and map task can be accomplished eciently using Pearl's poly-tree
algorithm (Pearl, 1988). As well, when the augmented graph is a tree, the
meu can be computed eciently. A poly-tree is a directed acyclic graph
whose underlying undirected graph has no cycles.
We claim that if a bucket elimination algorithm process variables in a
topological ordering (parents precede their child nodes), then the algorithm
coincides (with some minor modications) with the poly-tree algorithm. We
will demonstrate the main idea using bucket elimination for the mpe task.
The arguments are applicable for the rest of the tasks.
Example 7.1 Consider the ordering X1; U3; U2; U1; Y1; Z1; Z2; Z3 of the polytree in Figure 12a, and assume that the last four variables are observed (here
we denote an observed value by using primed lowercase letter and leave other
variables in lowercase). Processing the buckets from last to rst, after the
rst four buckets have been processed as observation buckets, we get
bucket(U3) = P (u3); P (x1ju1; u2; u3); P (z03ju3)
bucket(U2) = P (u2); P (z 02 ju2)
23
BUCKET ELIMINATION
bucket(U1) = P (u1); P (z 01 ju1)
bucket(X1) = P (y 01 jx1)
When processing bucket(U3 ) by elim-mpe, we get hU (u1 ; u2; u3), which is
placed in bucket(U2). The nal resulting buckets are
bucket(U3) = P (u3); P (x1ju1; u2; u3); P (z03ju3)
bucket(U2) = P (u2); P (z 02 ju2), hU (x1; u2; u1)
bucket(U1) = P (u1); P (z 01 ju1), hU (x1; u1)
bucket(X1) = P (y 01 jx1), hU (x1 )
We can now choose a value x1 that maximizes the product in X1's bucket,
then choose a value u1 that maximizes the product in U1 's bucket given the
selected value of X1, and so on.
3
3
2
1
It is easy to see that if elim-mpe uses a topological ordering of the
poly-tree, it is time and space O(exp(jF j)), where jF j is the cardinality of
the maximum family size. For instance, in Example 7.1, elim-mpe records
the intermediate function hU (X1; U2; U1) requiring O(k3 ) space, where k
bounds the domain size for each variable. Note, however, that Pearl's algorithm (which is also time exponential in the family size) is better, as it
records functions on single variables only.
In order to restrict space needs, we modify elim-mpe in two ways. First,
we restrict processing to a subset of the topological orderings in which
sibling nodes and their parent appear consecutively as much as possible.
Second, whenever the algorithm reaches a set of consecutive buckets from
the same family, all such buckets are combined and processed as one superbucket. With this change, elim-mpe is similar to Pearl's propagation algorithm on poly-trees.2 Processing a super-bucket amounts to eliminating all
the super-bucket's variables without recording intermediate results.
Example 7.2 Consider Example 7.1. Here, instead of processing each bucket
of Ui separately, we compute by a brute-force algorithm the function hU ;U ;U (x1 )
in the super-bucket of U1 ; U2; U3 and place the function in the bucket of X1.
We get the unary function
hU ;U ;U (x1 ) = maxu ;u ;u P (u3 )P (x1 ju1; u2; u3)P (z03ju3)P (u2)P (z02ju2)P (u1)P (z01ju1 ).
The details for obtaining an ordering such that all families in a poly-tree
can be processed as super-buckets can be worked out, but are beyond the
scope of this paper. In summary,
Proposition 7.3 There exist an ordering of a poly-tree, such that bucketelimination algorithms (elim-bel, elim-mpe, etc.) with the super-bucket modication have the same time and space complexity as Pearl's poly-tree algorithm for the corresponding tasks. The modied algorithm's time complexity
is exponential in the family size, and it requires only linear space. 2
3
1
1
2
3
1
2
2
3
3
2
Actually, Pearl's algorithm should be restricted to message passing relative to one
rooted tree in order to be identical with ours.
24
R. DECHTER
GF
F
FCB
DBA
BC
CBA AB
Figure 13. Clique-tree associated with the induced graph of Figure 7a
7.2. JOIN-TREE CLUSTERING
Join-tree clustering (Lauritzen and Spiegelhalter, 1988) and bucket elimination are closely related and their worst-case complexity (time and space)
is essentially the same. The sizes of the cliques in tree-clustering is identical
to the induced-width plus one of the corresponding ordered graph. In fact,
elimination may be viewed as a directional (i.e., goal- or query-oriented)
version of join-tree clustering. The close relationship between join-tree clustering and bucket elimination can be used to attribute meaning to the intermediate functions computed by elimination.
Given an elimination ordering, we can generate the ordered moral induced graph whose maximal cliques (namely, a maximal fully-connected
subgraph) can be enumerated as follows. Each variable and its earlier neighbors are a clique, and each clique is connected to a parent clique with whom
it shares the largest subset of variables (Dechter and Pearl, 1989). For example, the induced graph in Figure 7a yields the clique-tree in Figure 13,
If this ordering is used by tree-clustering,the same tree may be generated.
The functions recorded by bucket elimination can be given the following
meaning (details and proofs of these claims are beyond the scope of this
paper). The function hp (u) recorded in bucketp by elim-mpe and dened
over [i Si , fXpg, is the maximum probability extension of u, to variables
appearing later in the ordering and which are also mentioned in the cliquesubtree rooted at a clique containing Up . For instance, hF (b; c) recorded
by elim-mpe using d1 (see Example 3.1) equals maxf;g P (b; c; f; g ), since F
and G appear in the
clique-tree rooted at (FCB ). For belief assessment,
P
the function p = X ji=1 i , dened over Up = [i Si , Xp , denotes the
probability of all the evidence e+p observed in the clique subtree rooted at
a clique containing Up , conjoined with u. Namely, p(u) = P (e+p ; u).
p
25
BUCKET ELIMINATION
f=0
b=0
P(d=1|b,a)P(g=0|f=0)
P(d=1|b,a)P(g=0|f=1)
f=1
c=0
a=0
P(a)
b=1
P(d=1|b,a)P(g=0|f=0)
c=1
...
a=1
...
P(c|a)
...
P(b|a)
...
P(f|b,c)
P(d=1|b,a)P(g=0|f=1)
Figure 14. probability tree
8. Combining Elimination and Conditioning
A serious drawback of elimination algorithms is that they require considerable memory for recording the intermediate functions. Conditioning, on
the other hand, requires only linear space. By combining conditioning and
elimination, we may be able to reduce the amount of memory needed yet
still have performance guarantee.
Conditioning can be viewed as an algorithm for processing the algebraic
expressions dened for the task, from left to right. In this case, partial
results cannot be assembled; rather, partial value assignments (conditioning
on subset of variables) unfold a tree of subproblems, each associated with
an assignment to some variables. Say, for example, that we want to compute
the expression for mpe in the network of Figure 2:
M = a;c;b;f;d;g
max P (g jf )P (f jb; c)P (dja; b)P (cja)P (bja)P (a)
= max
P (bja) max
P (f jb; c) max
P (djb; a) max
a P (a) max
c P (cja) max
g P (g jf ):
b
f
d
(13)
We can compute the expression by traversing the tree in Figure 14, going
along the ordering from rst variable to last variable. The tree can be
traversed either breadth-rst or depth-rst and will result in known search
algorithms such as best-rst search and branch and bound.
26
R. DECHTER
Algorithm elim-cond-mpe
Input: A belief network BN = fP1; :::; Png; an ordering of the variables,
d; a subset C of conditioned variables; observations e.
Output: The most probable assignment.
Initialize: p = 0.
1. For every assignment C = c, do
p1 The output of elim-mpe with c [ e as observations.
p maxfp; p1g (update the maximum probability).
2. Return p and a maximizing tuple.
Figure 15. Algorithm elim-cond-mpe
We will demonstrate the idea of combining conditioning with elimination on the mpe task. Let C be a subset of conditioned variables, C X ,
and V = X , C . We denote by v an assignment to V and by c an assignment
to C . Clearly,
max
x P (x; e) = max
c max
v P (c; v; e) = maxc;v i P (xi ; c; v; ejxpa )
i
Therefore, for every partial tuple c, we compute maxv P (v; c; e) and a corresponding maximizing tuple
(xoV )(c) = argmaxV ni=1 P (xi ; e; cjxpa )
i
by using the elimination algorithm while treating the conditioned variables
as observed variables. This basic computation will be enumerated for all
value combinations of the conditioned variables, and the tuple retaining the
maximum probability will be kept. The algorithm is presented in Figure 15.
Given a particular value assignment c, the time and space complexity
of computing the maximum probability over the rest of the variables is
bounded exponentially by the induced width w(d; e [ c) of the ordered
moral graph adjusted for both observed and conditioned nodes. In this
case the induced graph is generated without connecting earlier neighbors
of both evidence and conditioned variables.
Theorem 8.1 Given a set of conditioning variables, C , the space complexity of algorithm elim-cond-mpe is O(n exp(w(d; c [ e)), while its time complexity is O(n exp(w(d; e [ c)+ jC j)), where the induced width w(d; c [ e),
is computed on the ordered moral graph that was adjusted relative to e and
c. 2
When the variables in e [ c constitute a cycle-cutset of the graph, the
graph can be ordered such that its adjusted induced width equals 1. In this
BUCKET ELIMINATION
27
case elim-cond-mpe reduces to the known loop-cutset algorithms (Pearl,
1988; Dechter, 1990).
Clearly, algorithm elim-cond-mpe can be implemented more eectively
if we take advantage of shared partial assignments to the conditioned variables. There is a variety of possible hybrids between conditioning and elimination that can rene the basic procedure in elim-cond-mpe. One method
imposes an upper bound on the arity of functions recorded and decides
dynamically, during processing, whether to process a bucket by elimination
or by conditioning (see (Dechter and Rish, 1996)). Another method which
uses the super-bucket approach, collects a set of consecutive buckets into
one super-bucket that it processes by conditioning, thus avoiding recording
some intermediate results (Dechter, 1996b; El-Fattah and Dechter, 1996).
9. Related work
We had mentioned throughout this paper many algorithms in probabilistic and deterministic reasoning that can be viewed as similar to bucketelimination algorithms. In addition, unifying frameworks observing the common features between various algorithms had also appeared both in the past
(Shenoy, 1992) and more recently in (Bistarelli et al., 1997).
10. Summary and Conclusion
Using the bucket-elimination framework, which generalizes dynamic programming, we have presented a concise and uniform way of expressing
algorithms for probabilistic reasoning. In this framework, algorithms exploit the topological properties of the network without conscience eort on
the part of the designer. We have shown, for example, that if algorithms
elim-mpe and elim-bel are given a singly-connected network then the algorithm reduces to Pearl's algorithms (Pearl, 1988) for some orderings (always
possible on trees). The same applies to elim-map and elim-meu, for which
tree-propagation algorithms were not explicitly derived.
The simplicity and elegance of the proposed framework highlights the
features common to bucket-elimination and join-tree clustering, and allows
for focusing belief-assessment procedures toward the relevant portions of the
network. Such enhancements were accompanied by graph-based complexity
bounds which are more rened than the standard induced-width bound.
The performance of bucket-elimination and tree-clustering algorithms
is likely to suer from the usual diculty associated with dynamic programming: exponential space and exponential time in the worst case. Such
performance deciencies also plague resolution and constraint-satisfaction
algorithms (Dechter and Rish, 1994; Dechter, 1997). Space complexity can
be reduced using conditioning. We have shown that conditioning can be
28
R. DECHTER
implemented naturally on top of elimination, thus reducing the space requirement while still exploiting topological features. Combining conditioning and elimination can be viewed as combining the virtues of forward and
backward search.
Finally, no attempt was made in this paper to optimize the algorithms
for distributed computation, nor to exploit compilation vs run-time resources. These issues can and should be addressed within the bucket-elimination
framework. In particular, improvements exploiting the structure of the conditional probability matrices, as presented recently in (Santos et al., in
press; Boutilier, 1996; Poole, 1997) can be incorporated on top of bucketelimination.
In summary, what we provide here is a uniform exposition across several tasks, applicable to both probabilistic and deterministic reasoning,
which facilitates the transfer of ideas between several areas of research.
More importantly, the organizational benet associated with the use of
buckets should allow all the bucket-elimination algorithms to be improved
uniformly. This can be done either by combining conditioning with elimination, as we have shown, or via approximation algorithms as is shown in
(Dechter, 1997).
11. Acknowledgment
A preliminary version of this paper appeared in (Dechter, 1996a). I would
like to thank Irina Rish and Nir Freidman for their useful comments on
dierent versions of this paper. This work was partially supported by NSF
grant IRI-9157636, Air Force Oce of Scientic Research grant AFOSR
F49620-96-1-0224, Rockwell MICRO grants ACM-20775 and 95-043, Amada
of America and Electrical Power Research Institute grant RP8014-06.
References
S. Arnborg and A. Proskourowski. Linear time algorithms for np-hard problems restricted
to partial k-trees. Discrete and Applied Mathematics, 23:11{24, 1989.
S.A. Arnborg. Ecient algorithms for combinatorial problems on graphs with bounded
decomposability - a survey. BIT, 25:2{23, 1985.
F. Bacchus and P. van Run. Dynamic variable ordering in csps. In Principles and Practice
of Constraints Programming (CP-95), Cassis, France, 1995.
A. Becker and D. Geiger. A suciently fast algorithm for nding close to optimal jnmction
trees. In Uncertainty in AI (UAI-96), pages 81{89, 1996.
U. Bertele and F. Brioschi. Nonserial Dynamic Programming. Academic Press, 1972.
S. Bistarelli, U. Montanari, and F Rossi. Semiring-based constraint satisfaction and optimization. Journal of the Association of Computing Machinery (JACM), to appear,
1997.
C. Boutilier. Context-specic independence in bayesian networks. In Uncertainty in
Articial Intelligence (UAI-96), pages 115{123, 1996.
BUCKET ELIMINATION
29
C. Cannings, E.A. Thompson, and H.H. Skolnick. Probability functions on complex
pedigrees. Advances in Applied Probability, 10:26{61, 1978.
G.F. Cooper. Nestor: A computer-based medical diagnosis aid that integrates causal and
probabilistic knowledge. Technical report, Computer Science department, Stanford
University, Palo-Alto, California, 1984.
M. Davis and H. Putnam. A computing procedure for quantication theory. Journal of
the Association of Computing Machinery, 7(3), 1960.
R. Dechter and J. Pearl. Network-based heuristics for constraint satisfaction problems.
Articial Intelligence, 34:1{38, 1987.
R. Dechter and J. Pearl. Tree clustering for constraint networks. Articial Intelligence,
pages 353{366, 1989.
R. Dechter and I. Rish. Directional resolution: The davis-putnam procedure, revisited.
In Principles of Knowledge Representation and Reasoning (KR-94), pages 134{145,
1994.
R. Dechter and I. Rish. To guess or to think? hybrid algorithms for sat. In Principles of
Constraint Programming (CP-96), 1996.
R. Dechter and P. van Beek. Local and global relational consistency. In Principles and
Practice of Constraint programming (CP-95), pages 240{257, 1995.
R. Dechter. Enhancement schemes for constraint processing: Backjumping, learning and
cutset decomposition. Articial Intelligence, 41:273{312, 1990.
R. Dechter. Constraint networks. Encyclopedia of Articial Intelligence, pages 276{285,
1992.
R. Dechter. Bucket elimination: A unifying framework for probabilistic inference algorithms. In Uncertainty in Articial Intelligence (UAI-96), pages 211{219, 1996.
R. Dechter. Topological parameters for time-space tradeos. In Uncertainty in Articial
Intelligence (UAI-96), pages 220{227, 1996.
R. Dechter. Mini-buckets: A general scheme of generating approximations in automated
reasoning. In Ijcai-97: Proceedings of the Fifteenth International Joint Conference on
Articial Intelligence, 1997.
Y. El-Fattah and R. Dechter. An evaluation of structural parameters for probabilistic
reasoning: results on benchmark circuits. In Uncertainty in Articial Intelligence
(UAI-96), pages 244{251, 1996.
D. Geiger, T. Verma, and J. Pearl. Identifying independence in bayesian networks. Networks, 20:507{534, 1990.
F.V. Jensen, S.L Lauritzen, and K.G. Olesen. Bayesian updating in causal probabilistic
networks by local computation. Computational Statistics Quarterly, 4, 1990.
U. Kjaerul. A computational scheme for reasoning in dynamic probabilistic networks.
In Uncertainty in Articial Intelligence (UAI-93), pages 121{149, 1993.
S.L. Lauritzen and D.J. Spiegelhalter. Local computation with probabilities on graphical
structures and their application to expert systems. Journal of the Royal Statistical
Society, Series B, 50(2):157{224, 1988.
J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
Y. Peng and J.A. Reggia. Plausability of diagnostic hypothesis. In National Conference
on Articial Intelligence (AAAI86), pages 140{145, 1986.
Y. Peng and J.A. Reggia. A connectionist model for diagnostic problem solving, 1989.
D. Poole. Probabilistic partial evaluation: Exploiting structure in probabilistic inference.
In Ijcai-97: Proceedings of the Fifteenth International Joint Conference on Articial
Intelligence, 1997.
S.K. Anderson R. D. Shachter and P. Solovitz. Global conditioning for probabilistic
inference in belief networks. In Uncertainty in Articial Intelligence (UAI-91), pages
514{522, 1991.
B. D'Ambrosio R.D. Shachter and B.A. Del Favro. Symbolic probabilistic inference in
belief networks. Automated Reasoning, pages 126{131, 1990.
E. Santos, S.E. Shimony, and E. Williams. Hybrid algorithms for approximate belief
updating in bayes nets. International Journal of Approximate Reasoning, in press.
30
R. DECHTER
E. Santos. On the generation of alternative explanations with implications for belief
revision. In Uncertainty in Articial Intelligence (UAI-91), pages 339{347, 1991.
R.D. Shachter. Evaluating inuence diagrams. Operations Research, 34, 1986.
R.D. Shachter. Probabilistic inference and inuence diagrams. Operations Research, 36,
1988.
R. D. Shachter. An ordered examination of inuence diagrams. Networks, 20:535{563,
1990.
P.P. Shenoy. Valuation-based systems for bayesian decision analysis. Operations Research,
40:463{484, 1992.
S.E. Shimony and E. Charniack. A new algorithm for nding map assignments to belief
networks. In P. Bonissone, M. Henrion, L. Kanal, and J. Lemmer ed., Uncertainty
in Articial Intelligence, volume 6, pages 185{193, 1991.
J.A. Tatman and R.D. Shachter. Dynamic programming and inuence diagrams. IEEE
Transactions on Systems, Man, and Cybernetics, 1990.
N.L. Zhang and D. Poole. Exploiting causal independence in bayesian network inference.
Journal of Articial Intelligence Research (JAIR), 1996.
Download