BUCKET ELIMINATION: A UNIFYING FRAMEWORK FOR PROBABILISTIC INFERENCE R. DECHTER Department of Information and Computer Science University of California, Irvine dechter@ics.uci.edu Abstract. Probabilistic inference algorithms for belief updating, nding the most probable explanation, the maximum a posteriori hypothesis, and the maximum expected utility are reformulated within the bucket elimination framework. This emphasizes the principles common to many of the algorithms appearing in the probabilistic inference literature and claries the relationship of such algorithms to nonserial dynamic programming algorithms. A general method for combining conditioning and bucket elimination is also presented. For all the algorithms, bounds on complexity are given as a function of the problem's structure. 1. Overview Bucket elimination is a unifying algorithmic framework that generalizes dynamic programming to accommodate algorithms for many complex problemsolving and reasoning activities, including directional resolution for propositional satisability (Davis and Putnam, 1960), adaptive consistency for constraint satisfaction (Dechter and Pearl, 1987), Fourier and Gaussian elimination for linear equalities and inequalities, and dynamic programming for combinatorial optimization (Bertele and Brioschi, 1972). Here, after presenting the framework, we demonstrate that a number of algorithms for probabilistic inference can also be expressed as bucket-elimination algorithms. The main virtues of the bucket-elimination framework are simplicity and generality. By simplicity, we mean that a complete specication of 2 R. DECHTER bucket-elimination algorithms is feasible without introducing extensive terminology (e.g., graph concepts such as triangulation and arc-reversal), thus making the algorithms accessible to researchers in diverse areas. More important, the uniformity of the algorithms facilitates understanding, which encourages cross-fertilization and technology transfer between disciplines. Indeed, all bucket-elimination algorithms are similar enough for any improvement to a single algorithm to be applicable to all others expressed in this framework. For example, expressing probabilistic inference algorithms as bucket-elimination methods claries the former's relationship to dynamic programming and to constraint satisfaction such that the knowledge accumulated in those areas may be utilized in the probabilistic framework. The generality of bucket elimination can be illustrated with an algorithm in the area of deterministic reasoning. Consider the following algorithm for deciding satisability. Given a set of clauses (a clause is a disjunction of propositional variables or their negations) and an ordering of the propositional variables, d = Q1 ; :::; Qn, algorithm directional resolution (DR) (Dechter and Rish, 1994), is the core of the well-known Davis-Putnam algorithm for satisability (Davis and Putnam, 1960). The algorithm is described using buckets partitioning the given set of clauses such that all the clauses containing Qi that do not contain any symbol higher in the ordering are placed in the bucket of Qi , denoted bucketi . The algorithm (see Figure 1) processes the buckets in the reverse order of d. When processing bucketi , it resolves over Qi all possible pairs of clauses in the bucket and inserts the resolvents into the appropriate lower buckets. It was shown that if the empty clause is not generated in this process then the theory is satisable and a satisfying truth assignment can be generated in time linear in the size of the resulting theory. The complexity of the algorithm is exponentially bounded (time and space) in a graph parameter called induced width (also called tree-width) of the interaction graph of the theory, where a node is associated with a proposition and an arc connects any two nodes appearing in the same clause (Dechter and Rish, 1994). The belief-network algorithms we present in this paper have much in common with the resolution procedure above. They all possess the property of compiling a theory into one from which answers can be extracted easily and their complexity is dependent on the same induced width graph parameter. The algorithms are variations on known algorithms and, for the most part, are not new, in the sense that the basic ideas have existed for some time (Cannings et al., 1978; Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Tatman and Shachter, 1990; Jensen et al., 1990; R.D. Shachter and Favro, 1990; Bacchus and van Run, 1995; Shachter, 1986; Shachter, 1988; Shimony and Charniack, 1991; Shenoy, 1992). What we are presenting here is a syntactic and uniform exposition emphasizing these algorithms' form BUCKET ELIMINATION 3 Algorithm directional resolution Input: A set of clauses ', an ordering d = Q1; :::; Qn. Output: A decision of whether ' is satisable. If it is, an equivalent output theory; else, an empty output theory. 1. Initialize: Generate an ordered partition of the clauses, bucket1; :::; bucketn, where bucketi contains all the clauses whose highest literal is Qi . 2. For p = n to 1, do if bucketp contains a unit clause, perform only unit resolution. Put each resolvent in the appropriate bucket. else, resolve each pair f( _ Qp); ( _ :Qp)g bucketp. If = _ is empty, ' isS not satisable, else, add to the appropriate bucket. 3. Return: i bucketi . Figure 1. Algorithm directional resolution as a straightforward elimination algorithm. The main virtue of this presentation, beyond uniformity, is that it allows ideas and techniques to ow across the boundaries between areas of research. In particular, having noted that elimination algorithms and clustering algorithms are very similar in the context of constraint processing (Dechter and Pearl, 1989), we nd that this similarity carries over to all other tasks. We also show that the idea of conditioning, which is as universal as that of elimination, can be incorporated and exploited naturally and uniformly within the elimination framework. Conditioning is a generic name for algorithms that search the space of partial value assignments, or partial conditionings. Conditioning means splitting a problem into subproblems based on a certain condition. Algorithms such as backtracking and branch and bound may be viewed as conditioning algorithms. The complexity of conditioning algorithms is exponential in the conditioning set, however, their space complexity is only linear. Our resulting hybrid of conditioning with elimination which trade o time for space (see also (Dechter, 1996b; R. D. Shachter and Solovitz, 1991)), are applicable to all algorithms expressed within this framework. The work we present here also ts into the framework developed by Arnborg and Proskourowski (Arnborg, 1985; Arnborg and Proskourowski, 1989). They present table-based reductions for various NP-hard graph problems such as the independent-set problem, network reliability, vertex cover, graph k-colorability, and Hamilton circuits. Here and elsewhere (Dechter and van Beek, 1995; Dechter, 1997) we extend the approach to a dierent set of problems. 4 R. DECHTER Following preliminaries (section 2), we present the bucket-elimination algorithm for belief updating and analyze its performance (section 3). Then, we extend the algorithm to the tasks of nding the most probable explanation (section 4), and extend it to the tasks of nding the maximum a posteriori hypothesis (section 5) and for nding the maximum expected utility (section 6). Section 7 relates the algorithms to Pearl's poly-tree algorithms and to join-tree clustering. We then describe schemes for combining the conditioning method with elimination (section 8). Conclusions are given in section 9. 2. Preliminaries Belief networks provide a formalism for reasoning about partial beliefs under conditions of uncertainty. It is dened by a directed acyclic graph over nodes representing random variables that takes value from given domains. The arcs signify the existance of direct causal inuences between the linked variables, and the strength of these inuences are quantied by conditional probabilities. A belief network relies on the notion of a directed graph. A directed graph is a pair, G = fV; E g, where V = fX1; :::; Xng is a set of elements and E = f(Xi; Xj )jXi; Xj 2 V; i 6= j g is the set of edges. If (Xi; Xj ) 2 E , we say that Xi points to Xj . For each variable Xi , the set of parent nodes of Xi , denoted pa(Xi), comprises the variables pointing to Xi in G, while the set of child nodes of Xi , denoted ch(Xi), comprises the variables that Xi points to. Whenever no confusion can arise, we abbreviate pa(Xi) by pai and ch(Xi ) by chi . The family of Xi , Fi , includes Xi and its child variables. A directed graph is acyclic if it has no directed cycles. In an undirected graph, the directions of the arcs are ignored: (Xi ; Xj ) and (Xj ; Xi) are identical. Let X = fX1; :::; Xng be a set of random variables over multivalued domains, D1; :::; Dn. A belief network is a pair (G; P ) where G is a directed acyclic graph and P = fPi g, where Pi denotes probabilistic relationships between Xi and its parents, namely conditional probability matrices Pi = fP (Xijpai)g. The belief network represents a probability distribution over X having the product form P (x1; ::::; xn) = ni=1 P (xi jxpa ) i where an assignment (X1 = x1 ; :::; Xn = xn ) is abbreviated to x = (x1; :::; xn) and where xS denotes the projection of a tuple x over a subset of variables S . An evidence set e is an instantiated subset of variables. A = a denotes a partial assignment to a subset of variables A from their respective domains. We use upper case letter for variables and nodes in a graph and lower case letters for values in variable's domains. 5 BUCKET ELIMINATION A A B C B C F F D D G (a) G (b) Figure 2. belief network P (g; f; d; c; b; a)= P (gjf )P (f jc;b)P (djb; a)P (bja)P (cja) Example 2.1 Consider the belief network dened by P (g; f; d; c; b; a) = P (gjf )P (f jc; b)P (djb; a)P (bja)P (cja): Its acyclic graph is given in Figure 2a. In this case, pa(F ) = fB; C g. The following queries are dened over belief networks: 1. belief updating namely, given a set observations, computing the posterior probability of each proposition, 2. Finding the most probable explanation (mpe), or, given some observed variables, nding a maximum probability assignment to the rest of the variables 3. Finding the maximum aposteriori hypothesis (map), or, given some evidence, nding an assignment to a subset of hypothesis variables that maximizes their probability, and nally, 4. given also a utility function, nding an assignment to a subset of decision variables that maximizes the expected utility of the problem (meu). It is known that these tasks are NP-hard. Nevertheless, they all permit a polynomial propagation algorithm for singly-connected networks (Pearl, 1988). The two main approaches to extending this propagation algorithm to multiply-connected networks are the cycle-cutset approach, also called conditioning, and tree-clustering (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Shachter, 1986). These methods work well for sparse networks with small cycle-cutsets or small clusters. In subsequent sections bucketelimination algorithms for each of the above tasks will be presented and relationship with existing methods will be discussed. We conclude this section with some notational conventions. Let u be a partial tuple, S a subset of variables, and Xp a variable not in S . We use (uS ; xp) to denote the tuple uS appended by a value xp of Xp. Denition 2.2 (elimination functions) Given a function h dened over subset of variables S , where X 2 S , the functions (minX h), (maxX h), 6 R. DECHTER P (meanX h), and ( X h) are dened over U = S , fX g as follows. For every U = u, (minX h)(u) = minx h(u; x), (maxX h)(u) = maxx h(u; x), P P P ( X h)(u) = x h(u; x), and (meanX h)(u) = x h(jXu;xj ) , where jX j is the cardinality of X 's domain. Given a set of functions h1 ;P:::; hj dened over the subsets S1 ; :::; Sj, the product function (j hj ) and J hj are dened P over U P= [j Sj . For every U = u, (j hj )(u) = j hj (uS ), and ( j hj )(u) = j hj (uS ). j j 3. An Elimination Algorithm for Belief Assessment Belief updating is the primary inference task over belief networks. The task is to maintain the probability of singleton propositions once new evidence arrives. Following Pearl's propagation algorithm for singly-connected networks (Pearl, 1988), researchers have investigated various approaches to belief updating. We will now present a step by step derivation of a general variable-elimination algorithm for belief updating. This process is typical for any derivation of elimination algorithms. Let X1 = x1 be an atomic proposition. The problem is to assess and update the belief in x1 given some evidence e. Namely, we wish to compute P (X1 = x1 je) = P (X1 = x1; e), where is a normalization constant. We will develop the algorithm using example 2.1 (Figure 2). Assume we have the evidence g = 1. Consider the variables in the order d1 = A; C; B; F; D; G. By denition we need to compute P (a; g = 1) = X c;b;f;d;g=1 P (g jf )P (f jb; c)P (dja; b)P (cja)P (bja)P (a) We can now apply some simple symbolic manipulation, migrating each conditional probability table to the left of summation variables which it does not reference, we get = P (a) X P (cja) X P (bja) X P (f jb; c) X P (djb; a) X P (gjf ) c b f g=1 d (1) Carrying the computation from right to left (from G to A), we rst compute the rightmost summation which generates a function over f , G (f ) P dened by: G (f ) = g=1 P (g jf ) and place it as far to the left as possible, yielding P (a) X P (cja) X P (bja) X P (f jb; c) c b f G(f ) X P (djb; a) d (2) 7 BUCKET ELIMINATION bucketG = bucketD = bucketF = bucketB = bucketC = bucketA = P (g jf ); g = 1 P (djb; a) P (f jb; c) P (bja) P (cja) P (a) Figure 3. Initial partitioning into buckets using d1 = A;C; B; F; D; G Summing next over d (generating a function denoted D (a; b), dened by D (a; b) = Pd P (dja; b)), we get P (a) X P (cja) X P (bja) c b D (a; b) X P (f jb; c) f P G(f ) (3) Next, summing over f ( generating F (b; c) = f P (f jb; c)G(f )), we get, X X P (a) P (cja) P (bja)D(a; b)F (b; c) (4) c b Summing over b (generating B (a; c)), we get P (a) X P (cja) c B (a; c) (5) Finally, summing over c (generating C (a)), we get P (a)C (a) (6) The answer to the query P (ajg = 1) can be computed by evaluating the last product and then normalizing. The bucket-elimination algorithm mimics the above algebraic manipulation using a simple organizational devise we call buckets, as follows. First, the conditional probability tables (CPTs, for short) are partitioned into buckets, relative to the order used d1 = A; C; B; F; D; G, as follows (going from last variable to rst varaible): in the bucket of G we place all functions mentioning G. From the remaining CPTs we place all those mentioning D in the bucket of D, and so on. The partitioning rule can be alternatively stated as follows. In the bucket of variable Xi we put all functions that mention Xi but do not mention any variable having a higher index. The resulting initial partitioning for our example is given in Figure 3. Note that observed variables are also placed in their corresponding bucket. This initialization step corresponds to deriving the expression in Eq. (1). Now we process the buckets from top to bottom, implementing the 8 R. DECHTER sum B Bucket G P(f | g) g = 1 Bucket D P( d | b, a ) Bucket F P( f | b, c) Bucket B P(b | a) Bucket C P(c | a ) Bucket A P(a ) G (f) (b,a) D ( b, c) F ( a, c ) B (a) C Figure 4. Bucket elimination along ordering d1 = A; C; B; F; D;G. right to left computation of Eq. (1). BucketG is processed rst. Processing a bucket amounts to eliminating the variable in the bucket from subsequent computation. To eliminate G, we sum over all values of g . Since, in this case we have an observed P value g = 1 the summation is over a singleton value. Namely, G (f ) = g=1 P (g jf ), is computed and placed in bucketF (this corresponds to deriving Eq. (2) from Eq. (1)). New functions are placed in lower buckets using the same placement rule. BucketD is processed next. We sum-out D getting D (b; a) = Pd P (djb; a), that is computed and placed in bucketB , (which corresponds to deriving Eq. (3) from Eq. (2)). The next variable is F . BucketF contains two functions P (f jb; c) and G (f ), and thus, following Eq. (4) we generate the function P F (b; c) = f P (f jb; c) G(f ) which is placed in bucketB (this corresponds to deriving Eq. (4) P from Eq. (3)). In processing the next bucketB , the function B (a; c) = b (P (bja) D (b; a) F (b; c)) is computed and placed in bucketC (deriving Eq. (5) from Eq. (4)). In processing the next bucketC , C (a) = Pc2C P (cja) B (a; c) is computed (which corresponds to deriving Eq. (6) from Eq. (5)). Finally, the belief in a can be computed in bucketA , P (ajg = 1) = P (a) C (a). Figure 4 summarizes the ow of computation of the bucket elimination algorithm for our example. Note that since throughout this process we recorded two-dimensional functions at the most, the complexity the algorithm using ordering d1 is (roughly) time and space quadratic in the domain sizes. What will occur if we use a dierent variable ordering? For example, lets apply the algorithm using d2 = A; F; D; C; B; G. Applying algebraic manipulation from right to left along d2 yields the following sequence of derivations: P (a; g = 1) = P (a) Pf Pd Pc P (cja) Pb P (bja) P (dja; b)P (f jb; c) Pg=1 P (g jf )= 9 BUCKET ELIMINATION G bucket bucket bucket bucket bucket G B C D F = P(g|f), g = 1 B = P(f | b,c), P(d | a,b), P(b|a) = λ C (a, f, d) = λ (a, f) D λ F (a) bucket A = P(a) C λ B (f, c, a, d) = P(c | a) D λ (f) G F A (b) (a) Figure 5. The buckets output when processing along d2 = A; F;D; C; B; G PP P P (a) P f G (f ) Pd Pc P (cja) b P (bja) P (dja; b)P (f jb; c)= P P (a) Pf G(f )P d c P (cja)B (a; d; c; f ) = P (a) Pf g(f ) d C (a; d; f ) = P (a) f G(f )D (a; f ) = P (a)F (a) The bucket elimination process for ordering d2 is summarized in Figure 5a. Each bucket contains the initial CPTs denoted by P 's, and the functions generated throughout the process, denoted by s. We summarize with a general derivation of the bucket elimination algorithm, called elim-bel. Consider an ordering of the variables X = (X1; :::; Xn). Using the notation xi = (x1; :::; xi) and xji = (xi ; xi+1; :::; xj ), where Fi is the family of variable Xi, we want to compute: X P (x ; e) = X X P (x ; ejx ) = P (x1 ; e) = n i i pa i x ,1) xn x=xn2 (2n Seperating Xn from the rest of the variables we get: X X X 2X ,F P (xi ; ejxpa ) P (xn ; ejxpa )X 2ch P (xi ; ejxpa ) = x=x(2n,1) where i n i X x x ,1) =(2n n(xU ) = n n xn n i i X 2X ,F P (xi ; ejxpa ) n (xU ) n i X P (x ; ejx i xn n n pan )Xi 2chn P (xi ; ejxpai ) (7) 10 R. DECHTER Algorithm elim-bel Input: A belief network BN = fP1; :::; Png; an ordering of the variables, d = X1; :::; Xn; evidence e. Output: The belief in X1 = x1. 1. Initialize: Generate an ordered partition of the conditional proba- bility matrices, bucket1 ; :::; bucketn, where bucketi contains all matrices whose highest variable is Xi . Put each observed variable in its bucket. Let S1; :::; Sj be the subset of variables in the processed bucket on which matrices (new or old) are dened. 2. Backward: For p n downto 1, do for all the matrices 1 ; 2; :::; j in bucketp , do (bucket with observed variable) if Xp = xp appears in bucketp, assign Xp = xp to each i and then put each resulting function in appropriate bucket. else, Up Sji=1 Si , fXpg. Generate p = PX ji=1i and add p to the largest-index variable in Up . 3. Return: Bel(x1 ) = P (x1 ) i i (x1)(where the i are in bucket1, is a normalizing constant). p Figure 6. Algorithm elim-bel Where Un denoted the variables appearing with Xn in a probability component, excluding Xn . The process continues recursively with Xn,1 . Thus, the computation performed in the bucket of Xn is captured by Eq. (7). Given ordering X1 ; :::; Xn, where the queried variable appears rst, the CPT s are partitioned using the rule described earlier. To process each bucket, all the bucket's functions, denoted 1; :::; j and dened over subsets S1; :::; Sj are multiplied, and then the bucket's variable is eliminated by P j summation. The computed function is p : Up ! R, p = X i=1 i , where Up = [i Si , Xp . This function is placed in the bucket of its largestindex variable in Up . The procedure continues recursively with the bucket of the next variable going from last variable to rst variable. Once all the buckets are processed, the answer is available in the rst bucket. Algorithm elim-bel is described in Figure 6. p Theorem 3.1 Algorithm elim-bel compute the posterior belief P (x1je) for any given ordering of the variables. 2 Both the peeling algorithm for genetic trees (Cannings et al., 1978), and Zhang and Poole's recent algorithm (Zhang and Poole, 1996) are variations of elim-bel. 11 BUCKET ELIMINATION G G G D B B F C C B C A (a) D D F F A A (b) (c) Figure 7. Two ordering of the moral graph of our example problem 3.1. COMPLEXITY We see that although elim-bel can be applied using any ordering, its complexity varies considerably. Using ordering d1 we recorded functions on pairs of variables only, while using d2 we had to record functions on four variables (see BucketC in Figure 5a). The arity of the function recorded in a bucket equals the number of variables appearing in that processed bucket, excluding the bucket's variable. Since recording a function of arity r is time and space exponential in r we conclude that the complexity of the algorithm is exponential in the size of the largest bucket which depends on the order of processing. Fortunately, for any variable ordering bucket sizes can be easily read in advance from an ordered associated with the elimination process. Consider the moral graph of a given belief network. This graph has a node for each propositional variable, and any two variables appearing in the same CPT are connected in the graph. The moral graph of the network in Figure 2a is given in Figure 2b. Let us take this moral graph and impose an ordering on its nodes. Figures 7a and 7b depict the ordered moral graph using the two orderings d1 = A; C; B; F; D; G and d2 = A; F; D; C; B; G. The ordering is pictured from bottom up. The width of each variable in the ordered graph is the number of its earlier neighbors in the ordering. Thus the width of G in the ordered graph along d1 is 1 and the width of F is 2. Notice now that using ordering d1 , the number of variables in the initial buckets of G and F , are also 1, and 2 respectively. Indeed, in the initial partitioning the number of variables mentioned in a bucket (excluding the bucket's variable) is always identical to the width of that node in the corresponding ordered moral graph. During processing we wish to maintain the correspondance that any 12 R. DECHTER two nodes in the graph are connected i there is function (new or old) deened on both. Since, during processing, a function is recorded on all the variables apearing in a bucket, we should connect the corresponding nodes in the graph, namely we should connect all the earlier neighbors of a processed variable. If we perform this graph operation recursively from last node to rst, (for each node connecting its earliest neighbors) we get the the induced graph. The width of each node in this induced graph is identical to the bucket's sizes generated during the elimination process (see Figure 5b). Example 3.2 The induced moral graph of Figure 2b, relative to ordering d1 = A; C; B; F; D; G is depicted in Figure 7a. In this case the ordered graph and its induced ordered graph are identical since all earlier neighbors of each node are already connected. The maximum induced width is 2. Indeed, in this case, the maximum arity of functions recorded by the elimination algorithms is 2. For d2 = A; F; D; C; B; G the induced graph is depicted in Figure 7c. The width of C is initially 1 (see Figure 7b) while its induced width is 3. The maximum induced width over all variables for d2 is 4, and so is the recorded function's dimensionality. A formal denition of all the above graph concepts is given next. Denition 3.3 An ordered graph is a pair (G; d) where G is an undirected graph and d = X1; :::; Xn is an ordering of the nodes. The width of a node in an ordered graph is the number of the node's neighbors that precede it in the ordering. The width of an ordering d, denoted w(d), is the maximum width over all nodes. The induced width of an ordered graph, w(d), is the width of the induced ordered graph obtained as follows: nodes are processed from last to rst; when node X is processed, all its preceding neighbors are connected. The induced width of a graph, w, is the minimal induced width over all its orderings. The tree-width of a graph is the minimal induced width plus one (Arnborg, 1985). The established connection between buckets' sizes and induced width motivates nding an ordering with a smallest induced width. While it is known that nding an ordering with the smallest induced width is hard (Arnborg, 1985), usefull greedy heuristics as well as approximation algorithms are available (Dechter, 1992; Becker and Geiger, 1996). In summary, the complexity of algorithm elim-bel is dominated by the time and space needed to process a bucket. Recording a function on all the bucket's variables is time and space exponential in the number of variables mentioned in the bucket. As we have seen the induced width bounds the arity of the functions recorded; variables appearing in a bucket coincide with the earlier neighbors of the corresponding node in the ordered induced moral graph. In conclusion, BUCKET ELIMINATION 13 Theorem 3.4 Given an ordering d the complexity of elim-bel is (time and space) exponential in the induced width w(d) of the network's ordered moral graph. 2 3.2. HANDLING OBSERVATIONS Evidence should be handled in a special way during the processing of buckets. Continuing with our example using elimination on order d1 , suppose we wish to compute the belief in A = a having observed b = 1. This observation is relevant only when processing bucketB . When the algorithm arrives at that bucket, the bucket contains the three functions P (bja), D (b; a), and F (b; c), as well as the observation b = 1 (see Figure 4). The processing rule dictates computing B (a; c) = P (b = 1ja)D(b = 1; a)F (b = 1; c). Namely, we will generate and record a two-dimensioned function. It would be more eective however, to apply the assignment b = 1 to each function in a bucket separately and then put the resulting functions into lower buckets. In other words, we can generate P (b = 1ja) and D (b = 1; a), each of which will be placed in the bucket of A, and F (b = 1; c), which will be placed in the bucket of C . By so doing, we avoid increasing the dimensionality of the recorded functions. Processing buckets containing observations in this manner automatically exploits the cutset conditioning eect (Pearl, 1988). Therefore, the algorithm has a special rule for processing buckets with observations: the observed value is assigned to each function in the bucket, and each resulting function is moved individually to a lower bucket. Note that, if the bucket of B had been at the top of our ordering, as in d2, the virtue of conditioning on B could have been exploited earlier. When processing bucketB it contains P (bja); P (djb; a); P (f jc; b); and b = 1 (see Figure 5a). The special rule for processing buckets holding observations will place P (b = 1ja) in bucketA , P (djb = 1; a) in bucketD , and P (f jc; b = 1) in bucketF . In subsequent processing, only one-dimensional functions will be recorded. We see that the presence of observations reduces complexity. Since the buckets of observed variables are processed in linear time, and the recorded functions do not create functions on new subsets of variables, the corresponding new arcs should not be added when computing the induced graph. Namely, earlier neighbors of observed variables should not be connected. To capture this renement we use the notion of adjusted induced graph which is dened recursively as follows. Given an ordering and given a set of observed nodes, the adjusted induced graph is generated by processing from top to bottom, connecting the earlier neighbors of unobserved nodes only. The adjusted induced width is the width of the adjusted induced graph. 14 R. DECHTER Theorem 3.5 Given a belief network having n variables, algorithm elim- bel when using ordering d and evidence e, is (time and space) exponential in the adjusted induced width w (d; e) of the network's ordered moral graph. 2 3.3. FOCUSING ON RELEVANT SUBNETWORKS We will now present an improvement to elim-bel whose essence is restricting the computation to relevant portions of the belief network. Such restrictions are already available in the literature in the context of existing algorithms (Geiger et al., 1990; Shachter, 1990). Since summation over all values of a probability function is 1, the recorded functions of some buckets will degenerate to the constant 1. If we could recognize such cases in advance, we could avoid needless computation by skipping some buckets. If we use a topological ordering of the belief network's acyclic graph (where parents precede their child nodes), and assuming that the queried variable starts the ordering1 , we can recognize skipable buckets dynamically, during the elimination process. Proposition 3.6 Given a belief network and a topological ordering X1; :::; Xn, algorithm elim-bel can skip a bucket if at the time of processing, the bucket contains no evidence variable, no query variable and no newly computed function. 2 Proof: If topological ordering is used, each bucket (that does not contain the queried variable) contains initially at most one function describing its probability conditioned on all its parents. Clearly if there is no evidence, summation will yield the constant 1. 2 Example 3.7 Consider again the belief network whose acyclic graph is given in Figure 2a and the ordering d1 = A; C; B; F; D; G, and assume we want to update the belief in variable A given evidence on F . Clearly the buckets of G and D can be skipped and processing should start with bucketF . Once the bucket of F is processed, all the rest of the buckets are not skipable. Alternatively, the relevant portion of the network can be precomputed by using a recursive marking procedure applied to the ordered moral graph. (see also (Zhang and Poole, 1996)). Denition 3.8 Given an acyclic graph and a topological ordering that starts with the queried variable, and given evidence e, the marking process works as follows. An evidence node is marked, a neighbor of the query variable is marked, and then any earlier neighbor of a marked node is marked. 1 otherwise, the queried variable can be moved to the top of the ordering 15 BUCKET ELIMINATION Algorithm elim-bel ::: 2. Backward: For p n downto 1, do for all the matrices 1 ; 2; :::; j in bucketp , do (bucket with observed variable) if Xp = xp appears in bucketp, then substitute Xp = xp in each matrix i and put each in appropriate bucket. else,Sif bucketp is NOT skipable , then P j j Up i=1 Si , fXpg p = X i=1 i. Add p to the largest-index variable in Up . p ::: Figure 8. Improved algorithm elim-bel The marked belief subnetwork, obtained by deleting all unmarked nodes, can be processed now by elim-bel to answer the belief-updating query. It is easy to see that Theorem 3.9 The complexity of algorithm elim-bel given evidence e is exponential in the adjusted induced width of the marked ordered moral subgraph. Proof: Deleting the unmarked nodes from the belief network results in a belief subnetwork whose distribution is identical to the marginal distribution over the marked variables. 2. 4. An Elimination Algorithm for mpe In this section we focus on the task of nding the most probable explanation. This task appears in applications such as diagnosis and abduction. For example, it can suggest the disease from which a patient suers given data on clinical ndings. Researchers have investigated various approaches to nding the mpe in a belief network. (See, e.g., (Pearl, 1988; Cooper, 1984; Peng and Reggia, 1986; Peng and Reggia, 1989)). Recent proposals include best rst-search algorithms (Shimony and Charniack, 1991) and algorithms based on linear programming (Santos, 1991). The problem is to nd x0 such that P (x0 ) = maxx i P (xi ; ejxpa ) where x = (x1; :::; xn) and e is a set of observations. Namely, computing for a given ordering X1; :::; Xn, n M = max (8) x i=1 P (xi ; ejxpa ) x P (x) = max x , max i n n 1 n i This can be accomplished as before by performing the mximization operation along the ordering from right to left, while migrating to the left, at 16 R. DECHTER each step, all components that do not mention the maximizing variable. We get, M = xmax xn ; e) = xmax max i P (xi ; ejxpa ) = = x P ( , x n (n i n 1) max xn P (xn ; ejxpan )Xi 2chn P (xi ; ejxpai ) = x= xn,1 Xi 2X ,Fn P (xi ; ejxpai ) max where max x= xn,1 Xi 2X ,Fn P (xi ; ejxpai ) hn (xUn ) hn (xU ) = max x P (xn ; ejxpa )X 2ch P (xi ; ejxpa ) n n n i n i Where Un are the variables appearing in components dened over Xn . Clearly, the algebraic manipulation of the above expressions is the same as the algebraic manipulation for belief assessment where summation is replaced by maximization. Consequently, the bucket-elimination procedure elim-mpe is identical to elim-bel except for this change. Given ordering X1; :::; Xn, the conditional probability tables are partitioned as before. To process each bucket, we multiply all the bucket's matrices, which in this case are denoted h1 ; :::; hj and dened over subsets S1 ; :::; Sj, and then eliminate the bucket's variable by maximization. The computed function in this case is hp : Up ! R, hp = maxX ji=1 hi , where Up = [i Si , Xp. The function obtained by processing a bucket is placed in the bucket of its largest-index variable in Up . In addition, a function xop (u) = argmaxX hp (u), which relates an optimizing value of Xp with each tuple of Up , is recorded and placed in the bucket of Xp . The procedure continues recursively, processing the bucket of the next variable while going from last variable to rst variable. Once all buckets are processed, the mpe value can be extracted in the rst bucket. When this backwards phase terminates the algorithm initiates a forwards phase to compute an mpe tuple. Forward phase: Once all the variables are processed, an mpe tuple is computed by assigning values along the ordering from X1 to Xn , consulting the information recorded in each bucket. Specically, once the partial assignment x = (x1; :::; xi,1) is selected, the value of Xi appended to this tuple is xoi (x), where xo is the function recorded in the backward phase. The algorithm is presented in Figure 9. Observed variables are handled as in elim-bel. Example 4.1 Consider again the belief network of Figure 2. Given the ordering d = A; C; B; F; D; G and the evidence g = 1, process variables from last to the rst after partitioning the conditional probability matrices into buckets, such that bucketG = fP (g jf ); g = 1g, bucketD = fP (djb; a)g, p p 17 BUCKET ELIMINATION Algorithm elim-mpe Input: A belief network BN = fP1; :::; Png; an ordering of the variables, d; observations e. Output: The most probable assignment. 1. Initialize: Generate an ordered partition of the conditional probabil- ity matrices, bucket1, : : :, bucketn , where bucketi contains all matrices whose highest variable is Xi . Put each observed variable in its bucket. Let S1; :::; Sj be the subset of variables in the processed bucket on which matrices (new or old) are dened. 2. Backward: For p n downto 1, do for all the matrices h1 ; h2; :::; hj in bucketp , do (bucket with observed variable) if bucketp contains Xp = xp, assign Xp = xp to each hi and put each in appropriate bucket. S j j else, Up i =1 Si , fXpg. Generate functions hp = maxX i=1 hi and xop = argmaxX hp . Add hp to bucket of largest-index variable in Up . 3. Forward: Assign values in the ordering d using the recorded functions xo in each bucket. p p Figure 9. Algorithm elim-mpe bucketF = fP (f jb; c)g, bucketB = fP (bja)g, bucketC = fP (cja)g, and bucketA = fP (a)g. To process G, assign g = 1, get hG(f ) = P (g = 1jf ), and place the result in bucketF . The function Go (f ) = argmaxhG (f ) is placed in bucketG as well. Process bucketD by computing hD (b; a) = maxd P (djb; a) and putting the result in bucketB . Record also Do (b; a) = argmaxDP (djb; a). The bucket of F , to be processed next, now contains two matrices: P (f jb; c) and hG (f ). Compute hF (b; c) = maxf p(f jb; c) hG (f ), and place the resulting function in bucketB . To eliminate B , we record the function hB (a; c) = maxb P (bja) hD (b; a) hF (b; c) and place it in bucketC . To eliminate C , we compute hC (a) = maxc P (cja) hB (a; c) and place it in bucketA . Finally, the mpe value is given in bucketA , M = maxa P (a) hC (a), is determined, along with the mpe tuple, by going forward through the buckets. The backward process can be viewed as a compilation (or learning) phase, in which we compile information regarding the most probable extension of partial tuples to variables higher in the ordering (see also section 7.2). Similarly to the case of belief updating, the complexity of elim-mpe is bounded exponentially in the dimension of the recorded matrices, and those are bounded by the induced width w(d; e) of the ordered moral graph. In 18 R. DECHTER summary, Theorem 4.2 Algorithm elim-mpe is complete for the mpe task. Its complexity (time and space) is O(n exp(w(d; e))), where n is the number of variables and w(d; e) is the e-adjusted induced width of the ordered moral graph. 2 5. An Elimination Algorithm for MAP We next present an elimination algorithm for the map task. By its denition, the task is a mixture of the previous two, and thus in the algorithm some of the variables are eliminated by summation, others by maximization. Given a belief network, a subset of hypothesized variables A = fA1 ; :::; Akg, and some evidence e, the problem is to nd an assignment to the hypothesized variables that maximizes their probability given P the evidence. Formally, we wish to compute maxa P (x; e) = maxa x ni=1 P (xi ; ejxpa ) where x = (a1; :::; ak; xk+1 ; :::; xn). In the algebraic manipulation of this expression, we push the maximization to the left of the summation. This means that in the elimination algorithm, the maximized variables should initiate the ordering (and therefore will be processed last). Algorithm elimmap in Figure 10 considers only orderings in which the hypothesized variables start the ordering. The algorithm has a backward phase and a forward phase, but the forward phase is relative to the hypothesized variables only. Maximization and summation may be somewhat interleaved to allow more eective orderings; however, we do not incorporate this option here. Note that the relevant graph for this task can be restricted by marking in a very similar manner to belief updating case. In this case the initial marking includes all the hypothesized variables, while otherwise, the marking procedure is applied recursively to the summation variables only. Theorem 5.1 Algorithm elim-map is complete for the map task. Its complexity is O(n exp(w(d; e)), where n is the number of variables in the relevant marked graph and w (d; e) is the e-adjusted induced width of its marked moral graph. 2 k k n k +1 i 6. An Elimination Algorithm for MEU The last and somewhat more complicated task we address is that of nding the maximum expected utility. Given a belief network, evidence e, a real-valued utility function u(x) additively decomposable relative to functions over Q = fQ1 ; :::; Qjg, Qi X , such that u(x) = PQ 2Qf1f;j:::;(xQfj ),dened and a subset of decision variables D = fD1; :::Dkg that are assumed to be root nodes, the meu task is to nd a set of decisions j j 19 BUCKET ELIMINATION Algorithm elim-map Input: A belief network BN = fP1; :::; Png; a subset of variables A = fA1 ; :::; Akg; an ordering of the variables, d, in which the A's are rst in the ordering; observations e. Output: A most probable assignment A = a. 1. Initialize: Generate an ordered partition of the conditional probabil- ity matrices, bucket1, : : :, bucketn , where bucketi contains all matrices whose highest variable is Xi . 2. Backwards For p n downto 1, do for all the matrices 1 ; 2; :::; j in bucketp , do (bucket with observed variable) if bucketp contains the observation Xp = xp, assign Xp = xp to each i and put each in appropriate bucket. else, Up Sji=1 Si P ,fXpg. If Xp not in A and if bucketp contains new functions, then p = X ji=1 i ; else, Xp 2 A, and p = maxX ji=1 i and a0 = argmaxX p. Add p to the bucket of the largest-index variable in Up . 3. Forward: Assign values, in the ordering d = A1 ; :::; Ak, using the information recorded in each bucket. p p p Figure 10. Algorithm elim-map do = (do 1; :::; dok ) that maximizes the expected utility. We assume that the variables not appearing in D are indexed Xk+1 ; :::; Xn. Formally, we want to compute E = dmax ;:::;d 1 and X k xk+1 ;:::xn ni=1 P (xi ; ejxpa ; d1; :::; dk)u(x); i d0 = argmaxD E As in the previous tasks, we will begin by identifying the computation associated with Xn from which we will extract the computation in each bucket. We denote an assignment to the decision variables by d = (d1; :::; dk) and xjk = (xk ; :::; xj ). Algebraic manipulation yields X X n P (x ; ejx ; d) X f (x ) E = max i pa j Q i=1 d i ,1 xn xnk+1 Qj 2Q j We can now separate the components in the utility functions into those mentioning Xn , denoted by the index set tn , and those not mentioning Xn , labeled with indexes ln = f1; :::; ng, tn . Accordingly we get X X n P (x ; ejx ; d) ( X f (x ) + X f (x )) E = max i pa j Q j Q i=1 d x(kn+1,1) xn i j 2ln j j 2tn j 20 R. DECHTER E = max [ d + X X n x , xn (kn+11) i=1 P (xi ; ejxpai ; d) X X n x , xn (n 1) k+1 i=1 P (xi ; ejxpai ; d) X f (x j Qj ) j 2ln X f (x j Qj )] j 2tn By migrating to the left of Xn all of the elements that are not a function of Xn , we get max [ d X x , n 1 k +1 X + x , n 1 k +1 X X fj (xQj )) Xi 2X ,Fn P (xi ; ejxpai ; d) ( xn j 2ln Xi 2X ,Fn P (xi ; ejxpai Xi 2Fn P (xi ; ejxpai ; d) X ; d) xn Xi 2Fn P (xi ; ejxpai X ; d) f (x j 2tn (9) j Q )] j We denote by Un the subset of variables that appear with Xn in a probabilistic component, excluding Xn itself, and by Wn the union of variables that appear in probabilistic and utility components with Xn , excluding Xn itself. We dene n over Un as (x is a tuple over Un [ Xn ) n(xU jd) = n We dene n over Wn as n (xW jd) = n X xn Xi 2Fi P (xi ; ejxpai ; d) X xn X f (x Xi 2Fn P (xi ; ejxpai ; d) j 2tn j Qj )) (10) (11) After substituting Eqs. (10) and (11) into Eq. (9), we get E = max d X ,1 xnk+1 X f (x n (xWn jd) j Qj )+ (x jd) ] n Un j 2ln Xi 2X ,Fn P (xi ; ejxpai ; d)n(xUn jd)[ (12) The functions n and n compute the eect of eliminating Xn . The result (Eq. (12)) is an expression, which does not include Xn , where the product has one more matrix n and the utility components have one more element n = . Applying such algebraic manipulation to the rest of the variables in order yields the elimination algorithm elim-meu in Figure 11. We assume here that decision variables are processed last by elim-meu. Each bucket contains utility components i and probability components, i . When there is no evidence, n is a constant and we can incorporate the marking modication we presented for elim-bel. Otherwise, during processing, the algorithm generates the i of a bucket by multiplying all its n n 21 BUCKET ELIMINATION Algorithm elim-meu Input: A belief network BN = fP1; :::; Png; a subset of variables D1; :::; Dk that are decision P variables which are root nodes; a utility function over X , u(x) = j fj (xQ ); an ordering of the variables, o, in which the D's appear rst; observations e. Output: An assignment d1; :::; dk that maximizes the expected utility. 1. Initialize: Partition components into buckets, where bucketi contains all matrices whose highest variable is Xi . Call probability matrices 1; :::; j and utility matrices 1 ; :::; l. Let S1 ; :::; Sj be the probability variable subsets and Q1; :::; Ql be the utility variable subsets. 2. Backward: For p n downto k + 1, do for all matrices 1; :::; j; 1 ; :::; l in bucketp, do (bucket with observed variable) if bucketp contains the observation Xp = xp, then assign Xp = xp to each i ; i and put each resulting matrix in appropriate bucket. S j else, Up Up [ (Sli=1 QPi , fXp )g. If i=1 Si , fXp g and Wp bucket contains an observation or new 's, then p = X i i and p = 1 PX ji=1 i Plj=1 j ; else, p = PX ji=1i Plj =1 j . Add p and p to the bucket of the largest-index variable in Wp and Up, respectively. 3. Forward: Assign values in the ordering o = D1 ; :::; Dk using the information recorded in each bucket of the decision variable. j p p p p Figure 11. Algorithm elim-meu probability components and summing over Xi . The of bucket Xi is computed as the average utility of the bucket; if the bucket is marked, the average utility of the bucket is normalized by its . The resulting and are placed into the appropriate buckets. The maximization over the decision variables can now be accomplished using maximization as the elimination operator. We do not include this step explicitly, since, given our simplifying assumption that all decisions are root nodes, this step is straightforward. Clearly, maximization and summation can be interleaved to some degree, thus allowing more ecient orderings. As before, the algorithm's performance can be bounded as a function of the structure of its augmented graph. The augmented graph is the moral graph augmented with arcs connecting any two variables appearing in the same utility component fi , for some i. Theorem 6.1 Algorithm elim-meu computes the meu of a belief network augmented with utility components (i.e., an inuence diagram) in O(n 22 R. DECHTER Z1 U1 Z2 Z3 U2 U3 Z Z 3 2 Z1 Y1 U1 X 1 U2 Y1 U3 X (a) 1 (b) Figure 12. (a) A poly-tree and (b) a legal processing ordering exp(w(d; e)), where w (d; e) is the induced width along d of the augmented moral graph. 2 Tatman and Schachter (Tatman and Shachter, 1990) have published an algorithm that is a variation of elim-meu, and Kjaerul's algorithm (Kjaerul, 1993) can be viewed as a variation of elim-meu tailored to dynamic probabilistic networks. 7. Relation of Bucket Elimination to Other Methods 7.1. POLY-TREE ALGORITHM When the belief network is a poly-tree, both belief assessment, the mpe task and map task can be accomplished eciently using Pearl's poly-tree algorithm (Pearl, 1988). As well, when the augmented graph is a tree, the meu can be computed eciently. A poly-tree is a directed acyclic graph whose underlying undirected graph has no cycles. We claim that if a bucket elimination algorithm process variables in a topological ordering (parents precede their child nodes), then the algorithm coincides (with some minor modications) with the poly-tree algorithm. We will demonstrate the main idea using bucket elimination for the mpe task. The arguments are applicable for the rest of the tasks. Example 7.1 Consider the ordering X1; U3; U2; U1; Y1; Z1; Z2; Z3 of the polytree in Figure 12a, and assume that the last four variables are observed (here we denote an observed value by using primed lowercase letter and leave other variables in lowercase). Processing the buckets from last to rst, after the rst four buckets have been processed as observation buckets, we get bucket(U3) = P (u3); P (x1ju1; u2; u3); P (z03ju3) bucket(U2) = P (u2); P (z 02 ju2) 23 BUCKET ELIMINATION bucket(U1) = P (u1); P (z 01 ju1) bucket(X1) = P (y 01 jx1) When processing bucket(U3 ) by elim-mpe, we get hU (u1 ; u2; u3), which is placed in bucket(U2). The nal resulting buckets are bucket(U3) = P (u3); P (x1ju1; u2; u3); P (z03ju3) bucket(U2) = P (u2); P (z 02 ju2), hU (x1; u2; u1) bucket(U1) = P (u1); P (z 01 ju1), hU (x1; u1) bucket(X1) = P (y 01 jx1), hU (x1 ) We can now choose a value x1 that maximizes the product in X1's bucket, then choose a value u1 that maximizes the product in U1 's bucket given the selected value of X1, and so on. 3 3 2 1 It is easy to see that if elim-mpe uses a topological ordering of the poly-tree, it is time and space O(exp(jF j)), where jF j is the cardinality of the maximum family size. For instance, in Example 7.1, elim-mpe records the intermediate function hU (X1; U2; U1) requiring O(k3 ) space, where k bounds the domain size for each variable. Note, however, that Pearl's algorithm (which is also time exponential in the family size) is better, as it records functions on single variables only. In order to restrict space needs, we modify elim-mpe in two ways. First, we restrict processing to a subset of the topological orderings in which sibling nodes and their parent appear consecutively as much as possible. Second, whenever the algorithm reaches a set of consecutive buckets from the same family, all such buckets are combined and processed as one superbucket. With this change, elim-mpe is similar to Pearl's propagation algorithm on poly-trees.2 Processing a super-bucket amounts to eliminating all the super-bucket's variables without recording intermediate results. Example 7.2 Consider Example 7.1. Here, instead of processing each bucket of Ui separately, we compute by a brute-force algorithm the function hU ;U ;U (x1 ) in the super-bucket of U1 ; U2; U3 and place the function in the bucket of X1. We get the unary function hU ;U ;U (x1 ) = maxu ;u ;u P (u3 )P (x1 ju1; u2; u3)P (z03ju3)P (u2)P (z02ju2)P (u1)P (z01ju1 ). The details for obtaining an ordering such that all families in a poly-tree can be processed as super-buckets can be worked out, but are beyond the scope of this paper. In summary, Proposition 7.3 There exist an ordering of a poly-tree, such that bucketelimination algorithms (elim-bel, elim-mpe, etc.) with the super-bucket modication have the same time and space complexity as Pearl's poly-tree algorithm for the corresponding tasks. The modied algorithm's time complexity is exponential in the family size, and it requires only linear space. 2 3 1 1 2 3 1 2 2 3 3 2 Actually, Pearl's algorithm should be restricted to message passing relative to one rooted tree in order to be identical with ours. 24 R. DECHTER GF F FCB DBA BC CBA AB Figure 13. Clique-tree associated with the induced graph of Figure 7a 7.2. JOIN-TREE CLUSTERING Join-tree clustering (Lauritzen and Spiegelhalter, 1988) and bucket elimination are closely related and their worst-case complexity (time and space) is essentially the same. The sizes of the cliques in tree-clustering is identical to the induced-width plus one of the corresponding ordered graph. In fact, elimination may be viewed as a directional (i.e., goal- or query-oriented) version of join-tree clustering. The close relationship between join-tree clustering and bucket elimination can be used to attribute meaning to the intermediate functions computed by elimination. Given an elimination ordering, we can generate the ordered moral induced graph whose maximal cliques (namely, a maximal fully-connected subgraph) can be enumerated as follows. Each variable and its earlier neighbors are a clique, and each clique is connected to a parent clique with whom it shares the largest subset of variables (Dechter and Pearl, 1989). For example, the induced graph in Figure 7a yields the clique-tree in Figure 13, If this ordering is used by tree-clustering,the same tree may be generated. The functions recorded by bucket elimination can be given the following meaning (details and proofs of these claims are beyond the scope of this paper). The function hp (u) recorded in bucketp by elim-mpe and dened over [i Si , fXpg, is the maximum probability extension of u, to variables appearing later in the ordering and which are also mentioned in the cliquesubtree rooted at a clique containing Up . For instance, hF (b; c) recorded by elim-mpe using d1 (see Example 3.1) equals maxf;g P (b; c; f; g ), since F and G appear in the clique-tree rooted at (FCB ). For belief assessment, P the function p = X ji=1 i , dened over Up = [i Si , Xp , denotes the probability of all the evidence e+p observed in the clique subtree rooted at a clique containing Up , conjoined with u. Namely, p(u) = P (e+p ; u). p 25 BUCKET ELIMINATION f=0 b=0 P(d=1|b,a)P(g=0|f=0) P(d=1|b,a)P(g=0|f=1) f=1 c=0 a=0 P(a) b=1 P(d=1|b,a)P(g=0|f=0) c=1 ... a=1 ... P(c|a) ... P(b|a) ... P(f|b,c) P(d=1|b,a)P(g=0|f=1) Figure 14. probability tree 8. Combining Elimination and Conditioning A serious drawback of elimination algorithms is that they require considerable memory for recording the intermediate functions. Conditioning, on the other hand, requires only linear space. By combining conditioning and elimination, we may be able to reduce the amount of memory needed yet still have performance guarantee. Conditioning can be viewed as an algorithm for processing the algebraic expressions dened for the task, from left to right. In this case, partial results cannot be assembled; rather, partial value assignments (conditioning on subset of variables) unfold a tree of subproblems, each associated with an assignment to some variables. Say, for example, that we want to compute the expression for mpe in the network of Figure 2: M = a;c;b;f;d;g max P (g jf )P (f jb; c)P (dja; b)P (cja)P (bja)P (a) = max P (bja) max P (f jb; c) max P (djb; a) max a P (a) max c P (cja) max g P (g jf ): b f d (13) We can compute the expression by traversing the tree in Figure 14, going along the ordering from rst variable to last variable. The tree can be traversed either breadth-rst or depth-rst and will result in known search algorithms such as best-rst search and branch and bound. 26 R. DECHTER Algorithm elim-cond-mpe Input: A belief network BN = fP1; :::; Png; an ordering of the variables, d; a subset C of conditioned variables; observations e. Output: The most probable assignment. Initialize: p = 0. 1. For every assignment C = c, do p1 The output of elim-mpe with c [ e as observations. p maxfp; p1g (update the maximum probability). 2. Return p and a maximizing tuple. Figure 15. Algorithm elim-cond-mpe We will demonstrate the idea of combining conditioning with elimination on the mpe task. Let C be a subset of conditioned variables, C X , and V = X , C . We denote by v an assignment to V and by c an assignment to C . Clearly, max x P (x; e) = max c max v P (c; v; e) = maxc;v i P (xi ; c; v; ejxpa ) i Therefore, for every partial tuple c, we compute maxv P (v; c; e) and a corresponding maximizing tuple (xoV )(c) = argmaxV ni=1 P (xi ; e; cjxpa ) i by using the elimination algorithm while treating the conditioned variables as observed variables. This basic computation will be enumerated for all value combinations of the conditioned variables, and the tuple retaining the maximum probability will be kept. The algorithm is presented in Figure 15. Given a particular value assignment c, the time and space complexity of computing the maximum probability over the rest of the variables is bounded exponentially by the induced width w(d; e [ c) of the ordered moral graph adjusted for both observed and conditioned nodes. In this case the induced graph is generated without connecting earlier neighbors of both evidence and conditioned variables. Theorem 8.1 Given a set of conditioning variables, C , the space complexity of algorithm elim-cond-mpe is O(n exp(w(d; c [ e)), while its time complexity is O(n exp(w(d; e [ c)+ jC j)), where the induced width w(d; c [ e), is computed on the ordered moral graph that was adjusted relative to e and c. 2 When the variables in e [ c constitute a cycle-cutset of the graph, the graph can be ordered such that its adjusted induced width equals 1. In this BUCKET ELIMINATION 27 case elim-cond-mpe reduces to the known loop-cutset algorithms (Pearl, 1988; Dechter, 1990). Clearly, algorithm elim-cond-mpe can be implemented more eectively if we take advantage of shared partial assignments to the conditioned variables. There is a variety of possible hybrids between conditioning and elimination that can rene the basic procedure in elim-cond-mpe. One method imposes an upper bound on the arity of functions recorded and decides dynamically, during processing, whether to process a bucket by elimination or by conditioning (see (Dechter and Rish, 1996)). Another method which uses the super-bucket approach, collects a set of consecutive buckets into one super-bucket that it processes by conditioning, thus avoiding recording some intermediate results (Dechter, 1996b; El-Fattah and Dechter, 1996). 9. Related work We had mentioned throughout this paper many algorithms in probabilistic and deterministic reasoning that can be viewed as similar to bucketelimination algorithms. In addition, unifying frameworks observing the common features between various algorithms had also appeared both in the past (Shenoy, 1992) and more recently in (Bistarelli et al., 1997). 10. Summary and Conclusion Using the bucket-elimination framework, which generalizes dynamic programming, we have presented a concise and uniform way of expressing algorithms for probabilistic reasoning. In this framework, algorithms exploit the topological properties of the network without conscience eort on the part of the designer. We have shown, for example, that if algorithms elim-mpe and elim-bel are given a singly-connected network then the algorithm reduces to Pearl's algorithms (Pearl, 1988) for some orderings (always possible on trees). The same applies to elim-map and elim-meu, for which tree-propagation algorithms were not explicitly derived. The simplicity and elegance of the proposed framework highlights the features common to bucket-elimination and join-tree clustering, and allows for focusing belief-assessment procedures toward the relevant portions of the network. Such enhancements were accompanied by graph-based complexity bounds which are more rened than the standard induced-width bound. The performance of bucket-elimination and tree-clustering algorithms is likely to suer from the usual diculty associated with dynamic programming: exponential space and exponential time in the worst case. Such performance deciencies also plague resolution and constraint-satisfaction algorithms (Dechter and Rish, 1994; Dechter, 1997). Space complexity can be reduced using conditioning. We have shown that conditioning can be 28 R. DECHTER implemented naturally on top of elimination, thus reducing the space requirement while still exploiting topological features. Combining conditioning and elimination can be viewed as combining the virtues of forward and backward search. Finally, no attempt was made in this paper to optimize the algorithms for distributed computation, nor to exploit compilation vs run-time resources. These issues can and should be addressed within the bucket-elimination framework. In particular, improvements exploiting the structure of the conditional probability matrices, as presented recently in (Santos et al., in press; Boutilier, 1996; Poole, 1997) can be incorporated on top of bucketelimination. In summary, what we provide here is a uniform exposition across several tasks, applicable to both probabilistic and deterministic reasoning, which facilitates the transfer of ideas between several areas of research. More importantly, the organizational benet associated with the use of buckets should allow all the bucket-elimination algorithms to be improved uniformly. This can be done either by combining conditioning with elimination, as we have shown, or via approximation algorithms as is shown in (Dechter, 1997). 11. Acknowledgment A preliminary version of this paper appeared in (Dechter, 1996a). I would like to thank Irina Rish and Nir Freidman for their useful comments on dierent versions of this paper. This work was partially supported by NSF grant IRI-9157636, Air Force Oce of Scientic Research grant AFOSR F49620-96-1-0224, Rockwell MICRO grants ACM-20775 and 95-043, Amada of America and Electrical Power Research Institute grant RP8014-06. References S. Arnborg and A. Proskourowski. Linear time algorithms for np-hard problems restricted to partial k-trees. Discrete and Applied Mathematics, 23:11{24, 1989. S.A. Arnborg. Ecient algorithms for combinatorial problems on graphs with bounded decomposability - a survey. BIT, 25:2{23, 1985. F. Bacchus and P. van Run. Dynamic variable ordering in csps. In Principles and Practice of Constraints Programming (CP-95), Cassis, France, 1995. A. Becker and D. Geiger. A suciently fast algorithm for nding close to optimal jnmction trees. In Uncertainty in AI (UAI-96), pages 81{89, 1996. U. Bertele and F. Brioschi. Nonserial Dynamic Programming. Academic Press, 1972. S. Bistarelli, U. Montanari, and F Rossi. Semiring-based constraint satisfaction and optimization. Journal of the Association of Computing Machinery (JACM), to appear, 1997. C. Boutilier. Context-specic independence in bayesian networks. In Uncertainty in Articial Intelligence (UAI-96), pages 115{123, 1996. BUCKET ELIMINATION 29 C. Cannings, E.A. Thompson, and H.H. Skolnick. Probability functions on complex pedigrees. Advances in Applied Probability, 10:26{61, 1978. G.F. Cooper. Nestor: A computer-based medical diagnosis aid that integrates causal and probabilistic knowledge. Technical report, Computer Science department, Stanford University, Palo-Alto, California, 1984. M. Davis and H. Putnam. A computing procedure for quantication theory. Journal of the Association of Computing Machinery, 7(3), 1960. R. Dechter and J. Pearl. Network-based heuristics for constraint satisfaction problems. Articial Intelligence, 34:1{38, 1987. R. Dechter and J. Pearl. Tree clustering for constraint networks. Articial Intelligence, pages 353{366, 1989. R. Dechter and I. Rish. Directional resolution: The davis-putnam procedure, revisited. In Principles of Knowledge Representation and Reasoning (KR-94), pages 134{145, 1994. R. Dechter and I. Rish. To guess or to think? hybrid algorithms for sat. In Principles of Constraint Programming (CP-96), 1996. R. Dechter and P. van Beek. Local and global relational consistency. In Principles and Practice of Constraint programming (CP-95), pages 240{257, 1995. R. Dechter. Enhancement schemes for constraint processing: Backjumping, learning and cutset decomposition. Articial Intelligence, 41:273{312, 1990. R. Dechter. Constraint networks. Encyclopedia of Articial Intelligence, pages 276{285, 1992. R. Dechter. Bucket elimination: A unifying framework for probabilistic inference algorithms. In Uncertainty in Articial Intelligence (UAI-96), pages 211{219, 1996. R. Dechter. Topological parameters for time-space tradeos. In Uncertainty in Articial Intelligence (UAI-96), pages 220{227, 1996. R. Dechter. Mini-buckets: A general scheme of generating approximations in automated reasoning. In Ijcai-97: Proceedings of the Fifteenth International Joint Conference on Articial Intelligence, 1997. Y. El-Fattah and R. Dechter. An evaluation of structural parameters for probabilistic reasoning: results on benchmark circuits. In Uncertainty in Articial Intelligence (UAI-96), pages 244{251, 1996. D. Geiger, T. Verma, and J. Pearl. Identifying independence in bayesian networks. Networks, 20:507{534, 1990. F.V. Jensen, S.L Lauritzen, and K.G. Olesen. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4, 1990. U. Kjaerul. A computational scheme for reasoning in dynamic probabilistic networks. In Uncertainty in Articial Intelligence (UAI-93), pages 121{149, 1993. S.L. Lauritzen and D.J. Spiegelhalter. Local computation with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 50(2):157{224, 1988. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. Y. Peng and J.A. Reggia. Plausability of diagnostic hypothesis. In National Conference on Articial Intelligence (AAAI86), pages 140{145, 1986. Y. Peng and J.A. Reggia. A connectionist model for diagnostic problem solving, 1989. D. Poole. Probabilistic partial evaluation: Exploiting structure in probabilistic inference. In Ijcai-97: Proceedings of the Fifteenth International Joint Conference on Articial Intelligence, 1997. S.K. Anderson R. D. Shachter and P. Solovitz. Global conditioning for probabilistic inference in belief networks. In Uncertainty in Articial Intelligence (UAI-91), pages 514{522, 1991. B. D'Ambrosio R.D. Shachter and B.A. Del Favro. Symbolic probabilistic inference in belief networks. Automated Reasoning, pages 126{131, 1990. E. Santos, S.E. Shimony, and E. Williams. Hybrid algorithms for approximate belief updating in bayes nets. International Journal of Approximate Reasoning, in press. 30 R. DECHTER E. Santos. On the generation of alternative explanations with implications for belief revision. In Uncertainty in Articial Intelligence (UAI-91), pages 339{347, 1991. R.D. Shachter. Evaluating inuence diagrams. Operations Research, 34, 1986. R.D. Shachter. Probabilistic inference and inuence diagrams. Operations Research, 36, 1988. R. D. Shachter. An ordered examination of inuence diagrams. Networks, 20:535{563, 1990. P.P. Shenoy. Valuation-based systems for bayesian decision analysis. Operations Research, 40:463{484, 1992. S.E. Shimony and E. Charniack. A new algorithm for nding map assignments to belief networks. In P. Bonissone, M. Henrion, L. Kanal, and J. Lemmer ed., Uncertainty in Articial Intelligence, volume 6, pages 185{193, 1991. J.A. Tatman and R.D. Shachter. Dynamic programming and inuence diagrams. IEEE Transactions on Systems, Man, and Cybernetics, 1990. N.L. Zhang and D. Poole. Exploiting causal independence in bayesian network inference. Journal of Articial Intelligence Research (JAIR), 1996.