From: AAAI-90 Proceedings. Copyright ©1990, AAAI (www.aaai.org). All rights reserved. Probabilistic Semantics for Cost Based Abduction* Charniak and Solomon E. Shimony Computer Science Department Box 1910,Brown University Providence, RI 02906 ec@cs.brown.edu and ses@cs.brown.edu Eugene Abstract Cost-based abduction attempts to find the best explanation for a set of facts by finding a minimal cost proof for the facts. The costs are computed by summing the costs of the assumptions necessary for the proof plus the cost of the rules. We examine existing methods for constructing explanations (proofs), as a minimization problem on a DAG. We then define a probabilistic semantics for the costs, and prove the equivalence of the cost minimization problem to the Bayesian network MAP solution of the system. Introduction The deductive nomological theory of explanation has it that an explanation is a proof of what is to be explained from knowledge of the world plus a set of assumptions. While there are well known problems with the theory [McDermott, 19871, it is nevertheless an attractive one for people in AI, since it ties something we know little about (explanation) to something we as a community know quite a bit more about (theorem proving). From an AI viewpoint the real problems with the deductive nomological theory are not the abstract ones of the philosopher, but rather the immediate one that there are many possible sets of assumptions, which together with our knowledge of the world would serve to explain (prove) the desired fact. Somehow, a choice between the sets must be made. Several researchers ([Kautz and Allen, 19861, [Genesereth, 1984]), have used the above technique and graded assumption sets by a) only allowing some formulas to be assumed, and b) preferring sets with the minimum number of assumptions. Obviously, these simplifying assumptions are severely limiting. Cost-based abduction is the obvious generalization of these theorem proving techniques. It allows any formula to be assumed, and assigns all assumed formulas *This work has been supported in part by the National Science Foundation under grants IST 8416034 and IST 8515005 and Office of Naval Research under grant N00014‘79-C-0529. We also wish to thank Robert Goldman for helpful comments about the semantics, and Randy Calistri for proof reading earlier versions of this paper. 106 AUTOMATEDREASONING a non-negative real number cost. The best explanation is then the proof with the minimum cost (a formal definition of cost-based abduction will be given in the next section). Something very similar to what we are calling “cost-based abduction” has been proposed and implemented by Hobbs and Stickel [Hobbs and Stickel, 19881, and it appears to be a promising method for handling abductive problems. However, their scheme has one immediate drawback; at the present moment the “costs” have no adequate semantics: for Hobbs and Stickel they are simply numbers pulled out of a hat. In what follows we will provide a probabilistic semantics for cost-based abduction, according to the following outline. First, we will formalize a cost-based proof (explanation) of some facts as an augmented DAG, and exploit the similarity of the DAG to a belief network (Bayesian network, [Pearl, 19881) to define a probability distribution using the topology of the DAG plus the costs. A major theorem of the paper will show that the maximum a-posteriori (MAP) assignment of truth values to the network corresponds to the minimal cost proof ‘. In the discussion these results will be explained in a more intuitive fashion. Appelt, in [Appelt, 19901, has attempted to give a semantics for the Hobbs-Stickel cost scheme. We will discuss the Hobbs-Stickel approach in more detail, and show why Appelt’s semantics is not adequate. Lastly, we will briefly mention our practical experience with an implementation of cost-based abduction. AG Representation A rule based system of the form: R: Pl for with assumability A P2 A a*- A Pn costs has rules ---) Q with costs c(pi) for each conjunct, and a cost c(R) for applying the rule. A conjunct has the same cost in all the rules where it appears on the left hand side (LHS). The cost of proving q with this rule is the cost of all the conjuncts assumed, plus the cost of the rule. For ‘A MAP assignment is the all random variables such that highest. way to set the values their joint probability of is the rest of this section and the next one, we assume without loss of generality that all rule costs are 0. We can do this by adding a po (that appears nowhere else) to the LHS, with a cost c(p0) = c(R). We want to find a minimal cost proof for some fact set E (“the evidence”). We now formalize the minimum cost proof problem as a minimization problem on a weighted AND/OR DAG (acronym WAODAG). We use a three valaugmented by symued logic (values (T, F, U}), bols for keeping track of assumed nodes, versus implied nodes. The values we actually use are Q = (TA, T, U, FA, F}, where U stands for undetermined (intuitively: either true or false), T for true, F for false, and the A superscript stands for “assumed”. We use U\ v to say that u is an immediate parent of 21. Definition where: 1 A WAODAG 1. G is a connected directed is a 4-tupte (G, c, r, s), acyclic graph, G = (V, E). 2. c is a function from (IV x Q} to the non-negative reals, called the cost function. For values T, F, U, we have zero cost. c(v) - c(v, TA). 3. r is a function from V to {AND, OR}, called the label. A node labeled AND is called an AND node, etc. 4. s is an AND node with outdegree 0 (evidence node). Definition 2 A truth assignment for a WAODAG is a function f from V to Q. A truth assignment is a (possibty partial) model i# the following conditions hold: 1. If v is a root node (a node with in-degree 0) then f (v> E PA, u, FA). 2. If v is a non-root node, then it can only be assigned values consistent with its parents and its label (AND or OR), and if its parents do not uniquely determine the node’s truth value, it can have any value in (TA, FA, U). The exact details of consistency are pursued in [Charniak and Shimony, 19901, but should be obvious from the well-known definitions of AND and OR in S-valued logic. Note that in our DAG, an OR node is true if at least one of its parents is true, as in belief networks, but not as commonly used for search AND/OR trees. A non-root node may still be assumed true if its parents determine that it has to be true. Intuitively, an assignment is a model if the AND/OR constraints are obeyed. A node v where f(v) = TA in an assignment, is called an assumed true node relative to the assignment. Likewise for other values of f(v). Definition iI-7f(s) E 3 A model for a WAODAG iTA, is satisfying The Best Selection Problem is the problem of finding a minimal cost (possibly not unique) satisfying model for a given WAODAG. The Given Cost Selection Problem is that of finding a satisfying model with cost less than or equal to a given cost. Note that in a partial model, assuming a node false is useless, as such an assumption cannot contribute towards a satisfying model. Theorem 1 The NP-complete. Given Cost Selection Problem is The theorem is easily proved via a reduction from Vertex Cover (see [Garey and Johnson, 19791). We present the complete proof in [Charniak and Shimony, 19901. The Best Selection Problem is clearly at least as hard as the Given Cost Selection Problem, because if we had a minimal cost satisfying model, we can find its cost in O(lVj), and give an answer to the Given Cost Selection Problem. Thus, the Best Selection Problem is NP-hard. We will now make the connection between the graphs and the rule based system. We assume that exactly all possibly relevant rule and fact instances are given. How that may be achieved is beyond the scope of this paper. Theorem 2 The Best Selection Problem subsumes the problem of finding a minimal cost proof for the rutebased system with assumability costs, assuming that the rule based system is acyclic. Informal proof: by constructing a WAODAG constructing the graph G, and assigning labels costs) for the rule instance set, as follows: (i.e. and OR For each literal in any rule2 R’s LHS, construct node v, and set c(vrTA) to the cost of the literal in the system. For each literal appearing only on the RHS of rules, construct an OR node v, with c(v, TA) = 00. For each LHS of a rule R, construct and AND node v with c(v,TA) = 00, make it a parent of the node constructed for the literal on the RHS of R (in step l), and make it a child of all the nodes constructed for the literals in the LHS of R. Construct an AND node s, with parent sponding to the facts to be proved. nodes corre- Example: Given the rule instances in the table, used for word-sense disambiguation in natural language, with rb=river-bank(bankl), sb=savings-bank(bankl), w=water(water5), p=plant(plant7), we want to explain the evidence: say(bankl)Asay(water5). I Rules II Literal Cost 1 Tl. Definition 4 The cost WAODAG is the sum c = of an assignment c +u, f(v)) VEV A for a ‘We assume that literals with the same name in different rules are the same literal. CHARNIAKAND~HIMONY 107 semantics we intend to treat assumability costs as negative logarithms of probabilities (so that summing costs is akin to multiplying probabilities), and we want = TA) to hold for all W(v) = FA> = 1 - P(f(v) root nodes. Thus, the cost of assuming a node v false c(v, FA) robabilistic Figure 1: WAODAG for our example Using the above construction, in figure 1, with best partial shown. rules we get the WAODAG model (total cost 5) Definition 5 A root-only assignment for a WAODAG is an assignment where only root nodes may be assumed (i.e. have values in {FA,TA)). It is possible to force root-only assignments for a WAODAG to be globally minimal by setting the cost of all non-root nodes to infinity (in practice, it suffices to set a cost greater than the sum of the root costs). We now show that given an WAODAG D, we can create another WAODAG D’, such that the semantics of minimal models is not changed, i.e. if a non-root node is selected in D, some corresponding new root node is selected in D’. Proof: by construction, as follows (D’ = D initially): 1. For each AND node v with cost c(v, TA) < 00 in D’, construct an OR node w in D’, and a new root node u, where c(u, TA) = c(v, TA), and make both v \w and u \ w. Transfer all the children3 of v to zu. 2. For each OR node v with cost c(v, TA) < 00, create a new root-node u, with C(ZL,TA) = c(v, TA). Make u\v. 3. For all non-root nodes v in D’, set c(v, TA) = 00. It is clear that each time a node is selected in a minimal cost model of D to be assumed true, the node constructed from it in D’ will be assumed true in some root-only minimal cost model for I)‘. Definition 6 An assignment ipAJEV, f(v)#U. - (or model) * is complete A variant of the Best Selection Problem is one of selecting a minimal cost complete model. Clearly, if the cost of assuming a node false is 0 for all nodes, the solution will be exactly the same as for the partial model Best Selection Problem. However, in our ‘If the AND node is s, create and make w\s’. 108 AUTOMATEDREASONING a new sink AND node, s’, = -log(l Semantics - e-+* TA’) for WAODAGs We now provide a probabilistic semantics for the cost based abduction system. We construct a boolean belief network out of the weighted AND/OR DAG, and show the correspondence between the solution to the Best Selection Problem and finding the most likely explanation for a given fact (or set of facts). We assume that the rule based system is in the WAODAG format with root-only assignment. We now construct a belief network from the given WAODAG, and show that a minimal cost satisfying complete model for the WAODAG corresponds to a maximumprobability assignment of root-nodes given the evidence in the belief network (where the evidence is exactly the set of facts to be proved using the rule system). From a WAODAG D we construct a belief network B as follows: B has exactly the nodes and arcs of D. Thus, we use the same name for a node of B and the corresponding node of D. Nodes retain their labels4. Each root node e-c(u, TA)e v in B has a prior probability of The node s is the “evidence node”, i.e. the event of node s being true is the evidence & . Defining an assignment for the network analogously with the WAODAG assignment, we assume, without loss of generality, that we are only interested in assignment to the set of root nodes5. We want to find the “best” satisfying model A , which assigns values from (TA, FA, V} to the set of all root nodes, i.e. the assignment that maximizes P(d ] E). An assignment of U to a root node means that it is omitted from the calculation of joint probabilities, as P(vi = U) = 1. Intuitively, we are searching for the most probable explanation for the given evidence. This can be done by running a Bayesian network algorithm for finding Bel* on the root nodes, as defined in [Pearl, 19881. We now show the following result: Theorem 3 In a boolean belief network B constructed as above, a satisfying complete model A that maximizes 4A belief network AND node has a 1 in its conditional distribution array for the case of all parents being true, and 0 elsewhere. An OR node is defined analogously. ‘Maximizing the probability over assignments to root nodes is equivalent to finding the MAP, when we allow only complete models, because a complete assignment for the root nodes induces a unique model for all other nodes. P(dI E) wi 11 a1so be a minimal model for D. cost satisfying complete Proof: In a belief network, all root nodes (given no evidence) are mutually independent. Thus, for any assignment of values to root nodes, A = (al, a2, . . . . a,,), where ai = (vi, s(i) and qi E {FA, TA}6. a2, “*) %a1 q1, = P(a1) P(a2) ... P(an) However, we also have (by definition probabilities): P(d 1E)= w of conditional I W(A) P(E) But as P(z? I d) = 1 when the assignment is a satisfying model (because all nodes are strict OR and AND nodes), and 0 otherwise, and P(E) is a constant, we can eliminate everything but P(d) from the maximization. Also, we have: - n P(d) = n P(u;) i=l e-+;) = = e- x;=, 44 i=l Since es is monotonically increasing in Z, we see that maximizing P(d I 8) is equivalent to minimizing the cost of the assignment, Q.E.D. We now generalize the DAG so that nodes can have any gating function 7. The definition of a model is extended in the obvious way. 4 Given a gate-only belief network, with a Theorem single evidence node, the problem of finding the most probable complete satisfying model given the evidence is equivalent to finding a minimal cost complete model for the weighted gated DAG. Proof: The proof of theorem 3 relies only on the fact that the probability of the evidence given a satisfying model is 1, and 0 given any other complete model. Thus, we can use exactly the same proof here. If we want to find the best partial model, and are only interested in satisfying models, the above theorem still holds. It is no longer true, however, that finding the minimum cost model is equivalent to finding the MAP over the entire belief net. Evaluation of Cost Based Abduction In essence, the theorems of the last two section serve to define not one scheme of cost-based abduction, but four. The most obvious distinction is between Theorems 3 and 4. The first restricts itself to WAODAGs, while the second generalizes to arbitrary gating functions. There are two things to keep in mind about this distinction. First, if one is using a standard theorem ‘Additionally, we use the a;‘s to denote the event of node vi having value qi. 7Gate nodes are any p robabilistic nodes which have only entries of 1 and 0 in their conditional distribution arrays. prover, as found in typical rule-based systems, the theorem prover will itself only generate AND/OR DAGS. So unless one is willing to add capabilities to the theorem prover, arbitrary gates are pointless. Also, since AND/OR gates do not require the labels FA and F, they are marginally simplier to implement. (However, as we will note in the section on implementation, we found that we needed the capabilities of arbitrary gates for our domain.). The second distinction is between partial and complete models. The distinction here is whether we simply assign costs to those facts which we must assume true (or false) to make the proof work, or go on to make decisions about every fact in the domain, whether or not it plays a role in the proof. Intuitively the former makes more sense. Unfortunately, our theorem (that we would get the MAP assignment) is only true for complete models, and counterexamples exist for noncomplete models. Also, the minimal cost complete model will not agree, in general, with the minimal cost partial model, even if we compare only the sets of nodes assumed. However, the minimal solution of the complete model problem will be a nearly minimal solution for the partial model, provided a) there is no other complete model with nearly the same cost as the minimum, and b) the cost for assuming nodes false is low. These are reasonable assumptions in many cases. For example, the probabilistic semantics presented in [Charniak and Goldman, 19881 is characterized by low prior probabilities, thus low costs for assuming nodes false. Lastly, a further word is required about the relation between the costs and the probabilities. In our definition of cost-based abduction there were three sorts of entities which received costs: rules, root nodes, and interior nodes. By the time we reached theorems 3 and 4, however, we had reduced this to one, by showing that rule and interior node costs could be replaced by added root node costs. But how do these transformations affect our interpretation of what the costs mean in terms of probabilities? Things are most obvious for root nodes. As stated earlier, the cost of assuming a root node must be -log( P(node)). F or rules and interior nodes, however, things are slightly more complex. The cost of a rule got moved to the cost of a new root node. Suppose for rule R we add the new root node R’. We can, of course, say that the cost of a rule must be -log(P(R’)). However, R’ does not correspond to anything in our model of the world (it is merely a mathematical fiction designed to make the proof simpler). We need to define the cost in terms of elements of our world model. Thus, suppose R is the rule: and we will denote the AND node corresponding to its left-hand side as AR’. We will refer to the AND node without the added cost root attached as AR. CHARNIAKAND~HIMONY 109 Suppose that the other rules which can prove q” are AND nodes (withRI, ...I R, with the corresponding It is easy to out attached cost roots) ARK, . . . . AR,. see that P(q I AR A BARD A . . . A -AR,) = P(R’). That is, the cost of a rule is minus log probability of its consequent being true given that a) its antecedent is true, and b) none of the other ways of proving (or assuming) the consequent are true. Analogously, we can show that the cost of assuming an interior OR node v is -log(P(v I all the ways of proving it are false)). Finally, since in practice there is never a need for assuming an AND node, we will ignore it here. The above analysis only holds when we are considering complete assignments. When partial assignments are allowed, a case may be made for setting the cost of assuming an interior OR node to k-hlw4)), b ecause if non of its parents are assigned (i.e. we are not proving the node, just assuming it), then presumably the probability of the node reverts to its prior probability. Given that the notion of partial MAP’s is not well defined in the literature, we defer the solution to this problem to future research. We believe, however, that decision theoretic methods may have to be applied in the latter case. Appelt’s Semantics for Hobbs-Stickel In [Hobbs and Stickel, 19881, Hobbs and Stickel have proposed a scheme very similar to what we have proposed. In their scheme, the initial facts to be explained are each assigned an assumption cost ci. All inference rules are of the form: The cost of the pi’s are then given by cost(p;) = ?lJi cost(q). However, if two separate portions of the proof-tree require the same assumption, the proof is only charged once for the assumption (and is charged the minimum of the two costs being charged for the fact). Hobbs and Stickel point out that C = Cr==, wi does not have to equal 1. If C < 1 then the system will prefer to assume pr , . . . . pn, since that will be less expensive than assuming q. They refer to this as mostspecific abduction. On the other hand, if C > 1, then, everything else being equal, the system will tend to just assume q (least-specific abduction). Note, however, that even with least-specific abduction, cost sharing on common assumptions can make a more specific scenario cost less. Hobbs and Stickel believe that leastspecific abduction (with cost sharing) is the way to go, at least for the abductive problems they are concerned with (natural language comprehension). In general we agree with this assessment. As we have noted, Hobbs and Stickel did not give a semantics for their weights, and Appelt in [Appelt, 19901 is concerned with overcoming this deficit. Appelt takes as his starting point Selman and Kautz’s theory of default reasoning [Selman and Kautz, 19891 called 110 AUTOMATEDREASONING model preference theory. In this theory a default rule p + q is interpreted as meaning that in all models in which p is true, the models in which q is also true are to be preferred. Appelt, in the spirit of abduction, reverses this by saying that a rule p”p ---) q where wp < 1 is to be interpreted as a model preference among those models which have q for those which have p was well. However, it is possible to have another rule ~~~ ----) s (where wr < l), but where p and f are not compatible. Thus if s is also in our model we must choose which rules to use. Appelt specifies that if wp < w1 then use p --) q, and vice versa. Appelt calls this scheme weighted abduction. The most obvious difference between Hobbs-Stickel, and weighted abduction is that the former use rules of the form: p;’ A . . . A pzn ---) q where weighted abduction only has rules of the form Pwp --) q. We assume that what Appelt has in mind is recasting the Hobbs Stickel rules as (pr A . . . A P,)~P ---) q. With wp = xi wi. Assuming this is correct, there are two immediate problems with Appelt’s semantics. First, it only handles the case where the sum of the wi’s is less than 1. This is what Hobbs and Stickel call more-specific abduction. But as they note, less-specific abduction seems to be the more important case, and Appelt says nothing about it8. Secondly, Appelt can give no obvious guidance to how to apportion We into the individual wi’s required by Hobbs-Stickel. But even if we restrict ourselves to finding the single wp, and also restrict ourselves to more-specific abduction, it would seem that Appelt’s semantics gives little guidance when trying to judge if a number is “right”. Suppose we have a system with a group of w’s for various rules (w,, wb . . . wy} and we now want to add a rule with the number wz , and we want to know what w, should be. Following Appelt we will look for models in which the rules wa, wb, . . . wY are used but where the rule Z conflicts with their use. In each case we see which rule takes precedence. But this is not sufficient. Suppose we have a model in which A and B are used but where Z makes both unusable. Since costs are additive we have several possibilities: {A + B > 2, but A < Z}, (B < 2 but A + B < 23, etc. Depending on the numbers, in some cases we should prefer using rules A and B together over Z, in other cases not. But this is not sufficient either, what about A + C, and A + B + C, and B + C, etc. In fact, the number of models which need to be checked grows exponentially with the number of costs in the system. ‘In his talk at the Symposium on Abduction 1990, Appelt has extended the scheme for some cases where the weights sum to more than 1. We doubt, however, whether this extension can be generalized. Implementation As we have showed in this paper, anything that can be done with cost-based abduction can also be done by One might then ask, why not probabilistic methods. use the more standard probability theory? The reason would have to be that the standard probabilistic Evaluating methods are computationally expensive. general belief nets is NP-hard. Unfortunately, as we saw in Theorem 1, so is the minimal cost proof problem. However, as the mathematics is quite a bit simpler for minimal cost proofs, one might hope that the constant in front is a lot less. Also, thinking in terms of minimal-cost proofs suggests different ways of looking for solutions. There are well known techniques for doing best first search on AND/OR search spaces, and this looks like an obvious application. We have implemented a best-first search scheme for roofs. We have applied it to finding minimal cost the work described in PCharniak and Goldman, 19881, which uses belief nets to make abductive decisions on problems that come up in natural language understanding, such as noun-phrase reference, or plan recognition. Thus, we had a ready-made source of networks and probabilities upon which we could test the system, and a benchmark (the speed of our current probabilistic methods). The approach seems to have some promise. We were able to adapt our networks to the new scheme with little difficulty. (Of the four versions of cost-based abduction mentioned above, we found that complete models did not seem to be required for our networks, but we did need general gates, and not just WAODAGs.). Because we did not use MAP labelings in our earlier work, but rather decided what to believe on the basis of the probability of individual statements, we simulated this by looking for multiple MAP solutions when there were several within some epsilon of each other, and only believing the statements which were in all of them. For the examples we have tried it on, this gave the same results as our belief network calculations. After minor tuning, our cost algorithm seems to be running significantly faster than the belief-network updating scheme we were using. However, we have not yet done formal timing comparisons between them, and at the moment both the cost algorithm, and the beliefnetwork updating scheme, are very sensitive to efficiency measures. Thus it is unclear how much the timing will prove. It does, however, seem safe to say that cost-based abduction deserves serious consideration. We should note that this speed-up occurs despite the fact that we do not have a good admissible heuristic cost estimator for our best-first search. We are using the simplest heuristic imaginable - at any point in a partial proof, we assume that the final cost of the proof will be the costs incurred to date. This is a very poor estimator because the bulk of a proof’s cost comes from the root assumptions, and they are not found until the end. We are currently exploring other estimator pos- sibilities, factoring such as logic minimization approaches and in assumption costs earlier in the search. Conclusion We have shown that cost-based abduction can be given an adequate semantics based upon probability theory, and that under the appropriate circumstances it is guaranteed to find the best MAP assignment of truth values to the propositions in our theory. Initial experiments with the model show that it does produce results consistent with full blown posterior probability calculation, and does so quickly compared to our current probabilistic methods. However, further experimentation is clearly required. References [Appelt, 19901 Douglas E. Appelt. A theory of abduction based on model preference. In Proceedings of the AAAI Symposium on Abduction, 1990. [Charniak and Goldman, 19881 Eugene Charniak and Robert Goldman. A logic for semantic interpretation. In Proceedings of the AAAI Conference, 1988. [Charniak and Shimony, 19901 Eugene Charniak and Solomon E. Shimony. Probabilistic semantics for rule based systems. Technical Report CS-90-02, Computer Science Department, Brown University, February 1990. [Garey and Johnson, 19791 M. R. Garey and D. S. Computers and Intractibility, A Guide to Johnson. the Theory of NP-completeness, page 190. W. H. Freeman and Co., 1979. [Genesereth, 19841 Michael R. Genesereth. The use of design descriptions in automated diagnosis. Artificial Intelligence, pages 411-436, 1984. [Hobbs and Stickel, 19881 Jerry R. Hobbs and Mark Stickel. Interpretation as abduction. In Proceedings of the 26th Conference of the ACL, 1988. Kautz and Allen, 19861 Henry A. Kautz and James F. Allen. Generalized plan recognotion. In Proceedings of the Fi;fth Conference of AAAI, August 1986. McDermott, 19871 Drew V. McDermott. Critique of pure reason. Computational Intelligence, 3:151-60, November 1987. [Pearl, 19881 J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988. [Selman and Kautz, 19891 Bart Selman and Henry Kautz. The complexity of model-preference default theories. In Reinfrank et. al., editor, Non-Monotonic Reasoning, pages 115-130. Springer Verlag, Berlin, 1989. CHARNIAKAND SHIMONY 111