From: FLAIRS-00 Proceedings. Copyright ' 2000, AAAI (www.aaai.org). All rights reserved. Independence Semantics BKBs Solomon Eyal Shimony Eugene Santos Jr. Tzachi Rosen Dept. of Comp. Sci. Ben Gurion University Beer-Sheva 84105, ISRAEL shimony@cs.bgu.ac.il Dept. of Comp. Sci. and Eng. University of Connecticut Storrs, CT 06269 eugene@engr.uconn.edu Dept. of Comp. Sci. Ben Gurion University Beer-Sheva 84105, ISRAEL tzachi@cs.bgu.ac.il Abstract Bayesian Knowledge Bases (BKB) are a rule-based probabilistic modelthat extend Bayes Networks(BN), by allowing context-sensitive independenceand cycles in the directed graph. BKBshave probabilistic semantics, but lack independencesemantics, i.e., a graphbased scheme determining what independence statements are sanctioned by the model. Such a semantics is provided through generalized dseparation, by constructing an equivalent BN. While useful for showingcorrectness, the construction is not practical for decision algorithms due to exponential size. Someresults for special cases, where independence can be determined from polynomial-time tests on the BKBgraph, are presented. Introduction Managing uncertainty in complex domains is a difficult task, especially during knowledgeacquisition and verification and validation. Approaches range from fuzzy logics to probabilistic networks (Nilsson 1986; Zadeh 1983; Pearl 1988; Thagard 1989; Dempster 1968; Shortliffe & Buchanan 1975; Shafer 1979; Heckerman 1991; Bacchus 1990). The difficulty lies in creating knowledgerepresentation with the right blend of flexibility and sound semantics. For the humanexpert and knowledgeengineer, flexibility and intuitiveness ease the acquisition and organization of knowledge for the target domain. On the other hand, sound and formal semantics prevents confusion concerning the interaction between the different sources of uncertainty. Most agree that encoding knowledge in terms of logical "if-then" style rules is the simplest and most intuitive approach to organization (Buchanan & Shortliffe 1984). Probability theory has been an accepted language both for the description of uncertainty and for making inferences from incomplete knowledge. However, the general language of probabilities is too unconstrained, makingit hard to organize information. Without additional knowledge such as independence conditions, the various sources of uncertainty can not be resolved or combined. °Copyright © 2000, AmericanAssociation for Artificial Intelligence (www.aaai.org).All rights reserved. 3O8 for FLAIRS-2000 Bayesian Knowledge Bases (BKB - (Santos & Santos 1999)) are a rule-based probabilistic modelthat is generalization of the well-known Bayes Networks (BN (Pearl 1988)). BKBsextend the BNmodel in two ways: by naturally allowing for context-sensitive independence, and by permitting cycles in the directed graph. These generalizations to Bayes networks are necessary when we need to model populations (or sample spaces) where the causal mechanism varies across the population. Several models in the literature permit such context-sensitive independence, using rules (Poole 1993; Shimony1993; 1995), trees (Boutilier et al. 1996), or other methods (Geiger & Heckerman1991). In addition to being intuitive, these schemes allow for a more compact model, and for specialized reasoning algorithms that improve reasoning speed (Boutilier et al. 1996; Shimony & Santos 1996). Whenthe direction of the causal chain depends on certain variable values, this creates cycles in the dependency graph that cannot be handled by existing schemes - except by lumping the variables together, or using undirected models - neither approach preserving the intuitive causal structure. Originally presented in (Santos & Santos 1999), and with a semantics in the form of a default probability distribution (Rosen, Shimony, & Santos Jr. 2000), the most glaring deficiency from the point of view of the probabilistic reasoning community is the lack of independence semantics for BKBs. The question we need to answer is "what kind of independence statements between variables (or their instantiations) does the graph structure of the model sanction?" The cycles in the BKBcorrelation graph make it extremely hard to answer the question: applying the Bayes networks dseparation criterion directly is useless. Nevertheless, dseparation is a powerful notion, here we suggest using it in an indirect manner in two ways: 1) using d-separation on a Bayes network constructed from the BKBin a manner that preserves independence (used for showing correctness), and 2) a context-based d-separation criterion that can be used directly on the BKB,used for determining independence in more practical algorithms. Webegin by reviewing definitions and semantics of BKBs.Next we show the construction of the Bayes network that preserves the distribution and independence cell correspond to a single node in a Bayes network. A set of I-nodes that contains at most one I-node in each partition (i.e. for each rv) is called st ate (w.r.t. 7r ). A state that contains exactly one I-node for each rv in a set of variables X is completefor X (resp., for a partition r). The set of variables assigned in a correlation graph segment, rule, or set of rules, or I-node, are called the span of that object (denoted span(-)). !...~..................................... . i ..~.................................................. ~........... al...~ ..........1...............!~t........... :. Definition 2 G is said to respect ~r if Figure 1: Example correlation graph semantics, and shows the correspondence between the above two types of d-separation. Weconclude with results on special-cases where independence statements can be tested efficiently in the BKBcorrelation graph. Background A Bayesian knowledge-base (abbrev. BKB) represents objects/world states and the relationships between them using a directed graph. The graph consists of nodes which denote various random variable instantiations, while the edges represent conditional dependencies between them (definition repeated from (Santos Santos 1999)). Anequivalent definition via probabilistic rules is sometimes easier to work with, and the terms are used synonymously throughout. BKBScapture a finer level of conditional independence than BNs(Pearl 1988) - where appropriate, we note the correspondence between these models. Definition 1 A correlation-graph G = (I U S, E) a directed graph such that I and S are disjoint, and E C {Ix S}U{SxI}. Furthermore, for alls E S, (s, v) and (s, v’) are in E if and only if v = v’. {I U are the nodes of G and E are the edges of G. A node in I is called an instantiation-node (abbrev. I-node) and a node in S is called a support-node (abbrev. S-node). I-nodes represent the various instantiations of randomvariables (abbrev. r.v.s), that is, an assignment of a value to a random variable. S-nodes, on the other hand, explicitly embodythe relationships (conditional dependence) between the I-nodes. See Figure 1 - filled circles represent S-nodes, ovals represent I-nodes. Let r be a partition on I. Each cell in 7r will denote the set of I-nodes (instantiations) which belong to single r.v. and are mutually exclusive instantiations. In BKBS,we can represent random variables with discrete but multiple instantiations. In Figure 1, one cell in 7r would be {U = 0, U = 1} which are two instantiations for the r.v.U. Thus, the set of I-nodes in a partition ¯ for any S-node s E S, the predecessor I-nodes of s, assigns at most one instantiation to each r.v., and, ¯ for any two S-nodes sl ~ s2 in S that have the same immediate descendant v, there exists an I-node predecessor of sl whose r.v. instantiation contradicts an I-node predecessor of s2. Nodes sl and sz are said to be mutually exclusive. An S-node represents a direct conditional dependency, between the single immediate I-node descendant of the S-node (also called the consequent) and the immediate I-node predecessors (also called the antecedent) (See Figure 1), and corresponds to conditioning case, or a conditional-probability table (CPT) entry, in a Bayes network. The value attached to the S-node R3 in the figure represents the conditional probability P(Y = IIX = 0, Z = 0) = 0.6. Priors are denoted by S-nodes without inputs as shown in Figure 1, S-node R4. The subgraph consisting of an S-node s, its incident edges, and its immediate neighbors, and the attached conditional probability, is called a conditional probability rule (CPR). A set of rules is said to be mutually exclusive (w.r.t. a partition 7r) if their correlation graph respects 7r. The conditions in definition 2 assure that conditional dependencies are meaningful (see (Santos &Santos 1999)). The first condition prevents conditioning on a self-contradictory event, i.e.. P(X = xl...,Y y,..., Y = y’,...) where y ~ y’. The second condition does not allow the model to specify conditioning events that overlap in probability space for the same I-node (e.g. the events {X = 0}, {Y = 0} have an "overlap", or conjunction, {X = 0,Y = 0}), in turn preventing local inconsistency (Santos & Santos 1999). Definition 3 A Bayesian knowledge-base K is a 3tuple (G, w, 7r) where G = (I U S, E) is a correlationgraph, w is a function from S to [0, 1], r is a partition on I, and G respects 7r. Furthermore, for each s E S, w(s) is the 1weight of s. Let r = (I’ U S’,E’) be some subgraph of our correlation-graph G = (I t3 S, E) where I’ C_ I, S’ C_ and E’ C E. Then, r has a weight w(r) defined follows:w(r)= I-l,es, w(s). 1Equivalently, we can define a BKBby a mutually exclusive set of rules T~ (whichdefine the correlation graph and weights - the weight of a rule is the weight of its S-node) over a set of variables 2d (whichdefines the partition) - these notions are used interchangeably. UNCERTAIN REASONING309 Definition 4 An I-node v 6 I’ is said to be wellsupported in r if there exists an edge (s, v) in E’. Furthermore, r is said to be well-supported if all I-nodes in I’ are well-supported. Each I-node must have an incoming S-node in r. Definition 5 An S-node s 6 S’ is said to be wellfounded in r if for all (v,s) 6 E, (v,s) 6 E’. thermore, r is said to be well-founded if all S-nodes in S’ are well-founded. If an S-node s is present in r, then all incoming Inodes (conditions) to s in G must also be in Definition 6 An S-node s 6 S’ is said to be welldefined in r if there exists an edge (s,v) 6 E’. Furthermore, r is said to be well-defined if all S-nodes in S’ are well-defined. Each S-node in r must support some I-node in r. Definition 7 r is said to be an inference over K if r is well-supported, well-founded, well-defined, acyclic, and the set of I-nodes oft are a state w.r.t. 7r. Furthermore, r is said to be a complete inference over K ifr’s 1-nodes are a complete state for 7r. Inference r is said to be maximalif no proper superset of r is an inference. Definition 8 A node v is said to be grounded in a correlation graph G if there exists an inference r C_ G such that v 6 r. A CPRis grounded in a correlation graph G if its S-node is grounded in G. The existence of a probability distribution for a BKB is assured by requiring normalization. A normalized set of CPRsis one for which the extenders of all its inferences are normalized. (Henceforth, we will assume a given BKB,and unless otherwise specified, all CPRs are taken from its correlation graphs, all variables are from the set of the BKBvariables A’, etc.) Definition 9 Let R be a CPR. tl is called an extender of inference I if R ~ :r and I U {R} is an inference. A set of CPRs~ is called complementaryw.r.t, an inference I and a variable X, if each of them extends I, their consequent variable is X, but no two of them has the same 1-node as a consequent. 7"4 is called complete (for X) if the consequencesinclude all possible instantiations of X. For example, the CPKs R2 and /{6 in Figure 1 are complementaryw.r.t, the inference 27 and the variable Y. In a mutually exclusive set of CPRs, an inference 27 has, w.r.t, any variable X, a unique maximal complementary set of CPRs, denoted mcs(I, X). Definition 10 Let C be a complementary set of CPRs w.r.t, an inference 27 and a variable X, and W(C) ~-]~aec w(R). C is called normalized w.r.t. I and X if W(C) < 1, and W(C) = when C is a c omplete com plementary set w.r.t. 27 and X. A set of CPRs(or correlation graph) is normalized if for every inference, all complementary sets of rules are normalized. 310 FLAIRS-2OO0 Definition 11 The state of an inference Z (denoted st(27)) is the set of Pnodesin its correlation graph. is called relevant to a state S if st(I) C_ S. I is the maximal relevant inference (MRI) w.r.t, a state S it is the (setwise) greatest inference relevant to S (if mutual exclusion holds, it is unique). As an example, the state of the inference 27 in Figure 1 is {Z = 1,U = 1}, and that of/(: is {X = 0, Y = 1,Z 0, T = 0, U = 0}. 27 is relevant, for instance, to the states {X = 0, Z=l,U=l}and{X=0, Y=l,Z= 1,T = 0, U = 1}, and /C is an MR/ to the complete state {X = 0, Y = 1,Z = 0,T = 0, U = 0, V = 0}. Definition 12 The composite state of an inference, denoted C(27), is the set of complete states to which :~ is relevant. The dominated composite state of an inference, denoted Cv(27) is the set of complete states for which 27 is the maximalrelevant inference. Definition 13 Let X’ be the set of variables not assigned in inference 27. The dominated weight of 27 is: w (27) = w(Z)II X 6P¢" w(R)]. R6rnes(I,X) The probability of a complete state S can be derived from the dominated weight of the most relevant inference to $ as follows: Definition 14 Let K be a ables 2(, and p a function complete states for X into with K (denoted K ~ p) if correlation graph of K, normalized BKBover varifrom the set of all possible [0, 1]. Then p is consistent for each inference I in the p(s)=w(27). sect(z) Function p is called the default distribution of K ilK p and for each inference 27 and complete states S, S’ 6 Cz~(27), we have p(S) = p(S’). If the BKBis incomplete (normalization holds with sum of rules being less than 1), there will be more than one consistent distribution - the default distribution is method for spreading out the remaining probability mass uniformly. Theorem 1 Let K be a normalized BKB over X, and p the default distribution of K. Then p is a joint probability distribution over X. In what follows, we will assume that the BKBhas consequent-completeness: if there is a rule that can deduce an I-node (that assigns a value to variable X) from some antecedent state, then all other values of X may be deduced from the same antecedents. Wewill also assume that all rules and I-nodes are grounded, and that all maximal inferences are complete. These assumptions are tantamount to assuming that the distribution is completely specified without resorting to defaults, i.e. that there is only one function p consistent with K. BN equivalent to a BKB It is easy to construct a Bayes network that has the same distribution as a BKB(e.g. a single node with an exponential-size domain), but doing so in a manner that preserves independence information is non-trivial. Our constructed Bayes network has a separate sub-graph corresponding to each inference in the BKB.The resulting graph is acyclic due to the fact that BKBinferences are acyclic. This entails that multiple (possibly an exponential number) of Bayes-network nodes are needed to represent each and every I-node and S-node (one for every inference in which it appears). The Bayes network B representing BKBK is constructed as follows. First, construct a sub-network Bz for each maximal BKBinference Z (the partial functions f, g below map to a new, unique Bayes network node for each possible value of their argument(s)): 1. For each variable (partition cell) X in K create continuous-valued node f(X). 2. For each I-node v E Z construct a binary valued BN node f (v, Z). 3. For each S-node s E 2: construct a binary valued BN node f(s, Z). 4. For each edge e = (s, v) E 2: where s is an S-node and v is an I-node make a BNedge (f(s,Z), f(v,Z)). Let the conditional distribution be P(f(v,Z) Tif(s,Z ) = T) = 1, P(f(v,Z) = Tif(s,Z ) = F) = (this is a deterministic dependency). 5. For each edge e = (v, s) E/7 where v is an I-node and s is an S-node make a BNedge (f(v,I), f(s,Z)). For every S-node s E 1: let f(s, Z) have an additional "enabling" edge (f(X), f(s, it))), where X is the variable assigned by the consequent of the rule represented by s. Node f(s,Z) is a deterministic AND. The role of each f(X) is to simulate the randomization process in a scheme that performs importance sampling on a BKB(Rosen, Shimony, & Santos Jr. 2000). Each f(X) has a uniform-distribution over the real interval domain[0, 1], with sub-intervals corresponding to each rule - such that each rule f(s, Z) is "enabled" with a probability equal to w(s), and the sub-intervals set so as to ensure that rules with the same antecedent but different assignments to X have disjoint sub-intervals. Nodesf(s, .) (i.e. same S-node but generated for different inferences) all are assigned the same interval, thus represent the same event in sample-space. Next, we glue all the multiple "clones" of the I-nodes together, by adding a binary-valued node g(v) for every I-node in K. Weadd edges (f(v,Z), g(v)) for every I-node v and inference 5[. Additionally, for each variable X, connect all the nodes in g(X) (shorthand for {g(x)ispan(x ) E X}) by a directed acyclic clique that assures mutual exclusion (e.g. if X is a binary-valued variable, we have an edge (g(X = F),g(X = T)). The conditional probabilities of g(v) are deterministic OR w.r.t, all its f(v,Z) parents, and deterministic (inverted) ANDw.r.t, the other g(X) nodes. For exam- ple, for binary variable X let P(g(X = T) = TIg(X F) = F A (f(Z = T,Z) fo r someZ)) = 1 an P(g(X = T) = Tlanything else) = Theorem2 The distribution over the nodes g(X) B is equal to default distribution over 2d in K. Now,let X, Y be sets of disjoint variables in K and Z be a set of compatible I-nodes (a "context") disjoint from X and Y. Let f(Z) stand for the set {f(z,Z)lz Z,Z E inferences of K), and g(Z) for {g(z)lz E Z}. Corollary 1 If g(X) is d-separated from g(Y) f(Z) U g(Z) in B, then X is independent of Y given Z in the default distribution for K. This follows immediately from the equivalence of the distributions - and d-separation in B once the nodes knownto be true are set as evidence. Independence for BKBs Corollary 1 provides a graph-based scheme of testing independence in a BKB. However, it does not capture all cases where graph-based independence holds. Additionally, better space and time efficiency is desired by avoiding construction of the equivalent BN. For any inference Z, denote by Z(W)the set of I-nodes from that appear in 2:. Definition 15 X and Y are i-d-separated by Z in Z (denoted Dz(X, YIZ)) , if Z(X) is d-separated from Z(Y) given Z(Z) in the sub-graph Define Bz, Bayes network conditioned on Z, as the constructed network B (as in the previous section), but with {f(W)[W E span(Z)} removed, and Bz removed for all inferences/: in K incompatible with Z. Theorem 3 /] g(X) is d-separated from g(Y) g(Z) U f(Z) in Bz, then X is independent of Y given Z in K. Additionally, Dz(X, YIZ) for every inference Z in K that is compatible with Z. Note that the converses of Theorem 3 do not hold, since an unblocked path in Bz may exist that traverses more than one inference. Also, independence may actually hold despite non-d-separation, due to properties of the exact numbersin the distribution specification - but the latter phenomenonalso occurs in Bayes networks. Unlike for Bayes network d-separation, we suspect that testing for independence in BKBs is NP hard. Someresults on special cases follow. Let G be the correlation graph of K, augmented by bi-directional arcs between all I-nodes that belong to the same variable. Consider graph GI of the strongly connected components in G - denote the components by{A1, A2, ..., Am}. We define path as "foreign" to a connected component A if it begins and ends outside A. Definition 16 Strongly connected component A is dominated by Z if all foreign paths that pass through any 1-node of A includes a node v E Z. UNCERTAIN REASONING 311 Consider a case where each strongly connected component is un-mixed, that is, if it contains nodes from X, then it contains no nodes from either Y or Z, and likewise w.r.t. Y and Z. In the (acyclic) directed graph of the connected components, let Ax be the set of components that contain nodes from X, Ay be components with nodes from Y, and Az for the nodes of Z. Proposition 1 If all components in Az are dominated rby Z, then d-separation of Ax from Ay by Az in G implies that X is independent of Y given Z in K. The above proposition provides for an obvious polynomial-time semi-decision algorithm for independence. With care, the algorithm can be extended to handle strongly connected components in Az that are not dominated by Z. Mixed components are partially overcome by partitioning the problem into the equivalent (polynomial size) set of separate independence problems, each consisting of an independence problem with only singleton sets of I-nodes from X and Y. Another useful test is immediate Markovblanket blocking (similar to the case for Bayes networks): Proposition 2 If all I-node parents, children, and siblings of all I-nodes from X are in Z, then X is independent of Y given Z. Conclusion BKBsgeneralize Bayes nets by allowing context-specific independence and cycles (Santos & Santos 1999). The size of the BKBrepresentation is at most linear in that of a Bayes network - but actually smaller than the equivalent explicit Bayes network representation when muchcontext-specific independence and large in-degree occur (e.g. a multi-input ORnode), or when cycles need to be represented. Reasoning complexity is NP-hard (NP-complete for decision problems), just as for Bayes networks, and with the same essential polynomial-time special cases. Consistency-checking is hard in the general case (Rosen, Shimony, & Santos Jr. 2000). This paper introduced a graphical method for testing independence statements in a Bayesian knowledge-base with cycles. The method, based on a specially constructed Bayes network that mimics the inferences in the BKBis an important step in advancing probabilistic rule-based schemes that permit cycles in the set of rules they allow. In special cases, such as with small strongly connected components, we have shown methods for efficient testing for independence. Acknowledgements Supported by AFOSR Grant Nos. F49620-99-1-0059 and 940006, Israel Ministry of Science, and Paul Ivanier Center for Robotics (BGU). S. E. Shimonyon sabbatical at Univ. of Connecticut. References Bacchus, F. 1990. Representing and Reasoning with Probabilistic Knowledge: A Logical Approach to Probabilities. The MITPress. 312 FLAIRS-2000 Boutilier, C.; Friedman, N.; Goldszmidt, M.; and Koller, D. 1996. Context-specific independence in bayesian networks. In Uncertainty in Artificial Intelligence, Proceedings of the 12th Conference, 115-123. Morgan Kanfmann. Buchanan, B. G., and Shortliffe, E. H. 1984. RuleBased Expert Systems. Addison Wesley. Dempster, A. P. 1968. A generalization of Bayesian inference. J. Royal Statistical Society 30:205-47. Geiger, D., and Heckerman, D. 1991. Advances in probabilistic reasoning. In Proceedings of the 7th Conference on Uncertainty in AL Heckerman, D. 1991. Probabilistic Similarity Networks. The MIT Press. Nilsson, N. J. 1986. Probabilistic logic. Artificial Intelligence 28:71-87. Pearl, J. 1988. Probabilistie Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann. Poole, D. 1993. Probabilistic Horn abduction and Bayesian networks. Artificial Intelligence 64(1):81129. Rosen, I.; Shimony, S. E.; and Santos Jr., E. 2000. Reasoning with BKBs- algorithms and complexity. In Sixth International Symposiumon Artificial Intelligence and Mathematics. Santos, Jr., E., and Santos, E. S. 1999. A framework for building knowledge-bases under uncertainty. Journal of Experimental and Theoretical Artificial Intelligence 11:265-286. Sharer, G. A. 1979. Mathematical Theory of Evidence. Princeton University Press. Shimony, S. E., and Santos, Jr., E. 1996. Exploiting case-based independence for approximating marginal probabilities. International Journal of Approximate Reasoning 14(1). Shimony, S. E. 1993. The role of relevance in explanation I: Irrelevance as statistical independence. International Journal of Approximate Reasoning 8(4):281324. Shimony, S. E. 1995. The role of relevance in explanation II: Disjunctive assignments and approximate independence. International Journal of Approximate Reasoning 13(1):27-60. Shortliffe, E. H., and Buchanan, B. G. 1975. A model of inexact reasoning in medicine. Mathematical Biosciences 23:351-379. Thagard, P. 1989. Explanatory coherence. Behavioral and Brain Sciences 12:435-502. Zadeh, L. A. 1983. The role of fuzzy logic in the managementof uncertainty in expert systems. Fuzzy Sets and Systems 11:199-227.