@ 1996 Kluwer Efficient Academic Applied Intelligence Publishers. Manufactured 6. 275-285 (1996) in The Netherlands. Inferencing for Sigmoid Bayesian Networks by Reducing Sampling Space YOUNG S. HAN, YOUNG C. PARK AND KEY-SUN CHOI Computer Science Department, Korea Advanced Institute of Scienceand Technology, Taejon, 305-701, Korea A sigmoid Bayesian network is a Bayesian network in which a conditional probability is a sigmoid function of the weights of relevant arcs. Its application domain includes that of Boltzmann machine as well as traditional decision problems. In this paper we show that the node reduction method that is an inferencing algorithm for general Bayesian networks can alsobe usedon sigmoid Bayesian networks, and we proposea hybrid inferencing method combining the node reduction and Gibbs sampling. The time efficiency of sampling after node reduction is demonstratedthrough experiments. The resultsof this paper bring sigmoid Bayesian networks closer to large scaleapplications. Abstract. Keywords: 1. Bayesian network, inferencing, Gibbs sampling Introduction The Bayesian model has been one of the major methodsto handle uncertainty in artificial intelligence. The model gave birth to many implementations with different names(e.g., influence diagram, belief network, knowledge map, causal network, and causal probabilistic network). They have different model structures and problem domains. The model remains largely inefficient for the application domains with nontrivial number of variables mainly due to the problems in knowledge acquisition and inference problem [5] though there have been attempts to deal with the difficulty of knowledge acquisition for general (non restricted) Bayesian networks or sparsenetworks [2, 121. Sigmoid Bayesian networks [9] have ways to acquire distribution information from samplesmechanically, and thus enhance the efficiency of the model though they are applicableto morelimited domains. It is the inference problem for the sigmoid variation of Bayesian networks that this paper addresses. We show that the node reduction algorithm by Shachter [ 1l] can be applied to the sigmoid variation as well, and suggesthybrid inferencing that combinesthe node reduction and sampling approachesas a practical alternative. The inference algorithm basedon node reduction was introduced by Shachter for the general definition of influence diagram, but in practice it is hardly a feasible method becauseof the exponential complexity. It alsoneedsto be mentionedthat inferencing on restricted networks such asNoisy-OR network [9] andPearl’stree (and polytree) can be efficient [IO]. Another inference algorithm for general Bayesian networks isGibbs samplingthat originatesin the Metropolis algorithm [4, 81. Sampling is an approximation of the exact inference value, and doesnot take excessive time in mostcases.Though the samplingmay take only linear time in network size to reach equilibrium, theoretically approximation is also NP-hard [ 1, 31, and the cost still can becomeunbearably high with sufficiently large networks. We proposea combination of the two approachesin which the node reduction algorithm is used to reduce the network size until the reduction cost reachesa certain threshold value, and sampling is then usedon the reduced network. The resulting inference schemeis a Gibbs samplingon a reduced network. The gain is the speedto achieve desiredaccuracy or the accuracy with fixed inference time. An alternative method proposed by Kjaerulff [6] to combine Gibbs sampling and an exact algorithm usesjunction trees that are obtained via the processesof triangulation. 276 Han, Park and Choi The difficulty in realizing the Bayesian network lies in the exponential nature of the two problems mentioned above. The knowledge acquisition problem is about building a network. The construction of Bayesian network can be done in two steps. In the first step, the dependencyinformation amongvariables is specified, resulting in a topology of the network. The conditional probability distribution for each node is provided usually by human experts. A rigorous specification of the probability distribution information accountsfor the complexity. The noisy-OR model is a variation of Bayesian network that works with reduced overhead of the acquisition problem though it is still NP-hard [9, lo], and Heckerman’s PathFinder was alsoproposedin this context [5]. Sigmoid Bayesian network encodesdistribution information as the weights of arcsinsteadof tablesin nodes. The mechanical acquisition of probability distribution in the sigmoid Bayesian network leadsto a different classof applications. Another source of complexity is inferencing on the model. Due to the exponential number of marginal probabilities to compute for an inference problem, an exact assessmentof Bayesian network is not feasible [5]. To do away with this problem, many specialforms of Bayesian networks have beensuggestedsuchasone developed by Pearl [lo]. Gibbs sampling on reduced sigmoid Bayesian network turned out to be reasonably practical judging from the tests with up to 250 nodes in which it took about ten secondsto solve an inference problem on a Spare 10 workstation. Our experiments show that the hybrid inference method can run only at the one-fifth of the sampling time on the average. p(a=l p(a=l p(a=l p(a=l p(a=l p(a=l p(a=l p(a=l &ure I. An example Bay&an network. In Section 2, the basic concepts of Bayesian networks and sigmoid Bayesian networks are explained, and relevant inference algorithms are described. Section 3 showsthat the node reduction algorithm by Shachter[ 1l] is alsoapplicableto the sigmoidBayesian network. Experimental support on the efficiency of the hybrid inferencing on sigmoid Bayesiannetwork is given in Section 4. 2. Related Work In this section somebackground materials are briefly explained for the later development of our discussion. The basic definition of Bayesian network and a couple of inferencing algorithms are covered, and the definition of sigmoid Bayesian network follows. 2. I. Bayesian Network and Inference Algorithms A Bayesian network is made of a graph and tables of conditional probability distributions. The graph structure takes the form of directed acyclic graph (DAG) where a node representsa variable or a proposition in a domain. Each node producesan output value at any moment according to the conditional probability distribution maintained in the node. Usually the human expert is responsiblefor the construction of the graph and tables. An example network in Fig. 1 has 1/O valued nodes, and the arc from node rain to node car accident indicates the conditional dependencybetween the two nodes. The figure showsthe chance of a car accident s=O, s=O, s=O, s=O, s=l, s=l, s=l, s=l, r=O, r=O, r=l, r=l, r=O, r=O, r=l, r=l, f=O) f=l) f=O) f=l) f=O) f=l) f=O) f=l) = = = = = = = = 0.2 0.6 0.3 0.7 0.5 0.7 0.4 0.9 Efficient Inferencing for Sigmoid Bayesian Networks in a street dependson the combination of various factors. Each node has its own conditional probability table such as the one for node car accident shown in the figure. Given a network for the above description, a state of the network is a set of output values of the nodes. The probability of a stateor the joint probability of the nodes is simply the product of the conditional probabilities specifiedin the tables. For example, a stateprobability of the example network in Fig. 1 can be computed as follows: P(r= l,s= l,f=O,a= l,c=O,t= 1) = P(r = l)P(s = l)P(f = 0 1r = 1) xP(a=lIs=l,r=l,f=O) xP(c=Oja= l)P(t= l/a= 1). This is based on the fact that there should be at least one topological order of the nodes, but in actual computation the order of the nodesneednot be known since the multiplication is commutative. Computing a conditional probability between an arbitrary set of the variables accountsfor a decision processor a probabilistic inferencing, but is known very difficult. A conditional probability suchasP(c = 1, t = 1 1r = 1, f = 1) is an inference problemfor the examplein Fig. 1. A straightforward computation of the conditional probability requires an exponential number of joint probabilities of the nodes. The inferencing problem turned out to be aNP-hard problem for non restrictedBayesian networks [ 1,5]. For somespecialgraph structuressuch astree, polytree, or Noisy-OR, there are efficient inference algorithms [5, 9, lo]. In the following we will review an inference algorithm by Shachter [ 1l] that we will apply to sigmoid Bayesian network in the next section. Whenever possible, the notations by Shachter [ 1l] are used to avoid unnecessaryconfusion. The nodes of a network are indexed by N = ( 1, . . . , n}, and the corresponding variables of the nodes are denoted by X1, . , X,. For each variable Xi, a finite set of outcomes, Qi, is defined, and there is a conditional probability distribution table for each node denoted by ni. The conditioning nodes of node i are representedby C(i). Capital letters such as I and J representa set of nodes. For example, in the example C(a) includes s, r, and f. The node reduction algorithm by Shachter is based on the observation that there can be nodes that are 271 irrelevant to a given inference problem, and that any node that is not a variable of the given inference problem can be madeto be irrelevant. Such irrelevant nodes called barren can be eliminated without distorting the inference value being computed. Once an inference problem is given, the node reduction is accomplished in two steps. The first step is to identify the nodes that are already barren and delete them. The second stepis then to make other nodesbarren. By definition, barren nodes are thosethat do not reach any node correspondingto the variables of a given inference problem. Graph theoretically barren nodesare defined as follows. 1. Given an inference problem P(J I K) and a Bayesian network, a node i is barren iff node i doesnot belong to neither J nor K and hasno outgoing edges, where J and K are sets of nodes (variables) constituting an inference problem. Dejinition InFig. 1 given P(t = 1 Ir = 1,f = 1) tocompute, It can be easily seenthat the conditional probability is not affected by the presenceof node c. Conditioned node t is connected to a temporary nodeX0 asa preparatory stepbefore making other nodes barren. Now every node except X0 and conditioning nodes, r and f, can be eliminated. Making a node barren is done by reversing the outgoing arcs from the node and updating the conditional probability tablesaccordingly to preserve the inference value, thereby the node is isolated from the rest of the network. Figure 2 showsthe resulting graph after node reduction on the example network is completed with respect to the above inference problem. The probability table of X0 should contain the desired inference value. The node reduction algorithm is inherently exponential since it is a processof updating the probability tables that are exponential store in nature. The algorithm may not be practical for the applications with heavy connectivity. c becomesbarren. Figure 2. Resulting network after node reduction. 278 Han, Park and Choi Given an inference problem P(Jllt’), and a Bayesian network 1. Set, the output value of nodes in A’ to 1. 2. For each of n nodes but not in K, compute t,he distribution conditioned by the rest of the network. This distribution following proportionality as explained in [9, lo]. P(Xi = ZilXjfi) a P(X; = ZiIXj<z) n P(Xj with n nodes, of the node is given by the = ZjlXi = Zi,X,+i,k$j). .i>i 3. 4. 5. 6. 7. Figure 3. An instance Determine the outcome of the current node according to the above distribution. Move to the next node till all the nodes are visited. Increase a counter if all the nodes in J display value 1 in the current network. Repeat steps from 2 to 5 unt,il a certain saturation condition is met. Report inference value by normalizing the counter by the total number of repetions. of Gibbs sampling method. Gibbs sampling is an approximation technique through a stochasticsimulation, and is regardedasone of the inference methods[4, 8,9, 141. Gibbs sampling accumulates a sample by visiting nodes in turn and changing the output value of the node. The value of eachnode is updatedaccording to the probability of the node conditioned by the values of all the other nodes, and the convergence of the approximation is guaranteed if in a given visiting schemeeach node can be visited infinitely often. Figure 3 summarizesthe Gibbs sampling algorithm. Gibbs samplingbelongsto a more well known class of sampling methods derived from Metropolis algorithm. Simulated annealingis anotherexample of such sampling. Discussionon the performanceof sampling method is given in Section 4. 2.2. Sigmoid Bayesian Network Graphically there is no difference between the traditional Bayesian networks and the sigmoid Bayesian networks. In the sigmoid Bayesian network, however, the conditional probability distribution is encoded through the weights of arcsbetween conditioning and conditioned nodes, and an actual probability is computed upon demand from sigmoid function whose input value is determined by the arc weights and node values. A popular sigmoid function is defined as n(r) = 1/( 1 + exp( -t)). The definition of sigmoid function is asfollows. 1. it is an odd function in which f(-x) 2. limX+oo f(x) = 0.5, lim.r+-oo f(x) = -0.5, and 3. it is monotone increasing. A condition that combinesconditions 1 and 2 of the above definition in our context is g(x) + g(-x) A function f(x) is sigmoidal iff = 1. whereg(x) is a derived function of f(x) by adding 0.5. Dejinition 3. A function g(x) is positive sigmoidaliff there exists function f(x) such that or equivalently g(x) + g(-x) where f(x) = 1. is sigmoidal. When we say sigmoid, it is in fact positive sigmoid according to our definition. There are numeroussigmoid functions such astangent hyperbolic functions. With nodestaking binary values - l/ 1, a conditional probability is defined asfollows, p(xi Dejinition 2. = -f(x), = Xi 1 XC(i) = Xc(j)) = 0 xi ( x j~c(i)xjw~~) 3 Efficient Inferencing for Sigmoid Bayesian Networks where xi can be either - 1 or 1, xc(i) representsa vector of binary values, and Wii is the weight of the arc from node j to node i. The gain from the sigmoid Bayesian networks is the efficiency of encoding the probability information, and the loss is the inflexibility of the probability distribution to be encoded. Only the values following a sigmoid distribution can be represented. The change in fact brings out a radical shift in applications of the network. A mechanical acquisition of the conditional probability becomespossiblein the sigmoid Bayesian network as in connectionist models. The applications now include the pattern classification problems that used to be tackled by neural networks and Hidden Markov models. More on the sigmoid variation of Bayesian networks can be found at [9, 131. 3. Node Reduction in Sigmoid Bayesian 1 (Barren Node Reduction ,--a I K J &h after arc reversal, Network In this section, we show that the samenode reduction that Shachter [ 111 applied to the influence diagrams can be done on sigmoid Bayesian networks. The major difference betweensigmoid Bayesiannetworks and inlluence diagrams is that the former does not have probability distribution table for each node. It is our goal to show that the node reduction algorithm works in spite of the difference. First we begin by explaining Shachter’snode reduction algorithm. The processof node reduction is basically to solve a given inference problem, P(f(X,) ] XK) where J and K are arbitrary subsetsof N nodes, and f () is a real valued function. Every node in the network except nodes K and a special node replacing nodesJ is eliminated preserving the probabilistic information in regard to the given inference problem. The first type of reduction is Barren Node Reduction. Barren nodes are those whose information is orthogonal to the given inference problem, and in other words they do not lead to any node in J or K. Proposition i 279 [ll]). Zf node i is barren with respectto J and K then it can be eliminatedfrom a Bayesian network without changing the value of P{f(X,) 1 X,}. If a node is not barren, then the network can be changedsuchthat the nodebecomesbarren andthe new network preservesthe information of the original one with respectto aninference problem. More specifically a node can be made barren by reversing the outgoing - new arc Figure 4. Typical situation of an arc reversal. arcs. A general context of arc reversal is illustrated in Fig. 4 where I = Cold(i) - C”‘d(j), J = Cold(j) C”‘d(i) - {i}, and K = Cold(i) II Cold(j). In the new configuration, Cnew(j) = I U J U K, and C”“+‘(i) = I U J U K U (j}. Notice that there are two types of new arcscreated. The new arcs from the nodesin I to node j compensatethe influence the nodesin I usedto deliver to j through i. The new arcsfrom the nodesin J to node i decompensatethe undesiredinfluence that the nodesin J are going to carry upon i through j. 2 (Arc-Reversal [lo]). Given a fully spec$ed Bayesian network containing an arc from node i to node j with no other directed path from i to j, then it ispossibleto transform the network to one with an arc from j to i. Proposition From the proof of Proposition 2 by Shachter, the following updating rules are given for the conditional probability distribution tablesof nodesi and j oncethe arc is reversed. For Xj, Zy”(Xj 1 XC”‘SV(j)) = C X,&j Jty(Xj 1 XCOkJ(j))3~‘d(Xi 1Xe,,d(i)), (1) 280 Han, Park and Choi and for X;, where rri(x,j 1XC(I)) is the probability that Xj takes a certain outcome given the conditioning vector, and corresponds to a table entry in general Bayesian networks. The arc reversal is completed when the probability distribution tables of i and j are updated according to the above equations. Once the arc reversal is done, node i becomes barren and can be eliminated. From Propositions 1 and 2, any node in the network can be removed, and when this process is intendedly directed an inference problem can be solved. It is obvious that Proposition 1 (barren node reduction) is directly applicable to sigmoid Bayesian networks, so it remains to show the applicability of the arc reversal to sigmoid Bayesian networks. It can be proved by showing the probability distributions of the above two equations are captured in sigmoid function with the modified weights of the arcs. As mentioned earlier, in this paper we assume the outcomes of a node in sigmoid Bayesian networks are - 1 and 1, thus CZ = { - 1, I}. By definition, the sigmoid interpretation of the probability that Xj takes Xj when conditioned by the value xc(.i) of conditioning variables XC(~) is ni(x.i 1xC(i))= o xj C ( kEC(i)Xk . wjk) . Then the following lemma readily follows property of sigmoid function. from the Lemma 1. For each node of a sigmoid Bayesian network, the conditional probabilities in sigmoid distribution observes the following relation. 1 - a(-t) holds, which in fact is a necessary condition for a distribution oft to be captured in a sigmoid function. It is also should be shown that the distribution is monotone increasing (or more strongly continuous increasing) in order for the distribution to be sigmoidal. An arc reversal entails updates of arc weights of existing arcs as well as creation of new arcs. The changes of the conditional probability distribution of nodes i and j conform to the distributions that Eqs. (1) and (2) specify. In the case of general Bayesian networks, the changes are directly encoded in the distribution tables. As for sigmoid Bayesian networks, however, the distributions from the equations should follow the sigmoid distribution. Otherwise, they will not be captured through the weights of arcs in the new configuration. The following lemmas show that the conditional probability distributions of nodes i and j can be captured in sigmoid functions by showing that they satisfy the sufficient conditions to be qualified as sigmoid functions. Lemma 2. For node j in Fig. 4, the conditional probability distribution according to Eq. (1) satisfies Lemma 1. That is, Proof: See Appendix A for a proof. Lemma 3. I xCCi)) = 1 - = 1- nj(-xj JTj(Xj 1Xc.c-(j,) in Eq. (1) is continuous. Proof: It suffices to show that Eq. (1) always defines an output value for every value of old configurations and limit exists at each value. IT~'~(x~ 1xc”ld(j)) and n4’d(Xi 1xc~(i)) are supposed to be sigmoidal or continuous and return values between 0 and 1 at all times. It is obvious that multiplying and summing those numbers will always be a number. Limiting the right-hand side of Eq. (l), lim nj(xi nr’“(x,i q x JT~‘~(x ( u)J+~(x~ I p> ) xC(j)) I XC(j)), where xc(j) is a vector of the outcomes of C( j), Xc(j) is deftned to be a vector of the opposite conJiguration Of Xc(j) SO that Xc(j) = -Xc(j). For example, if Xc(j) is (- 1, 1, l), Xc(j) becomes (1, -1, -1). For a sigmoid function a(), a(t) = Since limits are additive and multiplicative, equality is obvious. Lemma 4. ny”‘(xj the above 0 I XCW(~,) in Eq. (1) is increasing. Efficient Inferencing for Sigmoid Bayesian Networks Assuming that the nodesproduce binary outcomes, the Eq. (1) is like a function of 4 variables as follows. Proof: ~(a, b, c, 4 = f(a)f(b> + f(c)fW. where f() is a sigmoid function. Now the derivative of g() looks asfollows, g’b b, c, 4 = f’@>f(b) + f’(c>fW + f(a)f’@> Theorem works). 3 (Arc Reversal in Sigmoid Bayesian 281 Net- Given a sigmoidBuyesian network contuining an arc from node i to node j and with no other directed path from i to j, it is possibleto transform the network to one with an arc from j to i. The sigmoid interpretation of Eq. (1) gives a system of linear equations of the following form. Supposethere arethree arcsto nodej, w, , w,,, and w,, from the three groupsof nodesI, K, and J respectively. Proof: + f(c>f’W Obviously, g’(a, b, c, 4 2 0. This implies Eq. (1) is an increasingfunction. q 1. The distribution of Eq. (1) can be cuptured in a sigmoidfunction. Theorem Lemmas2, 3, and 4 fulfill the sufficiency for 0 the equation to be sigmoidal. Proof: Lemma 5. For node i in Fig. 4, the conditional probability distribution according to Eq. (2) satisfies Lemma 1. That is, Proof: Lemma See Appendix B for a proof. 6. 0 This can be shown in the similar manner to 0 Lemma 3. 7. One way to compute each arc weight is to subtract one vector from the other that has the samepolarity except the arc to be computed. For instance,if w, is to be computed, using the following vectors -w, - w/, - w, = k, +w, - w/, - w, = k2 X,!‘ew(xi 1xc~~~-ci,) in Eq. (2) is continuous. Proof: Lemma The rri’s arethosevaluescomputedfrom Eq. (1). There are as many linear equations as two to the power of the number of arcs. By Theorem 1 that showsrrj”ew conforms to sigmoid distributions, we know that there exists a solution in the above linear equations. By the sameargument,according to Theorem 2 there exists a unique assignmentof weights to the arcs that satisfies rrpew. Thus the arc from i to j can be reversed in s;gmoid Bayesian networks. 0 711!‘ew(xi 1xc”““(i)) in Eq. (2) is increasing. This can be shown in the similar manner to 0 Lemma 4. Proof: 2. The distribution of Eq. (2) can be cuptured in a sigmoidfunction. Theorem Lemmas5, 6, and 7 fulfill the sufficiency for the equation to be sigmoidal. 0 Proof: From the above Theorems 1 and 2, now we can conclude that new conditional probability distributions can be captured in the sigmoid function. gives w, = -(kl - k2)/2. Theorem 3 does not say that a particular sigmoid function must be used in describing the new distribution after arc reversal. The selection of a sigmoid function, however, will determine the inferencing behavior or sensitivity to the input value, and thus should depend on the performance characteristics of applications. Reversingan arc, however, hasnon trivial time complexity. High time complexity of arc reversal is one reasonto give rise to our approachto solving the inference problems. 4. The time complexity of an arc reversal in sigmoidBayesiannetworks is 0 ( N2) where N is the number of nodesof a given network. Theorem 282 Han, Park and Choi An arc reversal consistsmainly of updating the arc weights reflecting reversal. If we regard a visit to a node as a unit cost, an arc update takesan order of N visits. An arc update is done by computing Eqs. (1) and (2) and solving linear equations. There are at most 4N visits for Eq. (1) and 3N visits for Eq. (2). Also solving linear equationsrequires additional 2N correspondingto the two vectors. There are at most2N arcs to update. Together an arc reversal takes an order of square. 0 Proof: Consequently making a nodebarrentakestime complexity of 0 (N”) asthere can be at most N - 1 arcsto reverse. In practice, the number of arcs to reverse to make a node barren can vary from 0 to N, and as the number becomesbigger, the cost of node reduction may increase beyond the limit of average computing resources in many applications. After all an inferencing takes an order of O(N4) in time since there are 0 (N) nodes to be made barren and removed. Though all the nodesin the network can be made barren to be eliminated, the efficiency of reduction depends on the order of reduction. Searching for an optimal order is known to be NP-hard [ 111. The hybrid approach proposed in this paper combines two inference methods: node reduction and Gibbs sampling. The biggest barrier to Gibbs sampling is the size of network from which samplestates aregenerated. The samplespaceincreasesby two to the power of the network size. By samplespacewe mean a space of unique states from which sample vectors (states) are collected. The bigger the sample space, the more diverse the sampleswill be. On the other hand, node reduction suffers from high computational cost for sufficiently large networks. The node reduction method is particularly fragile in the networks with heavy connectivity becauseits complexity is directly affected by the number of arcs to reverse. Node reduction andGibbs samplingcomplementthe drawbacks of each other, which may allow a practical way to the inference problemsof largeapplications. In the next section we describethe proposedinferencing algorithm in details. 4. Gibbs Sampling on Reduced Network The basicidea of our proposalis to reducethe Bayesian network and apply Gibbs sampling to solve a given inference problem. In this section we show that a sigmoid Bayesian network containing hundredsof nodes can be put into practical usewhen the hybrid inferencing is used. A question arisesas to when to stop the node reduction before Gibbs sampling starts. Obviously the arc reversal must be stopped when the cost of an arc reversal is sufficiently high becauseof many arc updates. The number of the incoming arcs of the two nodesdirectly involved in the reversal determinesthe complexity of an arc reversal. It certainly depends on the available computing resource how much connectivity is high, and hence the threshold connectivity should be determinedafter test run asa function of incoming edgesof the two statesinvolved in the arc to reverse. Experiments were done to seehow much efficiency was gained from the hybrid inferencing. The domain of the experimental network is about lexical study. The nodescorrespondto words andthe arcsto the cooccurrence dependenciesof the words as observed in raw texts. The arc from node j to node i indicates that node i has taken place within a certain distance after the event of node j in the sampletexts. A cycle caused by an addition of an arc or node can be prevented by creating a duplicate of the node causing the cycle. At most two nodesfor each word are enough to avoid the cycles. In our case,the sampletexts are selectedfrom Penn Tree corpus [7]. The Wall Street Journal articles are used in collecting the bigram cooccurrence statistics from which the network topology and the arcs weights are obtained. The 250 nontrivial words with high frequency are selectedfrom the texts, then order sensitive bigrams are computed with window size 5. The arc weights aredeterminedby normalizing the frequencies of the arcsin the highestfrequency. It turned out that the experimental results using the frequency weighting closely resembledour intuition. Table 1 illustrates the inference problems used in the experiments and their results. The values in the table are from reduction method, but other methods (sampling and hybrid methods) gave the similar valueswithin 0.01 difference range. It seemsreasonable that P(at ] good) > P(on ] good) and word gain is more closely related to large than to i’ To seehow well different inference methodsperform on different network sizes, we preparednine different networks of varying number of nodes(from 20 to 250 nodes),and tested the sameset of inference problems Efficient Inferencing for Sigmoid Bayesian Networks Tub/e 1. Example inference results node reduction methods for a network I SO nodes. Inference problems Inference 0.500057 P (dollar 0.535875 / company) P (plant 1company) 0.603191 P (election 0.500002 1president) f (sale 1president) 0.500000 f (on 1good) 0.500000 (at 1good) 0.681567 f P(large 1gain) 0.50 1558 f (if I gain) f (market sampling cost can be much saved if larger difference is allowed. As shown in Fig. 5, reduction method performed remarkably well with small number of nodes. It started to get slow by the highest rate after 150 and more nodes while sampling method maintained the same rate of increase over the different networks. Because hybrid method uses node reduction, it may also show similar behaviour to the reduction method in the long run, but at reduced rate. Figure 5 implies that the reduction method may be less efficient for sufficiently large networks than the sampling method. It also can be said that the hybrid method will tend to stay below the other two curves with larger networks. In summary, the experiments suggest a strong evidence for the potential of hybrid method for networks with more than 200 nodes. from with outcomes P (year 1company) 0.500000 1cash) 283 0.530684 50 Hybrid Reduction Sampling method method method 5. c -a-0. 40 0 30 time (=.I d 50 100 150 network size (number of nodes) 200 -1 250 comparison of node reduction, sampling, Figure 5. Performance and hybrid methods. The experiments were done on a Spare 10 workstation. Sampling methods show a clear linear increase, reduction method follows relatively slow non linear curve, and hybrid method displays even slower non linear curve. The curves indicate that hybrid method will outperform other methods for the networks with more than 200 nodes. in Table 1 on reduction, sampling, and hybrid methods. All the three methods were run to produce the inference values within 0.01 value difference on a Spare 10 workstation. The value difference range is particularly critical to the sampling and hybrid methods because the Conclusion Sigmoid Bayesian network provides an alternative to the knowledge acquisition problem of Bayesian networks for some applications. A learning similar to those used in neural networks can be used as discussed in Neal [9], but the inferencing still remains a serious barrier to the large scale applications. A known result is that an accurate assessment on general Bayesian networks is a NP-hard problem, thus arguably an approximation technique such as Gibbs sampling can be a practical approach to the problem in particular for large applications. Gibbs sampling, however, is not sufficiently cheap, and thus may not be useful for real time applications. We have showed that the node reduction algorithm by Shachter could be used on sigmoid Bayesian network, and the Gibbs sampling approach could be made more practical when used with the reduction algorithm. It is clear that the proposed inference scheme can also be applied to the other variations of Bayesian networks. There are several factors that determine the performance in actual implementation of the hybrid algorithm. If possible, the network should be constructed with less connectivity, and the node reduction may well be used to the maximum by carefully arranging the sequence of reductions. An elaborate use of Gibbs sampling in which node visiting scheme and the volume of samples are carefully decided should also help enhance the performance. 284 Han, Park and Choi Appendix A. Proof: = Proof of Lemma 2 From the sigmoid definition of Eq. 1, c o(xj &El-I.11 0 X,j (( c xk Proof of Lemma 5 Proof: It suffices to show that rrrw(xi 1xcC1~dCi)) + npw(Xi [XCOI~(~)) = 1. From Eq. (2) and using Lemma 2, ’ wjk) kd”‘“(j) c kK”“‘(i) = Appendix B. c kH?“‘l(j)-(i] xk Wik xk ’ wjk - wjj Notice the numerator of the last fraction is the sigmoid definition of T/?ew(XjI xcncwcj)). 0 Acknowledgments We like to thank Dr. Lee Kang-hyuk for his kind proof reading and the young physicist In S. Han who as an undergraduate student made useful suggestions. We also thank anonymous referees for their valuable comments that greatly helped improve this paper. References = 1- 7Ty”(X,j 1 XCWW(,j)). 0 1. G. Cooper, “The computational complexity of probabilistic inferencing using Bayesian networks,” Artifciol Intelligence, vol. 42, pp. 393405, 1990. 2. G. Cooper and E. Herskovits, “A Bayesian method for the induction of probabilistic methods from data,” Muchine Learning, vol. 9, pp. 54-62, 1992. 3. P. Dagum and M. Luby, “‘Approximate probabilistic inference in Bayesian belief networks is NP-hard,” Arti’cial Intelligence, vol. 60, pp. 141-153, 1993. 4. A.E. Gelfand and A.F.M. Smith, “Sampling-based approaches to calculating marginal densities,” J. of the American Stafisfical Associution, vol. 85, no. 410, pp. 3981109, 1990. 5. D. Heckerman, Probabilistic Similurity Nefworks, MIT Press, 1991. 6. U. Kjaerulff, “HUGS: Combining exact inference and Gibbs sampling in junction trees,” Proceedings of the Eleventh Con,ference on Uncertainty in Artijkial Intelligence, San Francisco: Morgan Kaufmann, 1995. 7. M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank,” Compututionul Linguistics, vol. 19, no. 2, pp. 313-330, 1993. Efficient 8. N. Metropolis. A.W. Rosenbluth, A.H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, no. 6, pp. 1087-1092, 1953. 9. R.M. Neal, “Connectionist learning of Bayesian networks,“Artijciul Intelligence, vol. 56, pp. 71-l 13, 1992. 10. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Netwwrk~ of Pkuusible Inference, Morgan Kaufmann, San Mateo, CA, 1988. Il. R.D. Shachter, “Probabilistic inferencing and influence diagrams,” Operations Research, vol. 36, no. 4, pp. 589-604, 1988. 12. M. Singh and M. Valtorta, “Bayesian network structures from data,” Proceedings of the Ninth Crmference on Uncertainty rn Artificial Intelligence, Morgan-Kaufmann, pp. 259-265, 1993. 13. D.J. Spiegelhalter and S.L. Lauritzen, “Sequential updating of conditional probabilities on directed graphical structures,” Netwarkv. vol. 20, pp. 579-605, 1990. 14. J. York, “Use of the Gibbs sampler in expert systems,“Artijciul 1992. Intelligence, vol. 56, pp. 115-130, Young S. Han is a researcher at the Center for Artificial Intelligence Research ofthe Korea Advanced Institute of Science and Technology (KAIST)..He received Ph.D. in 1995 from KAIST and has accepted a Postdoctoral Fellowship from the University of Pennsylvania. His current research interests include pattern recognition, computational linguistics, and machine learning. Inferencing for Sigmoid Bayesian Networks 285 Young C. Park is a graduate student of Computer Science Program at the Korea Advanced Institute of Science and Technology (KAIST). He earned his master’s degree in 1994 from KAIST. His current research focuses on the information retrieval for Korean, data compression, and computational linguistics. Key-Sun Choi is an associate professor of Computer Science in the Korea Advanced Institute of Science and Technology (KAIST). He obtained his master’s and doctoral degrees from KAIST at 1980 and 1986 respectively. He is a co-founder of Korea Society of Cognitive Science and Korea Society of Hangul Research. He has organized an international conference in the field of computational linguistics. He also serves on the advisory boards of several government agenties and leads a consortium for the development of comprehensive research environment for Korean (Korea Information Base System). His research interests include information retrieval, computational linguistics, and cognitive science.