Efficient inferencing for sigmoid Bayesian networks by reducing

advertisement
@ 1996 Kluwer
Efficient
Academic
Applied Intelligence
Publishers. Manufactured
6. 275-285 (1996)
in The Netherlands.
Inferencing for Sigmoid Bayesian Networks
by Reducing Sampling Space
YOUNG S. HAN, YOUNG C. PARK AND KEY-SUN CHOI
Computer Science Department, Korea Advanced Institute of Scienceand Technology, Taejon, 305-701, Korea
A sigmoid Bayesian network is a Bayesian network in which a conditional probability is a sigmoid
function of the weights of relevant arcs. Its application domain includes that of Boltzmann machine as well
as traditional decision problems. In this paper we show that the node reduction method that is an inferencing
algorithm for general Bayesian networks can alsobe usedon sigmoid Bayesian networks, and we proposea hybrid
inferencing method combining the node reduction and Gibbs sampling. The time efficiency of sampling after node
reduction is demonstratedthrough experiments. The resultsof this paper bring sigmoid Bayesian networks closer
to large scaleapplications.
Abstract.
Keywords:
1.
Bayesian network, inferencing, Gibbs sampling
Introduction
The Bayesian model has been one of the major methodsto handle uncertainty in artificial intelligence. The
model gave birth to many implementations with different names(e.g., influence diagram, belief network,
knowledge map, causal network, and causal probabilistic network). They have different model structures
and problem domains. The model remains largely
inefficient for the application domains with nontrivial number of variables mainly due to the problems
in knowledge acquisition and inference problem [5]
though there have been attempts to deal with the difficulty of knowledge acquisition for general (non restricted) Bayesian networks or sparsenetworks [2, 121.
Sigmoid Bayesian networks [9] have ways to acquire
distribution information from samplesmechanically,
and thus enhance the efficiency of the model though
they are applicableto morelimited domains. It is the inference problem for the sigmoid variation of Bayesian
networks that this paper addresses.
We show that the node reduction algorithm by
Shachter [ 1l] can be applied to the sigmoid variation as
well, and suggesthybrid inferencing that combinesthe
node reduction and sampling approachesas a practical alternative. The inference algorithm basedon node
reduction was introduced by Shachter for the general
definition of influence diagram, but in practice it is
hardly a feasible method becauseof the exponential
complexity. It alsoneedsto be mentionedthat inferencing on restricted networks such asNoisy-OR network
[9] andPearl’stree (and polytree) can be efficient [IO].
Another inference algorithm for general Bayesian networks isGibbs samplingthat originatesin the Metropolis algorithm [4, 81. Sampling is an approximation of
the exact inference value, and doesnot take excessive
time in mostcases.Though the samplingmay take only
linear time in network size to reach equilibrium, theoretically approximation is also NP-hard [ 1, 31, and the
cost still can becomeunbearably high with sufficiently
large networks.
We proposea combination of the two approachesin
which the node reduction algorithm is used to reduce
the network size until the reduction cost reachesa certain threshold value, and sampling is then usedon the
reduced network. The resulting inference schemeis a
Gibbs samplingon a reduced network. The gain is the
speedto achieve desiredaccuracy or the accuracy with
fixed inference time. An alternative method proposed
by Kjaerulff [6] to combine Gibbs sampling and an exact algorithm usesjunction trees that are obtained via
the processesof triangulation.
276
Han, Park and Choi
The difficulty in realizing the Bayesian network lies
in the exponential nature of the two problems mentioned above. The knowledge acquisition problem
is about building a network. The construction of
Bayesian network can be done in two steps. In the
first step, the dependencyinformation amongvariables
is specified, resulting in a topology of the network.
The conditional probability distribution for each node
is provided usually by human experts. A rigorous
specification of the probability distribution information accountsfor the complexity. The noisy-OR model
is a variation of Bayesian network that works with
reduced overhead of the acquisition problem though
it is still NP-hard [9, lo], and Heckerman’s PathFinder was alsoproposedin this context [5]. Sigmoid
Bayesian network encodesdistribution information as
the weights of arcsinsteadof tablesin nodes. The mechanical acquisition of probability distribution in the
sigmoid Bayesian network leadsto a different classof
applications.
Another source of complexity is inferencing on the
model. Due to the exponential number of marginal
probabilities to compute for an inference problem, an
exact assessmentof Bayesian network is not feasible
[5]. To do away with this problem, many specialforms
of Bayesian networks have beensuggestedsuchasone
developed by Pearl [lo]. Gibbs sampling on reduced
sigmoid Bayesian network turned out to be reasonably practical judging from the tests with up to 250
nodes in which it took about ten secondsto solve an
inference problem on a Spare 10 workstation. Our experiments show that the hybrid inference method can
run only at the one-fifth of the sampling time on the
average.
p(a=l
p(a=l
p(a=l
p(a=l
p(a=l
p(a=l
p(a=l
p(a=l
&ure I.
An example Bay&an network.
In Section 2, the basic concepts of Bayesian networks and sigmoid Bayesian networks are explained,
and relevant inference algorithms are described.
Section 3 showsthat the node reduction algorithm by
Shachter[ 1l] is alsoapplicableto the sigmoidBayesian
network. Experimental support on the efficiency of
the hybrid inferencing on sigmoid Bayesiannetwork is
given in Section 4.
2.
Related
Work
In this section somebackground materials are briefly
explained for the later development of our discussion. The basic definition of Bayesian network
and a couple of inferencing algorithms are covered,
and the definition of sigmoid Bayesian network follows.
2. I.
Bayesian
Network
and Inference
Algorithms
A Bayesian network is made of a graph and tables of
conditional probability distributions. The graph structure takes the form of directed acyclic graph (DAG)
where a node representsa variable or a proposition in
a domain. Each node producesan output value at any
moment according to the conditional probability distribution maintained in the node. Usually the human
expert is responsiblefor the construction of the graph
and tables.
An example network in Fig. 1 has 1/O valued nodes,
and the arc from node rain to node car accident
indicates the conditional dependencybetween the two
nodes. The figure showsthe chance of a car accident
s=O,
s=O,
s=O,
s=O,
s=l,
s=l,
s=l,
s=l,
r=O,
r=O,
r=l,
r=l,
r=O,
r=O,
r=l,
r=l,
f=O)
f=l)
f=O)
f=l)
f=O)
f=l)
f=O)
f=l)
=
=
=
=
=
=
=
=
0.2
0.6
0.3
0.7
0.5
0.7
0.4
0.9
Efficient Inferencing for Sigmoid Bayesian Networks
in a street dependson the combination of various factors. Each node has its own conditional probability
table such as the one for node car accident shown in
the figure.
Given a network for the above description, a state of
the network is a set of output values of the nodes. The
probability of a stateor the joint probability of the nodes
is simply the product of the conditional probabilities
specifiedin the tables. For example, a stateprobability
of the example network in Fig. 1 can be computed as
follows:
P(r=
l,s=
l,f=O,a=
l,c=O,t=
1)
= P(r = l)P(s = l)P(f = 0 1r = 1)
xP(a=lIs=l,r=l,f=O)
xP(c=Oja=
l)P(t=
l/a=
1).
This is based on the fact that there should be at
least one topological order of the nodes, but in actual
computation the order of the nodesneednot be known
since the multiplication is commutative. Computing
a conditional probability between an arbitrary set of
the variables accountsfor a decision processor a probabilistic inferencing, but is known very difficult. A
conditional probability suchasP(c = 1, t = 1 1r = 1,
f = 1) is an inference problemfor the examplein Fig. 1.
A straightforward computation of the conditional probability requires an exponential number of joint probabilities of the nodes. The inferencing problem turned
out to be aNP-hard problem for non restrictedBayesian
networks [ 1,5]. For somespecialgraph structuressuch
astree, polytree, or Noisy-OR, there are efficient inference algorithms [5, 9, lo]. In the following we will
review an inference algorithm by Shachter [ 1l] that
we will apply to sigmoid Bayesian network in the next
section.
Whenever possible, the notations by Shachter [ 1l]
are used to avoid unnecessaryconfusion. The nodes
of a network are indexed by N = ( 1, . . . , n}, and the
corresponding variables of the nodes are denoted by
X1, . , X,. For each variable Xi, a finite set of outcomes, Qi, is defined, and there is a conditional probability distribution table for each node denoted by ni.
The conditioning nodes of node i are representedby
C(i). Capital letters such as I and J representa set of
nodes. For example, in the example C(a) includes s,
r, and f.
The node reduction algorithm by Shachter is based
on the observation that there can be nodes that are
271
irrelevant to a given inference problem, and that any
node that is not a variable of the given inference problem can be madeto be irrelevant. Such irrelevant nodes
called barren can be eliminated without distorting the
inference value being computed. Once an inference
problem is given, the node reduction is accomplished
in two steps. The first step is to identify the nodes
that are already barren and delete them. The second
stepis then to make other nodesbarren. By definition,
barren nodes are thosethat do not reach any node correspondingto the variables of a given inference problem. Graph theoretically barren nodesare defined as
follows.
1. Given an inference problem P(J I K)
and a Bayesian network, a node i is barren iff node i
doesnot belong to neither J nor K and hasno outgoing
edges, where J and K are sets of nodes (variables)
constituting an inference problem.
Dejinition
InFig. 1 given P(t = 1 Ir = 1,f = 1) tocompute,
It can be easily seenthat the conditional probability is not affected by the presenceof
node c. Conditioned node t is connected to a temporary nodeX0 asa preparatory stepbefore making other
nodes barren. Now every node except X0 and conditioning nodes, r and f, can be eliminated. Making
a node barren is done by reversing the outgoing arcs
from the node and updating the conditional probability tablesaccordingly to preserve the inference value,
thereby the node is isolated from the rest of the network. Figure 2 showsthe resulting graph after node
reduction on the example network is completed with
respect to the above inference problem. The probability table of X0 should contain the desired inference
value.
The node reduction algorithm is inherently exponential since it is a processof updating the probability
tables that are exponential store in nature. The algorithm may not be practical for the applications with
heavy connectivity.
c becomesbarren.
Figure
2.
Resulting
network
after node reduction.
278
Han, Park and Choi
Given
an inference
problem
P(Jllt’),
and a Bayesian
network
1. Set, the output
value of nodes in A’ to 1.
2. For each of n nodes but not in K, compute
t,he distribution
conditioned
by the rest of the network.
This distribution
following
proportionality
as explained
in [9, lo].
P(Xi = ZilXjfi)
a P(X; = ZiIXj<z) n P(Xj
with
n nodes,
of the node
is given by the
= ZjlXi = Zi,X,+i,k$j).
.i>i
3.
4.
5.
6.
7.
Figure
3.
An instance
Determine
the outcome
of the current
node according
to the above
distribution.
Move to the next node till all the nodes are visited.
Increase
a counter
if all the nodes in J display
value 1 in the current
network.
Repeat
steps from 2 to 5 unt,il a certain
saturation
condition
is met.
Report
inference
value by normalizing
the counter
by the total number
of repetions.
of Gibbs sampling
method.
Gibbs sampling is an approximation technique
through a stochasticsimulation, and is regardedasone
of the inference methods[4, 8,9, 141. Gibbs sampling
accumulates a sample by visiting nodes in turn and
changing the output value of the node. The value of
eachnode is updatedaccording to the probability of the
node conditioned by the values of all the other nodes,
and the convergence of the approximation is guaranteed if in a given visiting schemeeach node can be
visited infinitely often. Figure 3 summarizesthe Gibbs
sampling algorithm.
Gibbs samplingbelongsto a more well known class
of sampling methods derived from Metropolis algorithm. Simulated annealingis anotherexample of such
sampling. Discussionon the performanceof sampling
method is given in Section 4.
2.2.
Sigmoid Bayesian Network
Graphically there is no difference between the traditional Bayesian networks and the sigmoid Bayesian
networks. In the sigmoid Bayesian network, however, the conditional probability distribution is encoded
through the weights of arcsbetween conditioning and
conditioned nodes, and an actual probability is computed upon demand from sigmoid function whose input value is determined by the arc weights and node
values. A popular sigmoid function is defined as
n(r) = 1/( 1 + exp( -t)). The definition of sigmoid
function is asfollows.
1. it is an odd function in which f(-x)
2. limX+oo f(x) = 0.5,
lim.r+-oo f(x) = -0.5, and
3. it is monotone increasing.
A condition that combinesconditions 1 and 2 of the
above definition in our context is
g(x) + g(-x)
A function f(x)
is sigmoidal iff
= 1.
whereg(x) is a derived function of f(x) by adding 0.5.
Dejinition 3. A function g(x) is positive sigmoidaliff
there exists function f(x) such that
or equivalently
g(x) + g(-x)
where f(x)
= 1.
is sigmoidal.
When we say sigmoid, it is in fact positive sigmoid according to our definition. There are numeroussigmoid
functions such astangent hyperbolic functions.
With nodestaking binary values - l/ 1, a conditional
probability is defined asfollows,
p(xi
Dejinition 2.
= -f(x),
=
Xi
1 XC(i)
=
Xc(j))
=
0
xi
(
x
j~c(i)xjw~~)
3
Efficient Inferencing for Sigmoid Bayesian Networks
where xi can be either - 1 or 1, xc(i) representsa vector
of binary values, and Wii is the weight of the arc from
node j to node i.
The gain from the sigmoid Bayesian networks is
the efficiency of encoding the probability information,
and the loss is the inflexibility of the probability distribution to be encoded. Only the values following a
sigmoid distribution can be represented. The change
in fact brings out a radical shift in applications of the
network. A mechanical acquisition of the conditional
probability becomespossiblein the sigmoid Bayesian
network as in connectionist models. The applications
now include the pattern classification problems that
used to be tackled by neural networks and Hidden
Markov models. More on the sigmoid variation of
Bayesian networks can be found at [9, 131.
3.
Node Reduction
in Sigmoid
Bayesian
1 (Barren
Node
Reduction
,--a
I
K
J
&h
after
arc reversal,
Network
In this section, we show that the samenode reduction
that Shachter [ 111 applied to the influence diagrams
can be done on sigmoid Bayesian networks. The major difference betweensigmoid Bayesiannetworks and
inlluence diagrams is that the former does not have
probability distribution table for each node. It is our
goal to show that the node reduction algorithm works
in spite of the difference. First we begin by explaining
Shachter’snode reduction algorithm.
The processof node reduction is basically to solve
a given inference problem, P(f(X,)
] XK) where J
and K are arbitrary subsetsof N nodes, and f () is a
real valued function. Every node in the network except nodes K and a special node replacing nodesJ is
eliminated preserving the probabilistic information in
regard to the given inference problem. The first type
of reduction is Barren Node Reduction. Barren nodes
are those whose information is orthogonal to the given
inference problem, and in other words they do not lead
to any node in J or K.
Proposition
i
279
[ll]).
Zf
node i is barren with respectto J and K then it can be
eliminatedfrom a Bayesian network without changing
the value of P{f(X,)
1 X,}.
If a node is not barren, then the network can be
changedsuchthat the nodebecomesbarren andthe new
network preservesthe information of the original one
with respectto aninference problem. More specifically
a node can be made barren by reversing the outgoing
-
new arc
Figure
4.
Typical
situation
of an arc reversal.
arcs. A general context of arc reversal is illustrated in
Fig. 4 where I = Cold(i) - C”‘d(j), J = Cold(j) C”‘d(i) - {i}, and K = Cold(i) II Cold(j). In the new
configuration, Cnew(j) = I U J U K, and C”“+‘(i) =
I U J U K U (j}. Notice that there are two types of
new arcscreated. The new arcs from the nodesin I to
node j compensatethe influence the nodesin I usedto
deliver to j through i. The new arcsfrom the nodesin
J to node i decompensatethe undesiredinfluence that
the nodesin J are going to carry upon i through j.
2 (Arc-Reversal
[lo]).
Given a fully
spec$ed Bayesian network containing an arc from
node i to node j with no other directed path from i
to j, then it ispossibleto transform the network to one
with an arc from j to i.
Proposition
From the proof of Proposition 2 by Shachter, the
following updating rules are given for the conditional
probability distribution tablesof nodesi and j oncethe
arc is reversed. For Xj,
Zy”(Xj
1
XC”‘SV(j))
= C
X,&j
Jty(Xj
1
XCOkJ(j))3~‘d(Xi
1Xe,,d(i)),
(1)
280
Han, Park and Choi
and for X;,
where rri(x,j 1XC(I)) is the probability that Xj takes a
certain outcome given the conditioning vector, and corresponds to a table entry in general Bayesian networks.
The arc reversal is completed when the probability
distribution tables of i and j are updated according to
the above equations. Once the arc reversal is done,
node i becomes barren and can be eliminated. From
Propositions 1 and 2, any node in the network can be
removed, and when this process is intendedly directed
an inference problem can be solved.
It is obvious that Proposition 1 (barren node reduction) is directly applicable to sigmoid Bayesian
networks, so it remains to show the applicability of
the arc reversal to sigmoid Bayesian networks. It can
be proved by showing the probability distributions of
the above two equations are captured in sigmoid function with the modified weights of the arcs.
As mentioned earlier, in this paper we assume the
outcomes of a node in sigmoid Bayesian networks are
- 1 and 1, thus CZ = { - 1, I}. By definition, the sigmoid
interpretation of the probability that Xj takes Xj when
conditioned by the value xc(.i) of conditioning variables
XC(~)
is
ni(x.i 1xC(i))= o xj C
( kEC(i)Xk
. wjk) .
Then the following lemma readily follows
property of sigmoid function.
from the
Lemma 1. For each node of a sigmoid Bayesian
network, the conditional probabilities in sigmoid distribution observes the following relation.
1 - a(-t)
holds, which in fact is a necessary condition for a distribution oft to be captured in a sigmoid
function. It is also should be shown that the distribution
is monotone increasing (or more strongly continuous
increasing) in order for the distribution to be sigmoidal.
An arc reversal entails updates of arc weights of existing arcs as well as creation of new arcs. The changes
of the conditional probability distribution of nodes i
and j conform to the distributions that Eqs. (1) and
(2) specify. In the case of general Bayesian networks,
the changes are directly encoded in the distribution tables. As for sigmoid Bayesian networks, however, the
distributions from the equations should follow the sigmoid distribution. Otherwise, they will not be captured
through the weights of arcs in the new configuration.
The following lemmas show that the conditional probability distributions of nodes i and j can be captured in
sigmoid functions by showing that they satisfy the sufficient conditions to be qualified as sigmoid functions.
Lemma 2. For node j in Fig. 4, the conditional
probability distribution according to Eq. (1) satisfies
Lemma 1. That is,
Proof:
See Appendix A for a proof.
Lemma 3.
I xCCi))
=
1 -
= 1-
nj(-xj
JTj(Xj
1Xc.c-(j,) in Eq. (1) is continuous.
Proof:
It suffices to show that Eq. (1) always defines
an output value for every value of old configurations
and limit exists at each value.
IT~'~(x~
1xc”ld(j)) and n4’d(Xi 1xc~(i)) are supposed
to be sigmoidal or continuous and return values between 0 and 1 at all times. It is obvious that multiplying
and summing those numbers will always be a number.
Limiting the right-hand side of Eq. (l),
lim
nj(xi
nr’“(x,i
q
x
JT~‘~(x ( u)J+~(x~ I p>
) xC(j))
I XC(j)),
where xc(j) is a vector of the outcomes of C( j), Xc(j)
is deftned to be a vector of the opposite conJiguration
Of Xc(j)
SO that Xc(j)
= -Xc(j).
For example, if Xc(j) is (- 1, 1, l), Xc(j) becomes
(1, -1, -1).
For a sigmoid function a(), a(t) =
Since limits are additive and multiplicative,
equality is obvious.
Lemma 4.
ny”‘(xj
the above
0
I XCW(~,) in Eq. (1) is increasing.
Efficient Inferencing for Sigmoid Bayesian Networks
Assuming that the nodesproduce binary outcomes, the Eq. (1) is like a function of 4 variables as
follows.
Proof:
~(a, b, c, 4 = f(a)f(b>
+ f(c)fW.
where f() is a sigmoid function. Now the derivative
of g() looks asfollows,
g’b b, c, 4 = f’@>f(b)
+ f’(c>fW
+ f(a)f’@>
Theorem
works).
3 (Arc Reversal
in Sigmoid
Bayesian
281
Net-
Given a sigmoidBuyesian network contuining an arc from node i to node j and with no other
directed path from i to j, it is possibleto transform the
network to one with an arc from j to i.
The sigmoid interpretation of Eq. (1) gives
a system of linear equations of the following form.
Supposethere arethree arcsto nodej, w, , w,,, and w,,
from the three groupsof nodesI, K, and J respectively.
Proof:
+ f(c>f’W
Obviously,
g’(a, b, c, 4 2 0.
This implies Eq. (1) is an increasingfunction.
q
1. The distribution of Eq. (1) can be cuptured in a sigmoidfunction.
Theorem
Lemmas2, 3, and 4 fulfill the sufficiency for
0
the equation to be sigmoidal.
Proof:
Lemma
5. For node i in Fig. 4, the conditional
probability distribution according to Eq. (2) satisfies
Lemma 1. That is,
Proof:
Lemma
See Appendix B for a proof.
6.
0
This can be shown in the similar manner to
0
Lemma 3.
7.
One way to compute each arc weight is to subtract
one vector from the other that has the samepolarity
except the arc to be computed. For instance,if w, is to
be computed, using the following vectors
-w,
- w/, - w, = k,
+w, - w/, - w, = k2
X,!‘ew(xi 1xc~~~-ci,)
in Eq. (2) is continuous.
Proof:
Lemma
The rri’s arethosevaluescomputedfrom Eq. (1). There
are as many linear equations as two to the power of
the number of arcs. By Theorem 1 that showsrrj”ew
conforms to sigmoid distributions, we know that there
exists a solution in the above linear equations. By the
sameargument,according to Theorem 2 there exists a
unique assignmentof weights to the arcs that satisfies
rrpew. Thus the arc from i to j can be reversed in
s;gmoid Bayesian networks.
0
711!‘ew(xi
1xc”““(i)) in Eq. (2) is increasing.
This can be shown in the similar manner to
0
Lemma 4.
Proof:
2. The distribution of Eq. (2) can be cuptured in a sigmoidfunction.
Theorem
Lemmas5, 6, and 7 fulfill the sufficiency for
the equation to be sigmoidal.
0
Proof:
From the above Theorems 1 and 2, now we can conclude that new conditional probability distributions can
be captured in the sigmoid function.
gives w, = -(kl - k2)/2.
Theorem 3 does not say that a particular sigmoid
function must be used in describing the new distribution after arc reversal. The selection of a sigmoid
function, however, will determine the inferencing behavior or sensitivity to the input value, and thus should
depend on the performance characteristics of applications.
Reversingan arc, however, hasnon trivial time complexity. High time complexity of arc reversal is one
reasonto give rise to our approachto solving the inference problems.
4. The time complexity of an arc reversal
in sigmoidBayesiannetworks is 0 ( N2) where N is the
number of nodesof a given network.
Theorem
282
Han, Park and Choi
An arc reversal consistsmainly of updating
the arc weights reflecting reversal. If we regard a visit
to a node as a unit cost, an arc update takesan order of
N visits. An arc update is done by computing Eqs. (1)
and (2) and solving linear equations. There are at most
4N visits for Eq. (1) and 3N visits for Eq. (2). Also
solving linear equationsrequires additional 2N correspondingto the two vectors. There are at most2N arcs
to update. Together an arc reversal takes an order of
square.
0
Proof:
Consequently making a nodebarrentakestime complexity of 0 (N”) asthere can be at most N - 1 arcsto
reverse. In practice, the number of arcs to reverse to
make a node barren can vary from 0 to N, and as
the number becomesbigger, the cost of node reduction may increase beyond the limit of average computing resources in many applications. After all an
inferencing takes an order of O(N4) in time since
there are 0 (N) nodes to be made barren and removed.
Though all the nodesin the network can be made
barren to be eliminated, the efficiency of reduction depends on the order of reduction. Searching for an optimal order is known to be NP-hard [ 111.
The hybrid approach proposed in this paper combines two inference methods: node reduction and
Gibbs sampling. The biggest barrier to Gibbs sampling is the size of network from which samplestates
aregenerated. The samplespaceincreasesby two to the
power of the network size. By samplespacewe mean
a space of unique states from which sample vectors
(states) are collected. The bigger the sample space,
the more diverse the sampleswill be. On the other
hand, node reduction suffers from high computational
cost for sufficiently large networks. The node reduction method is particularly fragile in the networks with
heavy connectivity becauseits complexity is directly
affected by the number of arcs to reverse.
Node reduction andGibbs samplingcomplementthe
drawbacks of each other, which may allow a practical
way to the inference problemsof largeapplications. In
the next section we describethe proposedinferencing
algorithm in details.
4.
Gibbs
Sampling
on Reduced
Network
The basicidea of our proposalis to reducethe Bayesian
network and apply Gibbs sampling to solve a given
inference problem. In this section we show that a sigmoid Bayesian network containing hundredsof nodes
can be put into practical usewhen the hybrid inferencing is used.
A question arisesas to when to stop the node reduction before Gibbs sampling starts. Obviously the
arc reversal must be stopped when the cost of an arc
reversal is sufficiently high becauseof many arc updates. The number of the incoming arcs of the two
nodesdirectly involved in the reversal determinesthe
complexity of an arc reversal. It certainly depends
on the available computing resource how much connectivity is high, and hence the threshold connectivity
should be determinedafter test run asa function of incoming edgesof the two statesinvolved in the arc to
reverse.
Experiments were done to seehow much efficiency
was gained from the hybrid inferencing. The domain
of the experimental network is about lexical study. The
nodescorrespondto words andthe arcsto the cooccurrence dependenciesof the words as observed in raw
texts. The arc from node j to node i indicates that
node i has taken place within a certain distance after
the event of node j in the sampletexts. A cycle caused
by an addition of an arc or node can be prevented by
creating a duplicate of the node causing the cycle. At
most two nodesfor each word are enough to avoid the
cycles.
In our case,the sampletexts are selectedfrom Penn
Tree corpus [7]. The Wall Street Journal articles are
used in collecting the bigram cooccurrence statistics
from which the network topology and the arcs weights
are obtained. The 250 nontrivial words with high frequency are selectedfrom the texts, then order sensitive
bigrams are computed with window size 5. The arc
weights aredeterminedby normalizing the frequencies
of the arcsin the highestfrequency.
It turned out that the experimental results using the
frequency weighting closely resembledour intuition.
Table 1 illustrates the inference problems used in the
experiments and their results. The values in the table are from reduction method, but other methods
(sampling and hybrid methods) gave the similar valueswithin 0.01 difference range. It seemsreasonable
that P(at ] good) > P(on ] good) and word gain is more
closely related to large than to i’
To seehow well different inference methodsperform
on different network sizes, we preparednine different
networks of varying number of nodes(from 20 to 250
nodes),and tested the sameset of inference problems
Efficient Inferencing for Sigmoid Bayesian Networks
Tub/e
1. Example
inference
results
node reduction methods for a network
I SO nodes.
Inference
problems
Inference
0.500057
P (dollar
0.535875
/ company)
P (plant 1company)
0.603191
P (election
0.500002
1president)
f (sale 1president)
0.500000
f (on 1good)
0.500000
(at 1good)
0.681567
f
P(large
1gain)
0.50 1558
f (if I gain)
f (market
sampling cost can be much saved if larger difference is
allowed.
As shown in Fig. 5, reduction method performed
remarkably well with small number of nodes. It started
to get slow by the highest rate after 150 and more nodes
while sampling method maintained the same rate of
increase over the different networks. Because hybrid
method uses node reduction, it may also show similar
behaviour to the reduction method in the long run, but
at reduced rate.
Figure 5 implies that the reduction method may be
less efficient for sufficiently large networks than the
sampling method. It also can be said that the hybrid
method will tend to stay below the other two curves
with larger networks.
In summary, the experiments suggest a strong evidence for the potential of hybrid method for networks
with more than 200 nodes.
from
with
outcomes
P (year 1company)
0.500000
1cash)
283
0.530684
50
Hybrid
Reduction
Sampling
method
method
method
5.
c
-a-0.
40
0
30
time
(=.I
d
50
100
150
network size (number of nodes)
200
-1
250
comparison
of node reduction, sampling,
Figure 5. Performance
and hybrid methods.
The experiments
were done on a Spare 10
workstation.
Sampling methods show a clear linear increase, reduction method follows relatively slow non linear curve, and hybrid
method displays even slower non linear curve. The curves indicate
that hybrid method will outperform
other methods for the networks
with more than 200 nodes.
in Table 1 on reduction, sampling, and hybrid methods.
All the three methods were run to produce the inference
values within 0.01 value difference on a Spare 10 workstation. The value difference range is particularly critical to the sampling and hybrid methods because the
Conclusion
Sigmoid Bayesian network provides an alternative to
the knowledge acquisition problem of Bayesian networks for some applications. A learning similar to
those used in neural networks can be used as discussed
in Neal [9], but the inferencing still remains a serious
barrier to the large scale applications. A known result is that an accurate assessment on general Bayesian
networks is a NP-hard problem, thus arguably an approximation technique such as Gibbs sampling can be
a practical approach to the problem in particular for
large applications.
Gibbs sampling, however, is not sufficiently cheap,
and thus may not be useful for real time applications.
We have showed that the node reduction algorithm by
Shachter could be used on sigmoid Bayesian network,
and the Gibbs sampling approach could be made more
practical when used with the reduction algorithm. It
is clear that the proposed inference scheme can also
be applied to the other variations of Bayesian networks.
There are several factors that determine the performance in actual implementation of the hybrid algorithm. If possible, the network should be constructed
with less connectivity, and the node reduction may well
be used to the maximum by carefully arranging the sequence of reductions. An elaborate use of Gibbs sampling in which node visiting scheme and the volume of
samples are carefully decided should also help enhance
the performance.
284
Han, Park and Choi
Appendix A.
Proof:
=
Proof of Lemma 2
From the sigmoid definition of Eq. 1,
c
o(xj
&El-I.11
0
X,j
((
c
xk
Proof of Lemma 5
Proof: It suffices to show that rrrw(xi 1xcC1~dCi))
+ npw(Xi [XCOI~(~))
= 1. From Eq. (2) and using
Lemma 2,
’ wjk)
kd”‘“(j)
c
kK”“‘(i)
=
Appendix B.
c
kH?“‘l(j)-(i]
xk
Wik
xk
’ wjk
-
wjj
Notice the numerator of the last fraction is the sigmoid definition of T/?ew(XjI xcncwcj)).
0
Acknowledgments
We like to thank Dr. Lee Kang-hyuk for his kind
proof reading and the young physicist In S. Han
who as an undergraduate student made useful suggestions. We also thank anonymous referees for their
valuable comments that greatly helped improve this
paper.
References
= 1-
7Ty”(X,j
1 XCWW(,j)).
0
1. G. Cooper, “The computational
complexity
of probabilistic
inferencing using Bayesian networks,”
Artifciol
Intelligence,
vol.
42, pp. 393405,
1990.
2. G. Cooper and E. Herskovits,
“A Bayesian method for the induction of probabilistic
methods from data,” Muchine Learning,
vol. 9, pp. 54-62, 1992.
3. P. Dagum and M. Luby, “‘Approximate
probabilistic
inference
in Bayesian belief networks is NP-hard,” Arti’cial
Intelligence,
vol. 60, pp. 141-153, 1993.
4. A.E. Gelfand and A.F.M. Smith, “Sampling-based
approaches
to calculating marginal densities,” J. of the American Stafisfical
Associution, vol. 85, no. 410, pp. 3981109, 1990.
5. D. Heckerman,
Probabilistic
Similurity
Nefworks,
MIT Press,
1991.
6. U. Kjaerulff,
“HUGS:
Combining
exact inference and Gibbs
sampling in junction trees,” Proceedings
of the Eleventh Con,ference on Uncertainty
in Artijkial
Intelligence,
San Francisco:
Morgan Kaufmann,
1995.
7. M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz,
“Building a large annotated corpus of English:
The Penn Treebank,” Compututionul
Linguistics,
vol. 19, no. 2, pp. 313-330,
1993.
Efficient
8. N. Metropolis.
A.W. Rosenbluth,
A.H. Teller, and E. Teller,
“Equation
of state calculations
by fast computing
machines,”
J.
Chem. Phys., vol. 21, no. 6, pp. 1087-1092,
1953.
9. R.M. Neal, “Connectionist
learning of Bayesian networks,“Artijciul Intelligence,
vol. 56, pp. 71-l 13, 1992.
10. J. Pearl, Probabilistic
Reasoning
in Intelligent
Systems:
Netwwrk~ of Pkuusible Inference,
Morgan Kaufmann,
San Mateo,
CA, 1988.
Il. R.D. Shachter,
“Probabilistic
inferencing
and influence
diagrams,”
Operations
Research,
vol. 36, no. 4, pp. 589-604,
1988.
12. M. Singh and M. Valtorta, “Bayesian
network structures
from
data,” Proceedings
of the Ninth Crmference
on Uncertainty
rn Artificial
Intelligence,
Morgan-Kaufmann,
pp. 259-265,
1993.
13. D.J. Spiegelhalter
and S.L. Lauritzen,
“Sequential
updating of
conditional
probabilities
on directed graphical structures,”
Netwarkv. vol. 20, pp. 579-605,
1990.
14. J. York, “Use of the Gibbs sampler in expert systems,“Artijciul
1992.
Intelligence,
vol. 56, pp. 115-130,
Young S. Han is a researcher at the Center for Artificial
Intelligence
Research ofthe Korea Advanced Institute of Science and Technology
(KAIST)..He
received Ph.D. in 1995 from KAIST and has accepted
a Postdoctoral
Fellowship
from the University
of Pennsylvania.
His
current research interests include pattern recognition,
computational
linguistics,
and machine learning.
Inferencing
for Sigmoid
Bayesian
Networks
285
Young C. Park is a graduate student of Computer
Science Program at the Korea Advanced
Institute of Science and Technology
(KAIST).
He earned his master’s degree in 1994 from KAIST. His
current research focuses on the information
retrieval for Korean, data
compression,
and computational
linguistics.
Key-Sun
Choi is an associate professor of Computer
Science in the
Korea Advanced
Institute of Science and Technology
(KAIST).
He
obtained his master’s and doctoral degrees from KAIST at 1980 and
1986 respectively.
He is a co-founder
of Korea Society of Cognitive
Science and Korea Society of Hangul Research.
He has organized
an international conference in the field of computational
linguistics.
He also serves on the advisory boards of several government
agenties and leads a consortium
for the development
of comprehensive
research environment
for Korean (Korea Information
Base System).
His research interests include information
retrieval,
computational
linguistics,
and cognitive
science.
Download