An efficient Gibbs sampler for structural inference in Bayesian networks

advertisement
An efficient Gibbs sampler for structural inference
in Bayesian networks
Robert J. B. Goudie and Sach Mukherjee
Department of Statistics
University of Warwick
Coventry, CV4 7AL
June 9, 2011
Abstract
We propose a Gibbs sampler for structural inference in Bayesian networks. The standard Markov chain Monte Carlo (MCMC) algorithms
used for this problem are random-walk Metropolis-Hastings samplers, but
for problems of even moderate dimension, these samplers often exhibit
slow mixing. The Gibbs sampler proposed here conditionally samples
the complete set of parents of a node in a single move, by blocking together particular components. These blocks can themselves be paired
together to improve the efficiency of the sampler. The conditional distribution used for sampling can be viewed as a posterior distribution for a
constrained Bayesian variable selection for the parents of a node. This
view sheds further light on the increasingly well understood connection
between Bayesian variable selection and structural inference. We empirically examine the performance of the sampler using data simulated from
the ALARM network.
1
Introduction
Bayesian networks provide a framework for encoding probabilistic dependence
statements about the components of a multivariate probability distribution.
Random variables are represented by vertices of a directed acyclic graph (DAG).
The distribution of each random variable is specified conditional on its parents in
the graph. The full distribution is therefore described recursively. These models
are widely used in statistics, artificial intelligence and diverse other areas.
If the conditional distributions of the component random variables share a parametric form that makes sense under any dependence structure, we can consider
the space of Bayesian networks to be the space in which model selection or averaging takes place. In a Bayesian framework, selection is performed using the
1
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
marginal likelihood. This has a closed form if tractable conjugate priors are used
for the conditional distributions. Nonetheless, exact structural inference is not
usually possible because the large cardinality of the space of DAGs precludes a
exhaustive search or enumeration.
Structural inference for Bayesian networks has been widely studied, and a number of different approaches are available. There are a number of MCMC samplers
available, including MC3 (Madigan and York, 1995) and variants that improve
its efficiency (Giudici and Castelo, 2003; Grzegorczyk and Husmeier, 2008). In
high, or even moderate dimensions, it is not straightforward to construct an
MCMC sampler that converges rapidly to its target distribution. One attempt
to address this problem has been to perform MCMC in order space (Friedman
and Koller, 2003; Ellis and Wong, 2008; Eaton and Murphy, 2007). These approaches have more recently led to exact methods (Koivisto and Sood, 2004;
Parviainen and Koivisto, 2009) using dynamic programming. An alternative
class of approaches is constraint-based methods (Spirtes et al., 2000; Kalisch
and Bühlmann, 2008; Xie and Geng, 2008).
In this paper we propose a method for constructing Gibbs samplers for structural
inference of Bayesian networks. The standard MCMC algorithms used for this
problem propose small changes to the current state, and accept these proposals
according to the usual Metropolis-Hastings acceptance probability. The Gibbs
sampler proposed here considers the parents of each node as a single component
(by constructing blocks), and conditionally samples the entire parent set. This
allows the sampler to make ‘large’ moves that are sampled exactly from the local
conditional posterior distribution, enabling the sampler to locate and explore
the areas of significant posterior mass efficiently.
The remainder of this paper is organised as follows.
We start by introducing structural inference for Bayesian Networks. We then describe a naı̈ve Gibbs Sampler for this problem, which motivates the development
of an improved Gibbs sampler, based on the idea of blocking. We then describe
how this approach can be extended to larger blocks, before describing how both
of these algorithms can be implemented efficiently. Finally, we present empirical
results comparing the Gibbs sampler to some widely-used existing methods.
2
Background and Notation
A Bayesian network G is a directed, acyclic graph (DAG) with vertices V =
(V1 , . . . , Vp ), and directed edges E ⊂ V × V . The vertices correspond to the
components of a random vector X = [X1 , . . . , Xp ]T , subsets of which will be
denoted by XA for sets A ⊆ {1, . . . , p}. The arcs E of the graph G can be
specified in terms of the parents Gj of each vertex Vj for 1 ≤ j ≤ p. The
parents Gj of Vj are the subset of vertices V such that Vi ∈ Gj ⇔ (Vi , Vj ) ∈ E.
It will sometimes be convenient to use the collection of parents (G1 , . . . , Gp ) to
2
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
specify a graph G. Subsets thereof will be denoted by GA = {Gk : k ∈ A}. The
subset given by the complement AC = {1, . . . , p} \ A of a set A will be denoted
by G−A = {Gk : k ∈ AC }. In particular, the complete graph can be specified
by G = (G1 , . . . Gp ) = (Gi , G−i ) for any i ∈ {1, . . . , p}. We will use XGi to refer
to the random variables that are parents of Xi in the graph G.
The joint distribution of X is specified in terms of p(Xi | XGi , θi ), the conditional distribution of each Xi , given its parents XGi in the Bayesian network,
with parameters θi . For structural inference our interest focuses on the posterior
distribution on Bayesian networks P (G | X), where G ∈ G, the space of all possible DAGs with p vertices. This is proportional to the product of the marginal
likelihood p(X | G), and a prior π(G) for the Bayesian network structure. By
choosing conjugate priors for the parameters θi , and assuming local parameter
independence and modularity (Heckerman et al., 1995), we can obtain a closedform marginal likelihood. The details for the multinomial and Gaussian cases
are described by Heckerman et al. (1995) and Geiger and Heckerman (1994)
respectively.
We will not need to assume the graph prior π(G) takes any specific form, and
so in particular either flat (improper) priors over graph space, or informative
priors (Werhli and Husmeier, 2007; Mukherjee and Speed, 2008) can be used.
Under the assumptions we have made, the posterior distribution on Bayesian
networks is
p
Y
P (G | X) ∝ π(G)
p(Xi | XGi )
i=1
This is the target distribution for our sampler.
3
A Naı̈ve Gibbs Sampler
The standard sampler for structural inference for Bayesian networks is MC3
(Madigan et al., 1994), which is a Metropolis-Hastings sampler that explores
G by proposing to add or remove a single arc from the current graph G. The
sampler can be viewed as exploring the space of adjacency matrices, which
consists of p × p matrices whose elements Gij are indicator variables for whether
G includes an arc from i to j, and whose diagonal elements Gii = 0 for all i.
Each proposal G0 is drawn uniformly at random from the neighbourhood η(G)
of the current graph, defined as the set of G0 that differ from G by a single edge
addition or removal. The proposal G0 is accepted with probability min(1, α),
where
P (G0 | X)|η(G0 )|−1
.
α = min 1,
P (G | X)|η(G)|−1 )
Constructing a Gibbs sampler that is analogous to MC3 is straightforward. To
do this, we consider the posterior distribution on Bayesian networks to be a
3
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
joint distribution on off-diagonal entries in the adjacency matrix. We thus have
p(p − 1) random variables Gij , each of which takes the value 1 or 0.
The proposal distribution of MC3 can also be viewed as proposing to toggle the
value of Gij of the adjacency matrix for some i 6= j, subject to the restriction
that the proposal must be acyclic. A simple Gibbs sampler works in a similar
way. Each step of the Gibbs sampler samples Gij from its conditional distribution, given the rest of the graph GC
ij = {Guv : 1 ≤ u ≤ p, 1 ≤ v ≤ q} \ Gij .
Define G1ij as the graph G with an arc from i to j, and G0ij as the graph G
with no arc from i to j. If G1ij is acyclic, the conditional distribution of Gij is
Bernoulli.

P (G0ij | X)



 P (G0 | X) + P (G1 | X) g = 0
ij
ij
C
P (Gij = g | Gij ) =
1

P (Gij | X)



g=1
P (G0ij | X) + P (G1ij | X)
If G1ij is cyclic, G0ij is sampled with probability 1.
The choice of i and j can either be made sequentially (systematically) or randomly. There are few theoretical results to guide the choice of random- and
systematic-scan Gibbs samplers (Roberts and Sahu, 1997). Here, random-scan
Gibbs samplers are used throughout.
This naı̈ve Gibbs sampler offers no advantages over MC3 . However, thinking of
structural inference from a Gibbs sampling perspective opens up the possibility
of drawing on ideas from the Gibbs sampling literature to improve the mixing
rate of the MCMC algorithm, as we now discuss.
4
Optimising Gibbs Samplers
The mixing of Metropolis-Hastings algorithms depends strongly upon the choice
of proposal distribution, but Gibbs samplers make moves according to the conditional distribution. The conditional distribution has the attractive property
that it exactly reflects some local structure of the target distribution, in contrast
to proposal distributions for a Metropolis-Hastings algorithm, which are often
chosen for convenience.
Nonetheless, Gibbs sampling is not always efficient. Inefficiency occurs when
there is strong correlation between the components of the random vector. To see
this, consider Gibbs sampling a multivariate continuous distribution with highly
correlated components. At each step, a single component of the random vector
is sampled according to its conditional distribution, but since this component is
strongly correlated with another component, the conditional distribution will be
concentrated on only a small part of its support. This means that the sampler
4
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
Algorithm 1 A Gibbs sampler, with blocks
Initialise starting point G0 = (G01 , . . . , G0p )
for i in 1 to N do
for j in 1 to p do
Draw G0j ∼ P (Gj | Gi−1
−j , X)
i
0
Set G ← G = (Gj , Gi−1
−j )
end for
end for
is likely to make only small moves, which will mean that it explores the sample
space slowly. The same issue arises with discrete distributions.
For Bayesian networks, it is clear that there will be strong dependence between
the Gij , particularly for the groups {Gij : 1 ≤ i ≤ p} for each 1 ≤ j ≤ p that
correspond to parent sets. For example, there may be random variables Xr and
Xs that do not individually predict Xj well, but do when taken in combination.
Another possibility is of two pairs of random variables Xr , Xs and Xu , Xv that
in combination both predict Xj well, but such that any the 4 random variables
individually do not. In this case, the probability of transitioning from a graph
in which the parents of Xj are Vr and Vs to a graph in which the parents of Xj
are Vu and Vv may be extremely low.
One method for alleviating this problem is to transform the distribution so that
the components of the random variable are not correlated. In general, finding
a suitable transformation can be very difficult, and in the case of Bayesian
networks it seems unlikely that an appropriate transformation is available.
Instead we propose to group a number of the components together and sample
from their joint conditional distribution. In Gibbs sampling, this is known
as ‘blocking’. The method is widely thought to be beneficial in settings such
as this where there is strong correlation between components of the random
variable. In the case of multivariate normal distributions, Roberts and Sahu
(1997) have shown that for random-scan Gibbs sampling, convergence improves
when components of the random vector are sampled as blocks. By sampling
from the joint conditional distribution of a number of components, we avoid the
issues caused by the correlation between these components because the joint
conditional distribution naturally incorporates the correlation structure, and so
can account for it.
5
Single Parent Set Blocks
As we noted above, the efficiency of a Gibbs sampler can be improved by blocking together a number of components, and sampling from their joint conditional
5
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
distribution. In theory, any group of components can be taken as a block, but
sampling from their joint conditional distribution needs to be possible, and ideally simple. For Bayesian networks, the most natural blocks are those consisting
of groups {Gij : 1 ≤ i ≤ p} for each 1 ≤ j ≤ p. We can think of this as parameterising G by its parent sets G1 , . . . , Gp . This seems natural for Bayesian
networks because the marginal likelihood p(X | G) for a graph G factorises
across vertices into conditionals p(Xj | XGj ). These conditionals depend on the
parent set of each vertex, and so the posterior distribution on Bayesian networks
G ∈ G can be written as functions of G1 , . . . Gp in the following way.
P (G1 , . . . , Gp | X) ∝ π(G1 , . . . , Gp )
p
Y
p(Xi | XGi )
i=1
To be able to construct a Gibbs sampler using these blocks, we need to find
the conditional distribution for the blocks. Specifically, we need the conditional
distribution for a block Gj = {Gij : 1 ≤ i ≤ p}, 1 ≤ j ≤ p, given the other
parent sets G−j = {G1 , . . . , Gj−1 , Gj+1 , . . . , Gp }. Parent sets Gj for which G =
(Gj , G−j ) is cyclic will have no probability mass in the conditional distribution.
Let Kj? be the set of parent sets Gj such that G = (Gj , G−j ) is acyclic. The
conditional posterior distribution of Gj is multinomial for Gj ∈ Kj? , with weights
given by the posterior distribution of G = (Gj , G−j ).
P (Gj , G−j | X)
P (G−j | X)
P (Gj , G−j | X)
=P
Gj ∈K ? P (Gj , G−j | X)
P (Gj | G−j , X) =
(1)
j
It is interesting to note that when Kj? = P(V \{Vj }), the conditional distribution
(1) can be viewed as the posterior distribution of a standard Bayesian variable
selection problem with dependent variable Vj , and independent variables V−j .
When Kj? ⊂ P(V \ {Vj }), the conditional distribution (1) is a Bayesian variable
selection problem with independent variables given by a subset of V−j . This
highlights the connection between structural learning for Bayesian networks and
Bayesian variable selection problems. While this connection has not been widely
used for directed graphs, recently it has been widely studied and exploited for
undirected graphs (e.g. Meinshausen and Bühlmann, 2006).
When these variable selection problems are tractable, we can immediately implement a Gibbs sampler for G. At each sample, we simply draw a new parent
set for vertex Vj from P (Gj | G−j , X). Algorithm 1 is an outline of this Gibbs
sampler.
6
Two Parent Set Blocks
The Gibbs sampler described in Section 4 groups the edge indicators Gij into
blocks corresponding to the parents of a node, as these are likely to be correlated.
6
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
However, this may not be sufficient to achieve quick convergence because there
is often strong correlation between components of Gij and Gij 0 with j 6= j 0 .
When this is the case, samplers using the blocking strategy described above
may fail to mix efficiently.
In this case, we can form larger blocks formed of pairs of parent sets {Gij :
j ∈ {j1 , j2 }, 1 ≤ i ≤ p}. At each step of the Gibbs sampler, we now conditionally sample pairs of parent sets (Gj1 , Gj2 ), given the remainder of the
graph G−{j1 ,j2 } . For parents sets G−{j1 ,j2 } such that G = (Gj1 , Gj2 , G−{j1 ,j2 } )
is not acyclic, the conditional distribution is 0. Let Kj?1 ,j2 be the set of pairs
of parent sets (Gj1 , Gj2 ) such that G = (Gj1 , Gj2 , G−{j1 ,j2 } ) is acyclic. For
(Gj1 , Gj2 ) ∈ Kj?1 ,j2 , the conditional posterior distribution is multinomial, with
weights given by posterior distribution of G = (Gj1 , Gj2 , G−{j1 ,j2 } ).
P (Gj1 , Gj2 | G−{j1 ,j2 } , X)
=
P (Gj1 , Gj2 , G−{j1 ,j2 } | X)
P (G−{j1 ,j2 } )
=P
(Gj1
7
P (Gj1 , Gj2 , G−{j1 ,j2 } | X)
P (Gj1 , Gj2 , G−{j1 ,j2 } | X)
,Gj )∈K ?
2
(2)
j1 ,j2
Computational Aspects
Up to this point we have ignored the computational aspects of the algorithm.
In this section, we describe how the algorithms described above can be implemented efficiently. The main computational difficulty encountered by these
algorithms is checking for cycles.
7.1
Checking for Cycles
The most straightforward method for checking for cycles is depth-first search,
which takes O(p + e) time, where e is the number of edges in the graph. For
MC3 , we must consider all possible single-edge changes to a directed graph of
which there are O(p2 ), and so checking for cycles at each iteration takes O(p3 )
time in the worst case.
This creates a bottleneck in the algorithm, but we can avoid this by using ideas
first proposed in this context by Giudici and Castelo (2003). We describe an
alternative method that was proposed in the dynamic algorithms literature by
King and Sagert (2002).
Let T G be the transitive closure of the current state of the sampler, which for a
graph G = (V, E) is defined as the directed graph (V, E ? ), where (Vi , Vj ) ∈ E ?
if and only if a path from Vi to Vj exists in G. Knowing the transitive closure
is of use because its adjacency matrix T G = (TijG ) immediately reveals which
7
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
alterations can be made to G without introducing a cycle. The addition of an
G
edge (i, j) will introduce a cycle if and only if Tji
= 1. Clearly removing an
edge from G will never introduce a cycle. The adjacency matrix of the transitive
closure therefore enables graphs created by single-edge additions to be screened
for cycles in O(1) time.
The transitive closure for an arbitrary directed graph can be determined in
O(pω ) time (Munro, 1971), where ω is the best known exponent for matrix multiplication (Coppersmith and Winograd, 1990, show ω < 2.376). However, only
incremental changes are made to the current state G of the sampler, so a dynamic algorithm can be used to compute the transitive closure more efficiently.
We need a fully dynamic transitive closure algorithm, so that both insertion and
deletion of edges are supported. This problem has been the subject of significant
interest in the dynamic algorithms literature; Demetrescu et al. (2010) provide
an overview.
Algorithms for this problem provide a procedure for querying the transitive
closure, and procedures that update the transitive closure when an edge is added
or removed from the graph. A trade-off exists between the performance of these
two operations (Demetrescu and Italiano, 2005). We choose to implement the
algorithm introduced by King and Sagert (2002), which allows queries to be
performed in O(1) time, and updates in O(p2 ) worst-case time, assuming a
word size of O(log p). This is thought to be the best possible update bound
(Demetrescu and Italiano, 2005), yet the algorithm is simple to implement.
G
G
The algorithm maintains a path count matrix C G = (Cij
), where Cij
is the
G
number of distinct paths from Vi to Vj in G. Clearly, Tij = 1 if and only if
G
Cij
> 0, and so query operations are performed simply by checking whether the
relevant component of C G is positive.
Surprisingly, updating C G is also straightforward. Suppose G0 is formed by
G
adding an edge (i, j) to a graph G. Denote the ith column of C G by C•i
, and
th
G
the j row by Cj• . The increase in the number of distinct paths between
any two vertices Va and Vb is given by the (a, b) element of the outer product
G
G
(denoted ⊗) of C•i
and Cj•
. Updates to the path count matrix for the addition
of an edge (i, j) therefore add this outer product to the existing path count
matrix.
0
G
G
C G = C G + C•i
⊗ Cj•
Removing an edge from Vi to Vj is performed analogously.
0
G
G
C G = C G − C•i
⊗ Cj•
This algorithm is simple to implement, and provides a fast method for determining which edges can be added to a DAG without introducing a cycle.
8
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
Algorithm 2 An efficient Gibbs sampler, with blocks
Initialise starting point G0 = (G01 , . . . , G0p )
Compute initial path count matrix C G0
for i in 1 to N do
for j in 1 to p do
i−1
Remove parents of Vj in Gi−1 . Update C G
G
Retrieve Kj = {Vk : Cjki−1 = 0} of Vj
?
Compute power set Kj = P(Kj )
for Gj in Kj? do
Evaluate p(Xj | XGj )
end for
Evaluate P (Gj | Gi−1
−j , X) using (1)
Draw G0j ∼ P (Gj | Gi−1
−j , X)
i
0
Set G ← G = (Gj , Gi−1
−j )
Update C Gi
end for
end for
7.2
Application to Gibbs Sampler
The algorithms in the previous section enable the Gibbs samplers to be implemented efficiently. We first describe how this can be done for the Gibbs sampler
in Section 5, and then describe how this approach can be extended when blocks
are formed by two parent sets, as described in Section 6.
An efficient implementation of the Gibbs sampler with single parent set blocks
(Section 5) requires a method for efficiently computing Kj? , the set of parent
sets Gj such that G = (Gj , G−j ) is acyclic. This is, in fact, not difficult if the
transitive closure of the current graph is available. Recall that adding an edge
G
(i, j) will introduce a cycle if and only if Tji
= 1. Therefore, the set of vertices
G
that can be added as parents of Vj is the set Kj = {Vi : Tji
= 0}. In fact,
any subset of Kj can also be added without introducing a cycle. It is clear,
therefore, that Kj? = P(Kj ), the power set of Kj . An efficient algorithm that
implements the Gibbs sampler using this method is outlined in Algorithm 2.
The two parent set Gibbs sampler (Section 6) requires it to be feasible draw
samples from (2) quickly. To do this, we describe a method for identifying
Kj?1 ,j2 , the pairs of parent sets which, when added to the graph, do not create
a cycle.
Once again, the transitive closure makes it possible to find the required parent
sets. We first define Kj1 and Kj2 to be the non-descendants of Vj1 and Vj2
respectively. Similarly, we define KjC1 and KjC2 to be the descendants of Vj1 and
Vj2 respectively. As before, we have that Kj?1 = P(Kj1 ) and Kj?2 = P(Kj2 )
9
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
are the sets of parent sets that can separately be added to Vj1 and Vj2 without
forming a cycle. However, Kj?1 ,j2 is not equal to Kj?1 × Kj?2 because there may
be a subset C of Kj?1 × Kj?2 for which the (Gj1 , Gj2 ) ∈ C are such that G =
(Gj1 , Gj2 , G−{j1 ,j2 } ) is cyclic. Elements (Gj1 , Gj2 ) in C have the property that
Gj1 and Gj2 can be added to the current graph without forming a cycle in the
graph, but, when added to the current graph together, will create a cycle.
The elements of C correspond to the situation that both a descendant of Vj1 is
added as a parent of Vj2 and a descendant of Vj2 is added as a parent of Vj1 .
When enumerating Kj?1 ,j2 we must therefore ensure that our pairs of parent sets
do not both have a descendant of the other vertex as a parent. This makes it
helpful to partition Kj?1 ,j2 into three groups, depending on whether paths are
formed from Vj1 to Vj2 , from Vj2 to Vj1 , or neither. The corresponding partition
is Kj?1 ,j2 = {H0 , H1 , H2 }, where
H0
=
(Kj?1 ∩ Kj?2 ) × (Kj?2 ∩ Kj?1 )
H1
=
(Kj?1 ∩ KjC2 ) × (Kj?2 ∩ Kj?1 )
H2
=
(Kj?1 ∩ Kj?2 ) × (Kj?2 ∩ KjC1 ).
In H0 , no paths will be formed between Vj1 and Vj2 , because only non-descendants
of both Vj1 and Vj2 are included as potential parents. In H1 , Vj1 has a parent
that is a descendant of Vj2 and so the parents of Vj2 cannot include Vj1 or any
of its descendants. Similarly, in H2 , Vj1 has a parent that is not a descendant
of Vj2 and so the parent of Vj2 include Vj1 or any of its descendants without
inducing a cycle. The partition gives a simple way to enumerate all of the elements of Kj?1 ,j2 , because all of these sets are functions of the transitive closure,
which we dynamically update as described above. Since we can enumerate H0 ,
H1 and H2 , we can enumerate Kj?1 ,j2 . This means we can draw samples from
(2), as required by the Gibbs sampler with two parent set blocks.
8
Experiments
We analysed the performance of the MCMC samplers using synthetic data,
generated from the ALARM network (Beinlich et al., 1989), which includes 46
edges. This network is widely used to examine to the performance of methods
of structural learning.
We compare the performance of our Gibbs sampler with MC3 , the REV sampler of Grzegorczyk and Husmeier (2008), and a constraint-based approach introduced by Xie and Geng (2008). The constraint-based approach we use is a
variant of the PC-algorithm (Spirtes et al., 2000), which Xie and Geng (2008)
demonstrate can outperform the PC-algorithm. Our MC3 and Gibbs samplers
both implement the fast updating of the transitive closure from King and Sagert
(2002), and cache local marginal likelihoods. We used the default settings for
all of these samplers, in particular, setting significance level α = 0.05 for the
10
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
constraint-based method. The Gibbs sampler we used was a random-scan sampler, which used two parent set block moves with probability 0.25 and one parent
set block moves otherwise.
We chose a flat, improper prior π(G) ∝ 1, and constrain all of the MCMC
samplers to graphs with in-degree no greater than 3.
Since the true graph is known, the performance of structural learning algorithms
can be assessed using ROC (Receiver-operating characteristic) curves. The
MCMC algorithms return estimates for the posterior distribution on Bayesian
networks P (G | X), from which the posterior probability of any arc p(E) can
be computed. For some threshold τ , 0 ≤ τ ≤ 1, consider the set of edges
Eτ ⊂ V × V such that p(E) ≥ τ . Since we know the true graph, we can compare Eτ to the true graph, and in particular, count the number of true and
false positive edges. An ROC curve is given by plotting the number of true
positive edges against the number of false positive edges, for a range of values
of 0 ≤ τ ≤ 1. Naturally, we seek to maximise the number of true positives for
a given number of false positives, and so the algorithms with the greatest area
under the curve are preferred.
We drew three independent samples from the ALARM network, with sample
sizes n = 100, 500, 750 respectively. To ensure a fair comparison, we fixed compute time, running each of the MCMC samplers for 20 minutes, and performed
three independent runs starting from different initial graphs.
Figure 1 is a plot of ROC curves for the Gibbs sampler, MC3 , REV sampler
and Xie and Geng’s constraint-based method.
The MC3 algorithm gives the most variable results. For n = 100, in two cases,
the MC3 results are the most consistent with the true ALARM network. However, all of the MC3 assign probability zero to at least one true edge in the
graph. This is apparent in the ROC plots from the failure of the number of true
positive edges to reach 46, the number of total edges in the ALARM network,
until τ = 0. This effect is symptomatic of a sampler that has not converged.
Figure 2, which compares the posterior edge probabilities of two independent
runs of the sampler, confirms that MC3 fails to converge. The REV sampler
performs well at low sample sizes, but for n = 500 and n = 750 we again observe
that the posterior edge probabilities given are zero for some true edges, again
implying that the sampler has failed to converge. The constraint-based method
only gives a point estimate, and so this is indicated on the ROC plot by a cross.
We see that the method performs poorly when n = 100, but impressively for
large sample sizes.
The Gibbs sampler, however, gives the most consistent results across runs. The
samplers appear to have converged (see Figure 2), which is impressive in only
20 minutes, and in terms of the ROC analysis, the Gibbs sampler is mostly one
of the best performers. However, it is the consistency of its results that are
11
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
45
Number of true positives
40
35
30
25
20
15
n = 100
10
Gibbs
MC3
REV
Xie−Geng
5
0
0
150
300
450
600
750
900 1050 1200 1350
Number of false positives
45
Number of true positives
40
35
30
25
20
15
n = 500
10
Gibbs
MC3
REV
Xie−Geng
5
0
0
150
300
450
600
750
900 1050 1200 1350
Number of false positives
45
Number of true positives
40
35
30
25
20
15
n = 750
10
Gibbs
MC3
REV
Xie−Geng
5
0
0
150
300
450
600
750
900 1050 1200 1350
Number of false positives
Figure 1: ROC curves given by estimated posterior distributions from our Gibbs
sampler, MC3 , REV, and Xie-Geng’s constraint-based method. These plot the
number of true positives (y-axis) against the number of false positives (x-axis)
for a range of values of τ .
12
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
Final edge
probabilities
from 3 runs
Final edge probabilities
from
3 runs
!
!
!
Run.1
!!
!!
!
!!
! !
! !
!
!!
!!
!
!
!
!
!
!
!
! !
!
Run.1
!
!
!
!
!!
!
!
!
!
! !!
! !
!!!!!!
! ! !!
!!
!!
!
!
!
!
!
!
!
!!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!! !
!!
!!
!
!
!
!
!!
!!!!
!!! !
!
!!
!
!
! !!!
!
! !!!
!!
!! !
!
!!
!
!
! !!!!!
! ! !
!!
!
!
!!
!!
!!
!
!
!!
!
!
!!!!
!
!
!
!
!
!
! !
!! !
!
!
!
! ! !!
!
!
!
!!
!!!!!
!
!
!
!
!!
!
! !
!! !!
!!
!
!
!
!!
!
!
!
! !
!
!
!!!!
!
!
!
!
!
!
!
!!!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!!! !! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! ! !!
!
!
!
!
!!!!!!
!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!!
!
!!!
!
!
!
!
!!
!!!
!!
!
!!!
!
!
!
!!
!
! !
!
!
!!
!!!
! !
!!
!!
!
!!
!
!!
!
!
! !
!
!!!!
!!
!
!!!
!
!
! !!
!
!
! !!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !!
!
!
!
!
!
Figure 2: Convergence control for MC3!(left)
and the Gibbs
!
!
!
!!
!!sampler (right),
! !!
!
!
!
!
! !!
! independent
!
for n = 500. The posterior edge probabilities
given by two
runs
!
!
! ! !
! !
!!
are plotted against each other. When the
give
! same estimates of !!
! the
! two runs
!
!
!
!
!
!
!
!
!
!
!
!
the posterior edge probabilities, all of the points appear
on
! the line y = x.!We
! !
!
!!
!!
!
!
!
!!
!
!!
!!
observe that the two Gibbs
runs gives comparable
Run.2
!!
Run.2
!
!posterior edge probabilities,
!!
!!!
! !
!
! !
!
!
3
!
!
!
!
!
!
!! !
but the MC runs do not. This was typical !of !all the
sizes.
! ! !
!!
! runs and! sample
!!!
most appealing in practice, and these
rapidly.
9
!
!!
!
!!
!
!!
!
!
!!!
!
!
!
!
!
! !!
!!!
!
!!
!!
!!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!! !
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! converge
!
!
!!
!!
!
!
!
!
!
!
!
!
!!
!
!! ! ability
!
!!
!
!
!
stem
from
its apparent
to
!
!
!
Conclusion
Run.3
Run.3
We have introduced a Gibbs sampler for structural inference of Bayesian networks. The sampler uses the idea of blocking to improve its rate of convergence, and we demonstrated empirically its utility on simulated data. An appealing aspect of the approach is that it highlights and uses the connection
between Bayesian variable selection and structural inference of Bayesian netScatter
Plot Matrix
Plot Matrix
works. A number
of theoretical
resultsScatter
about Bayesian
variable selection are
known (see, e.g. Scott and Berger, 2010), whereas few equivalent theoretical results for Bayesian networks are available. For undirected graphical models, the
connection is widely appreciated, for example in the context of the graphical
lasso (Meinshausen and Bühlmann, 2006). Methods exploiting the connection
for directed models may therefore enable the transfer of methods and ideas in
the variable selection literature to be adapted for use in structural inference.
References
Beinlich, I., Suermondt, H., Chavez, R. and Cooper, G. (1989) The ALARM
monitoring system: A case study with two probabilistic inference techniques
for belief networks. In Second European Conference on Artificial Intelligence
in Medicine, pp. 247–256. Springer-Verlag, Berlin.
13
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
Coppersmith, D. and Winograd, S. (1990) Matrix multiplication via arithmetic
progressions. Journal of Symbolic Computation, 9, 251–280.
Demetrescu, C., Eppstein, D., Galil, Z. and Italiano, G. F. (2010) Dynamic
Graph Algorithms. In Algorithms and Theory of Computation Handbook:
General Concepts and Techniques (eds. M. J. Atallah and M. Blanton), pp.
9.1–9.28. Boca Raton: CRC Press.
Demetrescu, C. and Italiano, G. (2005) Trade-offs for fully dynamic transitive
closure on DAGs: breaking through the O(n2 ) barrier. Journal of the ACM,
52, 147–156.
Eaton, D. and Murphy, K. (2007) Bayesian structure learning using dynamic
programming and MCMC. In Proceedings of the 23rd Conference Annual
Conference on Uncertainty in Artificial Intelligence (UAI-07), pp. 101–108.
Corvallis, Oregon.
Ellis, B. and Wong, W. H. (2008) Learning Causal Bayesian Network Structures
From Experimental Data. Journal of the American Statistical Association,
103, 778–789.
Friedman, N. and Koller, D. (2003) Being Bayesian about network structure.
A Bayesian approach to structure discovery in Bayesian networks. Machine
Learning, 50, 95–125.
Geiger, D. and Heckerman, D. (1994) Learning Gaussian Networks. In Proceedings of the 10th Conference Annual Conference on Uncertainty in Artificial
Intelligence (UAI-94).
Giudici, P. and Castelo, R. (2003) Improving Markov Chain Monte Carlo Model
Search for Data Mining . Machine Learning, 50, 127–158.
Grzegorczyk, M. and Husmeier, D. (2008) Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Machine
Learning, 71, 265–305.
Heckerman, D., Geiger, D. and Chickering, D. M. (1995) Learning Bayesian
Networks: The Combination of Knowledge and Statistical Data. Machine
Learning, 20, 197–243.
Kalisch, M. and Bühlmann, P. (2008) Robustification of the PC-Algorithm for
Directed Acyclic Graphs. Journal of Computational and Graphical Statistics,
17, 773–789.
King, V. and Sagert, G. (2002) A Fully Dynamic Algorithm for Maintaining the
Transitive Closure. Journal of Computer and System Sciences, 65, 150–167.
Koivisto, M. and Sood, K. (2004) Exact Bayesian structure discovery in
Bayesian networks. Journal of Machine Learning Research, 5, 549–573.
14
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
Madigan, D., Raftery, A. E., York, J. C., Bradshaw, J. M. and Almond, R. G.
(1994) Strategies for graphical model selection. In Selecting Models from Data:
AI and Statistics IV (ed. R. W. O. P. Cheeseman), pp. 91–100. New York:
Springer-Verlag.
Madigan, D. and York, J. C. (1995) Bayesian Graphical Models for Discrete
Data. International Statistical Review / Revue Internationale de Statistique,
63, 215–232.
Meinshausen, N. and Bühlmann, P. (2006) High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34, 1436–1462.
Mukherjee, S. and Speed, T. P. (2008) Network Inference Using Informative
Priors. Proceedings of the National Academy of Sciences of the United States
of America, 105, 14313–14318.
Munro, I. (1971) Efficient determination of the transitive closure of a directed
graph. Information Processing Letters, 1, 56–58.
Parviainen, P. and Koivisto, M. (2009) Exact Structure Discovery in Bayesian
Networks with Less Space. Proceedings of the 25th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-09).
Roberts, G. O. and Sahu, S. K. (1997) Updating Schemes, Correlation Structure,
Blocking and Parameterization for the Gibbs Sampler. Journal Of The Royal
Statistical Society Series B-Statistical Methodology, 59, 291–317.
Scott, J. and Berger, J. (2010) Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38, 2587–
2619.
Spirtes, P., Glymour, C. and Scheines, R. (2000) Causation, Prediction, and
Search. Cambridge, MA: The MIT Press, second edition.
Werhli, A. V. and Husmeier, D. (2007) Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple
Sources of Prior Knowledge. Statistical Applications in Genetics and Molecular Biology, 6.
Xie, X. and Geng, Z. (2008) A Recursive Method for Structural Learning of
Directed Acyclic Graphs. Journal of Machine Learning Research, 9, 459–483.
15
CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism
Download