An efficient Gibbs sampler for structural inference in Bayesian networks Robert J. B. Goudie and Sach Mukherjee Department of Statistics University of Warwick Coventry, CV4 7AL June 9, 2011 Abstract We propose a Gibbs sampler for structural inference in Bayesian networks. The standard Markov chain Monte Carlo (MCMC) algorithms used for this problem are random-walk Metropolis-Hastings samplers, but for problems of even moderate dimension, these samplers often exhibit slow mixing. The Gibbs sampler proposed here conditionally samples the complete set of parents of a node in a single move, by blocking together particular components. These blocks can themselves be paired together to improve the efficiency of the sampler. The conditional distribution used for sampling can be viewed as a posterior distribution for a constrained Bayesian variable selection for the parents of a node. This view sheds further light on the increasingly well understood connection between Bayesian variable selection and structural inference. We empirically examine the performance of the sampler using data simulated from the ALARM network. 1 Introduction Bayesian networks provide a framework for encoding probabilistic dependence statements about the components of a multivariate probability distribution. Random variables are represented by vertices of a directed acyclic graph (DAG). The distribution of each random variable is specified conditional on its parents in the graph. The full distribution is therefore described recursively. These models are widely used in statistics, artificial intelligence and diverse other areas. If the conditional distributions of the component random variables share a parametric form that makes sense under any dependence structure, we can consider the space of Bayesian networks to be the space in which model selection or averaging takes place. In a Bayesian framework, selection is performed using the 1 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism marginal likelihood. This has a closed form if tractable conjugate priors are used for the conditional distributions. Nonetheless, exact structural inference is not usually possible because the large cardinality of the space of DAGs precludes a exhaustive search or enumeration. Structural inference for Bayesian networks has been widely studied, and a number of different approaches are available. There are a number of MCMC samplers available, including MC3 (Madigan and York, 1995) and variants that improve its efficiency (Giudici and Castelo, 2003; Grzegorczyk and Husmeier, 2008). In high, or even moderate dimensions, it is not straightforward to construct an MCMC sampler that converges rapidly to its target distribution. One attempt to address this problem has been to perform MCMC in order space (Friedman and Koller, 2003; Ellis and Wong, 2008; Eaton and Murphy, 2007). These approaches have more recently led to exact methods (Koivisto and Sood, 2004; Parviainen and Koivisto, 2009) using dynamic programming. An alternative class of approaches is constraint-based methods (Spirtes et al., 2000; Kalisch and Bühlmann, 2008; Xie and Geng, 2008). In this paper we propose a method for constructing Gibbs samplers for structural inference of Bayesian networks. The standard MCMC algorithms used for this problem propose small changes to the current state, and accept these proposals according to the usual Metropolis-Hastings acceptance probability. The Gibbs sampler proposed here considers the parents of each node as a single component (by constructing blocks), and conditionally samples the entire parent set. This allows the sampler to make ‘large’ moves that are sampled exactly from the local conditional posterior distribution, enabling the sampler to locate and explore the areas of significant posterior mass efficiently. The remainder of this paper is organised as follows. We start by introducing structural inference for Bayesian Networks. We then describe a naı̈ve Gibbs Sampler for this problem, which motivates the development of an improved Gibbs sampler, based on the idea of blocking. We then describe how this approach can be extended to larger blocks, before describing how both of these algorithms can be implemented efficiently. Finally, we present empirical results comparing the Gibbs sampler to some widely-used existing methods. 2 Background and Notation A Bayesian network G is a directed, acyclic graph (DAG) with vertices V = (V1 , . . . , Vp ), and directed edges E ⊂ V × V . The vertices correspond to the components of a random vector X = [X1 , . . . , Xp ]T , subsets of which will be denoted by XA for sets A ⊆ {1, . . . , p}. The arcs E of the graph G can be specified in terms of the parents Gj of each vertex Vj for 1 ≤ j ≤ p. The parents Gj of Vj are the subset of vertices V such that Vi ∈ Gj ⇔ (Vi , Vj ) ∈ E. It will sometimes be convenient to use the collection of parents (G1 , . . . , Gp ) to 2 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism specify a graph G. Subsets thereof will be denoted by GA = {Gk : k ∈ A}. The subset given by the complement AC = {1, . . . , p} \ A of a set A will be denoted by G−A = {Gk : k ∈ AC }. In particular, the complete graph can be specified by G = (G1 , . . . Gp ) = (Gi , G−i ) for any i ∈ {1, . . . , p}. We will use XGi to refer to the random variables that are parents of Xi in the graph G. The joint distribution of X is specified in terms of p(Xi | XGi , θi ), the conditional distribution of each Xi , given its parents XGi in the Bayesian network, with parameters θi . For structural inference our interest focuses on the posterior distribution on Bayesian networks P (G | X), where G ∈ G, the space of all possible DAGs with p vertices. This is proportional to the product of the marginal likelihood p(X | G), and a prior π(G) for the Bayesian network structure. By choosing conjugate priors for the parameters θi , and assuming local parameter independence and modularity (Heckerman et al., 1995), we can obtain a closedform marginal likelihood. The details for the multinomial and Gaussian cases are described by Heckerman et al. (1995) and Geiger and Heckerman (1994) respectively. We will not need to assume the graph prior π(G) takes any specific form, and so in particular either flat (improper) priors over graph space, or informative priors (Werhli and Husmeier, 2007; Mukherjee and Speed, 2008) can be used. Under the assumptions we have made, the posterior distribution on Bayesian networks is p Y P (G | X) ∝ π(G) p(Xi | XGi ) i=1 This is the target distribution for our sampler. 3 A Naı̈ve Gibbs Sampler The standard sampler for structural inference for Bayesian networks is MC3 (Madigan et al., 1994), which is a Metropolis-Hastings sampler that explores G by proposing to add or remove a single arc from the current graph G. The sampler can be viewed as exploring the space of adjacency matrices, which consists of p × p matrices whose elements Gij are indicator variables for whether G includes an arc from i to j, and whose diagonal elements Gii = 0 for all i. Each proposal G0 is drawn uniformly at random from the neighbourhood η(G) of the current graph, defined as the set of G0 that differ from G by a single edge addition or removal. The proposal G0 is accepted with probability min(1, α), where P (G0 | X)|η(G0 )|−1 . α = min 1, P (G | X)|η(G)|−1 ) Constructing a Gibbs sampler that is analogous to MC3 is straightforward. To do this, we consider the posterior distribution on Bayesian networks to be a 3 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism joint distribution on off-diagonal entries in the adjacency matrix. We thus have p(p − 1) random variables Gij , each of which takes the value 1 or 0. The proposal distribution of MC3 can also be viewed as proposing to toggle the value of Gij of the adjacency matrix for some i 6= j, subject to the restriction that the proposal must be acyclic. A simple Gibbs sampler works in a similar way. Each step of the Gibbs sampler samples Gij from its conditional distribution, given the rest of the graph GC ij = {Guv : 1 ≤ u ≤ p, 1 ≤ v ≤ q} \ Gij . Define G1ij as the graph G with an arc from i to j, and G0ij as the graph G with no arc from i to j. If G1ij is acyclic, the conditional distribution of Gij is Bernoulli. P (G0ij | X) P (G0 | X) + P (G1 | X) g = 0 ij ij C P (Gij = g | Gij ) = 1 P (Gij | X) g=1 P (G0ij | X) + P (G1ij | X) If G1ij is cyclic, G0ij is sampled with probability 1. The choice of i and j can either be made sequentially (systematically) or randomly. There are few theoretical results to guide the choice of random- and systematic-scan Gibbs samplers (Roberts and Sahu, 1997). Here, random-scan Gibbs samplers are used throughout. This naı̈ve Gibbs sampler offers no advantages over MC3 . However, thinking of structural inference from a Gibbs sampling perspective opens up the possibility of drawing on ideas from the Gibbs sampling literature to improve the mixing rate of the MCMC algorithm, as we now discuss. 4 Optimising Gibbs Samplers The mixing of Metropolis-Hastings algorithms depends strongly upon the choice of proposal distribution, but Gibbs samplers make moves according to the conditional distribution. The conditional distribution has the attractive property that it exactly reflects some local structure of the target distribution, in contrast to proposal distributions for a Metropolis-Hastings algorithm, which are often chosen for convenience. Nonetheless, Gibbs sampling is not always efficient. Inefficiency occurs when there is strong correlation between the components of the random vector. To see this, consider Gibbs sampling a multivariate continuous distribution with highly correlated components. At each step, a single component of the random vector is sampled according to its conditional distribution, but since this component is strongly correlated with another component, the conditional distribution will be concentrated on only a small part of its support. This means that the sampler 4 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism Algorithm 1 A Gibbs sampler, with blocks Initialise starting point G0 = (G01 , . . . , G0p ) for i in 1 to N do for j in 1 to p do Draw G0j ∼ P (Gj | Gi−1 −j , X) i 0 Set G ← G = (Gj , Gi−1 −j ) end for end for is likely to make only small moves, which will mean that it explores the sample space slowly. The same issue arises with discrete distributions. For Bayesian networks, it is clear that there will be strong dependence between the Gij , particularly for the groups {Gij : 1 ≤ i ≤ p} for each 1 ≤ j ≤ p that correspond to parent sets. For example, there may be random variables Xr and Xs that do not individually predict Xj well, but do when taken in combination. Another possibility is of two pairs of random variables Xr , Xs and Xu , Xv that in combination both predict Xj well, but such that any the 4 random variables individually do not. In this case, the probability of transitioning from a graph in which the parents of Xj are Vr and Vs to a graph in which the parents of Xj are Vu and Vv may be extremely low. One method for alleviating this problem is to transform the distribution so that the components of the random variable are not correlated. In general, finding a suitable transformation can be very difficult, and in the case of Bayesian networks it seems unlikely that an appropriate transformation is available. Instead we propose to group a number of the components together and sample from their joint conditional distribution. In Gibbs sampling, this is known as ‘blocking’. The method is widely thought to be beneficial in settings such as this where there is strong correlation between components of the random variable. In the case of multivariate normal distributions, Roberts and Sahu (1997) have shown that for random-scan Gibbs sampling, convergence improves when components of the random vector are sampled as blocks. By sampling from the joint conditional distribution of a number of components, we avoid the issues caused by the correlation between these components because the joint conditional distribution naturally incorporates the correlation structure, and so can account for it. 5 Single Parent Set Blocks As we noted above, the efficiency of a Gibbs sampler can be improved by blocking together a number of components, and sampling from their joint conditional 5 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism distribution. In theory, any group of components can be taken as a block, but sampling from their joint conditional distribution needs to be possible, and ideally simple. For Bayesian networks, the most natural blocks are those consisting of groups {Gij : 1 ≤ i ≤ p} for each 1 ≤ j ≤ p. We can think of this as parameterising G by its parent sets G1 , . . . , Gp . This seems natural for Bayesian networks because the marginal likelihood p(X | G) for a graph G factorises across vertices into conditionals p(Xj | XGj ). These conditionals depend on the parent set of each vertex, and so the posterior distribution on Bayesian networks G ∈ G can be written as functions of G1 , . . . Gp in the following way. P (G1 , . . . , Gp | X) ∝ π(G1 , . . . , Gp ) p Y p(Xi | XGi ) i=1 To be able to construct a Gibbs sampler using these blocks, we need to find the conditional distribution for the blocks. Specifically, we need the conditional distribution for a block Gj = {Gij : 1 ≤ i ≤ p}, 1 ≤ j ≤ p, given the other parent sets G−j = {G1 , . . . , Gj−1 , Gj+1 , . . . , Gp }. Parent sets Gj for which G = (Gj , G−j ) is cyclic will have no probability mass in the conditional distribution. Let Kj? be the set of parent sets Gj such that G = (Gj , G−j ) is acyclic. The conditional posterior distribution of Gj is multinomial for Gj ∈ Kj? , with weights given by the posterior distribution of G = (Gj , G−j ). P (Gj , G−j | X) P (G−j | X) P (Gj , G−j | X) =P Gj ∈K ? P (Gj , G−j | X) P (Gj | G−j , X) = (1) j It is interesting to note that when Kj? = P(V \{Vj }), the conditional distribution (1) can be viewed as the posterior distribution of a standard Bayesian variable selection problem with dependent variable Vj , and independent variables V−j . When Kj? ⊂ P(V \ {Vj }), the conditional distribution (1) is a Bayesian variable selection problem with independent variables given by a subset of V−j . This highlights the connection between structural learning for Bayesian networks and Bayesian variable selection problems. While this connection has not been widely used for directed graphs, recently it has been widely studied and exploited for undirected graphs (e.g. Meinshausen and Bühlmann, 2006). When these variable selection problems are tractable, we can immediately implement a Gibbs sampler for G. At each sample, we simply draw a new parent set for vertex Vj from P (Gj | G−j , X). Algorithm 1 is an outline of this Gibbs sampler. 6 Two Parent Set Blocks The Gibbs sampler described in Section 4 groups the edge indicators Gij into blocks corresponding to the parents of a node, as these are likely to be correlated. 6 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism However, this may not be sufficient to achieve quick convergence because there is often strong correlation between components of Gij and Gij 0 with j 6= j 0 . When this is the case, samplers using the blocking strategy described above may fail to mix efficiently. In this case, we can form larger blocks formed of pairs of parent sets {Gij : j ∈ {j1 , j2 }, 1 ≤ i ≤ p}. At each step of the Gibbs sampler, we now conditionally sample pairs of parent sets (Gj1 , Gj2 ), given the remainder of the graph G−{j1 ,j2 } . For parents sets G−{j1 ,j2 } such that G = (Gj1 , Gj2 , G−{j1 ,j2 } ) is not acyclic, the conditional distribution is 0. Let Kj?1 ,j2 be the set of pairs of parent sets (Gj1 , Gj2 ) such that G = (Gj1 , Gj2 , G−{j1 ,j2 } ) is acyclic. For (Gj1 , Gj2 ) ∈ Kj?1 ,j2 , the conditional posterior distribution is multinomial, with weights given by posterior distribution of G = (Gj1 , Gj2 , G−{j1 ,j2 } ). P (Gj1 , Gj2 | G−{j1 ,j2 } , X) = P (Gj1 , Gj2 , G−{j1 ,j2 } | X) P (G−{j1 ,j2 } ) =P (Gj1 7 P (Gj1 , Gj2 , G−{j1 ,j2 } | X) P (Gj1 , Gj2 , G−{j1 ,j2 } | X) ,Gj )∈K ? 2 (2) j1 ,j2 Computational Aspects Up to this point we have ignored the computational aspects of the algorithm. In this section, we describe how the algorithms described above can be implemented efficiently. The main computational difficulty encountered by these algorithms is checking for cycles. 7.1 Checking for Cycles The most straightforward method for checking for cycles is depth-first search, which takes O(p + e) time, where e is the number of edges in the graph. For MC3 , we must consider all possible single-edge changes to a directed graph of which there are O(p2 ), and so checking for cycles at each iteration takes O(p3 ) time in the worst case. This creates a bottleneck in the algorithm, but we can avoid this by using ideas first proposed in this context by Giudici and Castelo (2003). We describe an alternative method that was proposed in the dynamic algorithms literature by King and Sagert (2002). Let T G be the transitive closure of the current state of the sampler, which for a graph G = (V, E) is defined as the directed graph (V, E ? ), where (Vi , Vj ) ∈ E ? if and only if a path from Vi to Vj exists in G. Knowing the transitive closure is of use because its adjacency matrix T G = (TijG ) immediately reveals which 7 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism alterations can be made to G without introducing a cycle. The addition of an G edge (i, j) will introduce a cycle if and only if Tji = 1. Clearly removing an edge from G will never introduce a cycle. The adjacency matrix of the transitive closure therefore enables graphs created by single-edge additions to be screened for cycles in O(1) time. The transitive closure for an arbitrary directed graph can be determined in O(pω ) time (Munro, 1971), where ω is the best known exponent for matrix multiplication (Coppersmith and Winograd, 1990, show ω < 2.376). However, only incremental changes are made to the current state G of the sampler, so a dynamic algorithm can be used to compute the transitive closure more efficiently. We need a fully dynamic transitive closure algorithm, so that both insertion and deletion of edges are supported. This problem has been the subject of significant interest in the dynamic algorithms literature; Demetrescu et al. (2010) provide an overview. Algorithms for this problem provide a procedure for querying the transitive closure, and procedures that update the transitive closure when an edge is added or removed from the graph. A trade-off exists between the performance of these two operations (Demetrescu and Italiano, 2005). We choose to implement the algorithm introduced by King and Sagert (2002), which allows queries to be performed in O(1) time, and updates in O(p2 ) worst-case time, assuming a word size of O(log p). This is thought to be the best possible update bound (Demetrescu and Italiano, 2005), yet the algorithm is simple to implement. G G The algorithm maintains a path count matrix C G = (Cij ), where Cij is the G number of distinct paths from Vi to Vj in G. Clearly, Tij = 1 if and only if G Cij > 0, and so query operations are performed simply by checking whether the relevant component of C G is positive. Surprisingly, updating C G is also straightforward. Suppose G0 is formed by G adding an edge (i, j) to a graph G. Denote the ith column of C G by C•i , and th G the j row by Cj• . The increase in the number of distinct paths between any two vertices Va and Vb is given by the (a, b) element of the outer product G G (denoted ⊗) of C•i and Cj• . Updates to the path count matrix for the addition of an edge (i, j) therefore add this outer product to the existing path count matrix. 0 G G C G = C G + C•i ⊗ Cj• Removing an edge from Vi to Vj is performed analogously. 0 G G C G = C G − C•i ⊗ Cj• This algorithm is simple to implement, and provides a fast method for determining which edges can be added to a DAG without introducing a cycle. 8 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism Algorithm 2 An efficient Gibbs sampler, with blocks Initialise starting point G0 = (G01 , . . . , G0p ) Compute initial path count matrix C G0 for i in 1 to N do for j in 1 to p do i−1 Remove parents of Vj in Gi−1 . Update C G G Retrieve Kj = {Vk : Cjki−1 = 0} of Vj ? Compute power set Kj = P(Kj ) for Gj in Kj? do Evaluate p(Xj | XGj ) end for Evaluate P (Gj | Gi−1 −j , X) using (1) Draw G0j ∼ P (Gj | Gi−1 −j , X) i 0 Set G ← G = (Gj , Gi−1 −j ) Update C Gi end for end for 7.2 Application to Gibbs Sampler The algorithms in the previous section enable the Gibbs samplers to be implemented efficiently. We first describe how this can be done for the Gibbs sampler in Section 5, and then describe how this approach can be extended when blocks are formed by two parent sets, as described in Section 6. An efficient implementation of the Gibbs sampler with single parent set blocks (Section 5) requires a method for efficiently computing Kj? , the set of parent sets Gj such that G = (Gj , G−j ) is acyclic. This is, in fact, not difficult if the transitive closure of the current graph is available. Recall that adding an edge G (i, j) will introduce a cycle if and only if Tji = 1. Therefore, the set of vertices G that can be added as parents of Vj is the set Kj = {Vi : Tji = 0}. In fact, any subset of Kj can also be added without introducing a cycle. It is clear, therefore, that Kj? = P(Kj ), the power set of Kj . An efficient algorithm that implements the Gibbs sampler using this method is outlined in Algorithm 2. The two parent set Gibbs sampler (Section 6) requires it to be feasible draw samples from (2) quickly. To do this, we describe a method for identifying Kj?1 ,j2 , the pairs of parent sets which, when added to the graph, do not create a cycle. Once again, the transitive closure makes it possible to find the required parent sets. We first define Kj1 and Kj2 to be the non-descendants of Vj1 and Vj2 respectively. Similarly, we define KjC1 and KjC2 to be the descendants of Vj1 and Vj2 respectively. As before, we have that Kj?1 = P(Kj1 ) and Kj?2 = P(Kj2 ) 9 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism are the sets of parent sets that can separately be added to Vj1 and Vj2 without forming a cycle. However, Kj?1 ,j2 is not equal to Kj?1 × Kj?2 because there may be a subset C of Kj?1 × Kj?2 for which the (Gj1 , Gj2 ) ∈ C are such that G = (Gj1 , Gj2 , G−{j1 ,j2 } ) is cyclic. Elements (Gj1 , Gj2 ) in C have the property that Gj1 and Gj2 can be added to the current graph without forming a cycle in the graph, but, when added to the current graph together, will create a cycle. The elements of C correspond to the situation that both a descendant of Vj1 is added as a parent of Vj2 and a descendant of Vj2 is added as a parent of Vj1 . When enumerating Kj?1 ,j2 we must therefore ensure that our pairs of parent sets do not both have a descendant of the other vertex as a parent. This makes it helpful to partition Kj?1 ,j2 into three groups, depending on whether paths are formed from Vj1 to Vj2 , from Vj2 to Vj1 , or neither. The corresponding partition is Kj?1 ,j2 = {H0 , H1 , H2 }, where H0 = (Kj?1 ∩ Kj?2 ) × (Kj?2 ∩ Kj?1 ) H1 = (Kj?1 ∩ KjC2 ) × (Kj?2 ∩ Kj?1 ) H2 = (Kj?1 ∩ Kj?2 ) × (Kj?2 ∩ KjC1 ). In H0 , no paths will be formed between Vj1 and Vj2 , because only non-descendants of both Vj1 and Vj2 are included as potential parents. In H1 , Vj1 has a parent that is a descendant of Vj2 and so the parents of Vj2 cannot include Vj1 or any of its descendants. Similarly, in H2 , Vj1 has a parent that is not a descendant of Vj2 and so the parent of Vj2 include Vj1 or any of its descendants without inducing a cycle. The partition gives a simple way to enumerate all of the elements of Kj?1 ,j2 , because all of these sets are functions of the transitive closure, which we dynamically update as described above. Since we can enumerate H0 , H1 and H2 , we can enumerate Kj?1 ,j2 . This means we can draw samples from (2), as required by the Gibbs sampler with two parent set blocks. 8 Experiments We analysed the performance of the MCMC samplers using synthetic data, generated from the ALARM network (Beinlich et al., 1989), which includes 46 edges. This network is widely used to examine to the performance of methods of structural learning. We compare the performance of our Gibbs sampler with MC3 , the REV sampler of Grzegorczyk and Husmeier (2008), and a constraint-based approach introduced by Xie and Geng (2008). The constraint-based approach we use is a variant of the PC-algorithm (Spirtes et al., 2000), which Xie and Geng (2008) demonstrate can outperform the PC-algorithm. Our MC3 and Gibbs samplers both implement the fast updating of the transitive closure from King and Sagert (2002), and cache local marginal likelihoods. We used the default settings for all of these samplers, in particular, setting significance level α = 0.05 for the 10 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism constraint-based method. The Gibbs sampler we used was a random-scan sampler, which used two parent set block moves with probability 0.25 and one parent set block moves otherwise. We chose a flat, improper prior π(G) ∝ 1, and constrain all of the MCMC samplers to graphs with in-degree no greater than 3. Since the true graph is known, the performance of structural learning algorithms can be assessed using ROC (Receiver-operating characteristic) curves. The MCMC algorithms return estimates for the posterior distribution on Bayesian networks P (G | X), from which the posterior probability of any arc p(E) can be computed. For some threshold τ , 0 ≤ τ ≤ 1, consider the set of edges Eτ ⊂ V × V such that p(E) ≥ τ . Since we know the true graph, we can compare Eτ to the true graph, and in particular, count the number of true and false positive edges. An ROC curve is given by plotting the number of true positive edges against the number of false positive edges, for a range of values of 0 ≤ τ ≤ 1. Naturally, we seek to maximise the number of true positives for a given number of false positives, and so the algorithms with the greatest area under the curve are preferred. We drew three independent samples from the ALARM network, with sample sizes n = 100, 500, 750 respectively. To ensure a fair comparison, we fixed compute time, running each of the MCMC samplers for 20 minutes, and performed three independent runs starting from different initial graphs. Figure 1 is a plot of ROC curves for the Gibbs sampler, MC3 , REV sampler and Xie and Geng’s constraint-based method. The MC3 algorithm gives the most variable results. For n = 100, in two cases, the MC3 results are the most consistent with the true ALARM network. However, all of the MC3 assign probability zero to at least one true edge in the graph. This is apparent in the ROC plots from the failure of the number of true positive edges to reach 46, the number of total edges in the ALARM network, until τ = 0. This effect is symptomatic of a sampler that has not converged. Figure 2, which compares the posterior edge probabilities of two independent runs of the sampler, confirms that MC3 fails to converge. The REV sampler performs well at low sample sizes, but for n = 500 and n = 750 we again observe that the posterior edge probabilities given are zero for some true edges, again implying that the sampler has failed to converge. The constraint-based method only gives a point estimate, and so this is indicated on the ROC plot by a cross. We see that the method performs poorly when n = 100, but impressively for large sample sizes. The Gibbs sampler, however, gives the most consistent results across runs. The samplers appear to have converged (see Figure 2), which is impressive in only 20 minutes, and in terms of the ROC analysis, the Gibbs sampler is mostly one of the best performers. However, it is the consistency of its results that are 11 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism 45 Number of true positives 40 35 30 25 20 15 n = 100 10 Gibbs MC3 REV Xie−Geng 5 0 0 150 300 450 600 750 900 1050 1200 1350 Number of false positives 45 Number of true positives 40 35 30 25 20 15 n = 500 10 Gibbs MC3 REV Xie−Geng 5 0 0 150 300 450 600 750 900 1050 1200 1350 Number of false positives 45 Number of true positives 40 35 30 25 20 15 n = 750 10 Gibbs MC3 REV Xie−Geng 5 0 0 150 300 450 600 750 900 1050 1200 1350 Number of false positives Figure 1: ROC curves given by estimated posterior distributions from our Gibbs sampler, MC3 , REV, and Xie-Geng’s constraint-based method. These plot the number of true positives (y-axis) against the number of false positives (x-axis) for a range of values of τ . 12 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism Final edge probabilities from 3 runs Final edge probabilities from 3 runs ! ! ! Run.1 !! !! ! !! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! Run.1 ! ! ! ! !! ! ! ! ! ! !! ! ! !!!!!! ! ! !! !! !! ! ! ! ! ! ! ! !!! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !!! ! !! ! !! !! ! ! ! ! !! !!!! !!! ! ! !! ! ! ! !!! ! ! !!! !! !! ! ! !! ! ! ! !!!!! ! ! ! !! ! ! !! !! !! ! ! !! ! ! !!!! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! !! !!!!! ! ! ! ! !! ! ! ! !! !! !! ! ! ! !! ! ! ! ! ! ! ! !!!! ! ! ! ! ! ! ! !!! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !!!! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! !!!!!! ! !! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !!! ! ! ! ! !! !!! !! ! !!! ! ! ! !! ! ! ! ! ! !! !!! ! ! !! !! ! !! ! !! ! ! ! ! ! !!!! !! ! !!! ! ! ! !! ! ! ! !! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! Figure 2: Convergence control for MC3!(left) and the Gibbs ! ! ! !! !!sampler (right), ! !! ! ! ! ! ! !! ! independent ! for n = 500. The posterior edge probabilities given by two runs ! ! ! ! ! ! ! !! are plotted against each other. When the give ! same estimates of !! ! the ! two runs ! ! ! ! ! ! ! ! ! ! ! ! the posterior edge probabilities, all of the points appear on ! the line y = x.!We ! ! ! !! !! ! ! ! !! ! !! !! observe that the two Gibbs runs gives comparable Run.2 !! Run.2 ! !posterior edge probabilities, !! !!! ! ! ! ! ! ! ! 3 ! ! ! ! ! ! !! ! but the MC runs do not. This was typical !of !all the sizes. ! ! ! !! ! runs and! sample !!! most appealing in practice, and these rapidly. 9 ! !! ! !! ! !! ! ! !!! ! ! ! ! ! ! !! !!! ! !! !! !! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !!!!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! converge ! ! !! !! ! ! ! ! ! ! ! ! !! ! !! ! ability ! !! ! ! ! stem from its apparent to ! ! ! Conclusion Run.3 Run.3 We have introduced a Gibbs sampler for structural inference of Bayesian networks. The sampler uses the idea of blocking to improve its rate of convergence, and we demonstrated empirically its utility on simulated data. An appealing aspect of the approach is that it highlights and uses the connection between Bayesian variable selection and structural inference of Bayesian netScatter Plot Matrix Plot Matrix works. A number of theoretical resultsScatter about Bayesian variable selection are known (see, e.g. Scott and Berger, 2010), whereas few equivalent theoretical results for Bayesian networks are available. For undirected graphical models, the connection is widely appreciated, for example in the context of the graphical lasso (Meinshausen and Bühlmann, 2006). Methods exploiting the connection for directed models may therefore enable the transfer of methods and ideas in the variable selection literature to be adapted for use in structural inference. References Beinlich, I., Suermondt, H., Chavez, R. and Cooper, G. (1989) The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Second European Conference on Artificial Intelligence in Medicine, pp. 247–256. Springer-Verlag, Berlin. 13 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism Coppersmith, D. and Winograd, S. (1990) Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9, 251–280. Demetrescu, C., Eppstein, D., Galil, Z. and Italiano, G. F. (2010) Dynamic Graph Algorithms. In Algorithms and Theory of Computation Handbook: General Concepts and Techniques (eds. M. J. Atallah and M. Blanton), pp. 9.1–9.28. Boca Raton: CRC Press. Demetrescu, C. and Italiano, G. (2005) Trade-offs for fully dynamic transitive closure on DAGs: breaking through the O(n2 ) barrier. Journal of the ACM, 52, 147–156. Eaton, D. and Murphy, K. (2007) Bayesian structure learning using dynamic programming and MCMC. In Proceedings of the 23rd Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-07), pp. 101–108. Corvallis, Oregon. Ellis, B. and Wong, W. H. (2008) Learning Causal Bayesian Network Structures From Experimental Data. Journal of the American Statistical Association, 103, 778–789. Friedman, N. and Koller, D. (2003) Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50, 95–125. Geiger, D. and Heckerman, D. (1994) Learning Gaussian Networks. In Proceedings of the 10th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-94). Giudici, P. and Castelo, R. (2003) Improving Markov Chain Monte Carlo Model Search for Data Mining . Machine Learning, 50, 127–158. Grzegorczyk, M. and Husmeier, D. (2008) Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Machine Learning, 71, 265–305. Heckerman, D., Geiger, D. and Chickering, D. M. (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20, 197–243. Kalisch, M. and Bühlmann, P. (2008) Robustification of the PC-Algorithm for Directed Acyclic Graphs. Journal of Computational and Graphical Statistics, 17, 773–789. King, V. and Sagert, G. (2002) A Fully Dynamic Algorithm for Maintaining the Transitive Closure. Journal of Computer and System Sciences, 65, 150–167. Koivisto, M. and Sood, K. (2004) Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5, 549–573. 14 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism Madigan, D., Raftery, A. E., York, J. C., Bradshaw, J. M. and Almond, R. G. (1994) Strategies for graphical model selection. In Selecting Models from Data: AI and Statistics IV (ed. R. W. O. P. Cheeseman), pp. 91–100. New York: Springer-Verlag. Madigan, D. and York, J. C. (1995) Bayesian Graphical Models for Discrete Data. International Statistical Review / Revue Internationale de Statistique, 63, 215–232. Meinshausen, N. and Bühlmann, P. (2006) High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34, 1436–1462. Mukherjee, S. and Speed, T. P. (2008) Network Inference Using Informative Priors. Proceedings of the National Academy of Sciences of the United States of America, 105, 14313–14318. Munro, I. (1971) Efficient determination of the transitive closure of a directed graph. Information Processing Letters, 1, 56–58. Parviainen, P. and Koivisto, M. (2009) Exact Structure Discovery in Bayesian Networks with Less Space. Proceedings of the 25th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-09). Roberts, G. O. and Sahu, S. K. (1997) Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler. Journal Of The Royal Statistical Society Series B-Statistical Methodology, 59, 291–317. Scott, J. and Berger, J. (2010) Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38, 2587– 2619. Spirtes, P., Glymour, C. and Scheines, R. (2000) Causation, Prediction, and Search. Cambridge, MA: The MIT Press, second edition. Werhli, A. V. and Husmeier, D. (2007) Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge. Statistical Applications in Genetics and Molecular Biology, 6. Xie, X. and Geng, Z. (2008) A Recursive Method for Structural Learning of Directed Acyclic Graphs. Journal of Machine Learning Research, 9, 459–483. 15 CRiSM Paper No. 11-21, www.warwick.ac.uk/go/crism