Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Annealed Importance Sampling for Structure Learning in Bayesian Networks∗ Teppo Niinimäki and Mikko Koivisto HIIT & Department of Computer Science, University of Helsinki, Finland teppo.niinimaki@cs.helsinki.fi, mikko.koivisto@cs.helsinki.fi Abstract 2000], and it has been applied in various forms also to structure learning in BNs. We summarize some corner stones of the developments. Madigan and York [1995] presented a Markov chain that moves in the space of DAGs by simple arc changes. Friedman and Koller [2003] obtained a significantly faster-mixing chain by operating, not directly on DAGs, but in the much smaller and smoother space of node orderings. A drawback of the sampler, order-MCMC in the sequel, is that it introduces a bias favoring DAGs that are compatible with many node orderings. Ellis and Wong [2008] enhanced order-MCMC in two dimensions: a sophisticated sampler exploiting tempered distributions was implemented, and a heuristic for correcting the bias was introduced. Niinimäki et al. [2011] extended order-MCMC in another dimension, by showing that sampling suitable partial orders, instead of linear orders, further improves the mixing of the Markov chain, with negligible computational overhead. Also other refinements to Madigan and York’s sampler have been presented [Eaton and Murphy, 2007; Grzegorczyk and Husmeier, 2008; Corander et al., 2008], however with somewhat more limited advantages over order-MCMC. While the current MCMC methods for structure learning seem to work well in many cases, they, unfortunately, fail to satisfy some key desiderata: We present a new sampling approach to Bayesian learning of the Bayesian network structure. Like some earlier sampling methods, we sample linear orders on nodes rather than directed acyclic graphs (DAGs). The key difference is that we replace the usual Markov chain Monte Carlo (MCMC) method by the method of annealed importance sampling (AIS). We show that AIS is not only competitive to MCMC in exploring the posterior, but also superior to MCMC in two ways: it enables easy and efficient parallelization, due to the independence of the samples, and lower-bounding of the marginal likelihood of the model with good probabilistic guarantees. We also provide a principled way to correct the bias due to order-based sampling, by implementing a fast algorithm for counting the linear extensions of a given partial order. 1 Introduction To learn the structure of a Bayesian network (BN) [Pearl, 1988; 2000] from data, the Bayesian paradigm [Buntine, 1991; Cooper and Herskovits, 1992; Madigan and York, 1995] offers many appealing features, such as an explicit way to enter prior knowledge and full characterization of uncertainty about the quantities of interest, including proper treatment of potential non-identifiability issues. A major drawback of the Bayesian approach is, however, its large computational requirements. Indeed, the space of structures, namely directed acyclic graphs (DAGs), grows rapidly with the number of nodes in the network, and—even if a single good DAG could be found relatively fast—exploring the whole landscape of DAGs to some sufficient extent presents a difficult computational challenge. Currently, the fastest exact algorithms scale up to around 25-node instances [Koivisto and Sood, 2004; Koivisto, 2006; Tian and He, 2009]. The Markov chain Monte method has revolutionized applied mathematics in general [Diaconis, 2009] and the practice of Bayesian statistics in particular [Cappé and Robert, 1. Guarantees. We would like to know how the algorithm’s output relates to the quantity of interest. For instance, is it a lower bound, an upper bound, or an approximation to within some multiplicative or additive term? (Existing MCMC methods offer such guarantees only in the limit of running the algorithm infinitely many steps.) 2. Parallelizability. We would like to fully exploit parallel computation, that is, to run the algorithm in parallel on thousands of processors, preferably without frequent synchronization or communication between the parallel processes. (Existing MCMC methods are designed rather for a small number of very long runs.) 3. Bias correction. We would like to take advantage of the reduced state space (whether total or partial orders), yet enable the use of the uniform prior on (Markovequivalent) DAGs. (Existing MCMC methods rely on heuristic arguments to correct the bias and may become computationally infeasible with small data; we elaborate on this in Section 2.2.) ∗ This research is supported in part by the Academy of Finland, grants 125637, 218153, and 255675. 1579 of a directed acyclic graph (DAG) (N, A), where each node v ∈ N = {1, . . . , n} corresponds to a random variable Dv , and the arc set A ⊆ N × N specifies the parent set of the node, Av = {u ∈ N : uv ∈ A}. For each node v and its parent set Av the BN specifies a local conditional distribution p(Dv |DAv , Av ), and composes the joint distribution of D as p(Dv |DAv , Av ) . p(D|A) = The present work makes a step toward satisfying these three desiderata. Motivated by the first desire for guarantees, we seek a sampler such that we know exactly from which distribution the samples are drawn. Here, the method of annealed importance sampling (AIS) by Neal [2001] provides an appealing solution. It enables drawing independent and identically distributed samples and computing the associated importance weights, so that the expected value of each weighted sample matches the quantity of interest. In this setting, already a small number of samples may suffice, not only for finding an accurate estimate, but also for showing a relatively tight, high-confidence lower bound on the true value [Gomes et al., 2007; Gogate et al., 2007; Gogate and Dechter, 2011]. The independence of the samples also readily offers an easy way to parallelize the computations, thus satisfying the second desideratum. Finally, we address the issue of bias correction by computing the required correction term explicitly. This amounts to implementing an efficient algorithm for counting the number of topological sortings of a given DAG, or in other words, the linear extensions of the corresponding partial order. The problem is #P-hard [Brightwell and Winkler, 1991], which may have lead to the search for indirect solutions in earlier works. However, as we will show in this paper, a careful implementation of a dynamic programming algorithm allows us to count the number of linear extensions exactly up to around 40-node instances, typically within a few seconds. The purpose of this report is to communicate the main ideas and motivation underlying the proposed approach, demonstrate the potential of the approach, identify its current bottlenecks, and discuss the prospects of future research in the suggested direction. We have targeted our experiments to study rather specific questions using a few benchmark data sets. Extensive experimentation is left for future work. We note that previously Battle et al. [2010] have applied AIS to structure learning in Bayesian networks in a molecular biology application. The key difference to our method is that, instead of node orderings, they sample fully specified network models. They motivate the choice of AIS mainly by the complexity and multi-modality posterior distribution, which often lead to slow mixing of usual MCMC methods. 2 v∈N Here and henceforth, we identify the DAG with its arc set A, with the understanding that the node set N is fixed while varying configurations for A are of our interest. Indeed, we treat the arc set A as a random variable and consider the Bayesian approach to learn A from observed values of D, called data. In this setting it is customary to extend the BN model to the case where D is not a single vector but a matrix of random variables, consisting of m vectors, so that each Dv is a tuple of m random variables. Usual assumptions of exchangeability or modularity concerning the parametrization of the local conditional distributions, keep the above factorization of p(D|A) valid. Under the most popular specifications, this term can be efficiently computed for a given A [Heckerman et al., 1995]. By choosing some prior distribution p(A), the posterior distribution is obtained via the Bayes rule: p(A|D) = p(A)p(D|A)/p(D), where is p(D) is the marginal likelihood of the model, sometimes also called the evidence or the normalization constant. In addition to this fundamental quantity, the user is often interested in posterior expectations of various structural features f (A). For example, f (A) can be the indicator function of some particular arc of interest, evaluating to 1 if the arc is present in A, and to 0 otherwise [Friedman and Koller, 2003]. It is common to assign a modular prior, that is, p(A) ∝ q(A) = v qv (Av ) for some nonnegative functions qv . For instance, the uniform prior on DAGs is obtained by setting simply qv (Av ) ≡ 1. Often the support of the prior is restricted to DAGs in which the indegree of every node is at most some constant k, that is, qv (Av ) = 0 when |Av | > k. Motivated by computational issues, Friedman and Koller [2003] introduced another class of priors, in which the model is augmented with a new variable, a linear order on the nodes, L ⊆ N × N , or node ordering less formally. We call a linear order L an extension of A if A ⊆ L. Now, we assign a joint prior p(A, L) that vanishes whenever A ⊆ L, and p(A, L) ∝ q(A)ρ(L) otherwise, where q(A) is as before and ρ(L) is some nonnegative function; in the sequel, we take simply ρ(L) ≡ 1. The motivation of this construction stems from the factorization p(A|L)p(D|A) p(D|L) = Preliminaries We begin by describing the structure learning problem, mostly adopting the notation of Niinimäki et al. [2011]. We then briefly review Friedman and Koller’s [2003] orderMCMC method and extend it based on Geyer’s [1991] Metropolis-coupled MCMC (MC3 ). In our experiments, reported in Section 3, we use MC3 as a proxy of a related implementation1 by Ellis and Wong [2008]. We also discuss the possible drawbacks of Ellis and Wong’s [2008] heuristic for correcting the bias of order-MCMC. 2.1 A ∝ qv (Av )p(Dv |DAv , Av ) , (1) v Av ⊆Lv Structure Learning in Bayesian Networks where Lv = {u ∈ N : uv ∈ L} is the set of nodes that precede v in the order. By modeling different node orderings as mutually exclusive events, the factorization enables efficient treatment of DAGs via node orderings. A drawback of this augmentation is, however, that it introduces a systematic bias A Bayesian network (BN) represents a joint distribution of a vector of random variables D = (D1 , . . . , Dn ) in terms 1 The software used by Ellis and Wong [2008] is not publicly available at the moment (W.H. Wong, personal discussion). 1580 from the simpler modular prior on DAGs. Namely, letting (A) denote the number of linear extensions of A, we have p(A) ∝ q(A) L⊇A ρ(L) = q(A)(A). That is, the prior is biased due to the extra factor (A), which is reflected also in the posterior; for more thorough discussions, see the references [Friedman and Koller, 2003; Koivisto and Sood, 2004; Eaton and Murphy, 2007; Ellis and Wong, 2008]. 2.2 posterior, it may fail in other typical cases. To see this, consider what happens when = 0 and one samples only a single order L. Then, a calculation shows that the expected value of the “corrected” estimator forthe posterior expectation of f (A) is still proportional to A f (A)p(A|D)(A), thus unchanged and biased. For another example, consider a scenario where the posterior probability is 1/2 for a special DAG A0 and 1/(2M ) for some other M DAGs. Assume also that f (A0 ) = 1 and f (A) = 0 for A = A0 , and that every DAG in question has exactly one linear extension. Now, suppose T = M DAGs are sampled, again with = 0. Then, with high probability, nearly one half of the DAGs equal A0 the rest T /2 being unique. Since only unique DAGs are kept for the estimator, the corrected estimate is about (1/2) (1/2 + T /(4M )) = 2/3, whereas the correct value is 1/2. Taking fewer samples rapidly worsens the situation. A further concern with the heuristic is its computational requirements when the number of nodes is large, say 25 or larger, but the number of data points is small: the posterior is flat, and consequently, a very large number of DAGs need to be sampled from each order to gather the required mass of 1 − . (We note that Ellis and Wong [2008] limit their experiments to instances of at most 14 nodes and at least 100 data points.) Order-MCMC and Correction Friedman and Koller’s [2003] order-MCMC samples first node orderings and then DAGs that are compatible with the sampled orderings: 1. Sample node orderings along a Markov chain. Start from a random node ordering. To move, propose the swapping of the positions of two random nodes in the current order. Accept the proposal according to the Metropolis–Hastings ratio, such that the stationary distribution of the resulting Markov chain is the posterior of node orderings. Evaluate the posterior probability of a node ordering efficiently using the factorization (1). The accepted states yield a sample of node orderings, L(1) , L(2) , . . . , L(T ) , possibly after thinning and discarding some burn-in period. 2. Sample DAGs from the sampled orders. For each sampled node ordering L, generate a DAG compatible with the order from the posterior distribution p(A|L, D). Again this can be done efficiently by exploiting the independence of the parent sets Av given the order L. This produces a sample of DAGs A(1) , A(2) , . . . , A(T ) . 2.3 MC3 Extending order-MCMC with tempering techniques can yield still better mixing and also a good estimator for the marginal likelihood p(D). Here we consider one such technique, Metropolis-coupled MCMC (MC3 ) [Geyer, 1991]. In MC3 several Markov chains, indexed by 0, 1, . . . , r, are simulated in parallel, each chain i having its own stationary distribution pi . The idea is to take p0 as a “hot” distribution, e.g., the uniform distribution, and then let the pi be increasingly “cooler” and closer approximations of the posterior p, putting finally pr = p. Usually, powering schemes of the form If the feature of interest f is modular, i.e., f (A) = v fv (Av ), then an estimate for the posterior expectation of f(t)isobtained using only the sampled orderings, as t f (L ) T , where f (L) denotes the posterior expectation of f (A) under the constraint A ⊆ L, that is, f (L) = A f (A)p(A|D, L). This, again, can be evaluated relatively fast for any given L due to a factorization similar to (1). For non-modular features, DAGs are used, the esti sampled the mate being simply t f A(t) T . In principle, the bias due to favoring DAGs that have many linear extensions can be corrected by simple reweighting. The bias is removed in the estimate f A(t) 1 . (t) A A(t) t t pi ∝ pβi , 0 ≤ β 0 < β 1 < . . . < βr = 1 are used. For instance, Geyer and Thompson [1995] suggest harmonic stepping, βi = 1/(r +1−i); in our experiments we have used linear stepping, βi = i/r. In addition to running the chains in parallel, every now and then we propose a swap of the states Li and Lj of two randomly chosen chains i and j = i + 1. The proposal is accepted with probability pi (Lj )pj (Li ) . min 1, pi (Li )pj (Lj ) A difficulty here is, however, that computing (A) for a given ]. DAG A is #P-hard in general [Brightwell and Winkler, 1991 To circumvent the computation of the terms A(t) , Ellis and Wong [2008] proposed the following heuristic: From each unique sampled order L, sample a set of unique DAGs such that their total posterior mass, conditionally on L, is at least 1 − . Let U be the union of all the DAGs so obtained, over all the sampled orders L. Treat U as an importanceweighted sample, in which the weight of a DAG A is given by p(A) A∈U p(A). While this heuristic corrects the bias reliably when the sampled orders and DAGs cover nearly all the mass of the We note that each pi needs to be known only up to some constant factor, that is, we can efficiently evaluate a function gi that is proportional to pi . We denote by Zi the normalization constant gi /pi . Having T samples from each chain, the ratio Zr /Z0 can be estimated by the telescoping product of the estimates (t) Zi 1 (t) ≈ gi Li−1 gi−1 Li−1 , i = 1, . . . , r . Zi−1 T t When p0 is taken as the uniform distribution, the constant Z0 is known or easy to estimate, and so an estimate of the ratio Zr /Z0 gives us an estimate of the marginal likelihood p(D). 1581 3 Since sampling a single order L(t) is relatively expensive, it is often advisable to draw several DAGs per sampled order, to reduce the variance of the estimator. If two DAGs, A(t1 ) and A(t1 ) are generated from the same order L(t) , the associated terms in the above sums share the weight w(t) are no longer independent random variables. However, while this will change the variance of the respective estimators, their expectations remain unchanged. What is a reasonable number of DAGs per order, depends on the complexity of evaluating the term (A(t) ); we consider this issue next. AIS on Node Orderings We next describe our application of the AIS method, our approach to correct the bias by explicitly counting the linear extensions of a given DAG, and the lower bounding method based on Markov’s inequality. We also include experimental results on a few selected data sets. 3.1 Sampling Node Orderings AIS produces a sample of linear orders L(1) , . . . , L(T ) , and corresponding importance weights w(1) , . . . , w(T ) . Like in MC3 , a sequence of distributions p0 , p1 , . . . , pr are introduced, such that sampling from p0 is easy, and as i increases, the distributions pi provide gradually improving approximations to the posterior distribution p, until finally pr equals p. For each pi we assume the availability of a corresponding function gi that is proportional to pi and that can be evaluated fast at any given point. To sample L(t) , we first sample a sequence of linear orders, L0 , L1 , . . . , Lr−1 along a Markov chain, starting from p0 and moving according to suitably defined transition kernels τi , as follows: 3.3 To count the linear extensions of a DAG A ⊆ N ×N , we turn to the corresponding partial order P (A) on N , obtained from A by adding the loops vv for each v ∈ N (making the relation reflexive) and then taking the transitive closure (making the relation transitive). Clearly, A and P (A) have the same linear extensions. Since the transitive closure P (A) of a given A can be computed efficiently (e.g., in O(n3 ) time using the Floyd–Warshall algorithm), it suffices to consider the problem of counting the linear extensions of a given partial order. It is easy to see that the number of linear extensions of a partial order P on set S satisfies the recurrence (P ) = (P − v) , Generate L0 from p0 . Generate L1 from L0 using τ1 . .. . Generate Lr−1 from Lr−2 using τr−1 . v is maximal in P with (∅) = 1, where P − v is the partial order on S \ {v} obtained from P by deleting all pairs that mention v. This recurrence immediately suggests a dynamic programming algorithm that tabulates the number of linear extensions for all subsets that are downward closed with respect to the partial order, known as the downsets or ideals of the partial order. A straightforward implementation runs in time that scales as n2 I(P ), where I(P ) is the number of ideals of P . We have further reduced the time requirement to O(nI(P )) by implementing an efficient way to maintain the set of maximal elements while traversing through the ideals. Since I(P ) is typically much less than its worst-case bound 2n , our implementation scales well up to around 40-element instances even if the DAG underlying the partial order is relatively sparse; see Figure 1. The bottleneck of the implentation is in fact the space requirement which grows roughly as I(P ). The transition kernels τi are constructed by a simple Metropolis move: at state Li−1 a new state L is proposed by swapping the positions of two random nodes in the order; the proposal is accepted as the state Li with proba bility min 1, gi (L )/gi (Li−1 ) , and otherwise Li is set to Li−1 . It follows that τi leaves pi invariant. Finally, we set L(t) = Lr−1 and assign the importance weight as gr (Lr−1 ) g1 (L0 ) g2 (L1 ) ··· . w(t) = g0 (L0 ) g1 (L1 ) gr−1 (Lr−1 ) Such a weighted sample in hand, the ratio of the normalization constants Z= gr /p and Z0 = g0 /p0 can be estimated by Z/Z0 ≈ t w(t) T . Likewise, the expectation of a function f with respect to(t)the posterior p can be estimated by t w(t) f L(t) t w . That these estimators are well justified, follows directly from Neal’s argumentation [2001]. 3.2 Counting Linear Extensions 3.4 Sampling DAGs and Correcting the Bias Lower Bounding We will use the following elementary but useful consequence of Markov’s inequality; for a proof and variations, see the works of Gomes et al. [2007] and Gogate and Dechter [2011]. Theorem 1. Let X1 , X2 , . . . , Xs be independent nonnegative random variables with mean μ. Let 0 < δ < 1. Then, with probability at least 1 − δ, Similar to order-MCMC, sampling DAGs from the sampled node orderings enables posterior inference for nonmodular features and a way to correct the bias. Specifically, if A(t) is a sample from the posterior distribution of DAGs compatible with a node ordering L(t) , then we see that the corrected importance sampling estimate for the posterior expectation of f takes the form w(t) f A(t) w(t) . A(t) A(t) t t δ 1/s min{X1 , X2 , . . . , Xs } ≤ μ . For example, at confidence 0.95 = 1 − δ, a lower bound is obtained by taking the minimum over s = 5 samples and multiplying it by 0.051/5 ≈ 0.55, or by taking the minimum over s = 30 samples and multiplying it by 0.051/30 ≈ 0.90. In our experiments we have applied this idea to lower bound the marginal likelihood p(D) as follows: The T samples generated by AIS are partitioned into s = 9 bins. For the Likewise, the corrected estimate for the ratio Z/Z0 becomes 1 w(t) . T t A(t) 1582 200 hours of running time, we measured the performance of the corresponding estimators throughout the process. Consequently, at any given time budget samplers with larger r yielded respectively fewer samples. For each run of MC3 , we and included every 100th (M USHROOM and A LARM) or 10th (P ROMOTERS) simulated states in the sample. The first half of the samples were discarded as burn-in. From each sampled node ordering a single DAG was generated for bias correction, except for AIS on M USHROOM we generated 10 DAGs per node ordering to balance the time requirements of sampling and counting linear extensions. We first studied the performance of AIS and MC3 in estimating the marginal likelihood (Figure 2). We observed that all the six samplers typically produced a good estimate, the larger number of steps paying off when given sufficiently long running time. An exception is the MC3 sampler with r = 4 on M USHROOM. We also observed that, for M USHROOM, AIS (e.g., with r = 4nm steps) finally yields a lower bound that is within a factor of about 2 of the exact value. For A LARM and P ROMOTERS the lower bounds are even better (compared to the median estimate); in fact, they are nearly as good as they can get by taking the minimum over 9 estimates, which necessarily implies an error of at least ln 0.7168 ≈ −0.33 in the logarithm of the marginal likelihood. We then investigated the performance of the proposed biascorrected estimators (Figure 3, left). (Here we do not have results for P ROMOTERS, as the number of nodes, 58, turned out to be too large for counting the linear extensions, given the small maximum indegree of 3.) We found moderate increase in the variance of the estimates, compared to the case of not correcting the bias. The lower bounds were now within a factor of about 2 and 3 of the median estimate over 9 runs, for M USHROOM and A LARM, respectively. Finally, we compared the performance of AIS and MC3 with and without bias correction in estimating the arc posterior probabilities (Figure 3, right). We include results only for M USHROOM, for which the exact values are available, and only for the best choice of the number of annealing steps r. We measured the performance of a single run by the largest absolute error over all possible arcs. We observed that bias correction does not significantly worsen the accuracy of the estimates. We also observed that while MC3 is somewhat more efficient than AIS, the M USHROOM data set presents a difficult challenge for all the methods studied here. 4 10 max indegree 3 max indegree 6 3 10 2 time (s) 10 1 10 0 10 −1 10 −2 10 20 30 40 number of nodes 50 60 Figure 1: Speed of exact counting of linear extensions. For each number of nodes, the runtime (in seconds) is shown for 9 random DAGs, with the maximum indegree set to 3 or 6. Runtimes less than 0.01 seconds are rounded up to 0.01. The mean of the runtimes is shown by a solid line. No runtime estimates are shown when the memory requirement exceeded 16GB for at least one of the 9 DAGs. jth bin, let Xj be the corresponding average of T /9 weights w(t) . By Theorem 1, the minimum over the Xj multiplied by 0.051/9 ≈ 0.7168 is a lower bound of p(D)/Z0 with probability at least 0.95. (We will ignore the term Z0 as it depends on neither the data nor the prior, except for the maximum indegree, and it could be evaluated, e.g., using Robinson’s [1973] recurrences.) 3.5 Experimental Results We have experimented on three data sets. Our A LARM data set contains 100 data records generated from the 37-variable Alarm network. Mushroom is frequently used a 22-variable data set of 8124 records, of which we included a random subset of size 1000 to our M USHROOM data set. P ROMOTERS is another benchmark data set, consisting of 58 variables and 106 records. See the references [Beinlich et al., 1989; Blake and Merz, 1998] for more about these data sources. We specified the Bayesian model in a usual manner. For p(D|A) we used the BDeu scoring. We set qv (Av ) ≡ 1 for the modular part, and ρ(L) ≡ 1 for the term concerning node orderings. The maximum indegree parameter k was set to 4 for M USHROOM and to 3 for A LARM and P ROMOTERS. For M USHROOM we used exact algorithms [Koivisto, 2006; Tian and He, 2009] to compute the correct arc posterior probabilities, and for the biased prior, also the marginal likelihood. For a fair comparison of AIS and MC3 , we fixed or varied the free parameters as follows: For both methods, we used linear stepping in the annealing scheme. (We experimented also with other schemes, but the linear one achieved the most robust performance across the data sets and repeated runs.) However, the number of steps was varied: for MC3 we put r = 4, 16, 64; for AIS we put r = nm, 4nm, 16nm, thus depending on the data size. These choices reflect the fact that in MC3 it is crucial to simulate each parallel chain a large number of steps to get sufficient mixing and convergence, whereas in AIS it is crucial to simulate the annealed chain a large number of steps. While we gave each of the six samplers roughly 4 Conclusions and Future Work We have contributed two new ideas to structure learning in Bayesian networks: (1) the use of Neal’s [2001] annealed importance sampling (AIS) method to obtain independent samples with known expected value, and (2) the use of moderately-exponential-time dynamic programming algorithms to correct the bias due to sampling node orderings. By comparing to an advanced MCMC technique, MC3 , we have shown that the efficiency of AIS in exploring the posterior is competitive—AIS does not sacrifice accuracy for its other advantages. We next discuss to what extent our present implementation satisfies the three desiderata that motivated this work, and what are the future prospects in this regard. 1583 A LARM (n = 37, m = 100) −10945 −10946 −10947 −10948 AIS: 22000 AIS: 88000 AIS: 352000 AISL: 88000 MC3: 4 MC3: 16 MC3: 64 ln of marginal likelihood estimate ln of marginal likelihood estimate −10944 −10949 −10950 −10951 −10952 3 10 4 5 10 −1247 −8437 −1248 −8438 −1249 −1250 −1251 10 AIS: 3700 AIS: 14800 AIS: 59200 AISL: 14800 MC3: 4 MC3: 16 MC3: 64 −1252 −1253 −1254 −1255 2 10 6 10 P ROMOTERS (n = 58, m = 106) ln of marginal likelihood estimate M USHROOM (n = 22, m = 1000) 3 10 time (s) 4 10 time (s) 5 10 −8439 −8440 −8441 −8443 −8444 −8445 6 AIS: 6148 AIS: 24592 AIS: 98368 AISL: 24592 MC3: 4 MC3: 16 MC3: 64 −8442 10 2 10 4 6 10 time (s) 10 Figure 2: Performance of AIS and MC3 in estimating the marginal likelihood, without bias correction. For each of the six samplers, the median (of 9 runs) of the estimates is shown as a function of running time. In addition, a 0.95-confidence lower bound based on AIS with the medium number of steps is shown (“AISL”). For M USHROOM the correct value is shown by a horizontal line. The legend shows the number of annealing steps r. M USHROOM (n = 22, m = 1000) A LARM (n = 37, m = 100) −10966 −10967 −10968 −10969 cAIS: 22000 cAIS: 88000 cAIS: 352000 cAISL: 88000 −10970 −10971 4 5 10 10 6 10 −1299 −1300 cAIS: 3700 cAIS: 14800 cAIS: 59200 cAISL: 14800 median of largest absolute error −10965 −10972 3 10 1 −1298 ln of marginal likelihood estimate ln of marginal likelihood estimate −10964 M USHROOM (n = 22, m = 1000) −1301 −1302 −1303 −1304 −1305 −1306 2 10 3 10 time (s) 4 10 time (s) 5 10 6 10 AIS: 88000 MC3: 16 cAIS: 88000 cMC3: 16 0.8 0.6 0.4 0.2 0 2 10 3 10 4 10 time (s) 5 10 6 10 Figure 3: Left: Performance of AIS in estimating the marginal likelihood, with bias correction. For each sampler, the median (of 9 runs) of the estimates is shown as a function of running time. In addition, a 0.95-confidence lower bound based on AIS with the medium number of steps is shown (“cAISL”). The legend shows the number of annealing steps r. Rightmost: Performance of AIS and MC3 in estimating the arc posterior probabilities, with and without bias correction. For each sampler, the median (of 9 runs) of the largest absolute error over all possible arcs is shown as a function of running time. Bias correction. We implemented a direct way to correct the bias by explicitly counting the linear extensions of each sampled DAG. The main advantage of this approach is that it yields a weighting that provably corrects the bias. We showed that the #P-hardness of the problem does not render it intractable in practice, as long as the number of nodes is at most about 40—this is much beyond the scope of the existing dynamic programming algorithms for structure learning [Koivisto, 2006; Eaton and Murphy, 2007; Tian and He, 2009]. A topic of future research is to improve the exact algorithm further and to explore the practical value of the existing polynomial-time approximation schemes [Dyer et al., 1991; Karzanov and Khachiyan, 1991; Huber, 2006]. AIS the bottleneck is the complexity of simulating the annealing process to get a single sample. As it is vital to keep the number of simulation steps large, the hope for scalability to larger instances is in developing significantly faster algorithms for evaluating a given node ordering. Our current implementation can be boosted by an order of magnitude by incorporating some of the tricks of Friedman and Koller [2003]. Guarantees. We showed that AIS supplies an importance sampling distribution that enables lower-bounding of the marginal likelihood of the model with fairly good probabilistic guarantees. In obtaining quality guarantees for samplingbased estimates more generally, our success is, admittedly, only partial so far. Most importantly, we currently do not obtain useful lower (or upper) bounds for the posterior probabilities of individual arcs or larger subgraph, since we do not have useful upper bound for the involved normalization constant, that is, the marginal likelihood. In this light, the key topic of future research is to find efficient methods for upperbounding the marginal likelihood. Parallelization. A main advantage of AIS over MCMC is that AIS enables easy large-scale parallelization. There do exist parallel implementations of MCMC [Altekar et al., 2004; Corander et al., 2008], which rely on frequent synchronized communication. They are thus only suited for architectures with shared memory and some tens of parallel processes. In 1584 References [Gomes et al., 2007] C. P. Gomes, J. Hoffmann, A. Sabharwal, and B. Selman. From sampling to model counting. In M. M. Veloso, editor, IJCAI, pages 2293–2299, 2007. [Grzegorczyk and Husmeier, 2008] M. Grzegorczyk and D. Husmeier. Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Machine Learning, 71(2-3):265–305, 2008. [Heckerman et al., 1995] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243, 1995. [Huber, 2006] M. Huber. Fast perfect sampling from linear extensions. Discrete Mathematics, 306(4):420–428, 2006. [Karzanov and Khachiyan, 1991] A. Karzanov and L. Khachiyan. On the conductance of order Markov chains. Order, 8:7–15, 1991. [Koivisto and Sood, 2004] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549–573, 2004. [Koivisto, 2006] M. Koivisto. Advances in exact Bayesian structure discovery in Bayesian networks. In UAI, pages 241–248. AUAI, 2006. [Madigan and York, 1995] D. Madigan and J. York. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232, 1995. [Neal, 2001] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [Niinimaki et al., 2011] T. Niinimaki, P. Parviainen, and M. Koivisto. Partial order MCMC for structure discovery in Bayesian networks. In F. G. Cozman and A. Pfeffer, editors, UAI, pages 557–564. AUAI, 2011. [Parr and van der Gaag, 2007] R. Parr and L.-C. van der Gaag, editors. UAI 2007, Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, July 19-22, 2007. AUAI Press, 2007. [Pearl, 1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, 1988. [Pearl, 2000] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. [Robinson, 1973] R. W. Robinson. Counting labeled acyclic digraphs. In F. Harary, editor, New Directions in Graph Theory. Academic Press, 1973. [Tian and He, 2009] J. Tian and R. He. Computing posterior probabilities of structural features in Bayesian networks. In J. Bilmes and A. Y. Ng, editors, UAI, pages 538–547. AUAI, 2009. [Altekar et al., 2004] G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, and F. Ronquist. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics, 20:407–415, 2004. [Battle et al., 2010] A. J. Battle, M. Jonikas, P. Walter, J. Weissman, and D. Koller. Automated identification of pathways from quantitative genetic interaction data. Molecular Systems Biology, 6:379–391, June 2010. [Beinlich et al., 1989] I. Beinlich, G. Suermondt, R. Chavez, and G. F. Cooper. The ALARM monitoring system. In Proc. of the Second European Conference on Artificial Intelligence and Medicine, pages 247–256, 1989. [Blake and Merz, 1998] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. [Brightwell and Winkler, 1991] G. Brightwell and P. Winkler. Counting linear extensions. Order, 8:225–242, 1991. [Buntine, 1991] W. Buntine. Theory refinement on Bayesian networks. In B. D’Ambrosio and P. Smets, editors, UAI, pages 52– 60. Morgan Kaufmann, 1991. [Cappé and Robert, 2000] O. Cappé and C. P. Robert. Markov chain Monte Carlo: 10 years and still running! Journal of the American Statistical Association, 95:1282–1286, 2000. [Cooper and Herskovits, 1992] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992. [Corander et al., 2008] J. Corander, M. Ekdahl, and T. Koski. Parallel interacting MCMC for learning of topologies of graphical models. Data Min. Knowl. Discov., 17(3):431–456, 2008. [Diaconis, 2009] P. Diaconis. The Markov chain Monte Carlo revolution. Bull. Amer. Soc., 46(2):179–205, 2009. [Dyer et al., 1991] M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. J. ACM, 38(1):1–17, 1991. [Eaton and Murphy, 2007] D. Eaton and K. P. Murphy. Bayesian structure learning using dynamic programming and MCMC. In Parr and van der Gaag [2007], pages 101–108. [Ellis and Wong, 2008] B. Ellis and W. H. Wong. Learning causal Bayesian network structures from experimental data. Journal of the American Statistical Association, 103:778–789, 2008. [Friedman and Koller, 2003] N. Friedman and D. Koller. Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50:95– 125, 2003. [Geyer and Thompson, 1995] C. J. Geyer and E. A. Thompson. Annealing Markov chain Monte Carlo with application to ancestral inference. Journal of the American Statistical Association, 90:909–920, 1995. [Geyer, 1991] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. In E. M. Keramidas, editor, 23rd Symposium on Interface, pages 156–163, Fairfax Station: Interface Foundation, 1991. [Gogate and Dechter, 2011] V. Gogate and R. Dechter. Samplingbased lower bounds for counting queries. Intelligenza Artificiale, 5(2):171–188, 2011. [Gogate et al., 2007] V. Gogate, B. Bidyuk, and R. Dechter. Studies in lower bounding probabilities of evidence using the Markov inequality. In Parr and van der Gaag [2007], pages 141–148. 1585