Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Swift: Compiled Inference for Probabilistic Programming Languages Yi Wu UC Berkeley jxwuyi@gmail.com Lei Li Toutiao.com lileicc@gmail.com Stuart Russell Rastislav Bodik UC Berkeley University of Washington russell@cs.berkeley.edu bodik@cs.washington.edu Abstract computation [Mansinghka et al., 2013]. While better algorithms are certainly possible, our focus in this paper is on achieving orders-of-magnitude improvement in the execution efficiency of a given algorithmic process. A PPL system takes a probabilistic program (PP) specifying a probabilistic model as its input and performs inference to compute the posterior distribution of a query given some observed evidence. The inference process does not (in general) execute the PP, but instead executes the steps of an inference algorithm (e.g., Gibbs) guided by the dependency structures implicit in the PP. In many PPL systems the PP exists as an internal data structure consulted by the inference algorithm at each step [Pfeffer, 2001; Lunn et al., 2000; Plummer, 2003; Milch et al., 2005a; Pfeffer, 2009]. This process is in essence an interpreter for the PP, similar to early Prolog systems that interpreted the logic program. Particularly when running sampling algorithms that involve millions of repetitive steps, the overhead can be enormous. A natural solution is to produce model-specific compiled inference code, but, as we show in Sec. 2, existing compilers for general open-universe models [Wingate et al., 2011; Yang et al., 2014; Hur et al., 2014; Chaganty et al., 2013; Nori et al., 2014] miss out on optimization opportunities and often produce inefficient inference code. The paper analyzes the optimization opportunities for PPL compilers and describes the Swift compiler, which takes as input a BLOG program [Milch et al., 2005a] and one of three inference algorithms (LW, PMH, Gibbs) and generates target code for answering queries. Swift includes three main contributions: (1) elimination of interpretative overhead by joint analysis of the model structure and inference algorithm; (2) a dynamic slice maintenance method (FIDS) for incremental computation of the current dependency structure as sampling proceeds; and (3) efficient runtime memory management for maintaining the current-possible-world data structure as the number of objects changes. Comparisons between Swift and other PPLs on a variety of models demonstrate speedups ranging from 12x to 326x, leading in some cases to performance comparable to that of hand-built model-specific code. To the extent possible, we also analyze the contributions of each optimization technique to the overall speedup. Although Swift is developed for the BLOG language, the overall design and the choices of optimizations can be applied to other PPLs and may bring useful insights to similar AI A probabilistic program defines a probability measure over its semantic structures. One common goal of probabilistic programming languages (PPLs) is to compute posterior probabilities for arbitrary models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines—even the compiled ones—incur significant runtime interpretation overhead, especially for contingent and open-universe models. This paper describes Swift, a compiler for the BLOG PPL. Swift-generated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle memory management for possible worlds of varying sizes. Experiments comparing Swift with other PPL engines on a variety of inference problems demonstrate speedups ranging from 12x to 326x. 1 Introduction Probabilistic programming languages (PPLs) aim to combine sufficient expressive power for writing real-world probability models with efficient, general-purpose inference algorithms that can answer arbitrary queries with respect to those models. One underlying motive is to relieve the user of the obligation to carry out machine learning research and implement new algorithms for each problem that comes along. Another is to support a wide range of cognitive functions in AI systems and to model those functions in humans. General-purpose inference for PPLs is very challenging; they may include unbounded numbers of discrete and continuous variables, a rich library of distributions, and the ability to describe uncertainty over functions, relations, and the existence and identity of objects (so-called open-universe models). Existing PPL inference algorithms include likelihood weighting (LW) [Milch et al., 2005b], parental Metropolis– Hastings (PMH) [Milch and Russell, 2006; Goodman et al., 2008], generalized Gibbs sampling (Gibbs) [Arora et al., 2010], generalized sequential Monte Carlo [Wood et al., 2014], Hamiltonian Monte Carlo (HMC) [Stan Development Team, 2014], variational methods [Minka et al., 2014; Kucukelbir et al., 2015] and a form of approximate Bayesian 3637 Classical Probabilistic Program Optimizations (insensitive to inference algorithm) 1 2 3 4 5 6 7 8 9 10 Classical Algorithm Optimizations (insensitive to model) Inference Code 𝑷𝑰 Probabilistic Program (Model) 𝑷𝑴 𝓕𝑰 𝑷𝑴 → 𝑷𝑰 Algorithm-Specific Code 𝓕cg 𝑷𝑰 → 𝑷𝑻 Target Code (C++) 𝑷𝑻 Model-Specific Code Our Optimization: FIDS (combine model and algorithm) type Ball; type Draw; type Color; //3 types distinct Color Blue, Green; //two colors distinct Draw D[2]; //two draws: D[0], D[1] #Ball ~ UniformInt(1,20);//unknown # of balls random Color color(Ball b) //color of a ball ~ Categorical({Blue -> 0.9, Green -> 0.1}); random Ball drawn(Draw d) ~ UniformChoice({b for Ball b}); obs color(drawn(D[0])) = Green; // drawn ball is green query color(drawn(D[1])); // query Figure 2: The urn-ball model Figure 2: The urn-ball model Our Optimization: Data Structures for FIDS 1 type Cluster; type Data; which track all theD[20]; dependencies at runtime 2 distinct Data // 20 data points even though typ3 #Cluster Poisson(4); // the number of clusters ically only a~ tiny portion of dependency structure may 4 random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean change per iteration. Church simply avoids keeping track of 5 random Cluster z(Data d) dependencies and uses Eq.(1), including 6 ~ UniformChoice({c for Cluster c}); a potentially huge 7 random for Real x(Data d) ~ Gaussian(mu(z(d)), overhead redundant probability computations.1.0); // data 8 obs x(D[0]) = 0.1; // omit other data points For the second point, in order to track the existence of 9 query size({c for Cluster c}); the variables in an open-universe model, similar to BLOG, Church also maintains a complicated dynamic string-based Figure 3: The infinite Gaussian mixture model hashing scheme in the target code, which again causes interpretation overhead. Lastly, by Hur et al. [2014]such , Chaganty generictechniques approach proposed is Monte-Carlo sampling, as likeli[ ] [ ] et al. 2013 and Nori et al. 2014 primarily focus on optihood weighting [Milch et al., 2005a] (LW) and Metropolismizing PM by analyzing the static properties ofidea the input PP, Hastings algorithm (MH). For LW, the main is to sample which arerandom complementary Swift. every variable to from its prior per iteration and collect weighted samples for the queries, where the weight is the 3 likelihood Background of the evidences. For MH, the high-level procedure ofpaper the algorithm is summarized in Alg. 1.although M denotes an inThis focuses on the BLOG language, our apput BLOG program model), its evidence, Q asemanquery, N proach also applies to (or other PPLsEwith equivalent the number samples, denotes world, [McAllester ticsdenotes et al., of 2008; Wu etWal., 2014].a possible Other PPLs (x)also is the value of xtoinBLOG possiblevia world W ,single Pr[x|W ] denotes canWbe converted static assigntheform conditional probability of x in W[,Cytron and Pr[W ] denotes ment (SSA form) transformation et al., 1989; the of].possible world W . In particular, when the proHurlikelihood et al., 2014 posal distribution g in Alg.1 is the prior of every variable, the 3.1algorithm The BLOG Language becomes the parental MH algorithm (PMH). Our in the paper focuses LW and PMHprobabilbut our pro[Milch ] defines Thediscussion BLOG language et al.,on2005a solution to other Monte-Carlo methods well. ity posed measures over applies first-order (relational) possible worlds;asin this sense it is a probabilistic analogue of first-order logic. A BLOG declares types for objects and defines distri4 program The Swift Compiler butions over their numbers, as well as defining distributions has the following design applied goals for handlingObopenfor Swift the values of random functions to objects. universe probabilistic models. servation statements supply evidence, and a query statement specifies the posterior probability of interest. A random vari• Provide a general framework to automatically and efable in BLOG to exact the application of a random ficientlycorresponds (1) track the Markov blanket for each functionrandom to specific objects a possible BLOG natvariable andin (2) maintainworld. the minimum set of urally supports open universe probability models (OUPMs) and context-specific dependencies. Algorithm 1: Metropolis-Hastings Fig. 2 demonstrates the open-universealgorithm urn-ball (MH) model. In this version the, E query asks for the color ofHthe next random Input: M , Q, N ; Output : samples pick1 from an urn given world the colors of balls drawn previously. initialize a possible W0 with E satisfied; Line2 4forisi a number statement stating that the number vari1 to N do randomly pick a variable x from able3 #Ball, corresponding to the totalWnumber of balls, is i 1 ; 4 Wdistributed Wi 1 between and v i i 1 (x) uniformly 1Wand 20.; Lines 5–6 declare a 5 propose color(·), a value v 0applied for x viatoproposal g; random function balls, picking Blue or 0 6 Wi (x) v and ensure WiLines is supported; Green with a biased probability. 7–8 state that each ⇣ ⌘ 0 !v) Pr[W i ] the urn, with replacedraw a ball1,at g(v random from 7 chooses ↵ min ; g(v!v 0 ) Pr[Wi 1 ] ment. 8 if rand(0, 1) ↵ then Wi Wi 1 Fig. 3 describes OUPM, the infinite Gaussian mixH H +another (Wi (Q)) ; ture model (1-GMM). The model includes an unknown Figure 1: PPL compilers and optimization opportunities. systems for real-world applications. 2 Existing PPL Compilation Approaches In a general purpose programming language (e.g., C++), the compiler compiles exactly what the user writes (the program). By contrast, a PPL compiler essentially compiles the inference algorithm, which is written by the PPL developer, as applied to a PP. This means that different implementations of the same inference algorithm for the same PP result in completely different target code. As shown in Fig. 1, a PPL compiler first produces an intermediate representation combining the inference algorithm (I) and the input model (PM ) as the inference code (PI ), and then compiles PI to the target code (PT ). Swift focuses on optimizing the inference code PI and the target code PT given a fixed input model PM . Although there have been successful PPL compilers, these compilers are all designed for a restricted class of models with fixed dependency structures: see work by Tristan [2014] and the Stan Development Team [2014] for closed-world Bayes nets, Minka [2014] for factor graphs, and Kazemi and Poole [2016] for Markov logic networks. Church [Goodman et al., 2008] and its variants [Wingate et al., 2011; Yang et al., 2014; Ritchie et al., 2016] provide a lightweight compilation framework for performing the Metropolis–Hastings algorithm (MH) over general openuniverse probability models (OUPMs). However, these approaches (1) are based on an inefficient implementation of MH, which results in overhead in PI , and (2) rely on inefficient data structures in the target code PT . For the first point, consider an MH iteration where we are proposing a new possible world w0 from w accoding to the proposal g(·) by resampling random variable X from v to v 0 . The computation of acceptance ratio ↵ follows ✓ ◆ g(v 0 ! v) Pr[w0 ] ↵ = min 1, . (1) g(v ! v 0 ) Pr[w] Since only X is resampled, it is sufficient to compute ↵ using merely the Markov blanket of X, which leads to a much simplified formula for ↵ from Eq.(1) by cancelling terms in Pr[w] and Pr[w0 ]. However, when applying MH for a contingent model, the Markov blanket of a random variable cannot be determined at compile time since the model dependencies vary during inference. This introduces tremendous runtime overhead for interpretive systems (e.g., BLOG, Figaro), 3638 r t 1 t • P s i For (Fram gener goal, code ( For demo discus Sec. 4 Swift only s contin 4.1 Our d thoug includ adapti ing (R and R Dyna Dynam a dyn variab comp piler): and q during For input is use and (2 the cu infere DB is to r struct target For x(d) i doubl // mem // val ret Since mu(c Notab dency non-e 8 ~ UniformChoice({b for Ball b}); 9 obs color(drawn(D[0])) = Green; // drawn ball is green 10 query color(drawn(D[1])); // query Figure 2: The urn-ball model type Cluster; type Data; distinct Data D[20]; // 20 data points #Cluster ~ Poisson(4); // number of clusters random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean random Cluster z(Data d) ~ UniformChoice({c for Cluster c}); random Real x(Data d) ~ Gaussian(mu(z(d)), 1.0); // data Figure 2: The urn-ball model obs x(D[0]) = 0.1; // omit other data points query size({c for Cluster c}); 1 2 3 4 5 6 7 8 9 Figure3:3:The Theinfinite infiniteGaussian Gaussianmixture mixture model model Figure Algorithm 1: Metropolis–Hastings algorithmsuch (MH) generic approach is Monte-Carlo sampling, as likelihood et al., 2005a] (LW) and MetropolisInput:weighting M, E, Q,[Milch N ; Output : samples H (0) LW, the main idea is to sample (MH).wFor 1 Hastings initialize aalgorithm possible world with E satisfied; every random variable from its prior per iteration and col2 for i 1 to N do (i 1) themodel Figure 3:samples The infinite mixture weighted for theGaussian weight is the 3 lect randomly pick a variable Xqueries, from wwhere ; likelihood of the For MH, the high-level procedure (i) (i evidences. 1) (i 1) 4 w w and v X(w ); of the algorithm is summarized in Alg. 1. M denotes an in0 5 propose a value v for X via proposal g; generic approach is (or Monte-Carlo sampling, as likeliput BLOG program model), E its evidence,such Q a query, N (i) 0 (i) 6 X(wthe )number vMilch and ensure w W is self-supporting; [ ] hood weighting et al., 2005a (LW) and Metropolisdenotes of samples, denotes a possible world, ⌘ ⇣ 0 (i) g(v !v)For Pr[wLW, ] the main idea is to sample Hastings (MH). the value of x in0 )possible world 7 W (x) ↵ isalgorithm min 1, g(v!v ; W , Pr[x|W ] denotes Pr[w(i 1) ] every random variable from its prior per iteration and colthe conditional probability of (i) x in W ,(iand ] denotes the 1) Pr[W 8 if rand(0, 1) ↵ for thenthewqueries, w where ; lect weighted thewhen weight the likelihood ofsamples possible(i) world W . In particular, the ispro9 + (Q(w )); For likelihood ofHthe evidences. MH, the of high-level procedure posalHdistribution g in Alg.1 is the prior every variable, the becomes the parentalinMH algorithm (PMH).anOur ofalgorithm the algorithm is summarized Alg. 1. M denotes indiscussion in the paper focuses on LWevidence, and PMHQbut our proput BLOG program (or model), E its a query, N posed solution applies to other Monte-Carlo methods as well. denotes the number of samples, W denotes a possible world, number of clusters as stated in line 3, which is to be inferred W (x)the is data. the value of x in possible world W , Pr[x|W ] denotes from the4 conditional probability of x in W , and Pr[W ] denotes the The Swift Compiler likelihood of possible world W . In particular, when the pro3.2Swift Generic Algorithms has theInference following design goals for handling openposal distribution g in Alg.1 is the prior of every variable, the universe probabilistic models. All PPLs need to infer the posterior of a query algorithm becomes the parental MH distribution algorithm (PMH). Our given observed evidence. The majority use • the Provide a general framework to and automatically and proefdiscussion in the paper focuses ongreat LW PMHof butPPLs our Monte Carlo sampling methods such as likelihood weighting ficiently (1) track the exact Markov blanket for each posed solution applies to other Monte-Carlo methods as well. (LW) and Metropolis–Hastings MCMC the (MH). LW samples random variable and (2) maintain minimum set of unobserved random variables sequentially in topological or4 The Swift Compiler der, conditioned on evidence and sampled values earlier in Algorithm 1:following Metropolis-Hastings algorithm (MH) Swift has theeach design goals for handling the sequence; complete sample is weighted by theopenlikeuniverse Input :the M,evidence E, Q, N ;models. Output : samples H in the sequence. lihood of probabilistic variables appearing a possible W0 with E The1• initialize MH algorithm is world summarized in Alg. 1. M denotes a Provide aN general framework tosatisfied; automatically and ef2 for i 1 toE do evidence, Q a query, N the number of BLOG model, its ficiently (1) track the exact Markov blanket for each 3 randomly pick a variable x from Wi 1 ; samples, w a possible world, X(w) the value of randomset varivariable and maintain the minimum of 4 random W Wi 1 and v (2) W i i 1 (x); able X in w, Pr[X(w)|w the conditional proba0 -X ] denotes random variables necessary to evaluate the query and 5 propose a value v for x via proposal g ; 0 bility of W Xi (x) in w,vand the likelihood of pos6 andPr[w] ensuredenotes Wi is⌘supported; sible world w. In⇣ particular, when the proposal distribution g(v 0 !v) Pr[W i] ↵ min 1, g(v!v0 ) Pr[Wi 1 ] algorithm ; 1: Metropolis-Hastings g Algorithm in7 Alg.1 samples a variable conditioned on its(MH) parents, the 8 if rand(0, 1) ↵ then W W i i 1 algorithm becomes the parental MH algorithm (PMH). Our Input: M , E, Q , N ; Output : samples H H H +paper (Wi (Q)) ; discussion the focuses on LW and PMH but our pro1 initialize in a possible world W0 with E satisfied; 2 for isolution 1 to N do to other Monte Carlo methods as well. posed applies structures for Monte Carlo sampling algorithms to avoid interpretive overhead at runtime. the evidence (the dynamic slice [Agrawal and Horgan, ], or goal, equivalently the minimum self-supported For 1990 the first we propose a novel framework FIDSpar[Milch et al., updating tial world 2005a]). Dynamic Slices) for (Framework of Incrementally generating the inference code (PI in Fig. 1). For and the second Provide efficient memory memory management fast data data •• Provide efficient management and fast goal, we carefully choose data structures in the target C++ structures for Monte Carlo sampling algorithms to avoid structures for Monte Carlo sampling algorithms to avoid code (P T ). interpretive overhead at at runtime. runtime. overhead Forinterpretive convenience, r.v. is short for random variable. We For the first goal, we propose novel framework FIDS demonstrate examples of compiled inframework, the following For the first goal, we propose aacode novel FIDS discussion: the inference code (P ) generated by FIDS in (Framework of Incrementally updating Dynamic Slices) for I (Framework for Incrementally updating Dynamic Slices), for Sec. 4.1 are all ininference pseudo-code while theFig. target code (P by generating the code (P in 1). For the T )second I generating the inference code (PI in Fig. 1). For the second Swift in Sec. 4.2 are indata C++.structures Due to limited goal,shown we carefully carefully choose in the thespace, targetwe C++ goal, we choose data structures in target C++ only show the detailed transformation rules for the adaptive code (P ). T code (PT ). updating (ACU) technique and omit the others. contingency For convenience, convenience, r.v. r.v. isis short short for for random random variable. variable. We We For demonstrate examples of compiled code in the following demonstrate examplesinof compiled code in the following 4.1 Optimizations FIDS discussion: the the inference inference code code (P (PI)) generated generated by by FIDS FIDS in in discussion: I on LW and PMH, Our discussion ininthis section focuses al-) by Sec. 4.1 are all pseudo-code while the target code (P T Sec. 4.1FIDS is in can pseudo-code while the targetalgorithms. code (PT ) by Swift though be also applied to other FIDS Swift shown Sec. are inDue C++. to limited we shown inthree Sec.in 4.2 is 4.2 in C++. toDue limited space, space, we show includes optimizations: dynamic backchaining (DB), only show the detailed transformation rules for the adaptive the detailed transformation rules only and for the adaptive continadaptive contingency updating (ACU) reference countcontingency updating (ACU) technique andthe omit the others. gency updating andand omit others. ing (RC). DB is (ACU) applied technique to both LW PMH while ACU and RC are particularly for PMH. randomly pick a variable x from Wi 1 ; 4 Wi W 1 and v Wi 1 (x); 45 The Swifti Compiler propose a value v 0 for x via proposal g ; 0 6 Wi (x) Wi isgoals Swift has the following design for handling open⇣v and ensure ⌘supported; 3 g(v 0 !v) Pr[Wi ] universe 7 ↵ probability min 1, models. 0 g(v!v ) Pr[Wi 1 ] 8 4.1 Optimizations Optimizations in in FIDS FIDS 4.1 Dynamic backchaining Our discussion discussion in this this section section focuses focuses on on LW LW and and PMH, PMH, alalOur in Dynamic backchaining (DB)applied aims toto incrementally construct though FIDS can be be also also other algorithms. algorithms. FIDS though FIDS can applied to other FIDS aincludes dynamic three slice in each LW iteration and only sample those optimizations: dynamic backchaining (DB), includes necessary three optimizations: dynamic backchaining variables at runtime. DB is adaptive from idea (DB), of adaptive contingency updating (ACU) and reference countadaptive contingency updating (ACU),toand countcompiling lazy evaluation (also similar thereference Prolog coming (RC). (RC). DB DB isis applied applied to to both LW LW and PMH PMH while ACU ACU ing piler): in each iteration, DB both backtracksand from the while evidence and RC are particularly for PMH. and RC are specific to PMH. and query and samples a r.v. only when its value is required during inference. Dynamic backchaining Dynamic backchaining For every r.v. declaration random Tincrementally X ⇠ CX ; in the Dynamic backchaining (DB) toet construct [aims ] constructs Dynamic backchaining Milch al., 2005a input PP, FIDS generates (DB) a getter function get X(), which a dynamic slice sliceincrementally in each LW iteration and iteration only sample those andCX samisa dynamic used to (1) sample a value forinXeach fromLW its declaration variables necessary at runtime. DB is adaptive from idea of ples only those variables necessary at runtime. DB in Swift and (2) memoize the sampled value for later references in compiling lazy evaluation (also similar to the Prolog comis an example of compiling lazy aevaluation: in each iteration, the current iteration. Whenever r.v. is referred to during piler): initseach iteration, DB from evidence DB backtracks from the evidence and query andthe samples an inference, getter function will backtracks be evoked. and query and samples a r.v. only when its value is required r.v. only when its value is required during inference. DB is the fundamental technique for Swift. The key insight inference. every declaration random T interpretive X ⇠ CX ;data in the isduring toFor replace ther.v. dependency look-up in some For every r.v.generates declaration random T X ⇠ X(), CXin ; which in the structure direct machine seeking the input PP,byFIDS aaddress/instruction getter function get input PP, FIDS generates a getter function get X(), which target executable file. is used to (1) sample a value for X from its declaration CX isFor used to (1) sample asampled valuecode for X from its declaration example, the the inference for the function ofCin X and (2) memoize value forgetter later references andcurrent (2)thememoize the sampled value for references in x(d) in 1-GMM model (Fig. 3) shown below. the iteration. Whenever anis r.v. is later referred to during the current Whenever a evoked. r.v. is referred to during inference, itsiteration. getter function double get_x(Data d) { will be inference, its getter function will be evoked. // some code memoizing the sampled for value Compiling DB is the fundamental technique Swift. The memoization; DB is theisfundamental technique for Swift. and Themethod key insight key insight to replace dependency look-ups inif notthesampled, sample a innew value is// to replace dependency look-up some interpretive data vocations from some internal PP data structure with direct val = by sample_gaussian(get_mu(get_z(d)),1); structure directaccessing machine and address/instruction seeking the machine branching in the targetinexereturnaddress val; } target executable file. cutable file. Since · ) the is called before get_mu( ·getter ), only those of Forget_z( example, the inference code for the the getter function of For example, inference code for function mu(c) corresponding to non-empty clusters will be sampled. x(d) in in the the 1-GMM 1-GMM model model (Fig. (Fig. 3) 3) isis shown shown below. below. x(d) Notably, with the inference code above, no explicit dependouble get_x(Data d) { at runtime to discover those dency look-up is ever required // some code memoizing the sampled value non-empty clusters. memoization; // if not sampled, sample a new value val = sample_gaussian(get_mu(get_z(d)),1); return val; } memoization some pseudo-code for Since get_z( · ) isdenotes called before get_mu( · ), snippet only those performing memoization. Since get_z(·) is called before mu(c) corresponding to non-empty clusters will be sampled. get_mu(·), only mu(c) corresponding to non-empty Notably, with thethose inference code above, no explicit depenclusters will be issampled. Notably, with thetoinference code dency look-up ever required at runtime discover those above, no explicit dependency look-up is ever required at runnon-empty clusters. time to discover those non-empty clusters. When sampling a number variable, we need to allocate memory for the associated variables. For example, in the urnball model (Fig. 2), we need to allocate memory for color(b) ; • Provide a general framework toi automatically and efif rand(0, 1) ↵ then Wi W 1 H H(1) + (W ; exact Markov blanket for each ficiently track the i (Q)) random variable and (2) maintain the minimum set of random variables necessary to evaluate the query and the evidence (the dynamic slice [Agrawal and Horgan, 1990], or equivalently the minimum self-supporting partial world [Milch et al., 2005a]). 3639 When sampling a number variable, we need to allocate When sampling a number variables, variable, we to allocate memory for the associated i.e.,need in the urn-ball memory for 2), thewe associated variables, i.e., in urn-ball model (Fig. need to allocate memory forthe color(b) afmodel (Fig. 2),#Ball. we needThe to allocate memorygenerated for color(b) ter sampling corresponding codeaf-is ter sampling is after sampling #Ball.The Thecorresponding correspondinggenerated generatedcode infershown below.#Ball. shown below. ence code is shown below. int get_num_Ball() { int//get_num_Ball() some code for {memoization // some code for memoization memoization; memoization; // sample a value // a value valsample = sample_uniformInt(1,20); val = sample_uniformInt(1,20); // some code allocating memory for color // some code allocating memory for color allocate_memory_color(val); allocate_memory_color(val); return val; } return val; } allocate_memory_color(val) denotes some pseudoallocate_memory_color(val) denotes some some pseudoallocate_memory_color(val) denotes code segment for allocating val chunks of memory for the code segment for allocating val chunks of memory for the values of color(b). values of color(b). Reference Counting Reference Counting Reference counting (RC) generalize the idea of DB to increReference counting (RC) generalize the idea of to increReference counting the (RC) generalize of DB DB mentally maintain dynamic slicetheinidea PMH. RC to is increa effimentally maintain the dynamic slice ininPMH. RC isisana effimentally maintain the dynamic slice PMH. RC cient compilation strategy for the interpretive BLOG toeffidycient compilation strategy for interpretive BLOG to dycient compilation strategy for the the interpretive dynamically maintain references to variables andBLOG excludetothose namically maintain references to and those namically maintain references to variables variables and exclude exclude without any references from the current possible worldthose with without any references from the current possible world with without any references from the current possible world with minimal runtime overhead. RC is also similar to dynamic minimal runtime RC to minimal runtime overhead. overhead. RC is is also also similar similar to dynamic dynamic garbage collection in programming language community. garbage collection in programming language community. garbage collection in programming language community. For an a r.v. being tracked, say X, X, RC RC maintains maintains aa referreferFor r.v. being being tracked, tracked, say say Forcounter a r.v. X,cnt(X) RC maintains a reference cnt(X) defined by = |Ch(X|W )|. ence counter cnt(X) cnt(X) defined defined by by cnt(X) cnt(X) ==|Ch(X|W |Chw (X)|. ence counter )|.. Ch(X|W ) denote the children of X in the possible world W Ch the of X in the possible world world W w. w (X) denote Ch(X|W ) denote thechildren children the possible When cnt(X) becomes zero,of XXis isin removed from w; W ;when when. When cnt(X) becomes zero, X removed from When cnt(X) becomes zero, X is removed from W ; when cnt(X)become becomepositive, positive,X X isinstantiated instantiatedand and addedback back cnt(X) cnt(X) become positive, X is is instantiated and added added back to W . This procedure is performed recursively. to is performed recursively. to w. W .This Thisprocedure procedure Tracking referencesistotoperformed everyr.v. r.v.recursively. mightcause causeunnecessary unnecessary Tracking references every might Tracking references to every r.v. unnecessary overhead.For Forexample, example,for forclassical classicalmight Bayescause nets,RC RC neverexexoverhead. Bayes nets, never overhead. For example, for classical Bayes nets, RC never excludes any variables. Hence, as a trade-off, Swift only counts cludes any Hence, aa trade-off, Swift cludes any variables. variables. Hence, as asmodels, trade-off, Swift only onlytocounts counts references in open-universe open-universe particularly, those references in models, particularly, to references in open-universe models, particularly, to those those variables associated with number variables (i.e., color(b) in variables associated with variables associated with number number variables variables (e.g., (i.e., color(b) color(b) in in the urn-ball model). the urn-ball model). theTake urn-ball model). theurn-ball urn-ballmodel modelas asan an example. When Whenresampling resampling Take the Take the urn-ball modelthe as proposed an example. example. When resampling drawn(d) and accepting value v, the generated drawn(d) and the value drawn(d) and accepting accepting the proposed proposed value v, v, the the generated generated code for accepting the proposal will be code code for for accepting accepting the the proposal proposal will will be be void inc_cnt(X) { void if inc_cnt(X) (cnt(X) == {0) W.add(X); if (cnt(X) == 0) +W.add(X); cnt(X) = cnt(X) 1; } cnt(X) = cnt(X){ + 1; } void dec_cnt(X) void dec_cnt(X) { - 1; cnt(X) = cnt(X) cnt(X) = cnt(X) 1; if (cnt(X) == 0)- W.remove(X); } if (cnt(X) == 0) W.remove(X); void accept_value_drawn(Draw d, }Ball v) { void accept_value_drawn(Draw d, Ball v) { // code for updating dependencies omitted // dec_cnt(color(val_drawn(d))); code for updating dependencies omitted dec_cnt(color(val_drawn(d))); val_drawn(d) = v; val_drawn(d) = v; } inc_cnt(color(v)); inc_cnt(color(v)); } The The function function inc_cnt(X) inc_cnt(X) and and dec_cnt(X) dec_cnt(X) update update the the The function dec_cnt(X) update the references totoX. The accept_value_drawn() isis references X.inc_cnt(X) Thefunction functionand accept_value_drawn() references function Swift accept_value_drawn() is specialized r.v. drawn(d): specializedtototoX. r.v.The drawn(d): Swift analyzes analyzes the the input input propro- specialized to r.v. drawn(d): the variables. input program specialized codes for gramand andgenerates generates specializedSwift codesanalyzes fordifferent different variables. gram and generates specialized codes for different variables. Adaptive Adaptivecontingency contingencyupdating updating Adaptive contingency updating The goal of ACU is to The goal of ACU is to incrementally incrementally maintain maintain the the Markov Markov The goalfor ofevery ACUr.v. is with to incrementally maintain the Markov Blanket minimal Blanket for every r.v. with minimalefforts. efforts. Blanket for every with minimal CX denotes ther.v.declaration of efforts. r.v. X in the input PP; P arw (X) denote the parents of X in the possible world w; w[X v] denotes the new possible world derived from w by only changing the value of X to v. Note that deriving 3640 CX denotes the declaration of r.v. X in the input PP; the declaration X the in possible the inputworld PP; X denotes PC ar(X|W ) denote the parents of of r.v. X in PWar(X|W the parents X in the possible ; W (X ) denote v) denotes the newof possible world derivedworld from W v) denotes newofpossible from W; W by (X only changing thethe value X to v.world Notederived that deriving W by only changing the value of X to v. Note that deriving w[X v] may require instantiating new variables not existW (X v) may require instantiating new variables not exW (X v) due may instantiating new variables not exing in windue to dependency changes. isting W torequire dependency changes. isting in W due toar(X|W dependency Computing PPar X changes. is by executComputing ) for X straightforward is straightforward by exew (X) for Computing P ar(X|W )C for is straightforward by exeing its generation process, ,Xwithin w, which is often incuting its generation process, W , which is often XC X , within cuting its generation process, C , within W , which is often X expensive. Hence, the principal challenge for maintaining inexpensive. Hence, the principal challenge for maintaining inexpensive. Hence, isthe challenge for maintaining the efficiently maintain aa children the Markov Markov blanket blanket is totoprincipal efficiently maintain children set set the Markov blanket efficiently a children Ch(X) for r.v. X with the aim maintain identical set to Ch(X) for each each r.v. is Xto of keeping identical the Ch(X) for each r.v. X with thethe aimcurrent of keeping the the true set Ch current possible world ground truth Ch(X|W under PW Widentical . w. w (X) for) the 0 W. ground Ch(X|W ) under the current PW With aaproposed possible world (PW) Withtruth proposed possible world (PW)w W= (Xw[X v), itv],is With a proposed possible world (PW) W (X v), variit is iteasy is easy to add all missing dependencies from a particular to add all missing dependencies from a particular easy add all dependencies varir.v. . .That is,is,missing for U and everyfrom 2 PPar(U arw0 (U ableUto U That forana r.v. r.v. Va particular 2 |W),), able UU. must That be is,whether forCh aw r.v. U and Vadd 2P |W ), 0 (V since in we every can Ch(V) up-to-date we can verify U 2),Ch(V |Wkeep )r.v. and Uar(U to Ch(X) we can verify whether U 2 Ch(V |W ) and add U to Ch(X) by adding U to Ch(V) if U 2 6 Ch(V). Therefore, the if U 62 Ch(X). Therefore, the key step becomes trackingkey the ifsetU of 62 variables, Ch(X). thev)), key which step (w[X becomes tracking the step is tracking Therefore, the(W set(X of variables, v]), which will have a different set variables, (W v)), which will have a different will have a set in of W parents w[X v] different from w. setof of parents (X(Xinv). set of parents in W (X v). (W(X [X v)) v]) = = {Y : P arw (Y ) 6= P arw[X v] (Y )} (W (W (X v)){Y = : P ar(Y |W ) 6= P ar(Y |W (X v))}at Precisely computing (w[X v]) is very expensive {Y : P ar(Y |W ) 6= P ar(Y |W (X v))} runtime but computing an upper bound for (w[X v]) Precisely computing (W (X v))the is very expensive at does not influence the correctness of inference code. Precisely computing (W (X v)) is very expensive at runtime but computing an upper bound for (W (X v)) Therefore, we proceed to find an bound upper bound that(X holds v)) for runtime computing ancorrectness upper (W doesvalue notbut influence the of for the inference code. any v. We call this over-approximation the contingent does not influence the tocorrectness of the inference code. Therefore, we proceed find an upper bound that holds for set Contw (X). Therefore, weWe proceed to find an upper bound the thatcontingent holds for any value v. call this over-approximation Note thatv.for r.v. Uover-approximation with dependency that in any value Weevery this the changes contingent set Cont(X|W ).call be w(X v), X must reached in the condition of a control set Note Cont(X|W ). that for every r.v. Vpath with dependency that changes in statement onfor theevery execution CU within w. implies Note r.v. Vreached withofdependency thatThis changes in W(w[X (X thatv), X must be in the condition of a conv])X✓must Chwbe (X). One straightforward idea set W (Xstatement v), reached of, namely aiscontrol on the(X). execution pathinofthe CVcondition within W Cont Ch However, this leads toW too much w (X) =on w execution trol(W statement the path of C within , namely V (X v)) ✓ Ch(X|W One straightforward ideaforis overhead. (·)).).should be always empty (WCont(X|W (X For v))example, ✓=Ch(X|W One straightforward idea is set ) Ch(X|W ). However, this leads to too models with fixed dependencies. set Cont(X|W ) = Ch(X|W ). However, this leads to too much overhead. For example, ( · ) should be always empty Another approach is to first find the be set of switching much overhead. For example, ( · ) out should always empty for models with fixed dependency. variables S for every r.v. X, which is defined by for Another models X with fixed dependency. approach is to first find out the set of switching Another approach is toarfirst find6=out the set of switching SX =S{Y : 9w, vP Pisar (X)}, variables every r.v. X, which defined w (X) w[Y v]by X for variables SX for every r.v. X, which is defined by This immediate advantage: SX can (Y be statically SXbrings = {Y an : 9W, v P ar(X|W ) 6= P ar(X|W v))}, SX = {Yat :compile 9W, v Ptime. ar(X|W ) 6= P ar(X|W (Y SX v))}, computed However, computing is NP1 This brings Complete . an immediate advantage: SX can be statically This brings immediate SX canbbeSstatically computed atancompile time.advantage: However, computing is NPX taking Our solution is to derive the approximation S by computed 1at compile time. However, computingXSX is NPComplete . the union 1of free variables of if/case conditions, function arComplete . Our solution to derive conditions the approximation SbX then by taking guments, and setisexpression in CX , and set solution is variables to derive of theif/case approximation SbXfunction by taking theOur union of free conditions, arbY }. the union of conditions, function arCont (X) = {Y : Yofconditions 2if/case Chw (X) S wfree guments, and set variables expression in^CXX ,2and then set guments, and set expression conditions in CX , and then set Contw (X) is a function of the sampling variable Xb and the Cont(X|W ) = {Y : Y 2 Ch(X|W ) ^ we X 2 bSY }. a current PW w. Likewise, r.v. )X, Cont(X|W ) = {Y : Yfor2 every Ch(X|W ^ X 2maintain SY }. runtime contingent set Cont(X) identical to theXtrue Cont(X|W ) is a function of the sampling variable andset the Cont(X|W )W is .a Likewise, function ofwe the sampling X and Cont the current PW w. Cont(X) be the inw (X) current PWunder maintain a variable runtime can contingent current PW W .for Likewise, weXmaintain a runtime contingent crementally updated as well. set Cont(X) every r.v. corresponding to true continset Cont(X) for every r.v. X corresponding to true continBack to the original focus of ACU, suppose we are adding gent set Cont(X|W ) under the current PW W . Cont(X) gent setincrementally Cont(X|W ) updated. the PW current W . Cont(X) new dependencies inunder the new w0 =PW w[X v]. There can the be can be incrementally updated. are Back three to steps accomplish goal: (1) enumerating the theto original focusthe of ACU, suppose we are all adding Back focus ACU, are 0 (U variables Uthe 2 original Cont(X), (2)of for all P arwe ), adding add are U wx). new thetodependencies in the new PWVsuppose W2(X There new the dependencies new (X There 0 (U to Ch(V), and (3) forinalltheVthe 2goal: PPW arwW ) \ Sbx). U are to three steps to accomplish (1) enumerating the U , addall three steps These to accomplish (1) enumerating allway the Cont(V). steps canthe be goal: also repeated in a similar 1 Since the CX may contain arbitrary boolean formuto remove thedeclaration vanished dependencies. 1 Since thereduce declaration CX may contain computing arbitrary boolean las, one the can the model 3-SAT problem SX . formuTake 1-GMM (Fig. 3)toto ascomputing an example las, one can reduce the 3-SAT problem SX,. when resampling z(d), we need to change the dependency of r.v. x(d) 1 Since the declaration CX may contain arbitrary boolean formulas, one can reduce the 3-SAT problem to computing SX . { Fcc (C, Id, {}) } void Id::accept_value(T v) for(uT in Cont(Id)) u.del_from_Ch(); Fc (M = {random Id([T Id ,]⇤ ) ⇠ C) = Id = v; void Id::add_to_Ch() for(u inId, Cont(Id)) u.add_to_Ch(); } { Fcc (C, {}) }⇤ Fcc(M (M = =void random T Id([T Id([T Id Id ,] ,]⇤ )) ⇠ ⇠ C) C) = = F random T Id::accept_value(T v) void Id::add_to_Ch() Fcc (Cvoid ={Exp, X, seen) = Fcc (Exp, X, seen) Id::add_to_Ch() for(u in Cont(Id)) u.del_from_Ch(); {IdF F=cc (C, Id, {}) } }= Fcc (Exp, X, seen) Fcc (C = Dist(Exp), X, seen) cc (C, { {}) v; Id, void Id::accept_value(T v) seen) = Fcc (C void = iffor(u (Exp) then C1 else Cu.add_to_Ch(); 2 , X, Id::accept_value(T v) in Cont(Id)) } {ccfor(u for(u inseen) Cont(Id)) u.del_from_Ch(); u.del_from_Ch(); F (Exp, X, { in Cont(Id)) Id = v; (Exp) Id = Fcc (C =ifExp, X,v; seen) = Fcc (Exp, X, seen) for(u in Cont(Id)) u.add_to_Ch(); } {for(u Fcc (Cin , X, seen FV (Exp)) } seen) 1 u.add_to_Ch(); } Fcc (C = Dist(Exp), X,Cont(Id)) seen)[= Fcc (Exp, X, Fcc (C =else if (Exp) then C1 else C2 , X, seen) = Fcc (C = =FExp, Exp, X,(C seen) = Fcc (Exp, X, seen) seen)} {(Exp, FX, X,= seen F V (Exp)) cc (C cc[ cc F seen) F (Exp, X, X,2 ,seen) cc Fcc (C = =if Dist(Exp), X, seen) seen) = =F Fcc (Exp, X, X, seen) seen) cc (C cc (Exp, F Dist(Exp), (Exp) X, FFcc (C = if if (Exp) then C11 else else C22,, X, X, seen) seen) = cc (Exp, =, X, seen F = C {X,(Exp) Fseen) [ FV C (Exp)) } = cc(C cc (C1then F (Exp, X, seen); seen); ccline (Exp, X, else /* emitF acc of code for every r.v. U in the corresponding set */ if (Exp) if (C , X, seen [XF)\seen) V (Exp)) } 8{ U(Exp) 2Fcc ((F V2(Exp) \ Sb : Cont(U) += X; { F Fcc (C11,, X, X, seen seen [ [F FV V (Exp)); (Exp)); } cc (C { 8 U 2 (F V (Exp)\seen) : Ch(U) +=} X; else else Fcc (Exp, X, seen) = { F Fcc (C22,, X, X, seen [ [F FV V (Exp)); (Exp)); } } cc (C /* emit4: a{line of code for seen every r.v.for U in the corresponding setcode */ Figure Transformation rules outputting inference 8UF 2V ((F V (Exp) \ SbX )\seen) Cont(U)in+= X; with ACU. (Expr) denotes the free: variables Expr. Fcc (Exp, X, X, seen) = = cc (Exp, F 8 U 2seen) (F V (Exp)\seen) : Ch(U) += X;already been We assume switching SbXcorresponding have /* emit emit line the of code code for every everyvariables r.v. U U in in the the set */ */ /* aa line of for r.v. corresponding set computed. The rules for number statements and case statebX )\seen) : Cont(U) 8 U 2 ((F V (Exp) \ S += X; b 8 2 ((F V (Exp) \ SX )\seen) : Cont(U) += code X; Figure 4:UTransformation rules for outputting inference ments are shown for conciseness—they to 8U Unot 2 (F V (Exp)\seen) (Exp)\seen) Ch(U) +=are X; analogous 8 2 V Ch(U) += X; with ACU. F (F V (Expr) denotes:: the free variables in Expr. the rules for random variables and if statements respectively. We assume the switchingrules variables SbX haveinference already code been Figure 4: Transformation Transformation for outputting outputting Figure 4: rules for inference code computed. The rules for number statements and case statewith ACU. ACU. F F V (Expr) (Expr) denote denote the the free variables variables in in Expr. with variables Cont(X), (2) for allfree Ub 2 P ar(V |W ),Expr. add to V ments areVnot2Vshown for conciseness—they are analogous b We assume the switching variables S have already been X have already We assume the switching variables SX been brespectively. the rules forand random variables and if statements to Ch(U), (3) for all U 2 P ar(V |W ) \ S , add V to V computed. The The rules rules for for number number statements statements and and case case statestatecomputed. Cont(U). These steps can be also repeated in a similar way ments are are not not shown shown for for conciseness—they conciseness—they are are analogous analogous to to ments to remove the vanished dependencies. the rules for random variables and if statements respectively. the rules for random variables and if statements respectively. variables V 1-GMM 2 Cont(X), (2)(Fig. for all 2 Pexample ar(V |W, when ), addreV Take the model 3) U as an to Ch(U),z(d), and we (3)need for all U 2 P ar(V |W ) \ SbV of , add to sampling to change the dependency r.v. Vx(d) Cont(U). These steps can be also repeated in a similar way sinceCont Cont(z(d)|W ) = {x(d)} for any W . The generated since Cont (z(d)) = {x(d)} for any w. The generated code w since (z(d)) = {x(d)} for any w. The generated code w to remove theprocess vanished dependencies. code this is shown for thisforprocess process is shown shown below.below. for this is below. Takex(d)::add_to_Ch() the 1-GMM model (Fig. void { 3) as an example , when resampling z(d), += wex(d); need to change the dependency of Cont(z(d)) r.v.Ch(z(d)) x(d) since += Cont(z(d)|W ) = {x(d)} for any W . The x(d); Ch(mu(z(d))) += process x(d); is}shown below. generated code for this void z(d)::accept_value(Cluster v) { // accept the proposed value v for r.v. z(d) // code for updating references omitted for (u in Cont(z(d))) u.del_from_Ch(); val_z(d) = v; for (u in Cont(z(d))) u.add_to_Ch(); } We omitthe the code code for for del_from_Ch() del_from_Ch() for for conciseness, We omit the code for We omit for conciseness, conciseness, which isisessentially essentially the same as add_to_Ch() which whichis essentiallythe thesame sameas asadd_to_Ch() add_to_Ch()... The formal transformation rules for ACU are demonstrated The Theformal formaltransformation transformationrules rulesfor forACU ACUare aredemonstrated demonstrated in Fig. 4. F takes in a random variable declaration statement c ininFig. 4.4.FFcctakes ininaarandom variable declaration statement Fig. takes random variable declaration statement We omit theprogram code for del_from_Ch() forcode conciseness, in the BLOG program and outputs the inference code containin the BLOG and outputs the inference containin the BLOG program and outputs the inference code containwhich is essentially thevalue() same as add_to_Ch() . Ch(). ing methods accept and add to FFcc cc ing value()and andadd add to to Ch(). Ch(). F ingmethods methodsaccept accept value() cc The transformation rules for ACUthe areinference demonstrated takes in BLOG expression and generates the inference code takes in aaaBLOG expression generates code takes informal BLOG expressionand and generates the inference code in Fig.method 4. Fc takes a random statement inside method add to Ch(). ItItkeeps keeps track of already seen inside add Ch(). It track seen inside method addinto to Ch().variable keepsdeclaration trackof ofalready already seen in the BLOG program and outputs the inference code containvariables in seen, which makes sure that a variable will be variables in seen, which makes sure that a variable will variables in seen, which makes sure that a variable will be be ing methods accept and addset toat Ch(). Fcc added to the contingent set or the children set atatmost most once. added to set added tothe thecontingent contingentvalue() setor orthe thechildren children set mostonce. once. takes in BLOG expression generates the similarly. inference Code for removing variables can be generated similarly. Code variables can Codefor foraremoving removing variablesand canbe begenerated generated similarly. code inside method add to Ch(). It keeps track of already seen Since we maintain the exact children for each r.v. with Since we we maintain maintain the the exact exact children children for for each each r.v. with with Since r.v. variables in seen, which makes sure that↵ variable be ACU, the computation ofofacceptance acceptance ratio ↵↵ain in(line Eq. 711 for forwill PMH ACU,the thecomputation computation acceptance ratio in Alg. 1) ACU, of ratio Eq. PMH added to the contingent set or the children set at most once. can be simplified simplified to can be to Code for removing variables can be generated similarly. Since we maintain the0 exact ! Q children for each0 r.v.0 with ! Q 0acceptance 0in 0 ] 1) |w| Pr[X = = v|w v|w Pr[U (w )|wAlg. ACU, the|w| computation of-X ratio ↵ (line 7 Pr[X Pr[U (w )|w ] -X ]] -U U 2Ch (X) 0 w -U U 2Chw0 (X) Q min 1, 1, 0 Q min |w0 || Pr[X Pr[X = = vv00|w |w-X Pr[V (w)|w (w)|w-V -X ]] -V ]] |w V 2Ch 2Chw (X) Pr[V w (X) V (2) (2) 3641 ! Q = v|Wi ] u2Ch(x|Wi ) Pr[u|Wi ] Q min 1, |Wi | Pr[x = v 0 |Wi 1 ] v2Ch(x|Wi 1 ) Pr[v|Wi 1 ] for PMH can be simplified to (2) Q of random variables ! Here |W | denotes the total number ex|Wi 1 | Pr[x = v|W ] Pr[u|W ] i i u2Ch(x|Wi ) isting in1,W , which be number maintained via RC.variables existQ of min |w| Here denotes thecan total random 0 |Wnumber Here |w| denotes the total of random variables |W | Pr[x = v ] Pr[v|W i i 1 i exist1] v2Ch(x|W ) the computation time of ingFinally, in w, w, which which can be be maintained maintained viaACU RC.i is1 strictly shorter ing in can via RC. (2) than that of acceptance ratio ↵ (Eq. 2). |Wi 1 | Pr[x Finally, the computation time of ACU ACU is strictly strictly shorter Finally, computation of is shorter Here |W | the denotes the total time number of random variables exthan that of acceptance ratio ↵ (Eq. 2). than of ↵Swift (Eq. 2). via RC. isting in W acceptance , which can ratio beof maintained 4.2 that Implementation Finally, thememoization computation time of ACU is strictly shorter Lightweight 4.2 Implementation of Swift Swift 4.2 Implementation of than that of acceptance ratio ↵ (Eq. 2). details of the memoHere we introduce the implementation Lightweight memoization memoization Lightweight ization code in the getter function in section 4.1. 4.2 we Implementation of Swiftmentioned Here introduce the implementation details of target the memomemoHere we introduce implementation details of the Objects will be the converted to integers in the code. ization code in the getter function mentioned in section 4.1.for Lightweight memoization ization code in the mentioned in section 4.1. Swift analyzes thegetter input function PP and allocates static memory Objects willinbe betheconverted converted to integers integers in the theoftarget target code. Here we introduce the implementation details the memoObjects will to in code. memoization target code. For open-universe models Swift analyzes the input input PP and allocates static memory for ization code in the getter function mentioned in section 4.1. Swift PP and allocates memory for whereanalyzes the number of random variables arestatic unknown, the dymemoization in be the target code. code. For open-universe open-universe models Objects converted to integers in the target code. memoization in the target For namic tablewill data structure (i.e., vector<int> in models C++) is where thereduce number ofinput random variables are unknown, the dydySwift analyzes the PPofand allocates static memory for where the number random variables are unknown, the used to theof amount dynamic memory allocations: namic table data structure (e.g., vector in C++) C++) is used used to memoization in the code. For open-universe models namic table data structure vector in is to we only increase the target length(e.g., of the array when the number bereduce amount of dynamic dynamic memory are allocations: we only where the number random unknown,we theonly dyreduce amount of memory allocations: comes the larger than the capacity.variables increase the length length of the the array array when the number number becomes namic dataweakness structure (i.e., vector<int> inbecomes C++) tais increase the of when the One table potential of directly applying a dynamic larger than the capacity. used to reduce the amount of dynamic memory allocations: larger ble isthan that,the forcapacity. models with multiple nested number stateOne potential weakness ofofdirectly directly applying dynamic tawe onlypotential increase the consumption lengthof the array the number bements, the memory canapplying bewhen large. Swift, tathe One weakness aa In dynamic ble is can that, for models models with multiple nested number statements comes larger than capacity. user force the the target code to clear all the unused memory ble is that, for with multiple nested number statements (e.g. the aircraft tracking model with multiple aircrafts and One potential weakness of directly aaircrafts dynamic taevery fixed number of iterations via applying an compilation option, (e.g. the aircraft tracking model with multiple and blips for each one), the memory consumption can be large. ble is that, for models with multiple nested number statewhichforis each turnedone), off by blips thedefault. memory consumption can be large. In Swift, the user can can force the target target code to clear clear all the the ments, memory consumption can becode large. In Swift, the The the following code demonstrates the target code for In Swift, the user force the to all unused memory every fixed number ofall iterations via2) comuser canmemory forcecolor(b) the target code to clearof the unused memory #Ball and in thenumber urn-ball model (Fig.via unused every fixed iterations aa where compilation option, whichiteration isiterations turned offvia by an default. every ofis compilation option, iter fixed isoption, thenumber current number. pilation which turned off by default. The following code demonstrates the target code code for for the the which is turned off by default. The followingval_color, code demonstrates the target vector<int> mark_color; number variable #Ball and its associated color(b)’s in the The following code demonstrates the target code for number variable #Ball b) and {its associated color(b)’s in the int get_color(int urn-ball model (Fig. 2) 2)inwhere where iter ismodel the current current iteration #Ball and color(b) the (Fig. 2) where if (mark_color[b] ==urn-ball iter)is urn-ball model (Fig. iter the iteration number. iterreturn is the current iteration number. val_color[b]; // memoization number. mark_color[b] = iter;mark_color; // mark the flag vector<int> val_color, val_color[b] = ...//sampling code omitted int get_color(int b) { return val_color[b];} if (mark_color[b] == iter) int return val_num_B, mark_num_B; val_color[b]; // memoization int get_num_Ball() { mark_color[b] = iter; // mark the flag if(mark_num_B= == iter) returncode val_num_B; val_color[b] ...//sampling omitted val_num_B = ...// sampling code omitted return val_color[b];} if(val_num_B > val_color.size(){//allocate int val_num_B, mark_num_B; { get_num_Ball() val_color.resize(val_num_B); //memory int { mark_color.resize(val_num_B); } if(mark_num_B == iter) return val_num_B; return val_num_B; } val_num_B = ...// sampling code omitted if(val_num_B > val_color.size(){//allocate Note that when generating a new PW, by simply increasing { val_color.resize(val_num_B); //memory counter, all the memoization flags (i.e., mark color for mark_color.resize(val_num_B); } color( · )) will be automatically cleared. return val_num_B; } Lastly,that in when PMH, generating we need toa randomly a r.v. inper Note new PW, PW, select by simply simply Note thatInwhen when generating a new by inNote that generating a new PW, byprocess simply increasing iteration. the target C++ code, this is accomcreasing the the counter counter iter, iter, all all the the memoization flags flags (e.g., (e.g., creasing counter, allpolymorphism the memoization flags memoization (i.e.,anmark for plishedcolor via declaring abstractcolor class with mark for color(·)) color(·))bywill will be automatically automatically cleared. mark color for be cleared. color( · )) will be method automatically cleared. class for every r.v. in a resample() and a derived Lastly, ininPMH, PMH, wewe need to randomly randomly select an r.v. r.v.a to to samLastly, in need to an Lastly, PMH,we need to randomly select r.v.samper the PP. iteration. An example code snippet forcode, theselect 1 GMM model ple per In the target C++ this process is ac-is ple per below. iteration. the target process is aciteration. In theIntarget C++ C++ code,code, this this process is accomshown complished via polymorphism polymorphism by declaring declaring an abstract abstract class complished via by an plished via polymorphism by declaring an abstract classclass with class MH_OBJ {public: //abstract class with a resample() method and a derived class for every with a resample() method and a derived class for every a resample() method and a derived class for every r.v. virtual void resample()=0;//resample step in r.v. in the PP. An example example code for snippet forGMM the 1-GMM 1-GMM r.v. in An code snippet the thevirtual PP.the An PP. example code snippet the // 1for model is void add_to_Ch()=0; for ACU model isbelow. shown below. below. model is shown shown virtual double get_likeli()=0; class {public: //abstract }; // MH_OBJ some methods omitted class MH_OBJ {public: //abstract class class virtual virtual void void resample()=0;//resample resample()=0;//resample step step virtual virtual void void add_to_Ch()=0; add_to_Ch()=0; //for // forACU ACU virtual accept_value()=0;//for ACU virtual void double get_likeli()=0; virtual double get_likeli()=0; }; // some methods omitted }; // some methods omitted // maintain all the r.v.s in the current PW std::vector<MH_OBJ*> active_vars; Supporting new algorithms We demonstrate that FIDS can be applied to new algorithms Supporting new algorithms by implementing a translator for the Gibbs sampling [Arora We demonstrate that in FIDS can be applied to new algorithms et al., 2010] (Gibbs) Swift. [Arora byGibbs implementing a translator forthe theproposal Gibbs sampling is a variant of MH with distribution g(·) et 2010 Swift. setal., to be the] (Gibbs) posteriorindistribution. The acceptance ratio ↵ is Gibbs1 iswhile a variant of MH with the proposal distribution g(·) always the disadvantage is that it is only possible to set to be the posterior The acceptance ratio ↵ is explicitly construct thedistribution. posterior distribution with conjugate always 1 while is that it is only possible to priors. In Gibbs,the thedisadvantage proposal distribution still constructed explicitly constructblanket, the posterior with conjugate from the Markov which distribution again requires maintaining priors. In Gibbs, the proposal distribution still constructed the children set. Hence, FIDS can be fully is utilized. from the Markov blanket, whichpriors again yield requires maintaining However, different conjugate different forms the children set. Hence, FIDS be to fully utilized. of posterior distributions. In can order support a variety of However, different conjugate yield different forms proposal distributions, we need topriors (1) implement a conjugacy of posterior distributions. In posterior order to distribution support a variety of checker to check the form of for each proposal wetime need(the to (1) implement a conjugacy r.v. in thedistributions, PP at compile checker has nothing to do checker to check the form aofposterior posteriorsampler distribution for each with FIDS); (2) implement for every conr.v. in the PPinatthe compile (theof checker jugate prior runtimetime library Swift. has nothing to do with FIDS); (2) implement a posterior sampler for every conjugate prior in the runtime library of Swift. 5 Experiments // derived classes for r.v.s in the PP class Var_mu:public MH_OBJ{public://for mu(c) double val; // value for the var double cached_val; //cache for proposal int mark, cache_mark; // flags std::set<MH_OBJ*> Ch, Cont; //sets for ACU double getval(){...}; //sample & memoize double getcache(){...};//generate proposal ... }; // some methods omitted std::vector<Var_mu*> mu; // dynamic table class Var_z:public MH_OBJ{ ... };//for z(d) Var_z* z[20]; // fixed-size array Efficient proposal manipulation Although one PMH iteration only samples a single variable, generating a new PW may still involve (de-)instantiating an arbitrary number of random variables due to RC or sampling a number variable. The general approach commonly adopted by many PPL systems is to construct the proposed possible world in dynamically allocated memory and then copy it to the current world [Milch et al., 2005a; Wingate et al., 2011], which suffers from significant overhead. In order to generate and accept new PWs in PMH with negligible memory management overhead, we extend the lightweight memoization to manipulate proposals: Swift statically allocates an extra memoized cache for each random variable, which is dedicated to storing the proposed value for that variable. During resampling, all the intermediate results are stored in the cache. When accepting the proposal, the proposed value is directly loaded from the cache; when rejection, no action is needed due to our memoization mechanism. Here is an example target code fragment for mu(c) in the 1-GMM model. model. proposed proposed vars vars is is an an array array storing storing all all 1-GMM the variables variables to to be be updated updated if if the the proposal proposal gets gets accepted. accepted. the In the experiments, Swift generates target code in C++ with 5C++Experiments standard <random> library for random number gener- ation the armadillo ma[Sanderson, ] for with In theand experiments, Swiftpackage generates target code2010 in C++ trix computation. The baseline systems include BLOG (verC++ standard <random> library for random number genersion 0.9.1), (version 3.3.0),[Sanderson, Church (webchurch), Ination and theFigaro armadillo package 2010] for mafer.NET (version The 2.6),baseline BUGS systems (winBUGS 1.4.3) and (verStan trix computation. include BLOG (cmdStan sion 0.9.1),2.6.0). Figaro (version 3.3.0), Church (webchurch), Infer.NET (version 2.6), BUGS (winBUGS 1.4.3) and Stan 5.1 Benchmark models (cmdStan 2.6.0). We collect a set of benchmark models2 which exhibit various capabilities of models a PPL (Tab. 1), including the burglary 5.1 Benchmark model (Bur), the hurricane model (Hur),2 which the urn-ball model We collect a set of benchmark models exhibit var(U-B(x,y) denotes the urn-ball model with at most x balls and ious capabilities of a PPL (Tab. 1), including the burglary y draws), the TrueSkill model [Herbrich et al., 2007] (T-K), model (Bur), the hurricane model (Hur), the urn-ball model 1-dimensional Gaussian mixture model (GMM, 100 points (U-B(x,y) denotes the urn-ball model with at most x balls and with 4 clusters) and the infinite model (1-GMM). We y draws), the TrueSkill model [GMM Herbrich et al., 2007][ (T-K), also include a real-world dataset: handwritten digits LeCun 1-dimensional Gaussian mixture model (GMM, 100 points et al., 1998] using the PPCA model [Tipping and Bishop, with 4 clusters) and the infinite GMM model (1-GMM). We ]. All the models can be downloaded from the [GitHub 1999include also a real-world dataset: handwritten digits LeCun repository of] BLOG. et al., 1998 using the PPCA model [Tipping and Bishop, ] 1999 . All the can within be downloaded 5.2 Speedupmodels by FIDS Swift from the GitHub repository of BLOG. We first evaluate the speedup promoted by each of the three optimizations in FIDS individually. 5.2 Speedup by FIDS within Swift std::vector<MH_OBJ*>proposed_vars; class Var_mu:public MH_OBJ{public://for mu(c) double val; // value double cached_val; //cache for proposal int mark, cache_mark; // flags double getcache(){ // propose new value if(mark == 1) return val; if(mark_cache == iter) return cached_val; mark_cache = iter; // memoization proposed_vars.push_back(this); cached_val = ... // sample new value return cached_val; } void accept_value() { val = cached_val;//accept proposed value if(mark == 0) { // not instantiated yet mark = 1;//instantiate variable active_vars.push_back(this); } } void resample() { // resample proposed_vars.clear(); mark = 0; // generate proposal new_val = getcache(); mark = 1; alpha = ... //compute acceptance ratio if (sample_unif(0,1) <= alpha) { for(auto v: proposed_vars) v->accept_value(); // accept // code for ACU and RC omitted } } }; // some methods omitted Dynamic Backchaining for LW We first evaluate the speedup promoted by each of the three We compare the running time of the following versions of optimizations in FIDS individually. code: (1) the code generated by Swift (“Swift”); (2) the modDynamic Backchaining for LW ified compiled code with DB manually turned off (“No DB”), We compare the running time following versions of which sequentially samples all of thethe variables in the PP; (3) code: (1) the code generated by Swift (“Swift”);memoization (2) the modthe hand-optimized code without unnecessary ified compiled code with manually turned off (“No DB”), or recursive function callsDB (“Hand-Opt”). which sequentially samplestime all for the all variables in the PP; (3) We measure the running the 3 versions and the the hand-optimized code without unnecessary number of calls to the random number generatormemoization for “Swift” or function callsis(“Hand-Opt”). andrecursive “No DB”. The result concluded in Tab. 2. 22 Most models are simplified to ensure that all the benchmark systems can produce produce an an output output in in reasonably reasonably short short running running time. time. All All systems can of these models can be enlarged to handle more data without adding extra lines lines of of BLOG BLOG code. code. extra 3642 Bur R CT CC OU CG Hur X X X X U-B X X X T-K GMM X X X X 1-GMM X X X X 30 PPCA 25 X 20 2.3x No RC Swift 15 10 5 X 0 U-B(10,5) Table 1: Features of the benchmark models. R: continuous scalar or vector variables. CT: context-specific dependencies (contingency). CC: cyclic dependencies in the PP (while in any particular PW the dependency structure remains acyclic). OU: open-universe. CG: conjugate priors. U-B(20,5) U-B(40,5) U-B(40,10) U-B(40,20) U-B(40,40) Figure 5: Running time (s) of PMH in Swift with and without RC on urn-ball models for 20 million samples. Alg. Bur Hur U-B(20,10) U-B(40,20) Running Time (s): Swift v.s. Static Dep Static Dep 0.986 1.642 4.433 7.722 Swift 0.988 1.492 3.891 4.514 Speedup 0.998 1.100 1.139 1.711 Calls to Likelihood Functions Static Dep 1.8 ⇤ 107 2.7 ⇤ 107 1.5 ⇤ 108 3.0 ⇤ 108 7 7 7 Swift 1.8 ⇤ 10 1.7 ⇤ 10 4.1 ⇤ 10 5.5 ⇤ 107 Calls Saved 0% 37.6% 72.8% 81.7% Alg. Bur Hur U-B(20,2) U-B(20,8) Running Time (s): Swift v.s. No DB No DB 0.768 1.288 2.952 5.040 Swift 0.782 1.115 1.755 4.601 Speedup 0.982 1.155 1.682 1.10 Running Time (s): Swift v.s. Hand-Optimized Hand-Opt 0.768 1.099 1.723 4.492 Swift 0.782 1.115 1.755 4.601 Overhead 1.8% 1.4% 1.8% 2.4% Calls to Random Number Generator No DB 3 ⇤ 107 3 ⇤ 107 1.35 ⇤ 108 1.95 ⇤ 108 Swift 3 ⇤ 107 2.0 ⇤ 107 4.82 ⇤ 107 1.42 ⇤ 108 Calls Saved 0% 33.3% 64.3% 27.1% Table 3: Performance of PMH in Swift and the version utilizing static dependencies on benchmark models. best efficiency that can be achieved via compile-time analysis without maintaining dependencies at runtime. This version statically computes an upper bound of the children set c by Ch(X) = {Y : X appears in CY } and again uses Eq.(2) to compute the acceptance ratio. Thus, for models with fixed dependencies, “Static Dep” should be faster than “Swift”. In Tab. 3, “Swift” is up to 1.7x faster than “Static Dep”, thanks to up to 81% reduction of likelihood function evaluations. Note that for the model with fixed dependencies (Bur), the overhead by ACU is almost negligible: 0.2% due to traversing the hashmap to access to child variables. Table 2: Performance on LW with 107 samples for Swift, the version without DB and the hand-optimized version We measure the running time for all the 3 versions and the number of calls to the random number generator for “Swift” and “No DB”. The result is concluded in Tab. 2. The overhead due to memoization compared against the hand-optimized code is less than 2.4%. We can further notice that the speedup is proportional to the number of calls saved. 5.3 Reference Counting for PMH RC only applies to open-universe models in Swift. Hence we focus on the urn-ball model with various model parameters. The urn-ball model represents a common model structure appearing in many applications with open-universe uncertainty. This experiment reveals the potential speedup by RC for realworld problems with similar structures. RC achieves greater speedup when the number of balls and the number of observations become larger. We compare the code produced by Swift with RC (“Swift”) and RC manually turned off (“No RC”). For “No RC”, we traverse the whole dependency structure and reconstruct the dynamic slice every iteration. Fig. 5 shows the running time of both versions. RC is indeed effective – leading to up to 2.3x speedup. Swift against other PPL systems We compare Swift with other systems using LW, PMH and Gibbs respectively on the benchmark models. The running time is presented in Tab. 4. The speedup is measured between Swift and the fastest one among the rest. An empty entry means that the corresponding PPL fails to perform inference on that model. Though these PPLs have different host languages (C++, Java, Scala, etc.), the performance difference resulting from host languages is within a factor of 23 . 5.4 Experiment on real-world dataset We use Swift with Gibbs to handle the probabilistic principal component analysis (PPCA) [Tipping and Bishop, 1999], with real-world data: 5958 images of digit “2” from the handwritten digits dataset (MNIST) [LeCun et al., 1998]. We compute 10 principal components for the digits. The training and testing sets include 5958 and 1032 images respectively, each with 28x28 pixels and pixel value rescaled to [0, 1]. Since most benchmark PPL systems are too slow to handle this amount of data, we compare Swift against other two Adaptive Contingency Updating for PMH We compare the code generated by Swift with ACU (“Swift”) and the manually modified version without ACU (“Static Dep”) and measure the number of calls to the likelihood function in Tab. 3. The version without ACU (“Static Dep”) demonstrates the 3 3643 http://benchmarksgame.alioth.debian.org/ PPL Bur Church Figaro BLOG Swift Speedup 9.6 15.8 8.4 0.04 196 Church Figaro BLOG Swift Speedup 12.7 10.6 6.7 0.11 61 BUGS Infer.NET Swift Speedup 87.7 1.5 0.12 12 Hur U-B (20,10) T-K LW with 106 samples 22.9 179 57.4 24.7 176 48.6 11.9 189 49.5 0.08 0.54 0.24 145 326 202 PMH with 106 samples 25.2 246 173 59.6 17.4 18.5 30.4 38.7 0.15 0.4 0.32 121 75 54 Gibbs with 106 samples 0.19 0.34 1 1 - GMM 1-GMM 1627 997 998 6.8 147 1038 235 261 1.1 214 3703 151 72.5 0.76 95 1057 62.2 68.3 0.60 103 84.4 77.8 0.42 185 0.86 1 Figure 6: Log-perplexity w.r.t running time(s) on the PPCA model with visualized principal components. Swift converges faster. 6 Table 4: Running time (s) of Swift and other PPLs on the benchmark models using LW, PMH and Gibbs. Conclusion and Future Work We have developed Swift, a PPL compiler that generates model-specific target code for performing inference. Swift uses a dynamic slicing framework for open-universe models to incrementally maintain the dependency structure at runtime. In addition, we carefully design data structures in the target code to avoid dynamic memory allocations when possible. Our experiments show that Swift achieves orders of magnitudes speedup against other PPL engines. The next step for Swift is to support more algorithms, such as SMC [Wood et al., 2014], as well as more samplers, such as the block sampler for (nearly) deterministic dependencies [Li et al., 2013]. Although FIDS is a general framework that can be naturally extended to these cases, there are still implementation details to be studied. Another direction is partial evaluation, which allows the compiler to reason about the input program to simplify it. Shah et al. [2016] proposes a partial evaluation framework for the inference code (PI in Fig. 1). It is very interesting to extend this work to the whole Swift pipeline. scalable PPLs, Infer.NET and Stan, on this model. Both PPLs have compiled inference and are widely used for real-world applications. Stan uses HMC as its inference algorithm. For Infer.NET, we select variational message passing algorithm (VMP), which is Infer.NET’s primary focus. Note that HMC and VMP are usually favored for fast convergence. Stan requires a tuning process before it can produce samples. We ran Stan with 0, 5 and 9 tuning steps respectively4 . We measure the perplexity of the generated samples over test images w.r.t. the running time in Fig. 6, where we also visualize the produced principal components. For Infer.NET, we consider the mean of the approximation distribution. Swift quickly generates visually meaningful outputs with around 105 Gibbs iterations in 5 seconds. Infer.NET takes 13.4 seconds to finish the first iteration5 and converges to a result with the same perplexity after 25 iterations and 150 seconds. The overall convergence of Gibbs w.r.t. the running time significantly benefits from the speedup by Swift. For Stan, its no-U-turn sampler [Homan and Gelman, 2014] suffers from significant parameter tuning issues. We also tried to manually tune the parameters, which does not help much. Nevertheless, Stan does work for the simplified PPCA model with 1-dimensional data. Although further investigation is still needed, we conjecture that Stan is very sensitive to its parameters in high-dimensional cases. Lastly, this experiment also suggests that those parameter-robust inference engines, such as Swift with Gibbs and Infer.NET with VMP, would be preferred at practice when possible. Acknowledgement We would like to thank our anonymous reviewers, as well as Rohin Shah for valuable discussions. This work is supported by the DARPA PPAML program, contract FA875014-C-0011. References [Agrawal and Horgan, 1990] Hiralal Agrawal and Joseph R Horgan. Dynamic program slicing. In ACM SIGPLAN Notices, volume 25, pages 246–256. ACM, 1990. [Arora et al., 2010] Nimar S. Arora, Rodrigo de Salvo Braz, Erik B. Sudderth, and Stuart J. Russell. Gibbs sampling in open-universe stochastic languages. In UAI, pages 30–39. AUAI Press, 2010. 4 We also ran Stan with 50 and 100 tuning steps, which took more than 3 days to finish 130 iterations (including tuning). However, the results are almost the same as that with 9 tuning steps. 5 Gibbs by Swift samples a single variable while Infer.NET processes all the variables per iteration. [Chaganty et al., 2013] Arun Chaganty, Aditya Nori, and Sriram Rajamani. Efficiently sampling probabilistic programs via program analysis. In AISTATS, pages 153–160, 2013. 3644 [Cytron et al., 1989] Ron Cytron, Jeanne Ferrante, Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. An efficient method of computing static single assignment form. In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 25–35. ACM, 1989. In Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005. [Minka et al., 2014] T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6, 2014. [Nori et al., 2014] Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. R2: An efficient MCMC sampler for probabilistic programs. In AAAI Conference on Artificial Intelligence, 2014. [Pfeffer, 2001] Avi Pfeffer. IBAL: A probabilistic rational programming language. In In Proc. 17th IJCAI, pages 733–740. Morgan Kaufmann Publishers, 2001. [Pfeffer, 2009] Avi Pfeffer. Figaro: An object-oriented probabilistic programming language. Charles River Analytics Technical Report, 2009. [Plummer, 2003] Martyn Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, volume 124, page 125. Vienna, 2003. [Ritchie et al., 2016] Daniel Ritchie, Andreas Stuhlmüller, and Noah D. Goodman. C3: Lightweight incrementalized MCMC for probabilistic programs using continuations and callsite caching. In AISTATS 2016, 2016. [Sanderson, 2010] Conrad Sanderson. Armadillo: An open source c++ linear algebra library for fast prototyping and computationally intensive experiments. 2010. [Shah et al., 2016] Rohin Shah, Emina Torlak, and Rastislav Bodik. SIMPL: A DSL for automatic specialization of inference algorithms. arXiv preprint arXiv:1604.04729, 2016. [Stan Development Team, 2014] Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0, 2014. [Tipping and Bishop, 1999] Michael E. Tipping and Chris M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999. [Tristan et al., 2014] Jean-Baptiste Tristan, Daniel Huang, Joseph Tassarotti, Adam C Pocock, Stephen Green, and Guy L Steele. Augur: Data-parallel probabilistic modeling. In NIPS, pages 2600–2608. 2014. [Wingate et al., 2011] David Wingate, Andreas Stuhlmueller, and Noah D Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In International Conference on Artificial Intelligence and Statistics, pages 770–778, 2011. [Wood et al., 2014] Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, pages 2–46, 2014. [Wu et al., 2014] Yi Wu, Lei Li, and Stuart J. Russell. BFiT: From possible-world semantics to random-evaluation semantics in open universe. In Neural Information Processing Systems, Probabilistic Programming workshop, 2014. [Yang et al., 2014] Lingfeng Yang, Pat Hanrahan, and Noah D. Goodman. Generating efficient MCMC kernels from probabilistic programs. In AISTATS, pages 1068–1076, 2014. [Goodman et al., 2008] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. Church: A language for generative models. In UAI, pages 220–229, 2008. [Herbrich et al., 2007] Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill(tm): A bayesian skill rating system. In Advances in Neural Information Processing Systems 20, pages 569–576. MIT Press, January 2007. [Homan and Gelman, 2014] Matthew D Homan and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. The Journal of Machine Learning Research, 15(1):1593–1623, 2014. [Hur et al., 2014] Chung-Kil Hur, Aditya V. Nori, Sriram K. Rajamani, and Selva Samuel. Slicing probabilistic programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 133–144, New York, NY, USA, 2014. ACM. [Kazemi and Poole, 2016] Seyed Mehran Kazemi and David Poole. Knowledge compilation for lifted probabilistic inference: Compiling to a low-level language. 2016. [Kucukelbir et al., 2015] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic variational inference in Stan. In NIPS, pages 568–576, 2015. [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998. [Li et al., 2013] Lei Li, Bharath Ramsundar, and Stuart Russell. Dynamic scaled sampling for deterministic constraints. In AISTATS, pages 397–405, 2013. [Lunn et al., 2000] David J. Lunn, Andrew Thomas, Nicky Best, and David Spiegelhalter. WinBUGS - a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4):325–337, October 2000. [Mansinghka et al., 2013] Vikash K. Mansinghka, Tejas D. Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum. Approximate Bayesian image interpretation using generative probabilistic graphics programs. In NIPS, pages 1520–1528, 2013. [McAllester et al., 2008] David McAllester, Brian Milch, and Noah D Goodman. Random-world semantics and syntactic independence for expressive languages. Technical report, 2008. [Milch and Russell, 2006] Brian Milch and Stuart J. Russell. General-purpose MCMC inference over relational structures. In UAI. AUAI Press, 2006. [Milch et al., 2005a] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov. BLOG: Probabilistic models with unknown objects. In IJCAI, pages 1352–1359, 2005. [Milch et al., 2005b] Brian Milch, Bhaskara Marthi, David Sontag, Stuart Russell, Daniel L. Ong, and Andrey Kolobov. Approximate inference for infinite contingent Bayesian networks. 3645