Swift: Compiled Inference for Probabilistic Programming

advertisement
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)
Swift: Compiled Inference for Probabilistic Programming Languages
Yi Wu
UC Berkeley
jxwuyi@gmail.com
Lei Li
Toutiao.com
lileicc@gmail.com
Stuart Russell
Rastislav Bodik
UC Berkeley
University of Washington
russell@cs.berkeley.edu bodik@cs.washington.edu
Abstract
computation [Mansinghka et al., 2013]. While better algorithms are certainly possible, our focus in this paper is on
achieving orders-of-magnitude improvement in the execution
efficiency of a given algorithmic process.
A PPL system takes a probabilistic program (PP) specifying a probabilistic model as its input and performs inference to compute the posterior distribution of a query given
some observed evidence. The inference process does not (in
general) execute the PP, but instead executes the steps of an
inference algorithm (e.g., Gibbs) guided by the dependency
structures implicit in the PP. In many PPL systems the PP exists as an internal data structure consulted by the inference
algorithm at each step [Pfeffer, 2001; Lunn et al., 2000;
Plummer, 2003; Milch et al., 2005a; Pfeffer, 2009]. This process is in essence an interpreter for the PP, similar to early
Prolog systems that interpreted the logic program. Particularly when running sampling algorithms that involve millions of repetitive steps, the overhead can be enormous. A
natural solution is to produce model-specific compiled inference code, but, as we show in Sec. 2, existing compilers for general open-universe models [Wingate et al., 2011;
Yang et al., 2014; Hur et al., 2014; Chaganty et al., 2013;
Nori et al., 2014] miss out on optimization opportunities and
often produce inefficient inference code.
The paper analyzes the optimization opportunities for PPL
compilers and describes the Swift compiler, which takes as
input a BLOG program [Milch et al., 2005a] and one of three
inference algorithms (LW, PMH, Gibbs) and generates target
code for answering queries. Swift includes three main contributions: (1) elimination of interpretative overhead by joint
analysis of the model structure and inference algorithm; (2) a
dynamic slice maintenance method (FIDS) for incremental
computation of the current dependency structure as sampling
proceeds; and (3) efficient runtime memory management for
maintaining the current-possible-world data structure as the
number of objects changes. Comparisons between Swift and
other PPLs on a variety of models demonstrate speedups
ranging from 12x to 326x, leading in some cases to performance comparable to that of hand-built model-specific code.
To the extent possible, we also analyze the contributions of
each optimization technique to the overall speedup.
Although Swift is developed for the BLOG language, the
overall design and the choices of optimizations can be applied
to other PPLs and may bring useful insights to similar AI
A probabilistic program defines a probability measure over its semantic structures. One common goal
of probabilistic programming languages (PPLs)
is to compute posterior probabilities for arbitrary
models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines—even the compiled ones—incur significant runtime interpretation overhead, especially
for contingent and open-universe models. This
paper describes Swift, a compiler for the BLOG
PPL. Swift-generated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle
memory management for possible worlds of varying sizes. Experiments comparing Swift with other
PPL engines on a variety of inference problems
demonstrate speedups ranging from 12x to 326x.
1
Introduction
Probabilistic programming languages (PPLs) aim to combine
sufficient expressive power for writing real-world probability
models with efficient, general-purpose inference algorithms
that can answer arbitrary queries with respect to those models. One underlying motive is to relieve the user of the obligation to carry out machine learning research and implement
new algorithms for each problem that comes along. Another
is to support a wide range of cognitive functions in AI systems and to model those functions in humans.
General-purpose inference for PPLs is very challenging;
they may include unbounded numbers of discrete and continuous variables, a rich library of distributions, and the ability
to describe uncertainty over functions, relations, and the existence and identity of objects (so-called open-universe models). Existing PPL inference algorithms include likelihood
weighting (LW) [Milch et al., 2005b], parental Metropolis–
Hastings (PMH) [Milch and Russell, 2006; Goodman et
al., 2008], generalized Gibbs sampling (Gibbs) [Arora et
al., 2010], generalized sequential Monte Carlo [Wood et
al., 2014], Hamiltonian Monte Carlo (HMC) [Stan Development Team, 2014], variational methods [Minka et al., 2014;
Kucukelbir et al., 2015] and a form of approximate Bayesian
3637
Classical Probabilistic
Program Optimizations
(insensitive to inference
algorithm)
1
2
3
4
5
6
7
8
9
10
Classical Algorithm
Optimizations
(insensitive to model)
Inference Code 𝑷𝑰
Probabilistic
Program
(Model)
𝑷𝑴
𝓕𝑰 𝑷𝑴 → 𝑷𝑰
Algorithm-Specific
Code
𝓕cg 𝑷𝑰 → 𝑷𝑻
Target Code
(C++)
𝑷𝑻
Model-Specific Code
Our Optimization: FIDS
(combine model and algorithm)
type Ball; type Draw; type Color; //3 types
distinct Color Blue, Green; //two colors
distinct Draw D[2]; //two draws: D[0], D[1]
#Ball ~ UniformInt(1,20);//unknown # of balls
random Color color(Ball b) //color of a ball
~ Categorical({Blue -> 0.9, Green -> 0.1});
random Ball drawn(Draw d)
~ UniformChoice({b for Ball b});
obs color(drawn(D[0])) = Green; // drawn ball is green
query color(drawn(D[1])); // query
Figure
2: The
urn-ball
model
Figure
2: The
urn-ball
model
Our Optimization:
Data Structures for FIDS
1 type Cluster; type Data;
which
track all
theD[20];
dependencies
at runtime
2 distinct
Data
// 20 data
points even though typ3 #Cluster
Poisson(4);
// the
number
of clusters
ically
only a~ tiny
portion of
dependency
structure may
4 random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean
change
per
iteration.
Church
simply
avoids
keeping
track of
5 random Cluster z(Data d)
dependencies
and uses Eq.(1),
including
6
~ UniformChoice({c
for Cluster
c}); a potentially huge
7 random for
Real
x(Data d)
~ Gaussian(mu(z(d)),
overhead
redundant
probability
computations.1.0); // data
8 obs x(D[0]) = 0.1; // omit other data points
For
the
second
point,
in
order
to
track the existence of
9 query size({c for Cluster c});
the variables in an open-universe model, similar to BLOG,
Church also maintains a complicated dynamic string-based
Figure 3: The infinite Gaussian mixture model
hashing scheme in the target code, which again causes interpretation overhead.
Lastly,
by Hur et
al. [2014]such
, Chaganty
generictechniques
approach proposed
is Monte-Carlo
sampling,
as likeli[
]
[
]
et al.
2013
and
Nori
et
al.
2014
primarily
focus
on optihood weighting [Milch et al., 2005a] (LW) and Metropolismizing
PM by
analyzing
the static
properties
ofidea
the input
PP,
Hastings
algorithm
(MH).
For LW,
the main
is to sample
which
arerandom
complementary
Swift.
every
variable to
from
its prior per iteration and collect weighted samples for the queries, where the weight is the
3 likelihood
Background
of the evidences. For MH, the high-level procedure
ofpaper
the algorithm
is summarized
in Alg. 1.although
M denotes
an inThis
focuses on
the BLOG language,
our apput BLOG
program
model),
its evidence,
Q asemanquery, N
proach
also applies
to (or
other
PPLsEwith
equivalent
the number
samples,
denotes
world,
[McAllester
ticsdenotes
et al., of
2008;
Wu etWal.,
2014].a possible
Other PPLs
(x)also
is the
value of xtoinBLOG
possiblevia
world
W ,single
Pr[x|W
] denotes
canWbe
converted
static
assigntheform
conditional
probability
of x in W[,Cytron
and Pr[W
] denotes
ment
(SSA form)
transformation
et al.,
1989; the
of].possible world W . In particular, when the proHurlikelihood
et al., 2014
posal distribution g in Alg.1 is the prior of every variable, the
3.1algorithm
The BLOG
Language
becomes
the parental MH algorithm (PMH). Our
in the paper
focuses
LW and
PMHprobabilbut our pro[Milch
] defines
Thediscussion
BLOG language
et al.,on2005a
solution
to other
Monte-Carlo
methods
well.
ity posed
measures
over applies
first-order
(relational)
possible
worlds;asin
this sense it is a probabilistic analogue of first-order logic. A
BLOG
declares
types for objects and defines distri4 program
The Swift
Compiler
butions over their numbers, as well as defining distributions
has the
following
design applied
goals for
handlingObopenfor Swift
the values
of random
functions
to objects.
universe
probabilistic
models.
servation
statements
supply
evidence, and a query statement
specifies
the posterior
probability
of interest.
A random vari• Provide
a general
framework
to automatically
and efable in BLOG
to exact
the application
of a random
ficientlycorresponds
(1) track the
Markov blanket
for each
functionrandom
to specific
objects
a possible
BLOG natvariable
andin (2)
maintainworld.
the minimum
set of
urally supports open universe probability models (OUPMs)
and context-specific dependencies.
Algorithm
1: Metropolis-Hastings
Fig.
2 demonstrates
the open-universealgorithm
urn-ball (MH)
model. In
this version
the, E
query
asks
for the
color ofHthe next random
Input: M
, Q, N
; Output
: samples
pick1 from
an urn
given world
the colors
of balls
drawn previously.
initialize
a possible
W0 with
E satisfied;
Line2 4forisi a number
statement stating that the number vari1 to N do
randomly
pick a variable
x from
able3 #Ball,
corresponding
to the
totalWnumber
of balls, is
i 1 ;
4
Wdistributed
Wi 1 between
and v
i
i 1 (x)
uniformly
1Wand
20.; Lines 5–6 declare a
5
propose color(·),
a value v 0applied
for x viatoproposal
g;
random
function
balls, picking
Blue or
0
6
Wi (x)
v
and
ensure WiLines
is supported;
Green
with
a biased
probability.
7–8
state
that
each
⇣
⌘
0
!v) Pr[W
i ] the urn, with replacedraw
a ball1,at g(v
random
from
7 chooses
↵
min
;
g(v!v 0 ) Pr[Wi 1 ]
ment.
8
if rand(0, 1) ↵ then Wi
Wi 1
Fig. 3 describes
OUPM,
the infinite
Gaussian mixH
H +another
(Wi (Q))
;
ture model (1-GMM). The model includes an unknown
Figure 1: PPL compilers and optimization opportunities.
systems for real-world applications.
2
Existing PPL Compilation Approaches
In a general purpose programming language (e.g., C++), the
compiler compiles exactly what the user writes (the program).
By contrast, a PPL compiler essentially compiles the inference algorithm, which is written by the PPL developer, as
applied to a PP. This means that different implementations of
the same inference algorithm for the same PP result in completely different target code.
As shown in Fig. 1, a PPL compiler first produces an intermediate representation combining the inference algorithm (I)
and the input model (PM ) as the inference code (PI ), and then
compiles PI to the target code (PT ). Swift focuses on optimizing the inference code PI and the target code PT given a
fixed input model PM .
Although there have been successful PPL compilers, these
compilers are all designed for a restricted class of models
with fixed dependency structures: see work by Tristan [2014]
and the Stan Development Team [2014] for closed-world
Bayes nets, Minka [2014] for factor graphs, and Kazemi and
Poole [2016] for Markov logic networks.
Church [Goodman et al., 2008] and its variants [Wingate
et al., 2011; Yang et al., 2014; Ritchie et al., 2016] provide a lightweight compilation framework for performing
the Metropolis–Hastings algorithm (MH) over general openuniverse probability models (OUPMs). However, these approaches (1) are based on an inefficient implementation of
MH, which results in overhead in PI , and (2) rely on inefficient data structures in the target code PT .
For the first point, consider an MH iteration where we are
proposing a new possible world w0 from w accoding to the
proposal g(·) by resampling random variable X from v to v 0 .
The computation of acceptance ratio ↵ follows
✓
◆
g(v 0 ! v) Pr[w0 ]
↵ = min 1,
.
(1)
g(v ! v 0 ) Pr[w]
Since only X is resampled, it is sufficient to compute ↵ using merely the Markov blanket of X, which leads to a much
simplified formula for ↵ from Eq.(1) by cancelling terms in
Pr[w] and Pr[w0 ]. However, when applying MH for a contingent model, the Markov blanket of a random variable cannot be determined at compile time since the model dependencies vary during inference. This introduces tremendous runtime overhead for interpretive systems (e.g., BLOG, Figaro),
3638
r
t
1
t
• P
s
i
For
(Fram
gener
goal,
code (
For
demo
discus
Sec. 4
Swift
only s
contin
4.1
Our d
thoug
includ
adapti
ing (R
and R
Dyna
Dynam
a dyn
variab
comp
piler):
and q
during
For
input
is use
and (2
the cu
infere
DB
is to r
struct
target
For
x(d) i
doubl
//
mem
//
val
ret
Since
mu(c
Notab
dency
non-e
8
~ UniformChoice({b for Ball b});
9 obs color(drawn(D[0])) = Green; // drawn ball is green
10 query color(drawn(D[1])); // query
Figure 2: The urn-ball model
type Cluster; type Data;
distinct Data D[20]; // 20 data points
#Cluster ~ Poisson(4); // number of clusters
random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean
random Cluster z(Data d)
~ UniformChoice({c for Cluster c});
random Real x(Data d) ~ Gaussian(mu(z(d)), 1.0); // data
Figure 2: The urn-ball model
obs x(D[0]) = 0.1; // omit other data points
query size({c for Cluster c});
1
2
3
4
5
6
7
8
9
Figure3:3:The
Theinfinite
infiniteGaussian
Gaussianmixture
mixture model
model
Figure
Algorithm
1: Metropolis–Hastings
algorithmsuch
(MH)
generic approach
is Monte-Carlo sampling,
as likelihood
et al.,
2005a] (LW)
and MetropolisInput:weighting
M, E, Q,[Milch
N ; Output
: samples
H
(0) LW, the main idea is to sample
(MH).wFor
1 Hastings
initialize aalgorithm
possible world
with E satisfied;
every
random
variable
from
its prior per iteration and col2 for i
1 to N do
(i
1) themodel
Figure 3:samples
The
infinite
mixture
weighted
for theGaussian
weight is the
3 lect randomly
pick
a variable
Xqueries,
from wwhere
;
likelihood
of
the
For
MH,
the
high-level
procedure
(i)
(i evidences.
1)
(i 1)
4
w
w
and v
X(w
);
of the algorithm is summarized
in Alg. 1. M denotes an in0
5
propose
a value
v for X via proposal
g;
generic
approach
is (or
Monte-Carlo
sampling,
as likeliput BLOG
program
model), E
its
evidence,such
Q a query,
N
(i)
0
(i)
6
X(wthe )number
vMilch
and
ensure
w W is
self-supporting;
[
]
hood
weighting
et
al.,
2005a
(LW)
and
Metropolisdenotes
of
samples,
denotes
a
possible
world,
⌘
⇣
0
(i)
g(v !v)For
Pr[wLW,
] the main idea is to sample
Hastings
(MH).
the
value
of
x in0 )possible
world
7 W (x)
↵ isalgorithm
min
1, g(v!v
; W , Pr[x|W ] denotes
Pr[w(i 1) ]
every
random
variable
from
its
prior
per
iteration
and colthe conditional probability of (i)
x in W ,(iand
] denotes
the
1) Pr[W
8
if rand(0,
1) ↵ for
thenthewqueries,
w where
;
lect
weighted
thewhen
weight
the
likelihood
ofsamples
possible(i)
world W . In particular,
the ispro9
+ (Q(w
)); For
likelihood
ofHthe
evidences.
MH,
the of
high-level
procedure
posalHdistribution
g in Alg.1
is the
prior
every variable,
the
becomes
the parentalinMH
algorithm
(PMH).anOur
ofalgorithm
the algorithm
is summarized
Alg.
1. M denotes
indiscussion
in the paper
focuses on
LWevidence,
and PMHQbut
our proput
BLOG program
(or model),
E its
a query,
N
posed
solution
applies
to
other
Monte-Carlo
methods
as
well.
denotes
the
number
of
samples,
W
denotes
a
possible
world,
number of clusters as stated in line 3, which is to be inferred
W (x)the
is data.
the value of x in possible world W , Pr[x|W ] denotes
from
the4 conditional
probability
of x in W , and Pr[W ] denotes the
The Swift
Compiler
likelihood
of
possible
world
W . In particular, when the pro3.2Swift
Generic
Algorithms
has theInference
following design
goals for handling openposal
distribution
g
in
Alg.1
is
the prior of every variable, the
universe
probabilistic
models.
All
PPLs
need
to
infer
the
posterior
of a query
algorithm becomes the parental MH distribution
algorithm (PMH).
Our
given
observed
evidence.
The
majority
use
• the
Provide
a general
framework
to and
automatically
and proefdiscussion
in the
paper
focuses
ongreat
LW
PMHof
butPPLs
our
Monte
Carlo
sampling
methods
such
as
likelihood
weighting
ficiently
(1)
track
the
exact
Markov
blanket
for
each
posed solution applies to other Monte-Carlo methods as well.
(LW) and
Metropolis–Hastings
MCMC the
(MH).
LW samples
random
variable and (2) maintain
minimum
set of
unobserved random variables sequentially in topological or4 The Swift Compiler
der, conditioned on evidence and sampled values earlier in
Algorithm
1:following
Metropolis-Hastings
algorithm
(MH)
Swift
has theeach
design
goals
for handling
the
sequence;
complete
sample
is
weighted
by theopenlikeuniverse
Input
:the
M,evidence
E, Q, N ;models.
Output
: samples
H in the sequence.
lihood
of probabilistic
variables
appearing
a possible
W0 with E
The1• initialize
MH
algorithm
is world
summarized
in
Alg.
1. M denotes
a
Provide
aN
general
framework
tosatisfied;
automatically
and ef2 for i
1 toE
do evidence, Q a query, N the number of
BLOG
model,
its
ficiently
(1)
track
the
exact
Markov
blanket
for
each
3
randomly pick a variable x from Wi 1 ;
samples,
w a possible
world,
X(w)
the value
of randomset
varivariable
and
maintain
the minimum
of
4 random
W
Wi 1 and
v (2)
W
i
i 1 (x);
able
X
in
w,
Pr[X(w)|w
the
conditional
proba0 -X ] denotes
random
variables
necessary
to
evaluate
the
query
and
5
propose a value v for x via proposal g ;
0
bility
of W
Xi (x)
in w,vand
the likelihood of pos6
andPr[w]
ensuredenotes
Wi is⌘supported;
sible world w. In⇣ particular,
when
the proposal distribution
g(v 0 !v) Pr[W
i]
↵
min
1, g(v!v0 ) Pr[Wi 1 ] algorithm
;
1: Metropolis-Hastings
g Algorithm
in7 Alg.1
samples
a variable conditioned
on its(MH)
parents, the
8
if
rand(0,
1)
↵
then
W
W
i
i 1
algorithm
becomes
the
parental
MH algorithm
(PMH). Our
Input: M
, E, Q , N
; Output
: samples
H
H
H +paper
(Wi (Q))
;
discussion
the
focuses
on LW
and PMH but our pro1 initialize in
a possible
world
W0 with
E satisfied;
2 for isolution
1 to N
do to other Monte Carlo methods as well.
posed
applies
structures for Monte Carlo sampling algorithms to avoid
interpretive
overhead
at runtime.
the evidence
(the dynamic
slice [Agrawal and Horgan,
], or goal,
equivalently
the minimum
self-supported
For 1990
the first
we propose
a novel framework
FIDSpar[Milch et al., updating
tial world
2005a]). Dynamic Slices) for
(Framework
of Incrementally
generating
the inference
code
(PI in
Fig. 1). For and
the second
Provide
efficient memory
memory
management
fast data
data
•• Provide
efficient
management
and
fast
goal,
we
carefully
choose
data
structures
in
the
target
C++
structures
for
Monte
Carlo
sampling
algorithms
to
avoid
structures
for Monte Carlo sampling algorithms to avoid
code (P
T ).
interpretive
overhead at
at runtime.
runtime.
overhead
Forinterpretive
convenience,
r.v. is short
for random variable. We
For
the
first
goal,
we
propose
novel
framework
FIDS
demonstrate
examples
of
compiled
inframework,
the following
For the first goal, we propose aacode
novel
FIDS
discussion:
the
inference
code
(P
)
generated
by
FIDS
in
(Framework
of
Incrementally
updating
Dynamic
Slices)
for
I
(Framework for Incrementally updating Dynamic Slices), for
Sec.
4.1 are all
ininference
pseudo-code
while
theFig.
target
code
(P
by
generating
the
code
(P
in
1).
For
the
T )second
I
generating the inference code (PI in Fig. 1). For the second
Swift
in Sec. 4.2
are indata
C++.structures
Due to limited
goal,shown
we carefully
carefully
choose
in the
thespace,
targetwe
C++
goal,
we
choose
data structures
in
target
C++
only
show
the detailed transformation rules for the adaptive
code
(P
).
T
code (PT ). updating (ACU) technique and omit the others.
contingency
For convenience,
convenience, r.v.
r.v. isis short
short for
for random
random variable.
variable. We
We
For
demonstrate
examples
of
compiled
code
in
the
following
demonstrate
examplesinof
compiled code in the following
4.1
Optimizations
FIDS
discussion: the
the inference
inference code
code (P
(PI)) generated
generated by
by FIDS
FIDS in
in
discussion:
I on LW and PMH,
Our
discussion
ininthis
section focuses
al-) by
Sec.
4.1
are
all
pseudo-code
while
the
target
code
(P
T
Sec. 4.1FIDS
is in can
pseudo-code
while the
targetalgorithms.
code (PT ) by
Swift
though
be also
applied
to other
FIDS
Swift shown
Sec.
are inDue
C++.
to limited
we
shown
inthree
Sec.in
4.2
is 4.2
in C++.
toDue
limited
space, space,
we
show
includes
optimizations:
dynamic
backchaining
(DB),
only
show
the
detailed
transformation
rules
for
the
adaptive
the detailed
transformation
rules
only and
for the
adaptive
continadaptive
contingency
updating
(ACU)
reference
countcontingency
updating (ACU)
technique
andthe
omit
the others.
gency
updating
andand
omit
others.
ing
(RC).
DB is (ACU)
applied technique
to both LW
PMH while
ACU
and RC are particularly for PMH.
randomly pick a variable x from Wi 1 ;
4
Wi
W 1 and v
Wi 1 (x);
45 The
Swifti Compiler
propose a value v 0 for x via proposal g ;
0
6
Wi (x)
Wi isgoals
Swift
has
the following
design
for handling open⇣v and ensure
⌘supported;
3
g(v 0 !v) Pr[Wi ]
universe
7
↵ probability
min 1, models.
0
g(v!v ) Pr[Wi 1 ]
8
4.1 Optimizations
Optimizations in
in FIDS
FIDS
4.1
Dynamic
backchaining
Our discussion
discussion
in this
this section
section focuses
focuses on
on LW
LW and
and PMH,
PMH, alalOur
in
Dynamic
backchaining
(DB)applied
aims toto
incrementally
construct
though
FIDS
can be
be also
also
other algorithms.
algorithms.
FIDS
though
FIDS
can
applied
to
other
FIDS
aincludes
dynamic three
slice in
each LW iteration
and only
sample those
optimizations:
dynamic
backchaining
(DB),
includes necessary
three optimizations:
dynamic
backchaining
variables
at runtime. DB
is adaptive
from idea (DB),
of
adaptive
contingency
updating
(ACU)
and
reference
countadaptive contingency
updating
(ACU),toand
countcompiling
lazy evaluation
(also similar
thereference
Prolog coming (RC).
(RC). DB
DB isis applied
applied to
to both LW
LW and PMH
PMH while ACU
ACU
ing
piler):
in each iteration,
DB both
backtracksand
from the while
evidence
and
RC
are
particularly
for
PMH.
and
RC
are
specific
to
PMH.
and query and samples a r.v. only when its value is required
during
inference.
Dynamic
backchaining
Dynamic
backchaining
For
every
r.v. declaration
random
Tincrementally
X ⇠ CX ; in
the
Dynamic
backchaining
(DB)
toet
construct
[aims
] constructs
Dynamic
backchaining
Milch
al.,
2005a
input
PP, FIDS
generates (DB)
a getter
function
get
X(),
which
a dynamic slice
sliceincrementally
in each LW iteration
and iteration
only sample
those
andCX
samisa dynamic
used to (1) sample
a value forinXeach
fromLW
its declaration
variables
necessary
at
runtime.
DB
is
adaptive
from
idea
of
ples
only
those
variables
necessary
at
runtime.
DB
in
Swift
and (2) memoize the sampled value for later references in
compiling
lazy
evaluation
(also
similar
to
the
Prolog
comis an
example
of compiling
lazy aevaluation:
in each
iteration,
the
current
iteration.
Whenever
r.v. is referred
to during
piler):
initseach
iteration,
DB
from
evidence
DB
backtracks
from
the evidence
and query
andthe
samples
an
inference,
getter
function
will backtracks
be evoked.
and
query
and
samples
a
r.v.
only
when
its
value
is
required
r.v.
only
when
its
value
is
required
during
inference.
DB is the fundamental technique for Swift. The key insight
inference.
every
declaration
random
T interpretive
X ⇠ CX ;data
in the
isduring
toFor
replace
ther.v.
dependency
look-up
in some
For
every
r.v.generates
declaration
random
T X
⇠ X(),
CXin
; which
in the
structure
direct
machine
seeking
the
input PP,byFIDS
aaddress/instruction
getter
function
get
input
PP,
FIDS
generates
a
getter
function
get
X(),
which
target
executable
file.
is used to (1) sample a value for X from its declaration CX
isFor
used
to
(1) sample
asampled
valuecode
for
X from
its declaration
example,
the the
inference
for
the
function
ofCin
X
and
(2)
memoize
value
forgetter
later
references
andcurrent
(2)thememoize
the
sampled
value
for
references
in
x(d)
in
1-GMM
model
(Fig. 3)
shown
below.
the
iteration.
Whenever
anis r.v.
is later
referred
to during
the current
Whenever
a evoked.
r.v. is referred to during
inference,
itsiteration.
getter function
double
get_x(Data
d)
{ will be
inference,
its
getter
function
will
be
evoked.
//
some
code
memoizing
the
sampled for
value
Compiling DB is the fundamental technique
Swift. The
memoization;
DB
is theisfundamental
technique for
Swift. and
Themethod
key insight
key
insight
to replace dependency
look-ups
inif notthesampled,
sample
a innew
value
is//
to replace
dependency
look-up
some
interpretive
data
vocations
from some
internal
PP data
structure
with direct
val = by
sample_gaussian(get_mu(get_z(d)),1);
structure
directaccessing
machine and
address/instruction
seeking
the
machine
branching in the
targetinexereturnaddress
val; }
target
executable
file.
cutable file.
Since
· ) the
is called
before
get_mu(
·getter
), only
those of
Forget_z(
example,
the
inference
code
for the
the getter
function
of
For
example,
inference
code
for
function
mu(c)
corresponding
to
non-empty
clusters
will
be
sampled.
x(d) in
in the
the 1-GMM
1-GMM model
model (Fig.
(Fig. 3)
3) isis shown
shown below.
below.
x(d)
Notably, with the inference code above, no explicit dependouble
get_x(Data
d) { at runtime to discover those
dency
look-up
is ever required
//
some
code
memoizing
the sampled value
non-empty clusters.
memoization;
// if not sampled, sample a new value
val = sample_gaussian(get_mu(get_z(d)),1);
return val; }
memoization
some pseudo-code
for
Since
get_z( · ) isdenotes
called before
get_mu( · ), snippet
only those
performing
memoization.
Since get_z(·)
is called
before
mu(c) corresponding
to non-empty
clusters will
be sampled.
get_mu(·),
only
mu(c)
corresponding
to non-empty
Notably, with
thethose
inference
code
above, no explicit
depenclusters
will be issampled.
Notably,
with thetoinference
code
dency look-up
ever required
at runtime
discover those
above,
no explicit
dependency look-up is ever required at runnon-empty
clusters.
time to discover those non-empty clusters.
When sampling a number variable, we need to allocate
memory for the associated variables. For example, in the urnball model (Fig. 2), we need to allocate memory for color(b)
;
• Provide
a general
framework
toi automatically
and efif rand(0,
1) ↵ then
Wi
W
1
H
H(1)
+ (W
; exact Markov blanket for each
ficiently
track
the
i (Q))
random variable and (2) maintain the minimum set of
random variables necessary to evaluate the query and
the evidence (the dynamic slice [Agrawal and Horgan,
1990], or equivalently the minimum self-supporting partial world [Milch et al., 2005a]).
3639
When sampling a number variable, we need to allocate
When sampling
a number variables,
variable, we
to allocate
memory
for the associated
i.e.,need
in the
urn-ball
memory
for 2),
thewe
associated
variables,
i.e., in
urn-ball
model (Fig.
need to allocate
memory
forthe
color(b)
afmodel
(Fig. 2),#Ball.
we needThe
to allocate
memorygenerated
for color(b)
ter sampling
corresponding
codeaf-is
ter
sampling
is
after
sampling
#Ball.The
Thecorresponding
correspondinggenerated
generatedcode
infershown
below.#Ball.
shown
below.
ence
code
is shown below.
int get_num_Ball() {
int//get_num_Ball()
some code for {memoization
//
some
code for memoization
memoization;
memoization;
// sample a value
//
a value
valsample
= sample_uniformInt(1,20);
val
= sample_uniformInt(1,20);
// some
code allocating memory for color
//
some code allocating memory for color
allocate_memory_color(val);
allocate_memory_color(val);
return val; }
return val; }
allocate_memory_color(val) denotes some pseudoallocate_memory_color(val)
denotes some
some pseudoallocate_memory_color(val)
denotes
code segment for allocating val chunks
of memory for the
code segment for allocating val chunks of memory for the
values of color(b).
values of color(b).
Reference Counting
Reference Counting
Reference counting (RC) generalize the idea of DB to increReference
counting
(RC)
generalize
the idea of
to increReference
counting the
(RC)
generalize
of DB
DB
mentally maintain
dynamic
slicetheinidea
PMH.
RC to
is increa effimentally
maintain
the
dynamic
slice
ininPMH.
RC
isisana effimentally
maintain
the
dynamic
slice
PMH.
RC
cient compilation strategy for the interpretive BLOG toeffidycient
compilation
strategy
for
interpretive
BLOG to
dycient
compilation
strategy
for the
the
interpretive
dynamically
maintain
references
to variables
andBLOG
excludetothose
namically
maintain
references
to
and
those
namically
maintain
references
to variables
variables
and exclude
exclude
without any
references
from the
current possible
worldthose
with
without
any
references
from
the
current
possible
world
with
without
any
references
from
the
current
possible
world
with
minimal runtime overhead. RC is also similar to dynamic
minimal
runtime
RC
to
minimal
runtime overhead.
overhead.
RC is
is also
also similar
similar
to dynamic
dynamic
garbage collection
in programming
language
community.
garbage
collection
in
programming
language
community.
garbage
collection
in
programming
language
community.
For an
a r.v.
being
tracked,
say X,
X, RC
RC maintains
maintains aa referreferFor
r.v. being
being tracked,
tracked, say
say
Forcounter
a r.v.
X,cnt(X)
RC maintains
a reference
cnt(X)
defined
by
=
|Ch(X|W
)|.
ence
counter cnt(X)
cnt(X) defined
defined by
by cnt(X)
cnt(X) ==|Ch(X|W
|Chw (X)|.
ence
counter
)|..
Ch(X|W
)
denote
the
children
of
X
in
the
possible
world
W
Ch
the
of X
in the
possible world
world W
w.
w (X) denote
Ch(X|W
) denote
thechildren
children
the possible
When
cnt(X)
becomes
zero,of
XXis
isin
removed
from w;
W ;when
when.
When
cnt(X)
becomes
zero,
X
removed
from
When
cnt(X)
becomes
zero,
X
is
removed
from
W
;
when
cnt(X)become
becomepositive,
positive,X
X isinstantiated
instantiatedand
and addedback
back
cnt(X)
cnt(X)
become
positive,
X is
is instantiated
and added
added back
to
W
.
This
procedure
is
performed
recursively.
to
is performed recursively.
to w.
W
.This
Thisprocedure
procedure
Tracking
referencesistotoperformed
everyr.v.
r.v.recursively.
mightcause
causeunnecessary
unnecessary
Tracking
references
every
might
Tracking
references
to
every
r.v.
unnecessary
overhead.For
Forexample,
example,for
forclassical
classicalmight
Bayescause
nets,RC
RC
neverexexoverhead.
Bayes
nets,
never
overhead.
For
example,
for classical
Bayes nets,
RC
never
excludes
any
variables.
Hence,
as
a
trade-off,
Swift
only
counts
cludes
any
Hence,
aa trade-off,
Swift
cludes
any variables.
variables.
Hence, as
asmodels,
trade-off,
Swift only
onlytocounts
counts
references
in open-universe
open-universe
particularly,
those
references
in
models,
particularly,
to
references
in open-universe
models,
particularly,
to those
those
variables
associated
with
number
variables
(i.e.,
color(b)
in
variables
associated
with
variables
associated
with number
number variables
variables (e.g.,
(i.e., color(b)
color(b) in
in
the
urn-ball
model).
the
urn-ball
model).
theTake
urn-ball
model).
theurn-ball
urn-ballmodel
modelas
asan
an example. When
Whenresampling
resampling
Take
the
Take
the
urn-ball
modelthe
as proposed
an example.
example.
When
resampling
drawn(d)
and
accepting
value
v,
the
generated
drawn(d)
and
the
value
drawn(d)
and accepting
accepting
the proposed
proposed
value v,
v, the
the generated
generated
code
for
accepting
the
proposal
will
be
code
code for
for accepting
accepting the
the proposal
proposal will
will be
be
void inc_cnt(X) {
void
if inc_cnt(X)
(cnt(X) == {0) W.add(X);
if
(cnt(X)
== 0) +W.add(X);
cnt(X)
= cnt(X)
1; }
cnt(X)
= cnt(X){ + 1; }
void
dec_cnt(X)
void
dec_cnt(X)
{ - 1;
cnt(X)
= cnt(X)
cnt(X)
= cnt(X)
1;
if (cnt(X)
== 0)- W.remove(X);
}
if (cnt(X)
== 0) W.remove(X);
void
accept_value_drawn(Draw
d, }Ball v) {
void
accept_value_drawn(Draw
d, Ball
v) {
// code
for updating dependencies
omitted
// dec_cnt(color(val_drawn(d)));
code for updating dependencies omitted
dec_cnt(color(val_drawn(d)));
val_drawn(d) = v;
val_drawn(d)
= v; }
inc_cnt(color(v));
inc_cnt(color(v)); }
The
The function
function inc_cnt(X)
inc_cnt(X) and
and dec_cnt(X)
dec_cnt(X) update
update the
the
The function
dec_cnt(X) update the
references
totoX.
The
accept_value_drawn()
isis
references
X.inc_cnt(X)
Thefunction
functionand
accept_value_drawn()
references
function Swift
accept_value_drawn()
is
specialized
r.v.
drawn(d):
specializedtototoX.
r.v.The
drawn(d):
Swift analyzes
analyzes the
the input
input propro-
specialized
to r.v. drawn(d):
the variables.
input
program
specialized
codes
for
gramand
andgenerates
generates
specializedSwift
codesanalyzes
fordifferent
different
variables.
gram and generates specialized codes for different variables.
Adaptive
Adaptivecontingency
contingencyupdating
updating
Adaptive
contingency
updating
The
goal
of
ACU
is
to
The goal of ACU is to incrementally
incrementally maintain
maintain the
the Markov
Markov
The
goalfor
ofevery
ACUr.v.
is with
to
incrementally
maintain the Markov
Blanket
minimal
Blanket
for
every
r.v.
with
minimalefforts.
efforts.
Blanket
for every
with minimal
CX denotes
ther.v.declaration
of efforts.
r.v. X in the input PP;
P arw (X) denote the parents of X in the possible world w;
w[X
v] denotes the new possible world derived from w
by only changing the value of X to v. Note that deriving
3640
CX denotes the declaration of r.v. X in the input PP;
the declaration
X the
in possible
the inputworld
PP;
X denotes
PC
ar(X|W
) denote
the parents of
of r.v.
X in
PWar(X|W
the parents
X in the
possible
; W (X ) denote
v) denotes
the newof
possible
world
derivedworld
from
W
v) denotes
newofpossible
from
W; W
by (X
only changing
thethe
value
X to v.world
Notederived
that deriving
W
by
only
changing
the
value
of
X
to
v.
Note
that
deriving
w[X
v]
may
require
instantiating
new
variables
not
existW (X
v) may require instantiating new variables not exW
(X
v) due
may
instantiating
new variables not exing
in windue
to
dependency
changes.
isting
W
torequire
dependency
changes.
isting
in W due
toar(X|W
dependency
Computing
PPar
X changes.
is
by executComputing
) for
X straightforward
is straightforward
by exew (X) for
Computing
P ar(X|W
)C
for
is straightforward
by
exeing
its generation
process,
,Xwithin
w, which
is often
incuting
its generation
process,
W , which
is often
XC
X , within
cuting
its
generation
process,
C
,
within
W
,
which
is
often
X
expensive.
Hence,
the
principal
challenge
for
maintaining
inexpensive. Hence, the principal challenge for maintaining
inexpensive.
Hence, isthe
challenge
for
maintaining
the
efficiently
maintain
aa children
the Markov
Markov blanket
blanket
is totoprincipal
efficiently
maintain
children set
set
the
Markov
blanket
efficiently
a children
Ch(X)
for
r.v.
X
with
the aim maintain
identical set
to
Ch(X)
for each
each
r.v. is
Xto
of keeping
identical
the
Ch(X)
for
each
r.v. X with
thethe
aimcurrent
of keeping
the
the
true set
Ch
current
possible
world
ground
truth
Ch(X|W
under
PW
Widentical
. w.
w (X) for) the
0 W.
ground
Ch(X|W
) under
the current
PW
With
aaproposed
possible
world
(PW)
Withtruth
proposed
possible
world
(PW)w
W=
(Xw[X v), itv],is
With
a
proposed
possible
world
(PW)
W
(X
v), variit is
iteasy
is
easy
to
add
all
missing
dependencies
from
a particular
to add all missing dependencies from a particular
easy
add
all
dependencies
varir.v.
. .That
is,is,missing
for
U and everyfrom
2 PPar(U
arw0 (U
ableUto
U
That
forana r.v.
r.v. Va particular
2
|W),),
able
UU. must
That be
is,whether
forCh
aw
r.v.
U
and
Vadd
2P
|W ),
0 (V
since
in
we every
can
Ch(V)
up-to-date
we can
verify
U
2),Ch(V
|Wkeep
)r.v.
and
Uar(U
to Ch(X)
we
can
verify
whether
U
2
Ch(V
|W
)
and
add
U
to
Ch(X)
by
adding
U
to
Ch(V)
if
U
2
6
Ch(V).
Therefore,
the
if U 62 Ch(X). Therefore, the key step becomes trackingkey
the
ifsetU of
62 variables,
Ch(X).
thev)),
key which
step (w[X
becomes
tracking
the
step
is
tracking Therefore,
the(W
set(X
of variables,
v]),
which
will have
a different
set
variables,
(W
v)), which
will have
a different
will
have
a set in
of W
parents
w[X
v] different
from
w.
setof
of
parents
(X(Xinv).
set of parents in W (X
v).
(W(X
[X v))
v]) =
= {Y : P arw (Y ) 6= P arw[X v] (Y )}
(W
(W (X
v)){Y
= : P ar(Y |W ) 6= P ar(Y |W (X
v))}at
Precisely computing (w[X
v]) is very expensive
{Y : P ar(Y |W ) 6= P ar(Y |W (X
v))}
runtime but computing an upper bound for (w[X
v])
Precisely
computing
(W (X
v))the
is very
expensive
at
does
not influence
the correctness
of
inference
code.
Precisely
computing
(W
(X
v))
is
very
expensive
at
runtime
but
computing
an
upper
bound
for
(W
(X
v))
Therefore,
we
proceed to
find
an bound
upper bound
that(X
holds v))
for
runtime
computing
ancorrectness
upper
(W
doesvalue
notbut
influence
the
of for
the inference
code.
any
v.
We
call
this
over-approximation
the
contingent
does
not influence
the tocorrectness
of the
inference
code.
Therefore,
we proceed
find an upper
bound
that holds
for
set
Contw (X).
Therefore,
weWe
proceed
to find
an upper bound the
thatcontingent
holds for
any
value
v.
call
this
over-approximation
Note
thatv.for
r.v. Uover-approximation
with dependency that
in
any
value
Weevery
this
the changes
contingent
set Cont(X|W
).call be
w(X
v), X must
reached in the condition of a control
set Note
Cont(X|W
).
that
for
every
r.v. Vpath
with dependency
that changes
in
statement
onfor
theevery
execution
CU within w.
implies
Note
r.v.
Vreached
withofdependency
thatThis
changes
in
W(w[X
(X thatv),
X
must
be
in
the
condition
of
a
conv])X✓must
Chwbe
(X).
One straightforward
idea
set
W
(Xstatement
v),
reached
of, namely
aiscontrol
on
the(X).
execution
pathinofthe
CVcondition
within
W
Cont
Ch
However,
this
leads
toW
too
much
w (X) =on
w execution
trol(W
statement
the
path
of
C
within
,
namely
V
(X
v)) ✓ Ch(X|W
One straightforward
ideaforis
overhead.
(·)).).should
be always empty
(WCont(X|W
(X For
v))example,
✓=Ch(X|W
One
straightforward
idea
is
set
)
Ch(X|W
).
However,
this
leads
to too
models
with fixed
dependencies.
set
Cont(X|W
)
=
Ch(X|W
).
However,
this
leads
to
too
much
overhead.
For example,
( · ) should
be always
empty
Another
approach
is to first find
the be
set
of switching
much
overhead.
For
example,
( · ) out
should
always
empty
for models
with
fixed
dependency.
variables
S
for
every
r.v.
X,
which
is
defined
by
for Another
models X
with
fixed dependency.
approach
is to first find out the set of switching
Another
approach
is
toarfirst
find6=out
the
set of
switching
SX =S{Y
: 9w,
vP
Pisar
(X)},
variables
every
r.v.
X,
which
defined
w (X)
w[Y v]by
X for
variables SX for every r.v. X, which is defined by
This
immediate
advantage:
SX can (Y
be statically
SXbrings
= {Y an
: 9W,
v P ar(X|W
) 6= P ar(X|W
v))},
SX = {Yat :compile
9W, v Ptime.
ar(X|W
) 6= P ar(X|W
(Y SX v))},
computed
However,
computing
is NP1
This brings
Complete
. an immediate advantage: SX can be statically
This
brings
immediate
SX canbbeSstatically
computed
atancompile
time.advantage:
However,
computing
is NPX taking
Our
solution
is to derive
the
approximation
S by
computed
1at compile time. However, computingXSX is NPComplete
.
the
union
1of free variables of if/case conditions, function arComplete .
Our solution
to derive conditions
the approximation
SbX then
by taking
guments,
and setisexpression
in CX , and
set
solution
is variables
to derive of
theif/case
approximation
SbXfunction
by taking
theOur
union
of free
conditions,
arbY }.
the
union
of
conditions,
function
arCont
(X)
= {Y : Yofconditions
2if/case
Chw (X)
S
wfree
guments,
and
set variables
expression
in^CXX ,2and
then set
guments, and set expression conditions in CX , and then set
Contw (X) is a function of the sampling variable Xb and the
Cont(X|W
) = {Y : Y 2 Ch(X|W
) ^ we
X 2 bSY }. a
current
PW w. Likewise,
r.v. )X,
Cont(X|W
) = {Y : Yfor2 every
Ch(X|W
^ X 2maintain
SY }.
runtime
contingent
set Cont(X)
identical
to theXtrue
Cont(X|W
) is a function
of the sampling
variable
andset
the
Cont(X|W
)W
is .a Likewise,
function
ofwe
the
sampling
X
and
Cont
the current
PW
w. Cont(X)
be the
inw (X)
current
PWunder
maintain
a variable
runtime can
contingent
current
PW W
.for
Likewise,
weXmaintain
a runtime
contingent
crementally
updated
as well.
set Cont(X)
every
r.v.
corresponding
to true
continset
Cont(X)
for
every
r.v.
X
corresponding
to
true
continBack
to
the
original
focus
of
ACU,
suppose
we
are
adding
gent set Cont(X|W ) under the current
PW
W
.
Cont(X)
gent
setincrementally
Cont(X|W ) updated.
the PW
current
W . Cont(X)
new
dependencies
inunder
the new
w0 =PW
w[X
v]. There
can the
be
can
be
incrementally
updated.
are Back
three to
steps
accomplish
goal: (1)
enumerating
the
theto
original
focusthe
of ACU,
suppose
we are all
adding
Back
focus
ACU,
are
0 (U
variables
Uthe
2 original
Cont(X),
(2)of
for
all
P arwe
), adding
add are
U
wx).
new
thetodependencies
in the
new
PWVsuppose
W2(X
There
new
the
dependencies
new
(X
There
0 (U
to
Ch(V),
and
(3) forinalltheVthe
2goal:
PPW
arwW
) \ Sbx).
U are
to
three
steps
to accomplish
(1)
enumerating
the
U , addall
three steps These
to accomplish
(1) enumerating
allway
the
Cont(V).
steps canthe
be goal:
also repeated
in a similar
1
Since the
CX may contain arbitrary boolean formuto remove
thedeclaration
vanished dependencies.
1
Since
thereduce
declaration
CX may
contain computing
arbitrary boolean
las,
one the
can
the model
3-SAT
problem
SX . formuTake
1-GMM
(Fig. 3)toto
ascomputing
an example
las, one can reduce the 3-SAT problem
SX,. when resampling z(d), we need to change the dependency of r.v. x(d)
1
Since the declaration CX may contain arbitrary boolean formulas, one can reduce the 3-SAT problem to computing SX .
{ Fcc (C, Id, {}) }
void Id::accept_value(T v)
for(uT in
Cont(Id))
u.del_from_Ch();
Fc (M = {random
Id([T
Id ,]⇤ ) ⇠ C)
=
Id
=
v;
void Id::add_to_Ch()
for(u
inId,
Cont(Id))
u.add_to_Ch(); }
{
Fcc (C,
{}) }⇤
Fcc(M
(M =
=void
random
T Id([T
Id([T Id
Id ,]
,]⇤ )) ⇠
⇠ C)
C) =
=
F
random
T
Id::accept_value(T
v)
void
Id::add_to_Ch()
Fcc (Cvoid
={Exp,
X, seen)
= Fcc (Exp, X,
seen)
Id::add_to_Ch()
for(u
in Cont(Id))
u.del_from_Ch();
{IdF
F=cc
(C,
Id,
{}) }
}= Fcc (Exp, X, seen)
Fcc (C = Dist(Exp),
X, seen)
cc (C,
{
{})
v; Id,
void
Id::accept_value(T
v) seen) =
Fcc (C void
= iffor(u
(Exp)
then
C1 else Cu.add_to_Ch();
2 , X,
Id::accept_value(T
v)
in
Cont(Id))
}
{ccfor(u
for(u
inseen)
Cont(Id)) u.del_from_Ch();
u.del_from_Ch();
F
(Exp, X,
{
in
Cont(Id))
Id
=
v;
(Exp)
Id
=
Fcc (C =ifExp,
X,v;
seen) = Fcc (Exp, X, seen)
for(u
in
Cont(Id))
u.add_to_Ch();
}
{for(u
Fcc (Cin
, X,
seen
FV
(Exp))
} seen)
1
u.add_to_Ch();
}
Fcc (C = Dist(Exp),
X,Cont(Id))
seen)[=
Fcc
(Exp, X,
Fcc (C =else
if (Exp) then C1 else C2 , X, seen) =
Fcc
(C =
=FExp,
Exp,
X,(C
seen)
=
Fcc
(Exp,
X, seen)
seen)}
{(Exp,
FX,
X,=
seen
F V (Exp))
cc (C
cc[
cc
F
seen)
F
(Exp,
X,
X,2 ,seen)
cc
Fcc
(C =
=if
Dist(Exp),
X, seen)
seen) =
=F
Fcc
(Exp, X,
X, seen)
seen)
cc (C
cc (Exp,
F
Dist(Exp),
(Exp) X,
FFcc
(C
= if
if
(Exp)
then
C11 else
else
C22,, X,
X, seen)
seen)
=
cc
(Exp,
=, X, seen
F
=
C
{X,(Exp)
Fseen)
[ FV C
(Exp))
} =
cc(C
cc (C1then
F
(Exp,
X, seen);
seen);
ccline
(Exp,
X,
else
/* emitF
acc
of code
for every r.v. U in the corresponding set */
if
(Exp)
if
(C
, X, seen
[XF)\seen)
V (Exp))
}
8{
U(Exp)
2Fcc
((F
V2(Exp)
\ Sb
: Cont(U)
+= X;
{ F
Fcc
(C11,, X,
X, seen
seen [
[F
FV
V (Exp));
(Exp)); }
cc (C
{
8 U 2 (F V (Exp)\seen)
: Ch(U) +=} X;
else
else
Fcc (Exp,
X, seen) =
{ F
Fcc
(C22,, X,
X, seen [
[F
FV
V (Exp));
(Exp)); }
}
cc (C
/* emit4:
a{line
of
code for seen
every
r.v.for
U in
the corresponding
setcode
*/
Figure
Transformation
rules
outputting
inference
8UF
2V
((F
V (Exp)
\ SbX )\seen)
Cont(U)in+=
X;
with ACU.
(Expr)
denotes
the free: variables
Expr.
Fcc
(Exp, X,
X, seen) =
=
cc (Exp,
F
8 U 2seen)
(F V
(Exp)\seen)
: Ch(U)
+= X;already been
We
assume
switching
SbXcorresponding
have
/* emit
emit
line the
of code
code
for every
everyvariables
r.v. U
U in
in the
the
set */
*/
/*
aa line
of
for
r.v.
corresponding set
computed.
The
rules
for
number
statements
and case
statebX )\seen) : Cont(U)
8
U
2
((F
V
(Exp)
\
S
+=
X;
b
8
2 ((F V (Exp) \
SX )\seen)
: Cont(U)
+= code
X;
Figure
4:UTransformation
rules
for outputting
inference
ments are
shown
for conciseness—they
to
8U
Unot
2
(F
V (Exp)\seen)
(Exp)\seen)
Ch(U)
+=are
X; analogous
8
2
V
Ch(U)
+=
X;
with ACU.
F (F
V (Expr)
denotes:: the
free variables
in Expr.
the rules for random variables and if statements respectively.
We assume
the switchingrules
variables
SbX haveinference
already code
been
Figure
4: Transformation
Transformation
for outputting
outputting
Figure
4:
rules
for
inference
code
computed.
The
rules
for
number
statements
and
case
statewith ACU.
ACU. F
F V (Expr)
(Expr) denote
denote the
the free variables
variables in
in Expr.
with
variables
Cont(X),
(2) for allfree
Ub 2 P ar(V
|W ),Expr.
add to
V
ments
areVnot2Vshown
for conciseness—they
are analogous
b
We
assume
the
switching
variables
S
have
already
been
X have already
We
assume
the
switching
variables
SX
been
brespectively.
the
rules
forand
random
variables
and
if statements
to
Ch(U),
(3)
for
all
U
2
P
ar(V
|W
)
\
S
,
add
V
to
V
computed. The
The rules
rules for
for number
number statements
statements and
and case
case statestatecomputed.
Cont(U).
These
steps
can
be
also
repeated
in
a
similar
way
ments are
are not
not shown
shown for
for conciseness—they
conciseness—they are
are analogous
analogous to
to
ments
to remove
the
vanished
dependencies.
the
rules
for
random
variables
and
if
statements
respectively.
the
rules
for
random
variables
and
if
statements
respectively.
variables
V 1-GMM
2 Cont(X),
(2)(Fig.
for all
2 Pexample
ar(V |W, when
), addreV
Take the
model
3) U
as an
to
Ch(U),z(d),
and we
(3)need
for all
U 2 P ar(V
|W ) \ SbV of
, add
to
sampling
to change
the dependency
r.v. Vx(d)
Cont(U).
These
steps
can
be
also
repeated
in
a
similar
way
sinceCont
Cont(z(d)|W
)
=
{x(d)}
for
any
W
.
The
generated
since
Cont
(z(d))
=
{x(d)}
for
any
w.
The
generated
code
w
since
(z(d))
=
{x(d)}
for
any
w.
The
generated
code
w
to
remove
theprocess
vanished
dependencies.
code
this
is shown
for
thisforprocess
process
is shown
shown
below.below.
for
this
is
below.
Takex(d)::add_to_Ch()
the 1-GMM model (Fig.
void
{ 3) as an example , when
resampling
z(d), +=
wex(d);
need to change the dependency of
Cont(z(d))
r.v.Ch(z(d))
x(d) since +=
Cont(z(d)|W
) = {x(d)} for any W . The
x(d);
Ch(mu(z(d)))
+= process
x(d); is}shown below.
generated
code for this
void z(d)::accept_value(Cluster v) {
// accept the proposed value v for r.v. z(d)
// code for updating references omitted
for (u in Cont(z(d))) u.del_from_Ch();
val_z(d) = v;
for (u in Cont(z(d))) u.add_to_Ch(); }
We
omitthe
the code
code for
for del_from_Ch()
del_from_Ch() for
for
conciseness,
We
omit
the
code
for
We omit
for conciseness,
conciseness,
which
isisessentially
essentially
the
same
as
add_to_Ch()
which
whichis
essentiallythe
thesame
sameas
asadd_to_Ch()
add_to_Ch()...
The
formal
transformation
rules
for
ACU
are
demonstrated
The
Theformal
formaltransformation
transformationrules
rulesfor
forACU
ACUare
aredemonstrated
demonstrated
in
Fig.
4.
F
takes
in
a
random
variable
declaration
statement
c
ininFig.
4.4.FFcctakes
ininaarandom
variable
declaration
statement
Fig.
takes
random
variable
declaration
statement
We
omit
theprogram
code for
del_from_Ch()
forcode
conciseness,
in
the
BLOG
program
and
outputs
the
inference
code
containin
the
BLOG
and
outputs
the
inference
containin
the
BLOG
program
and
outputs
the
inference
code
containwhich
is essentially
thevalue()
same as add_to_Ch()
. Ch().
ing
methods
accept
and
add
to
FFcc
cc
ing
value()and
andadd
add to
to Ch().
Ch(). F
ingmethods
methodsaccept
accept value()
cc
The
transformation
rules
for ACUthe
areinference
demonstrated
takes
in
BLOG
expression
and
generates
the
inference
code
takes
in
aaaBLOG
expression
generates
code
takes
informal
BLOG
expressionand
and
generates
the
inference
code
in
Fig.method
4. Fc takes
a random
statement
inside
method
add
to
Ch().
ItItkeeps
keeps
track
of
already
seen
inside
add
Ch().
It
track
seen
inside
method
addinto
to
Ch().variable
keepsdeclaration
trackof
ofalready
already
seen
in
the
BLOG
program
and
outputs
the
inference
code
containvariables
in
seen,
which
makes
sure
that
a
variable
will
be
variables
in
seen,
which
makes
sure
that
a
variable
will
variables in seen, which makes sure that a variable will be
be
ing
methods
accept
and
addset
toat
Ch().
Fcc
added
to
the
contingent
set
or
the
children
set
atatmost
most
once.
added
to
set
added
tothe
thecontingent
contingentvalue()
setor
orthe
thechildren
children
set
mostonce.
once.
takes
in
BLOG expression
generates
the similarly.
inference
Code
for
removing
variables
can
be
generated
similarly.
Code
variables
can
Codefor
foraremoving
removing
variablesand
canbe
begenerated
generated
similarly. code
inside
method
add
to
Ch().
It
keeps
track
of
already
seen
Since
we
maintain
the
exact
children
for
each
r.v.
with
Since we
we maintain
maintain the
the exact
exact children
children for
for each
each
r.v. with
with
Since
r.v.
variables
in
seen, which
makes sure
that↵
variable
be
ACU,
the
computation
ofofacceptance
acceptance
ratio
↵↵ain
in(line
Eq. 711 for
forwill
PMH
ACU,the
thecomputation
computation
acceptance
ratio
in
Alg.
1)
ACU,
of
ratio
Eq.
PMH
added
to the contingent
set or the children set at most once.
can
be simplified
simplified
to
can
be
to
Code for removing variables can be generated similarly.
Since we maintain the0 exact
!
Q children for each0 r.v.0 with
!
Q
0acceptance
0in
0 ] 1)
|w|
Pr[X =
= v|w
v|w
Pr[U
(w
)|wAlg.
ACU, the|w|
computation
of-X
ratio
↵
(line
7
Pr[X
Pr[U
(w
)|w
]
-X ]]
-U
U
2Ch
(X)
0
w
-U
U 2Chw0 (X)
Q
min 1,
1, 0
Q
min
|w0 || Pr[X
Pr[X =
= vv00|w
|w-X
Pr[V (w)|w
(w)|w-V
-X ]]
-V ]]
|w
V 2Ch
2Chw
(X) Pr[V
w (X)
V
(2)
(2)
3641
!
Q
= v|Wi ] u2Ch(x|Wi ) Pr[u|Wi ]
Q
min 1,
|Wi | Pr[x = v 0 |Wi 1 ] v2Ch(x|Wi 1 ) Pr[v|Wi 1 ]
for PMH can be simplified to
(2)
Q of random variables !
Here |W | denotes
the total
number
ex|Wi 1 | Pr[x
= v|W
]
Pr[u|W
]
i
i
u2Ch(x|Wi )
isting
in1,W
, which
be number
maintained
via RC.variables existQ of
min |w|
Here
denotes
thecan
total
random
0 |Wnumber
Here
|w|
denotes
the
total
of
random
variables
|W
|
Pr[x
=
v
]
Pr[v|W
i
i
1
i exist1]
v2Ch(x|W
)
the computation
time of
ingFinally,
in w,
w, which
which
can be
be maintained
maintained
viaACU
RC.i is1 strictly shorter
ing
in
can
via
RC.
(2)
than that of acceptance ratio ↵ (Eq. 2).
|Wi
1 | Pr[x
Finally,
the
computation
time
of ACU
ACU
is strictly
strictly
shorter
Finally,
computation
of
is
shorter
Here
|W | the
denotes
the total time
number
of random
variables
exthan
that
of
acceptance
ratio
↵
(Eq.
2).
than
of
↵Swift
(Eq. 2). via RC.
isting
in
W acceptance
, which can ratio
beof
maintained
4.2 that
Implementation
Finally, thememoization
computation time of ACU is strictly shorter
Lightweight
4.2
Implementation
of Swift
Swift
4.2
Implementation
of
than
that
of
acceptance
ratio
↵ (Eq. 2). details of the memoHere we introduce the implementation
Lightweight memoization
memoization
Lightweight
ization
code
in
the
getter
function
in section 4.1.
4.2 we
Implementation
of Swiftmentioned
Here
introduce
the
implementation
details
of target
the memomemoHere
we introduce
implementation
details
of
the
Objects
will be the
converted
to integers
in the
code.
ization
code
in
the
getter
function
mentioned
in
section
4.1.for
Lightweight
memoization
ization
code in the
mentioned
in section
4.1.
Swift analyzes
thegetter
input function
PP and allocates
static
memory
Objects
willinbe
betheconverted
converted
to integers
integers
in the
theoftarget
target
code.
Here
we introduce
the
implementation
details
the memoObjects
will
to
in
code.
memoization
target
code.
For open-universe
models
Swift
analyzes
the input
input
PP
and
allocates
static
memory
for
ization
code
in the
getter
function
mentioned
in section
4.1.
Swift
PP
and
allocates
memory
for
whereanalyzes
the
number
of random
variables
arestatic
unknown,
the
dymemoization
in be
the
target code.
code.
For open-universe
open-universe
models
Objects
converted
to integers
in the target
code.
memoization
in
the
target
For
namic
tablewill
data
structure
(i.e.,
vector<int>
in models
C++)
is
where
thereduce
number
ofinput
random
variables
are
unknown,
the dydySwift
analyzes
the
PPofand
allocates
static
memory
for
where
the
number
random
variables
are
unknown,
the
used to
theof
amount
dynamic
memory
allocations:
namic
table
data
structure
(e.g.,
vector
in C++)
C++)
is used
used
to
memoization
in the
code.
For open-universe
models
namic
table
data
structure
vector
in
is
to
we only
increase
the target
length(e.g.,
of the
array
when
the number
bereduce
amount
of dynamic
dynamic
memory are
allocations:
we
only
where
the
number
random
unknown,we
theonly
dyreduce
amount
of
memory
allocations:
comes the
larger
than the
capacity.variables
increase
the length
length
of the
the array
array
when
the number
number
becomes
namic
dataweakness
structure
(i.e.,
vector<int>
inbecomes
C++) tais
increase
the
of
when
the
One table
potential
of
directly
applying
a dynamic
larger
than
the
capacity.
used
to
reduce
the
amount
of
dynamic
memory
allocations:
larger
ble isthan
that,the
forcapacity.
models with multiple nested number stateOne
potential
weakness
ofofdirectly
directly
applying
dynamic
tawe
onlypotential
increase
the consumption
lengthof
the array
the
number
bements,
the
memory
canapplying
bewhen
large.
Swift, tathe
One
weakness
aa In
dynamic
ble
is can
that,
for models
models
with
multiple
nested
number
statements
comes
larger
than
capacity.
user
force
the the
target
code
to clear
all the
unused
memory
ble
is
that,
for
with
multiple
nested
number
statements
(e.g.
the
aircraft
tracking
model
with
multiple
aircrafts
and
One
potential
weakness
of directly
aaircrafts
dynamic
taevery
fixed
number
of iterations
via applying
an
compilation
option,
(e.g.
the
aircraft
tracking
model
with
multiple
and
blips
for
each
one),
the
memory
consumption
can
be
large.
ble
is
that,
for
models
with
multiple
nested
number
statewhichforis each
turnedone),
off by
blips
thedefault.
memory consumption can be large.
In
Swift,
the
user can
can
force
the target
target
code
to clear
clear
all the
the
ments,
memory
consumption
can becode
large.
In Swift,
the
The the
following
code
demonstrates
the
target
code
for
In
Swift,
the
user
force
the
to
all
unused
memory
every
fixed
number
ofall
iterations
via2)
comuser
canmemory
forcecolor(b)
the
target
code
to clearof
the unused
memory
#Ball
and
in
thenumber
urn-ball
model
(Fig.via
unused
every
fixed
iterations
aa where
compilation
option,
whichiteration
isiterations
turned
offvia
by an
default.
every
ofis
compilation option,
iter fixed
isoption,
thenumber
current
number.
pilation
which
turned
off
by
default.
The
following
code
demonstrates
the
target code
code for
for the
the
which
is
turned
off
by
default.
The followingval_color,
code demonstrates
the target
vector<int>
mark_color;
number
variable
#Ball
and
its
associated
color(b)’s
in
the
The
following
code
demonstrates
the
target
code
for
number
variable #Ball b)
and {its associated color(b)’s in the
int get_color(int
urn-ball
model
(Fig. 2)
2)inwhere
where
iter
ismodel
the current
current
iteration
#Ball
and
color(b)
the
(Fig. 2)
where
if (mark_color[b]
==urn-ball
iter)is
urn-ball
model
(Fig.
iter
the
iteration
number.
iterreturn
is the current
iteration number.
val_color[b];
// memoization
number.
mark_color[b]
= iter;mark_color;
// mark the flag
vector<int>
val_color,
val_color[b]
= ...//sampling
code omitted
int
get_color(int
b) {
return
val_color[b];}
if
(mark_color[b]
== iter)
int return
val_num_B,
mark_num_B;
val_color[b];
// memoization
int
get_num_Ball()
{
mark_color[b]
= iter;
// mark the flag
if(mark_num_B= ==
iter) returncode
val_num_B;
val_color[b]
...//sampling
omitted
val_num_B
= ...// sampling code omitted
return
val_color[b];}
if(val_num_B
> val_color.size(){//allocate
int
val_num_B, mark_num_B;
{ get_num_Ball()
val_color.resize(val_num_B);
//memory
int
{
mark_color.resize(val_num_B);
}
if(mark_num_B
== iter) return val_num_B;
return val_num_B;
}
val_num_B
= ...// sampling
code omitted
if(val_num_B
> val_color.size(){//allocate
Note
that when generating
a new PW, by simply increasing
{
val_color.resize(val_num_B);
//memory
counter, all the memoization flags (i.e., mark
color for
mark_color.resize(val_num_B); }
color(
·
))
will
be
automatically
cleared.
return val_num_B; }
Lastly,that
in when
PMH, generating
we need toa randomly
a r.v. inper
Note
new PW,
PW, select
by simply
simply
Note
thatInwhen
when
generating
a new
by
inNote that
generating
a new
PW,
byprocess
simply
increasing
iteration.
the
target
C++
code,
this
is
accomcreasing the
the counter
counter iter,
iter, all
all the
the memoization flags
flags (e.g.,
(e.g.,
creasing
counter,
allpolymorphism
the memoization
flags memoization
(i.e.,anmark
for
plishedcolor
via
declaring
abstractcolor
class with
mark
for color(·))
color(·))bywill
will
be automatically
automatically
cleared.
mark
color
for
be
cleared.
color(
· )) will be method
automatically
cleared. class for every r.v. in
a resample()
and
a
derived
Lastly,
ininPMH,
PMH,
wewe
need
to randomly
randomly
select
an r.v.
r.v.a to
to
samLastly,
in
need
to
an
Lastly,
PMH,we
need
to randomly
select
r.v.samper
the
PP. iteration.
An
example
code
snippet
forcode,
theselect
1
GMM
model
ple
per
In
the
target
C++
this
process
is
ac-is
ple
per below.
iteration.
the target
process
is aciteration.
In theIntarget
C++ C++
code,code,
this this
process
is accomshown
complished
via polymorphism
polymorphism
by declaring
declaring
an abstract
abstract
class
complished
via
by
an
plished via polymorphism
by declaring
an abstract
classclass
with
class
MH_OBJ {public:
//abstract
class
with
a
resample()
method
and
a
derived
class
for
every
with
a
resample()
method
and
a
derived
class
for
every
a resample()
method
and
a
derived
class
for
every
r.v.
virtual
void
resample()=0;//resample
step in
r.v.
in
the
PP.
An example
example
code for
snippet
forGMM
the 1-GMM
1-GMM
r.v.
in
An
code
snippet
the
thevirtual
PP.the
An PP.
example
code
snippet
the //
1for
model is
void
add_to_Ch()=0;
for
ACU
model
isbelow.
shown below.
below.
model
is
shown
shown
virtual
double get_likeli()=0;
class
{public:
//abstract
}; // MH_OBJ
some methods
omitted
class
MH_OBJ
{public:
//abstract class
class
virtual
virtual void
void resample()=0;//resample
resample()=0;//resample step
step
virtual
virtual void
void add_to_Ch()=0;
add_to_Ch()=0; //for
// forACU
ACU
virtual
accept_value()=0;//for
ACU
virtual void
double
get_likeli()=0;
virtual
double
get_likeli()=0;
}; // some methods omitted
}; // some methods omitted
// maintain all the r.v.s in the current PW
std::vector<MH_OBJ*> active_vars;
Supporting new algorithms
We demonstrate that FIDS can be applied to new algorithms
Supporting
new algorithms
by implementing
a translator for the Gibbs sampling [Arora
We
demonstrate
that in
FIDS
can be applied to new algorithms
et al.,
2010] (Gibbs)
Swift.
[Arora
byGibbs
implementing
a translator
forthe
theproposal
Gibbs sampling
is a variant
of MH with
distribution
g(·)
et
2010
Swift.
setal.,
to be
the] (Gibbs)
posteriorindistribution.
The acceptance ratio ↵ is
Gibbs1 iswhile
a variant
of MH with the
proposal
distribution
g(·)
always
the disadvantage
is that
it is only
possible
to
set
to be the
posterior
The acceptance
ratio ↵ is
explicitly
construct
thedistribution.
posterior distribution
with conjugate
always
1 while
is that it is only
possible to
priors. In
Gibbs,the
thedisadvantage
proposal distribution
still constructed
explicitly
constructblanket,
the posterior
with
conjugate
from the Markov
which distribution
again requires
maintaining
priors.
In Gibbs,
the proposal
distribution
still constructed
the children
set. Hence,
FIDS can
be fully is
utilized.
from
the Markov
blanket,
whichpriors
again yield
requires
maintaining
However,
different
conjugate
different
forms
the
children set.
Hence, FIDS
be to
fully
utilized.
of posterior
distributions.
In can
order
support
a variety of
However,
different conjugate
yield different
forms
proposal
distributions,
we need topriors
(1) implement
a conjugacy
of
posterior
distributions.
In posterior
order to distribution
support a variety
of
checker
to check
the form of
for each
proposal
wetime
need(the
to (1)
implement
a conjugacy
r.v. in thedistributions,
PP at compile
checker
has nothing
to do
checker
to check
the form aofposterior
posteriorsampler
distribution
for each
with FIDS);
(2) implement
for every
conr.v.
in the
PPinatthe
compile
(theof
checker
jugate
prior
runtimetime
library
Swift. has nothing to do
with FIDS); (2) implement a posterior sampler for every conjugate
prior in the runtime library of Swift.
5 Experiments
// derived classes for r.v.s in the PP
class Var_mu:public MH_OBJ{public://for mu(c)
double val; // value for the var
double cached_val; //cache for proposal
int mark, cache_mark; // flags
std::set<MH_OBJ*> Ch, Cont; //sets for ACU
double getval(){...}; //sample & memoize
double getcache(){...};//generate proposal
... }; // some methods omitted
std::vector<Var_mu*> mu; // dynamic table
class Var_z:public MH_OBJ{ ... };//for z(d)
Var_z* z[20]; // fixed-size array
Efficient proposal manipulation
Although one PMH iteration only samples a single variable,
generating a new PW may still involve (de-)instantiating an
arbitrary number of random variables due to RC or sampling
a number variable. The general approach commonly adopted
by many PPL systems is to construct the proposed possible
world in dynamically allocated memory and then copy it to
the current world [Milch et al., 2005a; Wingate et al., 2011],
which suffers from significant overhead.
In order to generate and accept new PWs in PMH with
negligible memory management overhead, we extend the
lightweight memoization to manipulate proposals: Swift statically allocates an extra memoized cache for each random
variable, which is dedicated to storing the proposed value for
that variable. During resampling, all the intermediate results
are stored in the cache. When accepting the proposal, the proposed value is directly loaded from the cache; when rejection,
no action is needed due to our memoization mechanism.
Here is an example target code fragment for mu(c) in the
1-GMM model.
model. proposed
proposed vars
vars is
is an
an array
array storing
storing all
all
1-GMM
the variables
variables to
to be
be updated
updated if
if the
the proposal
proposal gets
gets accepted.
accepted.
the
In the experiments, Swift generates target code in C++ with
5C++Experiments
standard <random> library for random number gener-
ation
the armadillo
ma[Sanderson,
] for with
In
theand
experiments,
Swiftpackage
generates
target code2010
in C++
trix
computation.
The
baseline
systems
include
BLOG
(verC++ standard <random> library for random number genersion 0.9.1),
(version
3.3.0),[Sanderson,
Church (webchurch),
Ination
and theFigaro
armadillo
package
2010] for mafer.NET
(version The
2.6),baseline
BUGS systems
(winBUGS
1.4.3)
and (verStan
trix
computation.
include
BLOG
(cmdStan
sion
0.9.1),2.6.0).
Figaro (version 3.3.0), Church (webchurch), Infer.NET (version 2.6), BUGS (winBUGS 1.4.3) and Stan
5.1 Benchmark
models
(cmdStan
2.6.0).
We collect a set of benchmark models2 which exhibit various capabilities
of models
a PPL (Tab. 1), including the burglary
5.1
Benchmark
model
(Bur),
the
hurricane
model
(Hur),2 which
the urn-ball
model
We collect a set of benchmark
models
exhibit
var(U-B(x,y)
denotes
the
urn-ball
model
with
at
most
x
balls
and
ious capabilities of a PPL (Tab.
1),
including
the
burglary
y draws), the TrueSkill model [Herbrich et al., 2007] (T-K),
model (Bur), the hurricane model (Hur), the urn-ball model
1-dimensional Gaussian mixture model (GMM, 100 points
(U-B(x,y) denotes the urn-ball model with at most x balls and
with
4 clusters)
and the infinite
model
(1-GMM).
We
y draws),
the TrueSkill
model [GMM
Herbrich
et al.,
2007][ (T-K),
also
include
a
real-world
dataset:
handwritten
digits
LeCun
1-dimensional
Gaussian mixture model (GMM, 100 points
et al., 1998] using the PPCA model [Tipping and Bishop,
with 4 clusters) and the infinite GMM model (1-GMM). We
]. All the models can be downloaded from the [GitHub
1999include
also
a real-world dataset: handwritten digits LeCun
repository
of] BLOG.
et al., 1998
using the PPCA model [Tipping and Bishop,
]
1999
.
All
the
can within
be downloaded
5.2 Speedupmodels
by FIDS
Swift from the GitHub
repository of BLOG.
We first evaluate the speedup promoted by each of the three
optimizations
in FIDS
individually.
5.2
Speedup
by FIDS
within Swift
std::vector<MH_OBJ*>proposed_vars;
class Var_mu:public MH_OBJ{public://for mu(c)
double val; // value
double cached_val; //cache for proposal
int mark, cache_mark; // flags
double getcache(){ // propose new value
if(mark == 1) return val;
if(mark_cache == iter) return cached_val;
mark_cache = iter; // memoization
proposed_vars.push_back(this);
cached_val = ... // sample new value
return cached_val; }
void accept_value() {
val = cached_val;//accept proposed value
if(mark == 0) { // not instantiated yet
mark = 1;//instantiate variable
active_vars.push_back(this);
} }
void resample() { // resample
proposed_vars.clear();
mark = 0; // generate proposal
new_val = getcache();
mark = 1;
alpha = ... //compute acceptance ratio
if (sample_unif(0,1) <= alpha) {
for(auto v: proposed_vars)
v->accept_value(); // accept
// code for ACU and RC omitted
} } }; // some methods omitted
Dynamic
Backchaining
for LW
We
first evaluate
the speedup
promoted by each of the three
We compare the
running
time of the following versions of
optimizations
in FIDS
individually.
code: (1) the code generated by Swift (“Swift”); (2) the modDynamic
Backchaining
for LW
ified compiled
code with DB
manually turned off (“No DB”),
We
compare
the running
time
following
versions
of
which
sequentially
samples
all of
thethe
variables
in the
PP; (3)
code:
(1) the code generated
by Swift
(“Swift”);memoization
(2) the modthe hand-optimized
code without
unnecessary
ified
compiled
code with
manually turned off (“No DB”),
or recursive
function
callsDB
(“Hand-Opt”).
which
sequentially
samplestime
all for
the all
variables
in the PP;
(3)
We measure
the running
the 3 versions
and the
the
hand-optimized
code
without
unnecessary
number
of calls to the
random
number
generatormemoization
for “Swift”
or
function
callsis(“Hand-Opt”).
andrecursive
“No DB”.
The result
concluded in Tab. 2.
22 Most models are simplified to ensure that all the benchmark
systems
can produce
produce an
an output
output in
in reasonably
reasonably short
short running
running time.
time. All
All
systems can
of these models can be enlarged to handle more data without adding
extra lines
lines of
of BLOG
BLOG code.
code.
extra
3642
Bur
R
CT
CC
OU
CG
Hur
X
X
X
X
U-B
X
X
X
T-K
GMM
X
X
X
X
1-GMM
X
X
X
X
30
PPCA
25
X
20
2.3x
No RC
Swift
15
10
5
X
0
U-B(10,5)
Table 1: Features of the benchmark models. R: continuous
scalar or vector variables. CT: context-specific dependencies
(contingency). CC: cyclic dependencies in the PP (while in
any particular PW the dependency structure remains acyclic).
OU: open-universe. CG: conjugate priors.
U-B(20,5)
U-B(40,5)
U-B(40,10)
U-B(40,20)
U-B(40,40)
Figure 5: Running time (s) of PMH in Swift with and without
RC on urn-ball models for 20 million samples.
Alg.
Bur
Hur
U-B(20,10) U-B(40,20)
Running Time (s): Swift v.s. Static Dep
Static Dep
0.986
1.642
4.433
7.722
Swift
0.988
1.492
3.891
4.514
Speedup
0.998
1.100
1.139
1.711
Calls to Likelihood Functions
Static Dep
1.8 ⇤ 107 2.7 ⇤ 107 1.5 ⇤ 108
3.0 ⇤ 108
7
7
7
Swift
1.8 ⇤ 10
1.7 ⇤ 10
4.1 ⇤ 10
5.5 ⇤ 107
Calls Saved
0%
37.6%
72.8%
81.7%
Alg.
Bur
Hur
U-B(20,2) U-B(20,8)
Running Time (s): Swift v.s. No DB
No DB
0.768
1.288
2.952
5.040
Swift
0.782
1.115
1.755
4.601
Speedup
0.982
1.155
1.682
1.10
Running Time (s): Swift v.s. Hand-Optimized
Hand-Opt
0.768
1.099
1.723
4.492
Swift
0.782
1.115
1.755
4.601
Overhead
1.8%
1.4%
1.8%
2.4%
Calls to Random Number Generator
No DB
3 ⇤ 107
3 ⇤ 107
1.35 ⇤ 108 1.95 ⇤ 108
Swift
3 ⇤ 107 2.0 ⇤ 107 4.82 ⇤ 107 1.42 ⇤ 108
Calls Saved
0%
33.3%
64.3%
27.1%
Table 3: Performance of PMH in Swift and the version utilizing static dependencies on benchmark models.
best efficiency that can be achieved via compile-time analysis without maintaining dependencies at runtime. This version statically computes an upper bound of the children set
c
by Ch(X)
= {Y : X appears in CY } and again uses Eq.(2)
to compute the acceptance ratio. Thus, for models with fixed
dependencies, “Static Dep” should be faster than “Swift”.
In Tab. 3, “Swift” is up to 1.7x faster than “Static Dep”,
thanks to up to 81% reduction of likelihood function evaluations. Note that for the model with fixed dependencies
(Bur), the overhead by ACU is almost negligible: 0.2% due
to traversing the hashmap to access to child variables.
Table 2: Performance on LW with 107 samples for Swift, the
version without DB and the hand-optimized version
We measure the running time for all the 3 versions and the
number of calls to the random number generator for “Swift”
and “No DB”. The result is concluded in Tab. 2.
The overhead due to memoization compared against the
hand-optimized code is less than 2.4%. We can further notice
that the speedup is proportional to the number of calls saved.
5.3
Reference Counting for PMH
RC only applies to open-universe models in Swift. Hence we
focus on the urn-ball model with various model parameters.
The urn-ball model represents a common model structure appearing in many applications with open-universe uncertainty.
This experiment reveals the potential speedup by RC for realworld problems with similar structures. RC achieves greater
speedup when the number of balls and the number of observations become larger.
We compare the code produced by Swift with RC (“Swift”)
and RC manually turned off (“No RC”). For “No RC”, we
traverse the whole dependency structure and reconstruct the
dynamic slice every iteration. Fig. 5 shows the running time
of both versions. RC is indeed effective – leading to up to
2.3x speedup.
Swift against other PPL systems
We compare Swift with other systems using LW, PMH and
Gibbs respectively on the benchmark models. The running
time is presented in Tab. 4. The speedup is measured between
Swift and the fastest one among the rest. An empty entry
means that the corresponding PPL fails to perform inference
on that model. Though these PPLs have different host languages (C++, Java, Scala, etc.), the performance difference
resulting from host languages is within a factor of 23 .
5.4
Experiment on real-world dataset
We use Swift with Gibbs to handle the probabilistic principal component analysis (PPCA) [Tipping and Bishop, 1999],
with real-world data: 5958 images of digit “2” from the handwritten digits dataset (MNIST) [LeCun et al., 1998]. We
compute 10 principal components for the digits. The training
and testing sets include 5958 and 1032 images respectively,
each with 28x28 pixels and pixel value rescaled to [0, 1].
Since most benchmark PPL systems are too slow to handle this amount of data, we compare Swift against other two
Adaptive Contingency Updating for PMH
We compare the code generated by Swift with ACU (“Swift”)
and the manually modified version without ACU (“Static
Dep”) and measure the number of calls to the likelihood function in Tab. 3.
The version without ACU (“Static Dep”) demonstrates the
3
3643
http://benchmarksgame.alioth.debian.org/
PPL
Bur
Church
Figaro
BLOG
Swift
Speedup
9.6
15.8
8.4
0.04
196
Church
Figaro
BLOG
Swift
Speedup
12.7
10.6
6.7
0.11
61
BUGS
Infer.NET
Swift
Speedup
87.7
1.5
0.12
12
Hur
U-B
(20,10)
T-K
LW with 106 samples
22.9
179
57.4
24.7
176
48.6
11.9
189
49.5
0.08
0.54
0.24
145
326
202
PMH with 106 samples
25.2
246
173
59.6
17.4
18.5
30.4
38.7
0.15
0.4
0.32
121
75
54
Gibbs with 106 samples
0.19
0.34
1
1
-
GMM
1-GMM
1627
997
998
6.8
147
1038
235
261
1.1
214
3703
151
72.5
0.76
95
1057
62.2
68.3
0.60
103
84.4
77.8
0.42
185
0.86
1
Figure 6: Log-perplexity w.r.t running time(s) on the PPCA
model with visualized principal components. Swift converges
faster.
6
Table 4: Running time (s) of Swift and other PPLs on the
benchmark models using LW, PMH and Gibbs.
Conclusion and Future Work
We have developed Swift, a PPL compiler that generates
model-specific target code for performing inference. Swift
uses a dynamic slicing framework for open-universe models
to incrementally maintain the dependency structure at runtime. In addition, we carefully design data structures in the
target code to avoid dynamic memory allocations when possible. Our experiments show that Swift achieves orders of
magnitudes speedup against other PPL engines.
The next step for Swift is to support more algorithms, such
as SMC [Wood et al., 2014], as well as more samplers, such
as the block sampler for (nearly) deterministic dependencies [Li et al., 2013]. Although FIDS is a general framework
that can be naturally extended to these cases, there are still
implementation details to be studied.
Another direction is partial evaluation, which allows the
compiler to reason about the input program to simplify it.
Shah et al. [2016] proposes a partial evaluation framework
for the inference code (PI in Fig. 1). It is very interesting to
extend this work to the whole Swift pipeline.
scalable PPLs, Infer.NET and Stan, on this model. Both PPLs
have compiled inference and are widely used for real-world
applications. Stan uses HMC as its inference algorithm. For
Infer.NET, we select variational message passing algorithm
(VMP), which is Infer.NET’s primary focus. Note that HMC
and VMP are usually favored for fast convergence.
Stan requires a tuning process before it can produce samples. We ran Stan with 0, 5 and 9 tuning steps respectively4 .
We measure the perplexity of the generated samples over test
images w.r.t. the running time in Fig. 6, where we also visualize the produced principal components. For Infer.NET, we
consider the mean of the approximation distribution.
Swift quickly generates visually meaningful outputs with
around 105 Gibbs iterations in 5 seconds. Infer.NET takes
13.4 seconds to finish the first iteration5 and converges to a
result with the same perplexity after 25 iterations and 150
seconds. The overall convergence of Gibbs w.r.t. the running
time significantly benefits from the speedup by Swift.
For Stan, its no-U-turn sampler [Homan and Gelman,
2014] suffers from significant parameter tuning issues. We
also tried to manually tune the parameters, which does not
help much. Nevertheless, Stan does work for the simplified
PPCA model with 1-dimensional data. Although further investigation is still needed, we conjecture that Stan is very sensitive to its parameters in high-dimensional cases. Lastly, this
experiment also suggests that those parameter-robust inference engines, such as Swift with Gibbs and Infer.NET with
VMP, would be preferred at practice when possible.
Acknowledgement
We would like to thank our anonymous reviewers, as well
as Rohin Shah for valuable discussions. This work is supported by the DARPA PPAML program, contract FA875014-C-0011.
References
[Agrawal and Horgan, 1990] Hiralal Agrawal and Joseph R Horgan. Dynamic program slicing. In ACM SIGPLAN Notices,
volume 25, pages 246–256. ACM, 1990.
[Arora et al., 2010] Nimar S. Arora, Rodrigo de Salvo Braz,
Erik B. Sudderth, and Stuart J. Russell. Gibbs sampling in
open-universe stochastic languages. In UAI, pages 30–39.
AUAI Press, 2010.
4
We also ran Stan with 50 and 100 tuning steps, which took more
than 3 days to finish 130 iterations (including tuning). However, the
results are almost the same as that with 9 tuning steps.
5
Gibbs by Swift samples a single variable while Infer.NET processes all the variables per iteration.
[Chaganty et al., 2013] Arun Chaganty, Aditya Nori, and Sriram
Rajamani. Efficiently sampling probabilistic programs via program analysis. In AISTATS, pages 153–160, 2013.
3644
[Cytron et al., 1989] Ron Cytron, Jeanne Ferrante, Barry K
Rosen, Mark N Wegman, and F Kenneth Zadeck. An efficient method of computing static single assignment form. In
Proceedings of the 16th ACM SIGPLAN-SIGACT symposium
on Principles of programming languages, pages 25–35. ACM,
1989.
In Tenth International Workshop on Artificial Intelligence and
Statistics, Barbados, 2005.
[Minka et al., 2014] T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6, 2014.
[Nori et al., 2014] Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. R2: An efficient MCMC sampler
for probabilistic programs. In AAAI Conference on Artificial
Intelligence, 2014.
[Pfeffer, 2001] Avi Pfeffer. IBAL: A probabilistic rational programming language. In In Proc. 17th IJCAI, pages 733–740.
Morgan Kaufmann Publishers, 2001.
[Pfeffer, 2009] Avi Pfeffer. Figaro: An object-oriented probabilistic programming language. Charles River Analytics Technical Report, 2009.
[Plummer, 2003] Martyn Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In
Proceedings of the 3rd International Workshop on Distributed
Statistical Computing, volume 124, page 125. Vienna, 2003.
[Ritchie et al., 2016] Daniel Ritchie, Andreas Stuhlmüller, and
Noah D. Goodman. C3: Lightweight incrementalized MCMC
for probabilistic programs using continuations and callsite
caching. In AISTATS 2016, 2016.
[Sanderson, 2010] Conrad Sanderson. Armadillo: An open
source c++ linear algebra library for fast prototyping and computationally intensive experiments. 2010.
[Shah et al., 2016] Rohin Shah, Emina Torlak, and Rastislav
Bodik. SIMPL: A DSL for automatic specialization of inference algorithms. arXiv preprint arXiv:1604.04729, 2016.
[Stan Development Team, 2014] Stan Development Team. Stan
Modeling Language Users Guide and Reference Manual, Version 2.5.0, 2014.
[Tipping and Bishop, 1999] Michael E. Tipping and Chris M.
Bishop. Probabilistic principal component analysis. Journal
of the Royal Statistical Society, Series B, 61:611–622, 1999.
[Tristan et al., 2014] Jean-Baptiste Tristan, Daniel Huang,
Joseph Tassarotti, Adam C Pocock, Stephen Green, and Guy L
Steele. Augur: Data-parallel probabilistic modeling. In NIPS,
pages 2600–2608. 2014.
[Wingate et al., 2011] David Wingate, Andreas Stuhlmueller,
and Noah D Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In International Conference on Artificial Intelligence and
Statistics, pages 770–778, 2011.
[Wood et al., 2014] Frank Wood, Jan Willem van de Meent, and
Vikash Mansinghka. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, pages 2–46,
2014.
[Wu et al., 2014] Yi Wu, Lei Li, and Stuart J. Russell. BFiT:
From possible-world semantics to random-evaluation semantics in open universe. In Neural Information Processing Systems, Probabilistic Programming workshop, 2014.
[Yang et al., 2014] Lingfeng Yang, Pat Hanrahan, and Noah D.
Goodman. Generating efficient MCMC kernels from probabilistic programs. In AISTATS, pages 1068–1076, 2014.
[Goodman et al., 2008] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. Church: A language for generative models. In UAI,
pages 220–229, 2008.
[Herbrich et al., 2007] Ralf Herbrich, Tom Minka, and Thore
Graepel. Trueskill(tm): A bayesian skill rating system. In
Advances in Neural Information Processing Systems 20, pages
569–576. MIT Press, January 2007.
[Homan and Gelman, 2014] Matthew D Homan and Andrew
Gelman. The no-u-turn sampler: Adaptively setting path
lengths in hamiltonian monte carlo. The Journal of Machine
Learning Research, 15(1):1593–1623, 2014.
[Hur et al., 2014] Chung-Kil Hur, Aditya V. Nori, Sriram K. Rajamani, and Selva Samuel. Slicing probabilistic programs. In
Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14,
pages 133–144, New York, NY, USA, 2014. ACM.
[Kazemi and Poole, 2016] Seyed Mehran Kazemi and David
Poole. Knowledge compilation for lifted probabilistic inference: Compiling to a low-level language. 2016.
[Kucukelbir et al., 2015] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic variational inference
in Stan. In NIPS, pages 568–576, 2015.
[LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio,
and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
[Li et al., 2013] Lei Li, Bharath Ramsundar, and Stuart Russell.
Dynamic scaled sampling for deterministic constraints. In AISTATS, pages 397–405, 2013.
[Lunn et al., 2000] David J. Lunn, Andrew Thomas, Nicky Best,
and David Spiegelhalter. WinBUGS - a Bayesian modelling
framework: Concepts, structure, and extensibility. Statistics
and Computing, 10(4):325–337, October 2000.
[Mansinghka et al., 2013] Vikash K. Mansinghka, Tejas D.
Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum. Approximate Bayesian image interpretation using generative probabilistic graphics programs. In NIPS, pages 1520–1528, 2013.
[McAllester et al., 2008] David McAllester, Brian Milch, and
Noah D Goodman. Random-world semantics and syntactic independence for expressive languages. Technical report, 2008.
[Milch and Russell, 2006] Brian Milch and Stuart J. Russell.
General-purpose MCMC inference over relational structures.
In UAI. AUAI Press, 2006.
[Milch et al., 2005a] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov.
BLOG: Probabilistic models with unknown objects. In IJCAI,
pages 1352–1359, 2005.
[Milch et al., 2005b] Brian Milch, Bhaskara Marthi, David Sontag, Stuart Russell, Daniel L. Ong, and Andrey Kolobov. Approximate inference for infinite contingent Bayesian networks.
3645
Download