A Novel Inference Algorithm On ...

A Novel Inference Algorithm On Graphical Model
by
Yewen Pu
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
wU
0
-
z.-
Masters of Science in Computer Science and Engineering
CJ)
CI)
at the
LLI (iuo
<0
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
-Ii01
February 2015
@ Massachusetts Institute of Technology 2015. All rights reserved.
Signature redacted
.................................. ..
Author .........
Department of Electrical Engineering and Computer Science
January 30, 2015
Signature redacted
Certified by
Armando Solar-Lezama
Associate Professor
Thesis Supervisor
Signature redacted
.....................
Accepted by....
)
LC
)OProfessor Leslie A. Kolodziejski
Chair, Department Committee on Graduate Theses
1w)
A Novel Inference Algorithm On Graphical Model
by
Yewen Pu
Submitted to the Department of Electrical Engineering and Computer Science
on Jan 30, 2015, in partial fulfillment of the
requirements for the degree of
Masters of Science in Computer Science and Engineering
Abstract
We present a framework for approximate inference that, given a factor graph and
a subset of its variables, produces an approximate marginal distribution over these
variables with bounds. The factors of the factor graph are abstracted as as piecewise
polynomial functions with lower and upper bounds, and a variant of the variable elimination algorithm solves the inference problem over this abstraction. The resulting
distributions bound quantifies the error between it and the true distribution. We also
give a set of heuristics for improving the bounds by further refining the binary space
partition trees.
Thesis Supervisor: Armando Solar-Lezama
Title: Associate Professor
3
4
Acknowledgments
A list:
Thanks Armando, my advisor, for guidance and mental therapies and kind words.
Thanks Zenna, my labmate, for asking "but what does it mean?"
Thanks Rohit, my labmate, for helps with algorithms and speaking spanish.
Thanks Thomas Gregoire, Alex Townsend, Will Cuello, for math.
Thanks Meng, whom in hours of desperation, developed the skill to cook.
5
6
Contents
Introduction Overview . . . . . .
13
1.2
Bayesian Inference
. . . . . . . .
13
1.3
Probablistic Programming
. . . .
15
1.4
Related Works . . . . . . . . . . .
16
1.4.1
Sampling
. . . . . . . . .
16
1.4.2
Variational Inference
. . .
17
1.4.3
Variable Elimination
. . .
17
Our Approach . . . . . . . . . . .
17
.
.
.
.
.
.
.
.
1.1
1.5
19
. . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
19
Factors. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
22
. . . . . . . .
. . . . . . . . . . . . . . . . . . .
25
.
. . . . . . . . . . . . . . . . . . .
25
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
26
28
30
Algorithm Overview
2.2
Factor Graph
Distribution Abstraction
2.3.2
Patch
2.3.3
Abstract Factor . . . . . .
.
.
2.3.1
.
2.4
Factor Abstraction
.... ... ... ... ... ...
.
2.3
.
2.2.1
.
2.1
.
. . . . . . .
.
The Algorithm
... .... ... ... .. ....
2.4.1
Variable Elimination
.
2
13
Introduction
.... ... ... ... ... ...
30
2.4.2
Patch for the original factors . . . . . . . . . . . . . . . . . . .
31
2.4.3
Patch for multiplication
... .... ... ... ... ...
31
2.4.4
Patch for integration
Operations on Factors
. . . . . .
. . .
.
.
1
32
7
36
.
. . . . . . . . . . . .
Measurement of Error
. . . . . . . .
. . . . . . . . . . . .
36
2.5.2
The Splitting Heuristic . . . . . . . .
. . . . . . . . . . . .
37
2.5.3
Optimization of the Heuristic
. . . .
. . . . . . . . . . . .
38
Polynomial Bounds . . . . . . . . . . . . . .
. . . . . . . . . . . .
40
2.6.1
Polynomial Approximation . . . . . .
. . . . . . . . . . . .
40
2.6.2
Approximate Bounding . . . . . . . .
. . . . . . . . . . . .
41
2.6.3
Exact Bounding . . . . . . . . . . . .
. . . . . . . . . . . .
42
2.6.4
Bounding Polynomials Exactly
. . .
. . . . . . . . . . . .
42
2.6.5
Bounding Potential Functions Exactly
. . . . . . . . . . . .
43
.
.
.
.
.
.
2.5.1
.
2.6
Refinement Heuristic . . . . . . . . . . . . .
.
2.5
3 Results
45
Simple Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.2
Backward Inference . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
.
.
3.1
A Potential Function and Distance Functions
8
47
List of Figures
1-1
A code snippet on how one might express our example problem in a
probablistic programming language setting
. . . . . . . . . . . . . .
15
2-1
Visualizing the factor graph of the bus problem
. . . . . . . . . . . .
21
2-2
An example of a partition set Pf and the finest partition part* . . . .
28
2-3
The patch set Pf organized as a BSP to form the abstract factor G
29
2-4
The 1-dimensional abstraction factors G1 and G 2 are being multiplied.
The upper bound for the domain d is computed by multiplying together the polynomials from the smallest coverings of d from each of
the abstract factors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-5
33
The 2-dimensional abstract factor GI is integrated. A domain d has
its intersected patches in GI being drawn, for brevity these patches do
. . . . . . . . . . . . . . . . . . . .
not nessesarily come from a BSP
2-6
34
The domain d, a rectangle, has its distance bounded by a and b, deriving from the bounding sphere . . . . . . . . . . . . . . . . . . . . .
44
3-1
Visualizing the result factor of propagating a simple constraint . . . .
45
3-2
Visualizing the relationship between y and x after making an observation 46
9
10
List of Tables
11
12
Chapter 1
Introduction
1.1
Introduction Overview
In this master thesis, we explore a new method of performing bayesian inference on
a factor graph. It is the first algorithm of its kind to be able to perform inference
on a continuously valued factor graph with sound bounds.
At each stage of the
algorithm's execution, an under and over approximation of the true distribution is
computed, guaranteeing the true distribution to lay between these bounds.
1.2
Bayesian Inference
Bayesian inference is a set of techniques that allow one to model a phenomenon by
changing the hypothesis as one acquires more observations
tions in various disciplines of science such as biology
[6],
[4].
It has found applica-
geology [2], and physics [11].
Modeling a phenomenon in a bayesian setting gives a systematic framework on how
to robustly adjust the hypothesis with both prior beliefs and additional observations.
Formally, bayesian inference addresses the following question: Given some prior
beliefs about a hypothesis x, expressed as a prior probability p(x), a likelihood function p(yI'), modeling how likely one is to make an observation y depending on the
hypothesis x, and an actual observation y, we wish to conclude the posterior probability p(xly). The posterior probability is interpreted as our updated beliefs about
13
our hypothesis x taken into account the observations. The posterior probability can
be obtained by applying bayes rule:
p~l)=
p~ ~~)(1.1)
P(I)-p(ylx)p(x)
p(y)
Well use the following example to guide our discussion: Suppose we have a bag
of 3 coins, with two fair coins and one unfair coin that has a _a_
10 chance of showing a
head. We grab a coin from the bag at random, and flips the coin. If we observe the
flip to be a tail, what is the probability that we have chosen the fair coin?
From the problem statement, we know the prior distribution is as follows:
2
-
p(f air) =
-
1(1.2)
p(unf air) =
And the likelihood as follows:
1
p(taillfair)
2
2
p(tailUnTJair) =
p(head unf air) =
(1.3)
-
1
-
-
p(head f air)
We first compute p(tail), then apply bayes rule, to obtain the solution as follows:
p(tail) = p(tailIfair)p(fair) + p(tail unf air)p(unfair) = -1
p(tail f air)p(fair)
p(tail)
_
10
11
30
(1.4)
Note that in computing the posterior probability, the quantity p(tail) can be seen
as a normalization factor, which can be computed in the end. In practice, one often
adopt the proportionality version of the bayes rule:
p(xIy) cX p(y x)p(x)
14
(1.5)
and leave the normalization until the very end of the computation.
1.3
Probablistic Programming
Although one can always attempt solve the inference problems by hand, for larger and
more complex instances, it can be helpful to express the problems in a probabilistic
programming language. A probabilistic programming language is a set of programming language constructs that allows the user to formally model probabilistic (and
deterministic) relationships between objects of interest, and make queries about certain aspects of the model
[5].
A probabilistic programming language usually comes
with it an underlying algorithm, which solves for the query posed by the user.
Our example problem above may be expressed in a probabilistic programming
language as follows:
fair = Ber(2/3)
outcome = (if
(fair)
Ber (1/2)
e ls e
Ber (9/10)
end)
observe(outcome = false)
query-distribution(fair)
Figure 1-1: A code snippet on how one might express our example problem in a
probablistic programming language setting
Here, the function Ber models the bernoulli distribution with the probability
of returning true, which we can interpret as a coin has landed on its head.
variable fair thus denote the probability of us picking a fair coin.
The
The outcome
of our experiment outcome depends on if we have picked a fair coin.
If we have
chosen a fair coin, we can get a head with probability I, denoted by another bernoulli
distribution Ber(1/2), and if we have chosen an unfair coin, we can get a head with
probability -.
The observe function denote the outcome of our experiment is a tail,
15
and by querying the probability distribution of fair, we can answer the question
pr(f air tail).
Once the problem is modeled, the probablistic programming framework needs
to solve for the query posed by the user. There are two main approaches to solving the query, the generative/sampling based approach and the variational inference
approach. We will discuss these approaches briefly to give a context of where our
techniques fit in.
1.4
1.4.1
Related Works
Sampling
Conceptually, a sampling based inference algorithm works by executing the program
and collect traces of the execution. For instance, by executing the program in our
example, we might get a trace (fair = true, outcome = true). The traces are then
filtered via rejection sampling, discarding the traces that contradicts with our observation. For instance, the trace (fair = false, outcome = true) will be discarded since
it does not satisfy the observation that the outcome is a tail. The remaining traces
are then counted to approximate the queried probability. In this setting, the more
samples one can drawn, the closer the estimation it is to the true probability. When
the program size becomes large, however, one must adopt a more efficient sampling
algorithm than rejection sampling, such as MCMC, which moves from one valid sample state to another with a proposal distribution to avoid unnecessary rejections [1].
A MCMC algorithm provides a sample distribution that is guaranteed to converge to
the desired distribution if the algorithm is executed for sufficiently large number of
iterations. A key challenge to the sampling approach is the measurement of quality,
as there is difficult to bound the closeness between the sample distribution and the
true distribution at each iteration of the algorithm, as convergence is only guaranteed
16
in the limit.
1.4.2
Variational Inference
In contrast to the sampling approach which approximate the queried distribution,
the variational inference approach first simplifies the distribution to a simpler form,
then computes a locally optimal, symbolic solution of this simpler distribution [3]. In
the variational setting, the form of the distributions are restricted, for instance to the
family of exponential distributions with conjugate priors, and the posterior distribution is assumed to be factorizable. These assumptions then allows for fast symbolic
computations of the posterior distribution. In our example, we might approximate
the Ber distribution with a beta distribution, and approximate the if expression
with a mixture of gaussians. The challenge of the variational method is thus, while
able to compute the posterior quickly, there is no guarantee how close the computed
distribution is from the true distribution.
1.4.3
Variable Elimination
In the case where the variables are discrete, one can perform the variable elimination
algorithm, successively summing and multiplying factors until only the variables of
the query remains [12]. This algorithm can be extended to handle large amount of
variables by approximating the multiplication step, integrating part of the variables
away before the full product space is explored
1.5
[7].
Our Approach
The starting point for our approach is the general variable elimination algorithm
on factor graphs [12].
This algorithm works over the discrete domain, summing
away the variables that are not present in the queried distribution. We work with
17
the continuous analog of this algorithm, replacing summation with integration. The
main challenge of the continuous algorithm is that it requires the computation of
integrals and function products which may not have closed form solutions. Instead of
computing the exact integrals and products, Our algorithm computes these operations
on abstractions representing over-approximations and under-approximations of the
true distributions.
This allows the algorithm to compute sound upper and lower
bounds on the probabilities of arbitrary events. The algorithm achieves precision by
adaptively refining these abstractions to increase precision only in the regions where
additional precision is needed.
In contrast with sampling-based approaches such as MCMC, our algorithm can
provide strong guarantees on the probabilities of particular events, giving a sound
lower and upper bound at each stage of its refinement toward a better precision. In
contrast with variational inference methods, our approach is non-parametric and can
provide guaranees without making assumptions about the forms of the underlying
distribution.
It also provides sound bounds which the variational method cannot
guarantee.
In summary, we present the first algorithm capable of computing sound over and
under-approximations of distributions over multiple variables on a factor graph. We
show that by adaptively refining these abstractions, the algorithm can focus resources
on the regions of the distribution that need it most. As a consequence of providing
sound bounds, our algorithm does suffer from slowness of computation, however, it
does enables us to ask questions previously unanswerable by the sampling or variational approaches. We will also demonstrate that our approach provides a flexible
framework for extension by including user defined distributions and constraints. We
evaluate the algorithm on a number of micro-benchmarks designed to illustrate the
different features of the algorithm as well as two simple case studies: a model of bus
arrival and a model of the monty-hall problem.
18
Chapter 2
The Algorithm
2.1
Algorithm Overview
We will now give a detailed description of our algorithm.
We will first formally
describe what is a factor graph, and what kind of factors are supported in our framework. We will then describe the abstraction on the factors, assuming we can provide
lower and upper bound for the potential functions. We will then describe the variable elimination algorithm on this abstraction. We will then describe our refinement
heuristics for refining the factors for higher precisions. Lastly we will demonstrate
how to provide sound lower and upper bounds for the factors.
2.2
Factor Graph
A factor graph is a form of representation for a probability distribution. Formally, a
factor graph is a tuple (X, F) where X is the set of variables and F is the set of factors
over these variables. A variable x E X can take on values from a domain dom(x) C R,
a domain dom({xo, . .
, Xk}) over a set of variables is the cartesian product of each of
the variables domain. A factor f E F has a set of contributing variables denoted by
var(f). A factor also has an associated potential function over its variables, which
19
maps an assignment of these variables to a real number. Intuitively, the potential
function relates a set of variables by giving a higher weight for events that are more
likely to happen among these variables. For instance, one might use a factor to model
the relationships between the occurance of rain and the observation that the ground
is wet by returning a large value for when both events happen, and a small value
for when one event happens and the other does not. We will denote the potential
function by
f
as well when the context makes it clear. A factor graph is usually
represented as a bipartite graph where the nodes are the variables and the factors,
and edges denote whether a variable contributes to a factor.
A factor graph defines a joint probability distribution p over X as a product of
its potential functions:
p(x 1
r
... xn)=
J
fj(x3 1 ...
(2.1)
xjk)
fj EF
where Z is the normalization constant and
ij1, .
A marginalized distribution over a subset A
. ,
=
E var(fh).
{xo, . .
, Xr}
C X of the variables
is given by integrating away the rest of the variables:
p(x 1 . .. Xr) =
J
p(x1 ...
xn)dz
(2.2)
In this paper, we focus on the following question: given a factor graph (X, F) and a
subset of the variables A C X, produce an approximation of the marginal distribution
over the subset with sound bounds. Note we do not mention conditional probability
because one can express conditional probability in a factor graph setting by adding
new factors to the graph. The following example illustrates how an inference problem
might be expressed this way.
20
Example: Waiting For Bus
Consider the problem of predicting the arrival time
of a bus. Imagine a bus that comes once every hour in an unspecified 20 minutes
interval. Given an observation of the bus at the 10th minutes of the hour, what is
the distribution for the arrival time of the bus over the next hour? We can model
this problem with a factor graph as follows:
i_start = Var([0, 601)
i_end = Var([0, 601)
Factor(Plus(i.start, 20, iend))
Factor(Uniform(i-start, i_end, 10))
nextbus = Var([0, 601)
Factor(Uniform(i-start, i_end, next.bus))
20
i-start
_end
Uniform
Uniform
10
next-bus
Figure 2-1: Visualizing the factor graph of the bus problem
The factor graph generated by this program is shown in figure 3-2 In our example,
the variables are instantiated with the Var keyword and an interval domain, and
the factors are instantiated with the Factor keyword and the potential function to
be associated with this factor. The Plus factor constraints that the start time must
preceed the end time by 20 minutes. This effectively give assignments of ijtart and
iend a higher weight of they differ by 20. The observation of the first bus which
21
came at minute 10 is modeled by adding an additional uniform factor, with 10 as the
output. Also note that we only handle continuous distributions in our framework,
where we can approximate discrete distributions with their continuous analogs.
To predict the arrival time of the next bus, we ask for the distribution on next-bus.
Note that in this particular example we asked for a marginal distribution of a single
variable, but in general our framework can solve for distribution of multiple variables.
2.2.1
Factors
We now formally describe the set of factors that are supported in our algorithm.
Recall factors in a Factor Graph is simply a function mapping from a set of (real
valued) assignments to a non-negative value. There are two kinds of factors that we
support:
" factors that express distributions
" factors that express constraints
Our framework supports several distribution potentials, we list a few here.
Uriiform(a,b, x)
=
U[bra
otherwise
0
e*-L)2
Gauss(i, 0-, X)
if (a < x < b) A (b - a > E)
=
-
o, f2-7
Note that these distribution are no different than their usual definitions, and we
can add additional distributions easily.
Factor with a potential function that describes a constraint requires more efforts
to define. Factors such as equalities or additions does not naturally define a potential
function, but rather, they impose a constraint on the values attainable by the set of
22
variables. One can take the boolean approach into defining their potential functions.
For instance, let plus(y, X1, X2) denotes the constraint that y is the sum of x, and
2,
one might define the potential function for plus as follows:
2
)
PlUs(y, Xi, x
1 if
(2.3)
(y = X1 + X2)
0 otherwise
However, a boolean potential as the one defined above has measure 0 in the
domain which the potential is defined over, which is harmful as we employ a numerical
approach in bounding the potential functions. Therefore, we adopt a relaxed notion
of constraint, based on a distance function.
A distance function d, for a constraint c measures the euclidean distance between
an arbitrary ponit x C dom(f) to the set of points that satisfied the constraint:
dc(x) = miny {norm(x - y) I y satisfies c}
(2.4)
Here, the function norm computes the euclidean distance between the point x and
the point y.
Finding the distance function for an arbitrary constraint is computed by taking
the lagrange multiplier, where the objective function is simply the distance between
a point y and our input point x, and the constraint function that restrict the point
y to satisfy the constraint. In many cases, the satisfying set {y
satisfies c} for a
constraint c can be decomposed into a finitely many simpler shapes, and the distance
function is thus a minimum of the distances from our point x to each of these shapes.
We list several distance functions below, the full list is available in the appendix.
23
Note that when the point x is on satisfying set, the distance is 0.
3(j (y - xi - X2))2
dpus (y, x 1 , x 2 ) =
dequai(y,
x)
=
-
2
(2.5)
(2.6)
With the distance function defined, we define the potential function of a constraint
as follows
fc(x) = Gaussian(p = 0, o-, dc(x))
(2.7)
By mapping the distance function for a constraint d, with a gaussian, we have the
guarantee that as the constraint is satisfied, the value of the potential function is at
its highest, and when the constraint is unsatisfied, instead of setting the potential to
0, we measure "how far" has the point becomes unsatisfied, and gradually decays the
potential function depending on the distance.
We would like to remark that the choice of euclidean distance is not incidental,
as it enables us to compute sound bounds for the potential functions, which we will
discuss in a later section. If a user wish to extend the set of constraint potentials
without the requirement of sound bounds but rather approximate bounds, one may
supply distance function measured in other norms, as long as the distance satisfies
the condition that it is 0 when the constraint is satisfied.
24
2.3
Factor Abstraction
2.3.1
Distribution Abstraction
As stated above, our goal is to compute a marginal distribution, call it g, over a subset
A C X. In general, this problem has no closed form solution, so the common approach
is to either find an approximate distribution g* that is a member of a parameterized
family of functions (such as the exponential family) or to derive a set of samples
through a monte carlo simulation. A problem with both approaches is that, while
they may work well in practice, they can not provide hard bounds on the probability
of a specific event.
To compute a set of hard bounds, our method derives an an abstraction of the
desired distribution g. The abstraction j stands for a set of (unnormalized) distributions that is guaranteed to contain the exact distribution, and is bounded by gmmj
and
Ymax,
where:
S{h
Vx E dom(g), g...i(x) < h(x) < gmax(x)}
Given this abstraction, we can bound the probability of an event E by first defining
two functions:
gElower(x)
min(x)
if x
C
E, gma(x) otherwise
gE,,,,(x) = gmca (x) if x C E, gmin(x) otherwise
One can think of these functions as the most conservative distributions out of q for
25
bounding the probability of E. For instance, the function gE,1 0
is used to compute
the lower bound, and it assumes the smallest density values over E and the largest
density values outside of E. The bound is given as:
fLEE
gE1 0 wer
fXglue
(x)dx
(x) dx
< fr(E
< Pr(E) <
'x
E uPpe.(x)
YXEupper
dx
(x)dx
A nice property to note is that given g, and two events E1 and E2 such that
E1 C E2 , lowerbound(Pr(Ei)) < lowerbound(Pr(E2 )) and upperbound(Pr(E1)) <
upperbound(Pr(E2 )).
It remains that we need a computationally tractable way of expressing gmin and
gmax,
and in our framework, they are given as piecewise polynomials. Although we
would like to remark that the choice of any class of functions which has a closed form
multiplication and integration will be just as well.
We now formally define how the abstraction is implemented in our algorithm,
starting with the basic construct of a Patch. and moving up to an abstraction over a
factor called the AbstractFactor.
2.3.2
Patch
A Patch of a factor
f
It is a triple
is used to define the piece-wise polynomial.
(d, pl, p,), where d is the domain of the patch, and pi and p, are lower and upper
bound polynomials, respectively, such that for any x C d, pl(x) < p(x).
A domain d of a patch is a cartesian product of intervals d = I, x
Ii C dom(xi) Vxi C X. For a particular interval Ij, if xj
that Ij = dom(xj).
...
x I, , where
var(f), then we insist
By defining the domains on all the variables' domain instead
of only the variables that contributes to the factor
f,
we can express intersections
of domains between abstractions of different factors more easily.
26
Intersections of
domains is defined by the usual cartesian intersection:
f
di n d2
ij
i2j
We say patch, is a subset of patchy if
patchi C patch 6 (dom(patchi) n dom(patchj) = don(patchi))
A set of patch can form a partition for the factor f
.
partition({patch1, .. ,patch,} ) e
Pf is the Patch Set of a factor
(y
f,
dom(patchi) = dom(
f ))
A
(p
jni, p =0)
it is a set of patches that has the following
properties:
* there exists a patch in Pf that covers the entire domain of
f
3patch G Pf : dom(f) = dom(patch)
* any patch in Pf belongs to a partition of
f
made of patches in Pf
Vpatch C Pf ]X = {patchj C Pf} s.t. patch, E X A partition(X)
A finest partition part* is a partition of factor
(2.8)
f which
(2.9)
each of its patches cannot
contain a smaller patch in Pf
part* = {patch E Pf I Vpatchj C Pf,patch Z patch}
27
(2.10)
We now give an example illustrating a few of the definitions above. Let
factor that covers the domain [0,16] x [0, 16], let Pf the covering set of
f
f
be a
be {[0, 16] x
[0, 16], [0,16] x [0, 8], [0, 16] x [8, 16], [0,8] x [0, 8], [8,16] x [0, 8]1 and the a finest partition
part*
=
{[0, 16]
x [8, 16], [0, 8] x [0, 8], [8, 16] x [0, 8]}
Patch Set of a factor, P,
[0,1]x[6, 167]
[0,x1 6]x[,,16]
x
[0,16]x[0,11
[0,16]x[0,81
(8,16]x--
0,8](0,B]
The finest partition part*
[
I[08S]
B 16x816]
I 816
Figure 2-2: An example of a partition set Pf and the finest partition part*
We can now give the definition for gm n and gmax precisely as follows:
grin(x)
patch.pW(x) s.t. x E domr(patch) A patch G part*
(2.11)
x c domr(patch) A patch E part*
(2.12)
f, organized
as a binary
gmax(X) = patch.pi(x) s.t.
2.3.3
Abstract Factor
An Abstract Factor Gf consists of a patch set Pf of a factor
space partition BSP. Each patch in the patch set Pf make up the BSP as follows
BSP = Leaf (patch)
I Branch(svar, patch, BSP, BSP)
Here, svar denote the variable which splitted the domain of the branch.
An
example of an abstract factor is given below, using the Pf defined in the previous
section:
28
[0,161x[8,161
[8,16]x
[0,16x0,161
y
[0,81
[0, 16]x[0,8)
-
x
[0,8] x
[0,81
Figure 2-3: The patch set Pj organized as a BSP to form the abstract factor Gf
Organizing the patch set into a BSP makes several sub-routines of our algorithm
possible, such as the computation of intergrals.
of patches.
It also enables effecient look up
However, we will still refer to the patch set Pf as it is whenever the
explaination is more concise.
29
2.4
Operations on Factors
2.4.1
Variable Elimination
One way of solving for the marginal distribution is using the variable elimination (VE)
algorithm[12]. Given a factor graph (X, F) and a subset of the variables A C X, the
VE algorithm successively pick variables x' C X\A, and marginalize it away until
only the variables that remain are in A. The algorithm is given below, in its recursive
formulation:
VE(X, F, A) :
if(X\A)
0:
(2.13)
(2.14)
Hf
(2.15)
fEF
else :
(2.16)
let x' E (X\A)
F
{f E F I x' E var(f)}
f-
JJ1f
(2.17)
(2.18)
(2.19)
f E F.1
f\XI
fx, dx'
(2.20)
X' = X\{x'}
(2.21)
F' = (F\Fxi) U{f\xI}
(2.22)
VE(X', F', A)
(2.23)
However, for a set of arbitrary potential functions, the intergration operator may
not have a, closed form solution. As stated before, one of the main contributions of
our approach is to use abstractions to represent the factors, and we will now show
how one can perform the VE algorithm over the abstract factors. To do so, we must
achieve the following:
* convert the each of the original factors f E F to abstract factors Gf
30
* able to compute pair-wise multiplication of abstract factors G1 x Gf2
s able to compute integration of an abstract factor f, Gfdy
Once the above objectives can be met, the VE algorithm will iteratively reduce
the set of original abstract factors, creating intermediate abstract factors along the
way, until only a single abstract factor, Gresuit, remains. This abstract factor is the
output of the VE algorithm.
Essentially, the computation of an abstract factor Gf is the computation of the
patches in its patch set PF. Once the set of patches are obtained, they can be organized
together easily with some book keeping to form the BSP that makes up the abstract
factor. Therefore, we focus on the computation of a single patch for the remainder of
this section.
2.4.2
Patch for the original factors
The computation of patches for the original factors will be thoroughly examined in a
later chapter. For time time being, we will assume given a domain d we can always
find the appropriate lower bound polynomial pi and the upper bound polynomial pu
that correctly bounds the original potential function over the domain d.
2.4.3
Patch for multiplication
To compute a patch for a multiplication, the algorithm is given as input two abstract
factors GfI and G2, along with a domain d, which it uses to compute the appropriate
lower and upper bound pi and pu over the domain d.
Let's define the smallest cover c* of a domain d in a patch set Pf as follows:
c* = argminpatch{volume(patch) e Pf I d C domairn(patch)}
31
(2.24)
Where volume is simply the volume of d, obtained by multipliing the length of
all its intervals. The intuition of the smallest cover is that we want to find the most
precise patches in Gf 1 and Gf 2 that can be used to derive our bounds, and a smaller
patch will have a higher precision compared to a big one. The requirement that d C
domain(c*) is important because one cannot multiply together piece-wise polynomial
bounds and obtain a single polynomial bound over d. In the implementation of our
algorithm the smallest cover is found by searching the BSP of each of the factors for
a faster lookup.
Let c*' be the best cover of d in the patch set Pfi of Gf i, and let c*2 be the best
cover of d in the patch set Pf 2 . Then, we define the lower and upper bound for the
domain d as follows:
*1
Pi =
*2
c *P1i x C *P
x c*2 TU
PU = c *1
Since multiplication of polynomials gives another polynomial, the resulting p, and
pu are valid lower and upper bounds for the potential function over d. Note that
the bound only holds when the polynomials are non-negative, which is the case since
a distribution function can never be negative. We illustrate the computation of the
upper bound pu in multiplication with the following figure.
2.4.4
Patch for integration
To compute a patch for integration, the algorithm is given as input an abstract factor
Gf along with a variable y which we would like to integrate over. For convinience, we
use Ya(d) and Yb(d) to denote the lower and upper ends of the interval corresponding
to the variable y in a domain d.
32
P1
3
p2
G31:
G2:
3 x q1
G1 x G2
d
Figure 2-4: The 1-dimensional abstraction factors G 1 and G 2 are being multiplied.
The upper bound for the domain d is computed by multiplying together the polynomials from the smallest coverings of d from each of the abstract factors
Unlike integrateion, there may be many patches in Gf that intersects with the
domain d, as the integration reduces the dimension of the factor by 1.
Akin to
multiplication, however, we would also like to find the smallest covering of domain
d within Gf.
The algorithm for finding the smallest covering uses the BSP data
structure, and is given below:
smallestCover(BSP, d, y)
if
BSP == Leaf (Patch)
{Pat ch}
else
if BSP.split-var == y
union(smallestCover(BSP.left, d, y),
smallestCover(BSP.right, d, y))
else
if contains(BSP.left.patch, d)
smallestCover(BSP.left, d, y)
else
if contains(BSP.right.patch, d)
smallestCover(BSP.right, d, y)
else
BSP.Patch
end
end
end
end
end
One should notice that when the split variable is equal to the variable we're
33
integrating over, we can use it to filter which side of the BSP to further explore. This
is a nice property of the BSP: The abstract factor Gf and the abstract factor which
the patch over d is computed over shares the same variable y, and BSP only partition
the space by halving the domain, the intersection of d will only happen at one branch
of the BSP. or not at all, when the split variable is not equal to y. This property
let us find the smallest coverings for the domain d in Gf: If one branch contains d,
go down that branch, if neither branches can contain d, then we should return the
current patch without sub-dividing it. Below is a figure illustrating a smallestCover
for d computed in the integration step.
G1:
inte(G1,y)
d
Figure 2-5: The 2-dimensional abstract factor G1 is integrated. A domain d has its
intersected patches in G1 being drawn, for brevity these patches do not nessesarily
come from a BSP
Let C* be the best covers for d in Gf, consisting of patches in Gf. We define the
lower and upper bound for d as follows:
34
I
patcheC*
Pu =
E
patcheC*
Yb(patch.d)
ya (patch.d)
patch.pl dy
/Yb(patch.d)
patch.pu dy
ya(patch.d)
Since inequalities are preserved under integration, the new bounds pi and pu are
again valid lower and upper bounds for the true potential function over the domain
d.
35
2.5
Refinement Heuristic
Up until this point we demonstrated that one can compute the lower and upper
polynomial bounds of a patch over a domain d, and the resulting patches can be
then organized together to form a BSP of an abstract factor G. However, the task
of choosing a good set of patches remains. It is not difficult to see why a uniform
coverage of patches is inefficient, for the regions where the lower bound is already close
to the upper bound, there is no reason to further partition the patches for a better
precision. To intelligently partition the domain into patches, we needed a heuristic
function that can decide, at each iteration of our refinement step, which patch should
be splitted into smaller patches. We begin by defining a measurement of errors of
a factor, and explain why a some simple heuristics are inadequate. We then define
a measure of imprecision of a patch, and describe a heuristic that selects the patch
based on a ranking of all patches imprecisions. We finish this section by noting several
nice properties of this heuristic and the optimization opportunities these properties
enable.
2.5.1
Measurement of Error
To obtain a heuristic, we must first define an appropriate cost function. We define
the error of an abstract factor G as follows:
volum e(patch)
Err(G) =
(2.25)
patchE part*
Here, part* refers to the finest partition of an abstract factor G, and volume of
a patch is computed by integrating the difference function pu - pi over the entire
domain of the patch.
Naively, one would simply split a patch if it has a big volume, i.e. its lower bound
36
is much smaller than its upper bound. However, this approach does not work for two
reasons:
e The volumes of patches increases multiplicatively, a small differences in the
bounds of one patch can cause a much larger volume in other patches down
stream
* Some patches with big errors do not need to be split, because they are multiplied
by a patch that is close to 0, effectively cancling out the errors
Instead, one should measure directly the effect of a patch on the resulting factor
of the VE algorithm, Gresuit. One approach is to simply attempt to split a patch, and
measure the difference of error between Gresuitunspit and Gresuispit, and select the
patch that leads to the greatest decrease of errors. However, this approach does not
work because the effect of a split of a single patch becomes diluted by the imprecision
of other patches, and the improvement cannot be measured accurately until the other
patches have also been refined and made precise. As a result, this heuristic will favor
patches that has an immediate improvement on the qualities of Gresuit, and is prone
to get stuck in an local optimal, splitting a single patch into smaller and smaller
patches while ignoring the other patches altogether.
2.5.2
The Splitting Heuristic
Instead of measuring the effect of a patch with the errors, our heuristic measure the
effect of a patch with imprecisions. Intuitively, the imprecision of a patch is measured
by pretending all the other patches are precise, and that this patch is the only patch
with imprecise lower and upper bounds. Formally, let G',teh be the set of abstract
factors where all the patches have their lower bounds pi set to be equal to their upper
bounds pu, except for the patch in question patch*. Then, the imprecision is defined
as:
37
Imprecision(patch*) = E
(rr(G'tch*
esult)
(2.26)
Our heuristic computes the imprecision of all the patches, and choose the one with
the largest imprecision to split.
patchpit = argminpatchImprecision(patch)
(2.27)
By measuring imprecision instead of error, the effects of a patch's bound are not
diluted by the imprecision of other patches. As a result, as a patch is splitted and
became smaller, its imprecision will also decreases compared to the other patches,
allowing other patches to split as well. This avoids the problem of getting stuck in a
local optimal.
2.5.3
Optimization of the Heuristic
As one can imagine, computing imprecision for all of the patches is an expensive
task. In the worst case, this computation runs in 0(n2 ) where n is the total number
of patches, as during the computation of imprecision, each patch induces its own
modified abstract factors G'>ch, effectively creating a completely new set of patches.
However, there is a nice property of imprecision: For any patch, its imprecision can
only decrease as the other patches become more refined. Let G denote a set of abstract
factors and G' denote the same set of abstract factors after several refinement and
splitting has occured, then:
38
Err(Gatch ,result) <; ETTr(Gpatch,result)
( 2.28
)
Vpatch
Because our heuristic only split the patch with the most imprecision, if a patch
already has a small imprecision, there is no need to re-compute its imprecision because
only patches with larger imprecisions can be chosen by the heuristic for splitting.
Using this insight, we can organize all the patches and their imprecisions into a
queue, and only re-compute the maximum imprecision patches of the queue.
get-split()
=
patch, imprecision = dequeue!(imprecision-queue)
recomputed-imprecision = compute-imprecision(patch)
while (recomputed-imprecision != imprecision)
enqueue!(imprecision-queue, patch, recomputed-imprecision)
patch, imprecision = dequeue! (imprecision-queue)
recomputed-imprecision = computeimprecision(patch)
end
patch
end
This algorithm chooses the patch with the largest imprecision, if it discovers its
imprecision is the same as its re-computed imprecision, it returns the patch. Otherwise, it put the patch along with its re-computed imprecision back into the queue
and try again.
Further optimization is obtained by noticing the bounds of a patch only has effect
on few of the other patches, so imprecision of a patch can be computed more locally,
by using only the patches that it affects.
39
2.6
Polynomial Bounds
Recall from the variable elimination section, the soundness of the algorithm only
assumes that one can derive a sound lower bound pi and a sound upper bound p" for
the patches of the original factors, as the bounds are preserved in the integration and
multiplication operations. We would like to emphasize that our technique can work
with any valid lower and upper bounds for the patches, as long as the bounds come
from a family of functions that can be symbolically multiplied and integrated. In our
algorithm, the family of functions are polynomials, as they can be easily multiplied
and integrated, and can be easily bounded. The key question of this section is, given
an arbitrary potential function
f, that
is non-negative and over a domain d, can the
f
algorithm provide a lower and upper bound such that p' <
< pa. We will explain
how to obtain the bounds in two parts: In part one we will briefly explain how to
obtain a good polynomial approximation to the function
f,
and in part two we will
explain how to shift the polynomial by a constant, so that the shifted polynomial
becomes valid bounds for the function
2.6.1
f.
Polynomial Approximation
In the approximation step, an arbitrary, multi-variate function f(Xi,
be approximated with a polynomial p over a domain d.
As function
...
f
X) must
is a valid
potential function, it must have non-zero measure, is everywhere non-negative, and
continuous. The function
f is
not necessarily differentiable, as our distance functions.
which defines the potentials for the constraints, uses the min function internally.
To approximate the function
f,
we use techniques developed in the chebfun pack-
age of matlab [8], which we will describe on a high level. This algorithm works by
iteratively project the function f onto an outer product of Chebychev polynomials.
Let p, denote a chebychev polynomial of the single variable xi, then the product
40
PP = f1
p
is a polynomial of variables X1 ... Xk, expressed as an outer product. The
algorithm constructs ppi, which attempts to approximate our original function
computes the residue ri =
f
f,
and
- ppi. This residue function is again approximated by
PP2, and a new residue is computed r2 = r1 - PP2. This process is repeated until the
residue function is sufficiently small. The result is a sum of outer products of polynomials, which is the polynomial p = ppi + - - - + PPk that approximates the function
f.
In our work, we limit the degree of the chebychev polynomials to 2, as higher degrees,
while able to approximate the function
f
better, becomes numerically unstable when
being multiplied and integrated.
2.6.2
Approximate Bounding
Although the algorithm in
[8]
gives very good polynomial approximations, it does
not provide an error bound on how close the polynomial actually is to the function it
approximates. As a result, we are tasked with retrofitting the approximate polynomial
p with two constants ci and c2 such that p - c
<
f
< p
+ c2.
The simplest bounding method is simply use random sampling in conjunction of
gradient descent on the difference function
f
- p.
The maximum of the difference
function becomes c2, as it is the amount p will have to shift up to be always greater
than
f.
Conversely, the minimum of the difference function becomes -ci,
amount p will have to shift down to be always smaller than
f.
as it is the
This is the default
method of proving bounds as it gives very good bounds and is cheap to compute.
The obvious dropback of this method is it is not sound, as sampling with gradient
descent is prone to stuck in local minimums, which will lead to an incorrect bound.
However, in practice it rarely appears to behave abnormally.
41
2.6.3
Exact Bounding
However, if one insists on providing sound bounds, the algorithm can provide these
sound bounds at the expense of bound quality. Consider a function
f
and its ap-
proximation polynomial p over domain d, consider the bounds on the value of f and
p individually: a < min(f) < max(f) < b, c < min(p) < max(p) < d. Then, by
letting c2 = b - c and ci = d - a, p - cl and p + c2 forms a valid lower and upper
bound for the function
f
over d. It thus remains to provide sound bounds for the
polynomial p and the potential function
f
individually over the domain d, which we
will describe now.
2.6.4
Bounding Polynomials Exactly
Given a polynomial p over a domain d, the maximum and minimum of the polynomial
can be bounded in two ways, either via interval arithmetic or by root-finding. Interval
arithmetic [9] is a form of arithmetic where the operators work on intervals instead of
single values, with the guarantee that for any points chosen from the input intervals,
the usual arithmetic operation on those points results in a point that is contained
in the result interval. For instance, we can add two intervals: [1, 3] + [3, 5] = [4, 8],
and multiply two intervals [-1, 2] * [-1,2] = [-2,4].
As a polynomial is a series
of additions and multiplications, one can bound the maximum and minimum of a
polynomial over a domain d using interval arithmetic.
Another more precise way of obtaining a bound on the polynomial is to use rootfinding. We know the maximum and minimum of the polynomial can only occur at
the critical points where all the partial derivatives of the polynomials are equal to
0, or at the boundaries of the domain.
Using a root finding library [10], one can
easily identify the critical points on the interiors of the domain, and by substituting
the values at the boundaries of the domain to obtain a simpler polynomial, one can
obtain all the critical points of the polynomial on domain d, and enumerate over them
42
to obtain the maximum and the minimum.
2.6.5
Bounding Potential Functions Exactly
Like polynomials, one can bound the potential functions with interval arithmetic to
obtain a very loose bound.
However, we are able to find the exact bound for the
uniform potential function over the domain d analytically, the code for computing
this bound is in the appendix. For the gaussian function, we are still in the process
of discovering a sound bound for its potentials.
For the potential function that corresponds to a constraint, however, we are able
to obtain very tight bounds symbolically.
Recall that the potential function of a
constraint factor is given by first obtaining the distance function d, of the constraint
c, then applying a gaussian function on top of the distance function to obtain the
potential function. Therefore, if we are able to bound the distance function over a
domain d, we can obtain the bound for the potential function, as the gaussian function
is monotonically decreasing for a non-zero input. The question we must answer is
therefore: Given a distance function d, and a domain d, what is the smallest and
largest distances attainble over all the points in the domain?
The fact that the distance function measures euclidean distance will play a key
role: We can draw a bounding sphere around the domain d, and since the distance
function is measured in the euclidean distance, the smallest and largest distance value
attainable over the domain d is bounded by the smallest and largest value attainble
on the bounding sphere:
43
Bound f,d:
o
-
center(d)
diameter(d)
2
dfurthest = df(o) + r
r
dciosest = max(0, df(o) - r)
Below is a figure illustrating the bound for a distance function for a constraint
factor.
{y I y satisfy
c}
b
d
Figure 2-6: The domain d, a rectangle, has its distance bounded by a and b, deriving
from the bounding sphere
44
Chapter 3
Results
We will now list several preliminary results of our algorithm.
3.1
Simple Constraint
Consider the propagation of a simple set of constraints. Say we have the constraints
x = y and x = 5, and we wish to draw a conclusion on the value of y. This can
be done by combining the two factors together and integrating away the variable x.
Below is the result after 100 steps of refinements.
~~xbn
Figure 3-1: Visualizing the result factor of propagating a simple constraint
As we can see, the distribution correctly cluster around the valu y = 5, and we
45
have the guarantee that the true distribution lies between the lower bound and the
upper bound.
3.2
Backward Inference
Consider these set of constraints z = Unif (y, x), y + 20 = x, z = 10. Given these
constraints, what can one say about the joint distribution of y, x? This experiment
explores the ability to infer the hidden relationship between y and x after making an
observation on the output z.
Color
150
Figure 3-2: Visualizing the relationship between y and x after making an observation
This is a preliminary result demonstrating the ability to infer relationships between
variables, in this case, we can see that the distribution roughly satisfies the constraint
y+20 = x, however, the effect of making an observation at z = 10 is not so pronounced
from this experiment. This can be due to a insufficient number of refinements, which
our algorithm still has trouble with scaling.
46
Appendix A
Potential Function and Distance
Functions
Here is the python code defining the bound object for the uniform distribution
# the variable order for this is [low, high, x]
# the value for this potential is 1 /
(high - low) if
low < x < high
# and 0 otherwise
class UniformPotential(Potential):
def __init__(self, delta):
self .delta = delta
def find_minmax(self, constraints):
L1, L2 = constraints[0]
H1, H2 = constraints[1]
X1, X2 = constraints[2]
delta = self.delta
# let's first find the minimum
# for finding the minimum, there are 2 steps to it. For the 8 vertex of the
# cube if any point is UNSAT, then the minimum is 0 otherwise, it is the
# smallest achievable value of 1 / (H2 - Li), the value of x does _not_
# matter here
def pt-satconstraints(pt):
l,h,x = pt
sati = 1 <= x
sat2 = x <= h
47
sat3 = h >= 1 + delta
return sati and sat2 and sat3
pts-satisfy = [pt-satconstraints(pt) for pt in self.get-points(constraints)]
# if it has false, min is 0, otherwise, min is smallest value
minimum = 0.0 if False in pts-satisfy else 1.0 / (H2 - Li)
# let's now find the maximum, some helper functions first
# the idea is the max can only happen at the SAT side, so we don't need to
# worry about the UNSAT regions, if nothing is sat, the max will just be 0
# intersection with the first constraint H >= X1
# return a box if has intersection, or None if intersection empty
def getifirst intersection(11, 12, hi, h2):
if hi >= X1:
return 11, 12, hi, h2
if h2 < X1:
return None, None, None, None
else:
return 11, 12, X1, h2
# intersection with the second constraint L <= X2
# return a box if has intersction, or None if intersection empty
def get.second intersection(11, 12, hi, h2):
if 11 == None:
return None, None, None, None
if 12 <= X2:
return 11, 12, hi, h2
if 11 > X2:
return None, None, None, None
else:
return 11, X2, hi, h2
# given a point 11, hh in 2D, check if it satisfy the 3rd constraint H > L
# + delta
def pt.sat_3rd(pt):
11, hh = pt
return hh >= 11 + delta
# run through the boxes through the first 2 if there are any UNSAT
# we know the max / min are both 0, otherwise we do some math
afterintersection = get-second-intersection(*get-firstintersection(Li, L2, H1
48
maximum = None
# if intersection is empty, our box is unsat
# both upper / lower are 0
if after intersection[0] == None:
maximum = 0.0
else:
11, 12, hi, h2 = after-intersection
pt-satisfiability = [pt-sat_3rd(pt) for pt in self.get-points([(l1,12),(h1,h2
# if everything sat, max is closest to line, min is furthest(to line) max
# is closest
if False not in pt-satisfiability:
maximum = 1.0 / (hi - 12)
# if none of the points satisfy, we're entirely out of the game, so 0, 0
# for min max
if True not in pt-satisfiability:
maximum = 0.0
# if some are SAT and some are UNSAT, then the line pierce the thing.
# the minimum is outside aka 0, the max is 1.0 / delta
if False in ptsatisfiability and True in pt-satisfiability:
maximum = 1.0 / delta
return minimum, maximum
Here are the python code defining various distance functions for the constraint
factors
# equality:
# y = x
@to_11_norm
def equal-dist(lst):
xl = 1st[0]
x2 = 1st[11
return 0.5 * pow((xl - x2), 2)
# constant:
# x = const
def constdist(const):
def constdist(lst):
x = lst[0]
return equal-dist([x, const])
return constdist
# uniform:
# x = uniform (a, b)
49
@to_11_norm
def uniform_dist(a, b):
def _uniform(lst):
x = ist[]
if a < x and x < b:
return 0.0
else:
return 5.0
return _uniform
# asssert:
# asserting x is true
def assertdist(lst):
x = ist[]
return equal-dist([x,
pow(xl,2) + pow(x2,2),
pow(xl-1,2) + pow(x2,2),
pow(xl,2) + pow(x2-1,2),
+ pow(xl-1,2) + pow(x2-1,2)
)
# and:
# y = x1 and x2
@to_11_norm
def and-dist(lst):
y = ist[O]
xl = 1st[1]
x2 = lst[2]
return min( pow(y,2) +
pow(y,2) +
pow(y,2) +
pow(y-1,2)
1.0])
pow(xl,2) + pow(x2,2),
+ pow(xl-1,2) + pow(x2,2),
+ pow(xl,2) + pow(x2-1,2),
+ pow(xl-1,2) + pow(x2-1,2)
)
# or:
# y = x1 or x2
@to_11_norm
def ordist(lst):
y = ist[]
xl = 1st[11
x2 = lst[2]
return min( pow(y,2) +
pow(y-1,2)
pow(y-1,2)
pow(y-1,2)
# xor:
# y = x1 xor x2
@to_11_norm
50
pow(xl,2) + pow(x2,2),
+ pow(xl-1,2) + pow(x2,2),
+ pow(xl,2) + pow(x2-1,2),
pow(xl-1,2) + pow(x2-1,2)
)
def xordist(lst):
y = lst[O]
xi = 1st[1]
x2 = lst[2]
return min( pow(y,2) +
pow(y-1,2)
pow(y-1,2)
pow(y,2) +
# plus:
# y = x1 + x2
@to_11_norm
def plus.dist (1st):
y = ist[]
x1 = lst[11
x2 = 1st[21
return 3*pow(l.0/3 * (xl+x2-y),2)
# TODO: make it not approximate
# times:
# y = x1 * x2
@to_11_norm
def times-dist (1st):
y = ist[Ol
xl = 1st[11
x2 = lst[2]
return abs(y-(xl*x2))
# mod:
# y = x1 % x2
def moddist(lst):
y = lst[O]
x1 = lst[1]
x2 = lst[2]
def mod-sat(pt):
y,xl,x2 = pt
if (x2 == 0):
return False
return y == x1 % x2
yr = round(y)
x1r = round(xl)
x2r = round(x2)
maxdist = ()
d = 0
51
canddist = ()
while (d < maxdist):
for i in range(-d,d+1):
for j in range(-d,d+1):
ptl = (yr-d, xlr+i, x2r+j)
pt2 = (yr+d, xlr+i, x2r+j)
pt3 = (yr+i, x1r-d, x2r+j)
pt4 = (yr+i, xlr+d, x2r+j)
pt5 = (yr+i, xlr+j, x2r-d)
pt6 = (yr+i, xlr+j, x2r+d)
candidates = filter(mod-sat,[ptl,pt2,pt3,pt4,pt5,pt6l)
dists = map(lambda x: euclid-dist(x,(y,xl,x2)), candidates)
if dists == [I:
continue
canddist = min(canddistmin(dists))
#euclidian distance actually returns the squared dist, we need to sqrt it
maxdist = int(pow(cand-dist,0.5)+1)
d = d+1
return canddist
# indirect less than:
# y = x1 < x2
@to_11_norm
def lt-dist (1st):
y = lst[0]
x1 = lst[1]
x2 = 1st[21
def ltcasel(y, x1, x2):
if (xl > x2 or x1 == x2):
return pow(y,2)
else:
return pow(y,2)+2*pow((xl-x2)/2,2)
def ltcase2(y, x1, x2):
if (xl < x2):
return pow(y-1,2)
else:
return pow(y-1,2)+2*pow((xl-x2)/2,2)
return min(lt-casel(y,xl,x2),lt-case2(y,xl,x2))
# direct less than
# x1 < x2
@to_11_norm
def lessdist (1st):
y = 1.0
xl = 1st[0]
52
x2 = 1st [1
def lt-casel(y, xl, x2):
if (xl > x2 or xl == x2):
return pow(y,2)
else:
return pow(y,2)+2*pow((xl-x2)/2,2)
def ltcase2(y, xl, x2):
if (xl < x2):
return pow(y-1,2)
else:
return pow(y-1,2)+2*pow((xl-x2)/2,2)
return min(lt-casel(y,xl,x2),lt-case2(y,xl,x2))
# indirect equal:
# y =
(xl == x2)
@to_11_norm
def eq-eq-dist (1st):
y = ist[Ol
x1 = lst[1]
x2 = 1st[21
dist-uneq = 0
# if the two numbers are sufficiently different, take projection
if (abs(xl - x2) > 1):
distuneq = pow(y,2)
# otherwise, takes the two lines:
# x2 = xl+1, x2 = xl-l and
# take the min dist to those two lines
else:
distuneqi = pow(y, 2) +\
pow(0.5*(x2-xl-1), 2) +\
pow(0.5*(xl-x2+1), 2)
distuneq2 = pow(y, 2) +\
pow(O.5*(x2-xl+l), 2) +\
pow(0.5*(xl-x2-1), 2)
distuneq = min(dist-uneql, dist-uneq2)
dist-eq = pow(0.5*(x2-xl),2) +\
pow(0.5*(xl-x2),2) +\
pow(y-1,2)
return min(dist-eq, dist-uneq)
# gernealize from indirect equal:
# y = lxl - x21 < bound
@to_11_norm
def samedist(bound):
53
def _eq-eq-dist (1st):
y = lst[O]
xl = 1st [1]
x2 = 1st[21
true-y-dist = pow(l - y, 2)
false-y-dist = pow(y, 2)
distlaterall = pow(0.5*(x2-xl-bound),2) +\
pow(O.5*(xl-x2+bound), 2)
distlateral2 = pow(0.5*(x2-xl+bound),2) +\
pow(O.5*(xt-x2-bound), 2)
# if sufficiently close, distance to true is just projection
if (abs(xl - x2) < bound):
dist-eq = true-y-dist
dist-uneq = false-y-dist + min(distlaterall, distlateral2)
return min (dist-eq, dist.uneq)
else:
dist-uneq = false-y-dist
dist-eq = true.y.dist + min(distlaterall, distlateral2)
return min (dist-eq, distuneq)
return _eq-eqdist
# not:
# y = not x
@to_11_norm
def not-dist(lst):
)
y = lst[O]
x = lst[1]
return min( pow(y-1,2) + pow(x,2),
pow(y,2) + pow(x-1,2)
# neg:
# y = -x
@to_11_norm
def neg-dist(lst):
y = ist[O]
x = lst[1]
return 2*pow((y+x)/2,2)
# arracc:
# y = idx xO x2 x3 x4 ...
def arracc-dist(lst):
y = ist[Ol
xn-1
54
idx = 1st [11
xs = lst[2]
size = len(xs)
def d-regionl():
if (idx < -1):
return pow(y,2)
else:
return pow(idx+1,2) + pow(y,2)
def d-region-po:
ret = ()
for p in range(size):
ret = min(ret,
pow(idx-p,2) + 2.0 * pow((y-xs[pl)/2,2)
return ret
)
def d-region2(:
if (idx > size):
return pow(y,2)
else:
return pow(idx-size,2) + pow(y,2)
return min(d-regionl(), d-region2(, d.region-po)
# arrass
# y = idx realidx noval yes-val
def arrassdist(lst):
y = lst[0]
idx = 1st[1]
real idx = lst[21
valno = 1st[3]
val-yes = 1st[41
def d-eqO:
return pow(idx-real idx,2) + pow((y-val-yes)/2,2) * 2
def duneqi():
if (idx < realidx - 1):
return 2 * pow((y-val-no)/2,2)
else:
return 2 * pow((y-val-no)/2,2) + pow(idx-(real-idx-1),2)
def duneq2():
if (idx > realidx + 1):
return 2 * pow((y-valno)/2,2)
else:
return 2 * pow((y-val-no)/2,2) + pow(idx-(real-idx+1),2)
55
return min(d-eq(),d.un-eqi(),d-un-eq2())
# if then else: if b then vi else v2
@to_11_norm
def ite-dist(lst):
y = ist[]
b = ist[1]
v1 = lst[2]
v2 = 1st[31
def dtrueand-vl():
return pow(b-1.0,2) + pow((y-vt)/2,2) * 2
def dfalseand-v2():
return pow(b-0.0,2) + pow((y-v2)/2,2) * 2
return min(d.trueand-vl(), dfalse_andv2())
56
Bibliography
[1] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan.
An introduction to MCMC for machine learning. Machine Learning, 50(1-2):543, 2003.
[2] R.C. Aster, B. Borchers, and C.H. Thurber. ParameterEstimation and Inverse
Problems. Elsevier Science, 2011.
[3] Matthew J. Beal. Variational Algorithms for Approximate Bayesian Inference.
PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.
[4] Christopher M. Bishop. PatternRecognition and Machine Learning (Information
Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA,
2006.
[5] Noah D. Goodman and Andreas Stuhlm"uller. The design and implementation
of probabilistic programming languages. Accessed: 2014-08-27.
[6] John P. Huelsenbeck, Fredrik Ronquist, Rasmus Nielsen, and Jonathan P. Bollback. Bayesian inference of phylogeny and its impact on evolutionary biology.
Science, 294(5550):2310-2314, December 2001.
[7] Emma Rollon and Rina Dechter. New mini-bucket partitioning heuristics for
bounding the probability of evidence. In Maria Fox and David Poole, editors,
AAAI. AAAI Press, 2010.
[8] Alex Townsend and Lloyd N. Trefethen. An extension of chebfun to two dimensions. SIAM J. Scientific Computing, 35(6), 2013.
[9] Warwick Tucker. Validated numerics : A Short Introduction to Rigorous Computations. Princeton University Press, 2011.
[10] Jan Verschelde. Algorithm 795: Phcpack: A general-purpose solver for polynomial systems by homotopy continuation. ACM Trans. Math. Softw., 25(2):251276, June 1999.
[11] Udo von Toussaint. Bayesian inference in physics. Rev. Mod. Phys., 83:943-999,
Sep 2011.
57
[12] N.L. Zhang and D. Poole. A simple approach to bayesian network computation.
In proceedings of the 10th Canadian Conference on Artificial Intelligence, pages
16-22, Banff, Alberta, Canada, 1994.
58