22 - arXiv.org

advertisement
Asymptotically exact inference in deep generative
models and differentiable simulators
Matthew M. Graham
School of Informatics
University of Edinburgh
m.m.graham@ed.ac.uk
Amos Storkey
School of Informatics
University of Edinburgh
a.storkey@ed.ac.uk
arXiv:1605.07826v2 [stat.CO] 12 Jul 2016
Abstract
Many generative models can be expressed as a deterministic differentiable function
of random inputs drawn from some simple probability density. This framework
includes both deep generative architectures such as Variational Autoencoders and a
large class of dynamical system simulators. We present a method for performing
efficient MCMC inference in such models when conditioning on observations of
the model output. For some models this offers an asymptotically exact inference
method where Approximate Bayesian Computation might otherwise be employed.
We use the intuition that conditional inference corresponds to integrating a density
across the manifold corresponding to the set of inputs consistent with the observed
outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo
which leverages the smooth geometry of the manifold to coherently move between
states exactly satisfying the constraint. We validate the method by performing
inference tasks in a diverse set of models: parameter inference in a dynamical
predator-prey simulation, joint 3D pose and camera model inference from 2D
projections and image in-painting with a generative model of MNIST digit images.
1
Introduction
Developments in generative modelling with deep architectures such as Variational AutoEncoders (VAEs) [17, 33] and Generative Adversial Nets (GANs) [14] have made it possible to
learn probabilistic models of increasingly complex, high-dimensional datasets. These models can be
formulated as a differentiable function g : U → Y which maps random inputs U ∈ U ⊆ RM from a
simple base density pU [u] to an implicit learnt distribution over outputs Y ∈ Y ⊆ RN ; the term
Differentiable Generator Net (DGN) has been suggested to encompass such models [13].
The DGN framework also encapsulates a wide class of simulator models defined procedurally (e.g.
numerical integration of a system of stochastic differential equations) with randomness introduced
by draws from a random number generator. The operations used in such simulations are often
differentiable, but manually calculating the derivatives of the simulator outputs with respect to inputs
would be tedious at best and unmanageable for larger computational graphs. Fortunately automatic
differentiation provides a computationally efficient framework for exactly calculating such derivatives
given just the code used to define the forward model, with the implementations in frameworks such
as Theano [36] able to cope with an increasingly wide class of models.
The end-to-end differentiability of DGNs is key in allowing them to be efficiently fit to data, usually
with some variant of stochastic gradient descent. The usefulness of differentiability extends beyond
learning however — inference methods such as Hamiltonian Monte Carlo (HMC) fundamentally rely
on gradient information to allow coherent exploration of high-dimensional state spaces.
A key application of generative models is to perform inference given observations of the output of
the system. For example in a generative model of images we may wish to infer plausible in-paintings
of an image patch given knowledge of the surrounding pixel values. Parameter inference can be
seen as another instance of this task. Here parameters Θ are considered as part of the model output
Y = [Θ ; X ], sampled from their prior via a deterministic function of a subset of the random inputs
U , and the model is conditioned on observations of the remaining ‘data’ outputs X . The data outputs
are generated from parameters by another deterministic function which takes as input the parameters
Θ and the remaining random inputs which produce the variability in X given Θ .
For many generators we cannot evaluate the marginal density on the outputs pY [y]. In a parameter
inference setting this usually corresponds to not being able to evaluate the likelihood pX | Θ [x | θ].
The lack of a closed form density precludes direct use of approximate inference methods such as
variational inference and Markov chain Monte Carlo (MCMC) in the output space.
2
Approximate Bayesian Computation
A lack of explicit likelihoods is the motivation for Approximate Bayesian Computation (ABC)
methods [4, 22]. In ABC the simulated data outputs X are decoupled from the observed data X̆ by a
noise model or kernel pX̆ | X [x̆ | x], typically parametrised by a tolerance parameter e.g. uniform
ball kernel radius or Gaussian kernel standard deviation. The ABC likelihood is then defined as
Z
pX̆ | Θ [x̆ | θ] = pX̆ | X [x̆ | x] pX | Θ [x | θ] dx.
(1)
X
An unbiased Monte Carlo estimate can be calculated from one or more simulated model outputs
pX̆
|Θ
[x̆ | θ] ≈
S
io
h
1 Xn
pX̆ | X x̆ | x(s)
S s=1
x(s) ∼ pX | Θ [· | θ] ∀s ∈ {1, . . . S} .
(2)
This unbiased ABC likelihood estimator can be used in various Monte Carlo inference schemes. In
the ABC rejection algorithm, the kernel is typically uniform across some bounded region defined by
a distance measure between the simulated and true data being less than , with parameters proposed
from the prior accepted if the simulated data lies within the of the true data and rejected otherwise.
In ABC MCMC methods [23], the ABC likelihood estimator is instead used in a Metropolis–Hastings
acceptance step, with new model parameters proposed
from some perturbative proposal distribution
given the current values. If the simulated outputs x(s) used in the estimator are maintained as part
of the Markov chain state, this can be seen as an instance of a pseudo-marginal (PM) method [3].
The kernel can have a probabilistic interpretation as representing uncertainty introduced by measurement error or mismatch between the unknown generative process by which the observed data was
produced and the modelled generator [32]. However in practice the kernel is often motivated on
practical terms and the tolerance parameter fixed on computational efficiency grounds rather than
treated as some unobserved variable to be inferred [34].
A common explanation for the use of a kernel in ABC is that for continuous data the probability
of generating simulated data exactly matching the observed data is zero if using an ABC rejection
type scheme [22] — the observed data forms a zero measure set. ABC therefore can be seen as
relaxing the constraint that true and simulated data exactly match and instead accepting parameters
which produce simulated data within some region around the true data. From this perspective the
distribution the parameters are sampled from only approximates the ‘true’ posterior.
3
Conditional inference in differentiable generative models
In a typical VAE model for continuous data, the generator is specified with a hierarchical Gaussian
structure, with Gaussian distributed latent variables Z parametrising some lower dimensional manifold
in the output space via a function m(Z ) specified by a feed-forward network. Additional Gaussian
noise N with dim(N ) = dim(Y ) is added, allowing points in the output space off the manifold to
be generated, scaled by a further function s(Z ) specified by a feed-forward network. Considering the
overall input as U = [Z ; N ] the generator function is therefore g(U ) = m(Z ) + s(Z ) N .
In VAE models we could exploit the hierarchical structure to perform MCMC inference in the joint
distribution pY ,Z [y, z] = pY | Z [y | z] pZ [z] (or conditionals thereof) over latent variables Z and
outputs Y which is evaluable as both the latent marginal pZ [z] and output conditional pY | Z [y | z]
are of known forms (e.g. Gaussian). However such hierarchical model structures are known to be
challenging for MCMC methods to explore well [28], with often large variations in scale across the
joint space and tight coupling between latent variables and outputs which can prove challenging even
for MCMC methods designed for densities with complex geometry like HMC [5].
Rather than defining the Markov chain on [Y ; Z ] here we propose to instead use U . As the outputs
Y are full determined by U there is no loss of information in this alternative parametrisation.
Importantly we can also use U as the state with arbitrary generator functions without any assumption
of some hierarchical structure, including likelihood-free simulator models typically used in ABC.
Generating samples of U to compute unconditional expectations of functions of Y is trivial - we
simply need to define a MCMC dynamic which leaves pU [u] invariant. Often we can in fact generate
exact independent samples from pU [u].
2
It is not immediately clear however how to perform conditional inference. More explicitly if we
partition the generator output as Y = [Y 1 ; Y 2 ] = [g1 (U ); g2 (U )], how can we produce MCMC
samples of U which will allow us to estimate expectations of Y given Y 2 = y̆2 has been observed?
Conditioning on part of the output maps to constraining the outputs to be in an axis-aligned linear subspace. However, as the mapping from inputs to outputs in the generator is generally a non-linear function, this corresponds to restricting the inputs to a non-linear manifold C = {u ∈ U : g2 (u) = y̆2 }
embedded in the input space. We claim here and show in Appendix A that if we generate a set of
input MCMC samples {u(s) }Ss=1 restricted to C and with invariant density
1
∂g ∂g T − 2
2 2 πU [u] ∝ pU [u] (3)
∂u ∂u with respect to the natural reference (Hausdorff) measure for the manifold, then
S
1 X
f ◦ g(u(s) ) .
S→∞ S
s=1
E [f(Y ) | Y 2 = y̆2 ] = lim
(4)
Intuitively the determinant term in (3) adjusts for the contraction / expansion of the infinitesimal
‘thickness’ (extent in directions orthogonal to the tangent space) of the manifold when mapping
through the generator function.
A general framework for performing conditional inference in differentiable generators is therefore to
define MCMC updates which explore the target density (3) on the constraint manifold C. We propose
here to use a HMC method which simulates the dynamics of a constrained mechanical system.
4
Constrained Hamiltonian Monte Carlo
In standard HMC [11, 29] the vector variable of interest U is augmented with a momentum variable
P ∈ RM . The momentum is taken to be independently Gaussian distributed with zero mean and
covariance M, often called the mass matrix. The new joint target density is then πU ,P [u, p] ∝
exp [−H(u, p)] where H(u, p) = − log πU [u] + 21 pT M−1 p is termed the Hamiltonian.
The canonical Hamiltonian dynamic is described by the system of ordinary differential equations
∂H
du
=
= M−1 p,
dt
∂p
dp
∂H
∂ log πU [u]
=−
=
.
dt
∂u
∂u
(5)
This dynamic is time-reversible, measure-preserving and exactly conserves the Hamiltonian. Symplectic numerical integrators allow approximate integration of the Hamiltonian flow while maintaining
the time-reversibility and measure-preservation properties. Further subject to stability bounds on the
time-step, such integrators will exactly conserve some ‘nearby’ Hamiltonian, and so the change in
the Hamiltonian will tend to remain small even over long simulated trajectories [19].
These properties make simulated Hamiltonian dynamics an ideal proposal mechanism for a Metropolis
MCMC method. The Metropolis accept ratio for a proposal (up , pp ) generated by simulating the
dynamic Ns time steps forward from (u, p) with a symplectic integrator and then negating the
momentum, is simply exp {H(u, p) − H(up , pp )}. Typically the change in the Hamiltonian will
be small and so the probability of acceptance high. To ensure ergodicity, dynamic moves can be
interspersed with updates independently sampling a new momentum from N (0, M).
In our case the system is subject to a constraint of the form c(u) = 0. By introducing Lagrangian
multipliers λi for each of the constraints, the Hamiltonian for a constrained system can be written as
1
H(u, p) = − log πU [u] + pT M−1 p + λT c(u),
2
with the corresponding constrained Hamiltonian dynamic being
du
= M−1 p,
dt
dp
∂ log πU [u]
∂c T
=
−
λ,
dt
∂u
∂u
c(u) = 0,
(6)
∂c −1
M p = 0.
∂u
(7)
Two popular symplectic numerical integrators for simulating the constrained dynamics are SHAKE
[35] and its variant RATTLE [2]. These methods are generalisations of the Störmer-Verlet (‘leapfrog’)
integrator typically used in standard HMC, and maintain the favourable properties of being timereversible, measure-preserving and having bounded Hamiltonian error growth.
The use of constrained dynamics in HMC has been proposed several times. In the molecular dynamics
literature, both [16] and [21] suggest using a simulated constrained dynamic within a HMC framework
to estimate free-energy profiles.
3
Most relevantly here Brubaker et al. proposed using a constrained HMC variant to perform inference
in distributions defined on non-Euclidean manifolds [7]. They give sufficient conditions on H, C
and c for the scheme to satisfy detailed balance and be ergodic: that H is C 2 continuous, and C is a
∂c
connected smooth and differentiable manifold and ∂u
has full row-rank everywhere.
5
Method
Our constrained HMC implementation is shown in algorithm 1. We use a generalisation of the
RATTLE scheme to simulate the dynamic. The inner updates of the state to solve for the geodesic
motion on the constraint manifold are split into multiple smaller steps, which can be considered a
special case of the scheme described in [18]. This allows more flexibility in choosing an appropriately
small step-size to ensure convergence of the iterative solution of the equations projecting on to the
constraint manifold while still allowing a more efficient larger step size for updates to the momentum
due to the negative log density gradient. We have assumed M = I here; other mass matrix choices
can be equivalently implemented by adding an initial linear transformation stage in the generator.
Algorithm 1 Constrained Hamiltonian Monte Carlo algorithm
∂c
Input: (u, p) : current state pair with c(u) = 0 and ∂u
p = 0, (C, L) : constraint Jacobian and Gram matrix
Cholesky decomposition at current u, : Newton iteration tolerance, δt : time step, Ns : number of time
steps to simulate, Ni : number of inner geodesic updates per time step
∂c 0
Output: (u0 , p0 ) : new state pair with c(u0 ) = 0, ∂u
0 p = 0 and if (u, p) ∼ πU ,P [·, ·] then
(u0 , p0 ) ∼ πU ,P [·, ·], (C0 , L0 ) : constraint Jacobian and Gram matrix Cholesky decomposition at new u0
1:
2:
3:
4:
5:
6:
7:
8:
up , pp , Cp , Lp ← S IM DYNAMIC(u, p, C, L)
r ← Uniform(0, 1)
if r < exp {H(u, p) − H(up , pp )} then
u0 , p0 , C0 , L0 ← up , pp , Cp , Lp
else
u0 , p0 , C0 , L0 ← u, p, C, L
n ← N (0, I)
p0 ← P ROJECT M OM(n, C0 , L0 )
21:
22:
23:
24:
25:
26:
27:
28:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
29:
function S IM DYNAMIC(u, p, C, L)
for s ∈ {1 . . . Ns } do
if s = 1 then
∂ log π p̃ ← p + δt
2
∂u
u
else
∂ log π p̃ ← p + δt
function P ROJECT M OM(p, C, L)
return p − CT L−T L−1 Cp
30:
31:
32:
U
33:
34:
U
∂u
function P ROJECT P OS(u, C, L)
c ← c(u)
while kck∞ > do
u ← u − CT L−T L−1 c
c ← c(u)
return u
u
35:
p ← P ROJECT M OM(p̃, C, L)
u, p, C, L ← I NNER S TEP(u, p, C, L)
∂ log π p̃ ← p + δt
2
∂u
u
p ← P ROJECT M OM(p̃, C, L)
return u, p, C, L
36:
37:
U
38:
39:
40:
function I NNER S TEP(u, p, C, L)
for i ∈ {1 . . . Ni } do
δt
p
ũ ← u + N
i
0
u ← P ROJECT
P OS(ũ, C, L)
∂c C ← ∂u
u0
L ← chol CCT
i
p̃ ← N
(u0 − u)
δt
p ← P ROJECT M OM(p̃, C, L)
u ← u0
return u, p, C, L
Each inner dynamics time-step involves making an unconstrained updates u → ũ and then projecting
∂c T
ũ back on to C by solving for λ which satisfy c(ũ − δt2 ∂u
λ/2) = 0. This is performed in the
function P ROJECT P OS in algorithm 1. Here we use a quasi-Newton method for solving the system of
equations in the projection step. The true Newton update would be
n o−1
∂c T
∂c ∂c T
u(t+1) ← u(t+1) − ∂u
c(u(t+1) ).
(8)
(t)
(t+1)
∂u u
∂u u(t)
u
This requires recalculating the constraint Jacobian and solving a dense linear system within the
optimisation loop. Instead we use a symmetric quasi-Newton update, the Jacobian Gram matrix
∂c ∂c T
∂u ∂u evaluated at the previous state used to condition the moves. This matrix is guaranteed to
be positive-definite and a Cholesky decomposition can be calculated outside the optimisation loop
allowing cheaper quadratic cost solves within the loop. In the rare cases where the quasi-Newton
iteration fails we fall back to a MINPACK implementation of the robust Powell’s Hybrid method [31].
∂c ∂c T
For larger systems, the Choleksy decomposition of the constraint Jacobian Gram matrix ∂u
∂u
(line 36) will become a dominant computational cost, generally scaling cubically with the number of
constraints. The elementwise or autoregressive noise structures of many models however allows a
significantly reduced quadratically scaling computational cost as explained in Appendix C.
4
The momentum updates in the S IM DYNAMIC routine require evaluating the gradient of the logarithm
of the target density (3). This can by achieved by using automatic differentiation to directly calculate
the gradient from the expression given in (3), however both the log-density and gradient can be more
efficiently calculated by reusing the Cholesky decomposition of the constraint Jacobian Gram matrix
computed in line 36. Details are given in Appendix B.
In the P ROJECT P OS routine convergence is signalled when the elementwise maximum absolute value
of the constraint function is below some tolerance . This acts analogously to the parameter in ABC
methods, however here we typically set this parameter to some multiple of machine precision and so
the approximation introduced is comparable to that otherwise incurred for using non-exact arithmetic.
A final implementation detail is the requirement to find some initial u satisfying c(u) = 0. If the
generator has one of the two structures described in Appendix C this can usually be easily achieved.
In both cases z can be chosen arbitrarily then n found by solving for each of the output constraints
independently (elementwise) or sequentially (autoregressive). For more general generators, we could
choose a subset of the inputs (or linear projections of the inputs) of equal dimensionality to the
number of constraints and plug the resulting system of equations into a black-box solver.
6
Related work
Closely related is the aforementioned Constrained HMC method of [7]. This demonstrates the validity
of the constrained HMC framework theoretically and experimentally, and we do not claim any original
contribution in this respect. The focus in [7] is on performing inference in distributions inherently
defined on some fixed non-Euclidean manifold such as the unit sphere or space of orthogonal matrices.
Our work builds on [7] by highlighting that conditioning on the output of a generator imposes a
constraint on its inputs and so defines a density in input space restricted to some manifold. Unlike
the cases considered in [7] our constraints are therefore data-driven and the target density on the
manifold implicitly defined by a generator function and base density.
Geodesic Monte Carlo [8] also considers applying a HMC scheme to sample from non-linear
manifolds embedded in a Euclidean space. Similarly to [7] however the motivation is performing
inference with respect to distributions explicitly defined on a manifold such as directional statistics.
The method presented in [8] requires an exact solution for the geodesic flow on the manifold. Our
use of constrained Hamiltonian dynamics, and in particular the geodesic integration scheme of [20],
can be considered an extension for cases when an exact geodesic solution is not available. Instead the
geodesic flow is approximately simulated while still maintaining the required measure-preservation
and reversibility properties for validity of the overall HMC scheme.
Sampling from a Manifold [10] gives a clear exposition on the theory underlying sampling from
distributions on embedded submanifolds, including a derivation of the form for conditional densities
in this setting which (3) is a particular case of.
Hamiltonian ABC [24], also proposes applying HMC to perform conditional inference in the
likelihood-free models considered here. An ABC set-up is used with a Gaussian synthetic-likelihood
formed by estimating first and second moments from simulated data; this stabilises the likelihood and
gradient evaluations for small but introduces extra bias on top of that inherent to ABC methods.
Rather than using automatic differentiation to exactly calculate gradients of the generator function,
Hamiltonian ABC proposes using an unbiased stochastic gradient estimator. Having only a stochastic
gradient requires the use of methods from the series of works extending HMC to such cases [9, 37]. It
has been suggested however that the use of stochastic gradients can destroy the favourable properties
of Hamiltonian dynamics which enable coherent exploration of high dimesional state spaces [6].
In Hamiltonian ABC it is also observed that representing the generative model as a deterministic
function by fixing the random inputs to the generator is a useful method for improving exploration of
the state space. This is achieved by including the seed of the pseudo-random number generator in the
chain state rather than the set of random inputs as we do.
Also related is Optimisation Monte Carlo [25]. The authors propose using an optimiser to find
parameters of a simulator model consistent with observed data (to within some tolerance ) given
fixed random inputs sampled independently. The optimisation is not measure-preserving and so
the Jacobian of the map is approximated with finite differences to weight the samples. Our method
also uses an optimiser to find inputs consistent with the observations, however by using a measurepreserving dynamic we avoid having to re-weight samples which can scale poorly with dimensionality.
Our method also differs in treating all inputs to a generator equivalently; while the Optimisation Monte
Carlo authors similarly identify the simulator models as deterministic functions they distinguish
between parameters and random inputs, optimising the first and independently sampling the latter.
This can lead to random inputs being sampled for which no parameters can be found consistent
5
(a)
(b)
(c)
Figure 1: Lotka–Volterra (a) Observed predator-prey populations (solid) and ABC sample trajectories with = 10 (dashed) and = 100 (dot-dashed). (b) Marginal empirical histograms for the four
parameters (columns) from constrained HMC samples (top) and ABC samples with = 10 (middle)
and = 100 (bottom). Horizontal axes shared across columns. Red arrows indicate true parameter
values. (c) Mean ESS normalised by compute time for each of four parameters for ABC with = 10
(red), = 100 (green) and our method (blue). Error bars show ±3 standard errors of mean.
with the observations (even with a ‘soft’ within constraint). Although optimisation failure is also
potentially an issue for our method, as we jointly optimise all inputs and are approximating some
exact continuous time constraint-satisfying dynamic we found this occurred rarely in practice if
an appropriate step size is chosen. Our method can also be applied in cases were the parameter
dimensionality is greater than the number of output constraints unlike in Optimization Monte Carlo.
7
Experiments
To illustrate the general applicability of our method we performed conditional inference tasks in three
diverse settings: parameter inference in a stochastic Lokta-Volterra predator-prey model simulation,
three-dimensional human pose and camera parameter inference given two-dimensional joint position
information and finally in-painting of missing regions of digit images using a generative model
trained on MNIST. In all three experiments Theano was used to specify the generator function and
calculate the required derivatives. All experiments were run on an Intel Core i5-2400 quad-core CPU.
7.1
Lotka–Volterra model parameter inference
As a first demonstration we considered a stochastic continuous time and state variant of the Lotka–
Volterra model, a common example problem for ABC methods e.g. [25]. In particular we consider
parameter inference given a simulated solution of the following stochastic differential equations
dq1 = (θ1 q1 − θ2 q1 q2 ) dt + dn1
dq2 = (−θ3 q2 + θ4 q1 q2 ) dt + dn2
(9)
4
where q1 and q2 represent the prey and predator population respectively, θ = {θi }i=1 the system
parameters and n1 and n2 zero-mean, unit variance white noise processes.
To generate the observed data, an Euler-Maruyama discretisation was used with time-step 1, initial
condition q1 = q2 = 100 and θ = [0.4 0.005, 0.05, 0.001] (chosen to give stable oscillatory dynamics). Theh simulation was runifor 50 time-steps with the observed data defined as the concatenated
vector x = q1(1) q2(1) ... q1(50) q2(50) . The parameters were given log-normal priors θi ∼ log N (−2, 1).
We compared our method to various ABC approaches (§2) using a uniform ball kernel with radius .
ABC rejection failed catastrophically, with no acceptances in 106 samples even with a large = 1000.
Standard random-walk ABC MCMC also performed very poorly with the dynamic having zero
acceptances over multiple runs of 105 updates for = 100 and getting stuck at points in parameter
space over many updates for larger = 1000, even with small proposal steps. Because of this poor
performance we used a recent method from the pseudo-marginal literature, which uses alternating
elliptical slice sampling updates of the parameters and random draws [27]. The slice sampling updates
locally adapt the size of steps made to ensure a move can always be made. Using this method we
were able to obtain reasonable convergence over long runs for both = 100 and = 10.
The results are summarised in Figure 1. Figure 1a shows the simulated data used as observations and
ABC sample trajectories for = 10 and = 100 . Though both samples follow the general trends
of the observed data there are large discrepancies particularly for = 100. Our method in contrast
samples parameters generating trajectories exactly matching the observations at all points. Figure
1b shows the marginal histograms for the parameters. The inferred posterior on the parameters are
significantly more tightly distributed about the true values used to generate the observations for our
approach and the = 10 case compared to the results for = 100; even for the = 10 case however
it can be seen that there are spurious appearing peaks in the distributions for θ3 and θ4 .
Figure 1c shows the relative sampling efficiency of our approach against the ABC methods, as
measured by the effective sample sizes (ESS) (computed with R-CODA [30]) normalised by run time
6
(a)
(b)
Figure 2: Human pose (a) RMSEs of 3D pose posterior mean estimates given binocular projections,
using samples from our method (blue) versus running HMC in hierarchical model (red) for three
different scenes sampled from prior. Horizontal axes show computation time to produce number of
samples in estimate. Solid curves are average RMSE over 10 runs with different seeds and shaded
regions show ±3 standard errors of mean. (b) Orthographic projections (top front view, bottom side
view) of 3D poses consistent with monocular projections. Left most pair (black) shows true pose
used to generate observations, right three pairs (coloured) show constrained HMC posterior samples.
averaged across 10 sampling runs for each method. Despite the significantly higher per-update cost
in our method, the coherent movement about the state space afforded by the Hamiltonian dynamic
gave significantly better performance even over the very approximate = 100 case.
7.2
Human pose and camera model inference
For our next experiment we considered inferring a 3D human pose and camera model from 2D
projection(s) of joint positions. We used a 19 joint skeleton model, with a prior density over poses
parametrised by 47 local joint angles Θ learnt from the PosePrior motion capture dataset [1] with
a VAE with a 30 dimensional latent representation Z . For the bone lengths L, a log-normal model
was fitted to data from the ANSUR anthropometric dataset [15], due to symmetry 13 independent
lengths being specified. A simple pin-hole projective camera model with 3 position parameters P
and fixed focal-length was used1 . A log-normal prior was placed on the depth position P 3 to enforce
positivity with normal priors on P 1 and P 2 . A small amount of Gaussian noise was added to the
projected positions to give the observations Y . This ensured the constraint Jacobian was full row-rank
everywhere and gave a known hierarchical Gaussian density on {Y , Θ , Z , log L, P 1:2 , log P 3 }.
We first considered binocular pose estimation, with the 3D scene information inferred given 2D
projections from two cameras with a known offset between them. In this stereo vision case, the
disparity between the projections gives information about the depth direction and so we would expect
the posterior distribution on the 3D pose to be tightly distributed around the true values used to
generate the observations. We compared our constrained HMC method to running standard HMC on
the known (unnormalised) conditional density of {Θ , Z , log L, P 1:2 , log P 3 } given Y .
Figure 2a shows the root mean squared error (RMSE) between the posterior mean estimate of the
3D joint positions and the true positions used to generate the observations as the number of samples
included in the estimate increases for three test scenes. For both methods the horizontal axis have
been scaled by computation time and the curves shown are averages across 10 runs. Initial pilot runs
were used to find appropriate settings for the time step and number of integrator steps.
The constrained HMC method (red curves) tends to give significantly more accurate estimates
particularly over longer periods. Visually inspecting the sampled poses and individual run traces (not
shown) it seems that the HMC runs tended to often get stuck in local modes corresponding to a subset
of joints being ‘incorrectly’ positioned while still broadly matching the (noisy) projections. The
complex dependencies of the joint positions on the angle parameters mean the dynamic struggles to
find an update which brings the ‘incorrect’ joints closer to their true positions without moving other
joints out of line. The constrained HMC method seems to have been less susceptible to this issue.
We also considered inferring 3D scene information from a single 2D projection. Monocular projection is inherently information destroying with significant uncertainty to the true pose and camera
parameters which generated the observations. Figure 2b shows pairs of orthographic projections of
3D poses: the left most column is the pose used to generate the projection conditioned on and the
right three columns are poses sampled using constrained HMC consistent with the observations. The
top row shows front x–y views, corresponding to the camera view though with a orthographic rather
than perspective projection, the bottom row shows side z–y views with the z axis the depth from the
camera. The dynamic is able to move between a range of plausible appearing poses consistent with
the observations while reflecting the inherent depth ambiguity from the monocular projection.
1
The camera orientation was assumed fixed to avoid replicating the degrees of freedom specified by the
angular orientation of the root joint of the skeleton: only the relative camera–skeleton orientation is important.
7
(a)
(b)
Figure 3: MNIST In-painting samples generated using constrained HMC (a) and standard HMC (b).
The top black-on-white quarter of each image is the fixed observed region and the remaining whiteon-black region the proposed in-painting. In left-right scan order the images in (a) are consecutive
samples from a run; in (b) the images are every 40th sample to account for the quicker updates.
7.3
MNIST in-painting
As a final task we considered inferring in-paintings for missing regions of digit images Y INF given
knowledge of the rest of the pixel values Y OBS . A Gaussian VAE trained on MNIST was used as
the generative model, with a 50-dimensional latent representation Z . We compared our method to
running standard HMC in the known conditional density on Z given Y OBS (Y INF is conditionally
independent of Y OBS given Z and so can be directly sampled from its Gaussian conditional).
Example samples are shown in Figure 3. In this case the constrained and standard HMC approaches
appear to be performing similarly, with both seeming to be able to find a range of plausible inpaintings given the observed pixels. Without cost adjustment the standard HMC samples show
greater correlation between subsequent updates, however for a fairer comparison the samples shown
were thinned to account for the approximately 40× larger run-time per constrained HMC sample.
Although the constrained dynamic does not improve efficiency here neither does it seem to hurt it.
8
Discussion
We have presented a generally applicable framework for performing conditional inference in generative models specified by a differentiable function of continuous random inputs. Though simulating
the constrained Hamiltonian dynamics is computationally costly, the resulting coherent exploration
of the state space can lead to significantly improved sampling efficiency over alternative methods.
Further our approach allows asymptotically exact inference in differentiable likelihood-free models
where ABC methods might otherwise be used. We suggest an approach for dealing with two of
the key issues in ABC set-ups - allowing inference in continuous spaces as the kernel parameter collapses to zero and providing a method for updating the random draws used in simulator models
which maintains a large probability of acceptance in high-dimensional spaces. As well as being of
practical importance itself, this approach should prove useful in providing ‘ground truth’ inferences
in more complex models to assess the affect of the approximations used in various ABC methods on
the quality of the inference.
A limitation of our method is the (strong) requirement of differentiability of the generator. This
prevents applying our approach to generative models which use discontinuous operations or discrete
random inputs. For discrete inputs one option would be to alternate updating the discrete inputs
given fixed continuous inputs with some valid MCMC move and applying our approach to update
the continuous inputs for fixed discrete inputs; if there is strong coupling between the discrete and
continuous inputs however this approach could mix poorly.
A common technique in ABC methods is to define a kernel or distance measure in terms of summary
statistics of the observed data [22]. If appropriately informative statistics are chosen this helps to
alleviate poor scaling of the methods with dimensionality (in both cost and approximation error).
Providing the summary statistics are a differentiable function of the observations this idea can be
quite naturally applied in our approach by defining the constraint function c in terms of summary
statistics computed from the observations y. Given the key dependence of the computational cost of
our method on the constraint dimensionality, defining the constraints in terms of lower-dimensional
summaries could be key in allowing scaling the method to larger systems.
Acknowledgments
Thanks to Iain Murray for several useful discussions and to Ben Leimkuhler for pointing out
the potential of using a geodesic integration scheme. This work was supported in part by grants
EP/F500385/1 and BB/F529254/1 for the University of Edinburgh School of Informatics Doctoral
Training Centre in Neuroinformatics and Computational Neuroscience (www.anc.ac.uk/dtc) from
the UK Engineering and Physical Sciences Research Council (EPSRC), UK Biotechnology and
Biological Sciences Research Council (BBSRC), and the UK Medical Research Council (MRC).
8
A
Computing conditional expectations on the output of a generator
We here give a derivation for the target density on the constraint manifold in the input space of
a differentiable generator for computing conditional expectations on the output. This is largely a
restatement of results in [10, §2.3] and is provided mainly to make those results more easily relatable
to the notation of this paper.
For clarity in this section the measure being integrated with respect to will be explicitly denoted. For
a variable of integration x, λD {dx} will denote the D dimensional Lebesgue measure and HD {dx}
the D dimensional Hausdorff measure over some space X which will be specified.
A key result we will use is Federer’s Co-Area Formula [12, §3.2.12]:
Theorem (Co-Area Formula). Let m : X ⊆ RL → V ⊆ RK be Lipschitz with L > K and
h : X → R be Lebesgue measurable. Then
1
Z
Z Z
∂m ∂m T 2
h(x) h(x) HL−K {dx} λK {dv}
(10)
λL {dx} =
∂x
∂x
−1
X
V m (v)
h
i
∂mi
=
with ∂m
the Jacobian of the map, m−1 (v) the L − K dimensional sub-manifold embed∂x
∂xj
i,j
ded in X with Hausdorff measure HL−K {dx}, which is the solution set {x ∈ X : m(x) = v}.
∂m T Corollary. If the Jacobian of m has full row-rank everywhere such that ∂m
∂x ∂x > 0 ∀x ∈ X
∗
then for a Lebesgue measurable h : X → R
1
Z
Z Z
∂m ∂m T − 2
∗
L
∗
h (x) λ {dx} =
h (x) HL−K {dx} λK {dv}
(11)
∂x ∂x X
V m−1 (v)
− 1
∂m T 2
which can be easily shown by setting h(x) = h∗ (x) ∂m
in (10).
∂x ∂x We will also use what is sometimes termed the Law of the Unconscious Statistician (LOTUS) to
express expectations of functions of random (vector) variables when an explicit density on the random
output of the function is not known.
Theorem (Law of the Unconscious Statistician). Let Y be a random vector on support Y ⊆ RN
with density pY [y] with respect to the Lebesgue measure λN {dy} and f : Y → R be Lebesgue
measurable. If we define a new random variable Z = f(Y ) then
Z
E [Z ] = E [f(Y )] =
f(y) pY [y] λN {dy} .
(12)
Y
Corollary. If Y is defined as Y = g(U ) for some g : U ⊆ RM → Y then
Z
E [f(Y )] = E [(f ◦ g)(U )] =
f ◦ g(u) pU [u] λM {du} .
(13)
U
This leads us to the main result
Theorem. Let U be a random vector with density pU [u] with respect to the Lebesgue measure
λM {du} on support U ⊆ RM . Further let g : U → Y be a smooth map, with Y ⊆ RN ; N ≤ M
∂g
defining a random vector Y = g(U ). Assume ∂u
has full row-rank everywhere.
Partition the output space Y = Y1 × Y2 with Y1 ⊆ RN1 and Y2 ⊆ RN2 and Y 1 = g1 (U ),
Y 2 = g2 (U ). Then the conditional expectation of some function f of Y given Y 2 = y̆2 has been
observed is
1
Z
∂g ∂g T − 2
1
2 2 E [f(Y ) | Y 2 = y̆2 ] =
f ◦ g(u) pU [u] HM −N2 {du} .
(14)
∂u ∂u pY 2 [y̆2 ] C
with pY 2 [y̆2 ] the marginal density on Y 2 with respect to the Lebesgue measure λN2 {dy2 } which
must be non-zero for the conditional expectation to be well-defined; C is the M − N2 dimensional
sub-manifold defined by the solution set {u ∈ U : g2 (u) = y̆2 }.
Proof. By the Law of Total Expectation we have that
Z
E [f (Y )] =
E [f (Y ) | Y 2 = y2 ]
Y2
9
pY
2
[y2 ] λN2 {dy2 } .
(15)
Using LOTUS (12) we get
Z
E [f (Y ) | Y 2 = y2 ]
Y2
pY
2
[y2 ] λ
N2
Z
{dy2 } =
U
f ◦ g(u) pU [u] λM {du} .
Applying the co-area formula corollary (11) to the right-hand side gives
Z
E [f (Y ) | Y 2 = y2 ] pY 2 [y2 ] λN2 {dy2 } =
(16)
(17)
Y2
Z
Y2
Z
g2−1 (y2 )
f ◦ g(u) pU
1
∂g ∂g T − 2
2 2 HM −N2 {du} λN2 {dy2 } .
[u] ∂u ∂u Define Y2? = {y2 ∈ Y2 : pY 2 [y2 ] > 0}. Then we have
Z
{E [f (Y ) | Y 2 = y2 ]} pY 2 [y2 ] λN2 {dy2 } =
(18)
(19)
Y2?


Z
Y2?
1
 pY 2 [y2 ]
Z
g2−1 (y2 )
f ◦ g(u) pU

1

∂g ∂g T − 2
2 2 [u] HM −N2 {du} pY 2 [y2 ] λN2 {dy2 } .
∂u ∂u 
(20)
As this holds for arbitrary Lebesgue measurable f, this implies that the terms inside the braces are
equal for y2 ∈ Y2? 2 . As y̆2 ∈ Y2? by assumption and C = g2−1 (y̆2 ) we have
1
Z
∂g ∂g T − 2
1
2 2 (21)
HM −N2 {du} .
E [f(Y ) | Y 2 = y̆2 ] =
f ◦ g(u) pU [u] ∂u ∂u pY 2 [y̆2 ] C
Corollary. Define a target density with respect to the Hausdorff measure HM −N2 {du} on C
1
∂g ∂g T − 2
2 2 (22)
πU [u] ∝ pU [u] .
∂u ∂u S
If we generate a set of MCMC samples u(s) s=1 which leave πU [u] invariant with respect to
HM −N2 {du} on C, by the Law of Large Numbers we can then form a Monte Carlo estimate for the
conditional expectation
S
1 X
E [f(Y ) | Y 2 = y̆2 ] = lim
f ◦ g(u(s) ) .
(23)
S→∞ S
s=1
B
Evaluating the target density and its gradient
For the constrained Hamiltonian dynamics we need to be able to evaluate the logarithm of the target
density (22) up to an additive constant and its gradient with respect to u. We have that
∂g ∂g T 1
2 2 log πU [u] = log pU [u] − log (24)
− log Z
∂u ∂u 2
where Z is the normalising constant for the density which is independent of u.
2 ∂g2 T In general evaluating the Gram matrix determinant log ∂g
∂u ∂u has computational cost which
scales as O(M N22 ). However as part of the constrained dynamics updates the lower-triangular
Cholesky decomposition L of the Gram matrix
matrix determinant we have
∂g2 ∂g2 T
∂u ∂u
is calculated. Using basic properties of the
1
log LLT − log Z
2
1
[u] − log |L| LT − log Z
2
[u] − log |L| − log Z
log πU [u] = log pU [u] −
= log pU
= log pU
= log pU [u] −
N2
X
log(Lii ) − log Z
(25)
(26)
(27)
(28)
i=1
2
Intuitively we could consider setting f(y) = f ? (y)δ N2 [y2 − y̆2 ] where δ N [·] denotes the N dimensional
Dirac delta and f ? is arbitrary, in which case the result follows from the sifting property of the Dirac delta.
10
The base density pU [u] will typically be of a simple form e.g. standard Gaussian, therefore we can
evaluate the logarithm of the target density up to an additive constant at a marginal computational
cost that scales linearly with dimensionality.
For the gradient we can use reverse-mode automatic differentiation to calculate the gradient of (28)
with respect to u. This requires propagating partial derivatives through the Choleksy decomposition
[26]; efficient implementations for this are present in many automatic differentation frameworks
including Theano.
Alternatively the gradient of (24) can be explicitly derived. The gradient of log pU [u] will generally
Z
be trivial and ∂ log
= 0. The gradient of the second term can be calculated using
∂u
"
#−1 "
#
T 

T
T
T
2
2
∂g2 ∂g2
∂
∂ g2 ∂g2
∂g2 ∂ g2
∂g2 ∂g2 log +
(29)
= Trace
 ∂u ∂u
∂u ∂u ∂ui
∂ui ∂u ∂u
∂u ∂ui ∂u 
(
T )
∂ 2 g2
−T −1 ∂g2
= 2 Trace
L L
.
(30)
∂ui ∂u
∂u
The matrix inside the square brackets is independent of i and can be computed once by solving the
system of equations by forward and backward substitution. The matrix of second partial derivatives
∂ 2 g2
∂ui ∂u can either be manually derived for the specific generator function or calculated using automatic
differentiation. The trace of the matrix product is then just the sum over all indices of the element-wise
product of the pair.
C
Exploiting structure in the generator
Often the generator inputs U can be split in to two distinct groups — global inputs Z which effect all
of the conditioned on outputs (e.g. inputs which map to model parameters) and local ‘noise’ inputs
N , each element of which affect only a subset of the outputs.
In particular systems with a constraint c(u) = g2 (u) − y̆2 and a generator function g2 giving the
conditioned on outputs which can be expressed in one of the two forms
yi = gi (z, ni ) (element-wise) or yi = g̃i (z, yi−1 , ni ) = gi (z, n≤i ) (autoregressive)
(31)
∂c
∂u
∂c
∂c ∂c
have a Jacobian
= [ ∂z
∂n ] in which ∂n is diagonal (element-wise) or triangular (autoregressive).
∂c ∂c T
∂c ∂c T
∂c ∂c T
The decomposition of ∂u
can then be computed by low-rank Cholesky
∂u = ∂n ∂n + ∂z ∂z
∂c
∂c
updates of the triangular / diagonal matrix ∂n with each of the columns of ∂z
. As dim(z) = L
is often significantly less than, and independent of, the number of outputs conditioned on N2 , the
resulting O(LN22 ) cost of the Cholesky updates is a significant improvement over the original O(N23 ).
Many DGN models have an element-wise noise structure including the Gaussian VAE. The autoregressive noise structure commonly occurs in stochastic dynamical simulations where the outputs are
a time sequence of states, with noise being added each time-step, for example the Lotka-Volterra
model considered in the experiments in Section 7.
References
[1] I. Akhter and M. J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. In IEEE
Conference on Computer Vision and Pattern Recognition, 2015.
[2] H. C. Andersen. RATTLE: A “velocity” version of the SHAKE algorithm for molecular dynamics
calculations. Journal of Computational Physics, 1983.
[3] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations.
The Annals of Statistics, 2009.
[4] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics.
Genetics, 2002.
[5] M. Betancourt. A general metric for Riemannian manifold Hamiltonian Monte Carlo. In Geometric science
of information. Springer, 2013.
[6] M. Betancourt. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data
subsampling. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
[7] M. A. Brubaker, M. Salzmann, and R. Urtasun. A family of MCMC methods on implicitly defined
manifolds. In International Conference on Artificial Intelligence and Statistics, 2012.
[8] S. Byrne and M. Girolami. Geodesic Monte Carlo on embedded manifolds. Scandinavian Journal of
Statistics, 40(4):825–845, 2013.
11
[9] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceedings of the
31st International Conference on Machine Learning, 2014.
[10] P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. In Advances in Modern Statistical
Theory and Applications, pages 102–125. Institute of Mathematical Statistics, 2013.
[11] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters B, 1987.
[12] H. Federer. Geometric measure theory. Springer, 2014.
[13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
[15] C. C. Gordon, T. Churchill, C. E. Clauser, B. Bradtmiller, J. T. McConville, I. Tebbets, and R. A. Walker.
Anthropometric survey of US army personell: Final report. Technical report, United States Army, 1988.
[16] C. Hartmann. An ergodic sampling scheme for constrained Hamiltonian systems with applications to
molecular dynamics. Journal of Statistical Physics, 2008.
[17] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International
Conference on Learning Representations (ICLR), 2014.
[18] B. Leimkuhler and C. Matthews. Efficient molecular dynamics using geodesic integration and solvent–
solute splitting. In Proc. R. Soc. A, volume 472, page 20160138. The Royal Society, 2016.
[19] B. Leimkuhler and S. Reich. Simulating Hamiltonian dynamics. Cambridge University Press, 2004.
[20] B. J. Leimkuhler and R. D. Skeel. Symplectic numerical integrators in constrained Hamiltonian systems.
Journal of Computational Physics, 1994.
[21] T. Lelièvre, M. Rousset, and G. Stoltz. Langevin dynamics with constraints and computation of free energy
differences. Mathematics of computation, 2012.
[22] J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian computational methods.
Statistics and Computing, 2012.
[23] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavaré. Markov chain Monte Carlo without likelihoods.
Proceedings of the National Academy of Sciences, 2003.
[24] E. Meeds, R. Leenders, and M. Welling. Hamiltonian ABC. In Proceedings of 31st Conference of
Uncertainty in Artificial Intelligence, 2015.
[25] T. Meeds and M. Welling. Optimization Monte Carlo: Efficient and embarrassingly parallel likelihood-free
inference. In Advances in Neural Information Processing Systems, 2015.
[26] I. Murray. Differentiation of the Cholesky decomposition. arXiv preprint arXiv:1602.07527, 2016.
[27] I. Murray and M. M. Graham. Pseudo-marginal slice sampling. In International Conference on Artificial
Intelligence and Statistics, 2016.
[28] R. M. Neal. Slice sampling. Annals of statistics, 2003.
[29] R. M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2011.
[30] M. Plummer, N. Best, K. Cowles, and K. Vines. CODA: Convergence diagnosis and output analysis for
MCMC. R News, 2006.
[31] M. J. D. Powell. Numerical Methods for Nonlinear Algebraic Equations, chapter A Hybrid Method for
Nonlinear Equations. Gordon and Breach, 1970.
[32] O. Ratmann, C. Andrieu, C. Wiuf, and S. Richardson. Model criticism based on likelihood-free inference,
with an application to protein network evolution. Proceedings of the National Academy of Sciences, 2009.
[33] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in
deep generative models. In Proceedings of The 31st International Conference on Machine Learning, 2014.
[34] C. P. Robert, K. Mengersen, and C. Chen. Model choice versus model criticism. Proceedings of the
National Academy of Sciences of the United States of America, 2010.
[35] J.-P. Ryckaert, G. Ciccotti, and H. J. Berendsen. Numerical integration of the Cartesian equations of motion
of a system with constraints: molecular dynamics of n-alkanes. Journal of Computational Physics, 1977.
[36] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, 2016.
[37] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings
of the 28th International Conference on Machine Learning, 2011.
12
Download