Asymptotically exact inference in deep generative models and differentiable simulators Matthew M. Graham School of Informatics University of Edinburgh m.m.graham@ed.ac.uk Amos Storkey School of Informatics University of Edinburgh a.storkey@ed.ac.uk arXiv:1605.07826v2 [stat.CO] 12 Jul 2016 Abstract Many generative models can be expressed as a deterministic differentiable function of random inputs drawn from some simple probability density. This framework includes both deep generative architectures such as Variational Autoencoders and a large class of dynamical system simulators. We present a method for performing efficient MCMC inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where Approximate Bayesian Computation might otherwise be employed. We use the intuition that conditional inference corresponds to integrating a density across the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to coherently move between states exactly satisfying the constraint. We validate the method by performing inference tasks in a diverse set of models: parameter inference in a dynamical predator-prey simulation, joint 3D pose and camera model inference from 2D projections and image in-painting with a generative model of MNIST digit images. 1 Introduction Developments in generative modelling with deep architectures such as Variational AutoEncoders (VAEs) [17, 33] and Generative Adversial Nets (GANs) [14] have made it possible to learn probabilistic models of increasingly complex, high-dimensional datasets. These models can be formulated as a differentiable function g : U → Y which maps random inputs U ∈ U ⊆ RM from a simple base density pU [u] to an implicit learnt distribution over outputs Y ∈ Y ⊆ RN ; the term Differentiable Generator Net (DGN) has been suggested to encompass such models [13]. The DGN framework also encapsulates a wide class of simulator models defined procedurally (e.g. numerical integration of a system of stochastic differential equations) with randomness introduced by draws from a random number generator. The operations used in such simulations are often differentiable, but manually calculating the derivatives of the simulator outputs with respect to inputs would be tedious at best and unmanageable for larger computational graphs. Fortunately automatic differentiation provides a computationally efficient framework for exactly calculating such derivatives given just the code used to define the forward model, with the implementations in frameworks such as Theano [36] able to cope with an increasingly wide class of models. The end-to-end differentiability of DGNs is key in allowing them to be efficiently fit to data, usually with some variant of stochastic gradient descent. The usefulness of differentiability extends beyond learning however — inference methods such as Hamiltonian Monte Carlo (HMC) fundamentally rely on gradient information to allow coherent exploration of high-dimensional state spaces. A key application of generative models is to perform inference given observations of the output of the system. For example in a generative model of images we may wish to infer plausible in-paintings of an image patch given knowledge of the surrounding pixel values. Parameter inference can be seen as another instance of this task. Here parameters Θ are considered as part of the model output Y = [Θ ; X ], sampled from their prior via a deterministic function of a subset of the random inputs U , and the model is conditioned on observations of the remaining ‘data’ outputs X . The data outputs are generated from parameters by another deterministic function which takes as input the parameters Θ and the remaining random inputs which produce the variability in X given Θ . For many generators we cannot evaluate the marginal density on the outputs pY [y]. In a parameter inference setting this usually corresponds to not being able to evaluate the likelihood pX | Θ [x | θ]. The lack of a closed form density precludes direct use of approximate inference methods such as variational inference and Markov chain Monte Carlo (MCMC) in the output space. 2 Approximate Bayesian Computation A lack of explicit likelihoods is the motivation for Approximate Bayesian Computation (ABC) methods [4, 22]. In ABC the simulated data outputs X are decoupled from the observed data X̆ by a noise model or kernel pX̆ | X [x̆ | x], typically parametrised by a tolerance parameter e.g. uniform ball kernel radius or Gaussian kernel standard deviation. The ABC likelihood is then defined as Z pX̆ | Θ [x̆ | θ] = pX̆ | X [x̆ | x] pX | Θ [x | θ] dx. (1) X An unbiased Monte Carlo estimate can be calculated from one or more simulated model outputs pX̆ |Θ [x̆ | θ] ≈ S io h 1 Xn pX̆ | X x̆ | x(s) S s=1 x(s) ∼ pX | Θ [· | θ] ∀s ∈ {1, . . . S} . (2) This unbiased ABC likelihood estimator can be used in various Monte Carlo inference schemes. In the ABC rejection algorithm, the kernel is typically uniform across some bounded region defined by a distance measure between the simulated and true data being less than , with parameters proposed from the prior accepted if the simulated data lies within the of the true data and rejected otherwise. In ABC MCMC methods [23], the ABC likelihood estimator is instead used in a Metropolis–Hastings acceptance step, with new model parameters proposed from some perturbative proposal distribution given the current values. If the simulated outputs x(s) used in the estimator are maintained as part of the Markov chain state, this can be seen as an instance of a pseudo-marginal (PM) method [3]. The kernel can have a probabilistic interpretation as representing uncertainty introduced by measurement error or mismatch between the unknown generative process by which the observed data was produced and the modelled generator [32]. However in practice the kernel is often motivated on practical terms and the tolerance parameter fixed on computational efficiency grounds rather than treated as some unobserved variable to be inferred [34]. A common explanation for the use of a kernel in ABC is that for continuous data the probability of generating simulated data exactly matching the observed data is zero if using an ABC rejection type scheme [22] — the observed data forms a zero measure set. ABC therefore can be seen as relaxing the constraint that true and simulated data exactly match and instead accepting parameters which produce simulated data within some region around the true data. From this perspective the distribution the parameters are sampled from only approximates the ‘true’ posterior. 3 Conditional inference in differentiable generative models In a typical VAE model for continuous data, the generator is specified with a hierarchical Gaussian structure, with Gaussian distributed latent variables Z parametrising some lower dimensional manifold in the output space via a function m(Z ) specified by a feed-forward network. Additional Gaussian noise N with dim(N ) = dim(Y ) is added, allowing points in the output space off the manifold to be generated, scaled by a further function s(Z ) specified by a feed-forward network. Considering the overall input as U = [Z ; N ] the generator function is therefore g(U ) = m(Z ) + s(Z ) N . In VAE models we could exploit the hierarchical structure to perform MCMC inference in the joint distribution pY ,Z [y, z] = pY | Z [y | z] pZ [z] (or conditionals thereof) over latent variables Z and outputs Y which is evaluable as both the latent marginal pZ [z] and output conditional pY | Z [y | z] are of known forms (e.g. Gaussian). However such hierarchical model structures are known to be challenging for MCMC methods to explore well [28], with often large variations in scale across the joint space and tight coupling between latent variables and outputs which can prove challenging even for MCMC methods designed for densities with complex geometry like HMC [5]. Rather than defining the Markov chain on [Y ; Z ] here we propose to instead use U . As the outputs Y are full determined by U there is no loss of information in this alternative parametrisation. Importantly we can also use U as the state with arbitrary generator functions without any assumption of some hierarchical structure, including likelihood-free simulator models typically used in ABC. Generating samples of U to compute unconditional expectations of functions of Y is trivial - we simply need to define a MCMC dynamic which leaves pU [u] invariant. Often we can in fact generate exact independent samples from pU [u]. 2 It is not immediately clear however how to perform conditional inference. More explicitly if we partition the generator output as Y = [Y 1 ; Y 2 ] = [g1 (U ); g2 (U )], how can we produce MCMC samples of U which will allow us to estimate expectations of Y given Y 2 = y̆2 has been observed? Conditioning on part of the output maps to constraining the outputs to be in an axis-aligned linear subspace. However, as the mapping from inputs to outputs in the generator is generally a non-linear function, this corresponds to restricting the inputs to a non-linear manifold C = {u ∈ U : g2 (u) = y̆2 } embedded in the input space. We claim here and show in Appendix A that if we generate a set of input MCMC samples {u(s) }Ss=1 restricted to C and with invariant density 1 ∂g ∂g T − 2 2 2 πU [u] ∝ pU [u] (3) ∂u ∂u with respect to the natural reference (Hausdorff) measure for the manifold, then S 1 X f ◦ g(u(s) ) . S→∞ S s=1 E [f(Y ) | Y 2 = y̆2 ] = lim (4) Intuitively the determinant term in (3) adjusts for the contraction / expansion of the infinitesimal ‘thickness’ (extent in directions orthogonal to the tangent space) of the manifold when mapping through the generator function. A general framework for performing conditional inference in differentiable generators is therefore to define MCMC updates which explore the target density (3) on the constraint manifold C. We propose here to use a HMC method which simulates the dynamics of a constrained mechanical system. 4 Constrained Hamiltonian Monte Carlo In standard HMC [11, 29] the vector variable of interest U is augmented with a momentum variable P ∈ RM . The momentum is taken to be independently Gaussian distributed with zero mean and covariance M, often called the mass matrix. The new joint target density is then πU ,P [u, p] ∝ exp [−H(u, p)] where H(u, p) = − log πU [u] + 21 pT M−1 p is termed the Hamiltonian. The canonical Hamiltonian dynamic is described by the system of ordinary differential equations ∂H du = = M−1 p, dt ∂p dp ∂H ∂ log πU [u] =− = . dt ∂u ∂u (5) This dynamic is time-reversible, measure-preserving and exactly conserves the Hamiltonian. Symplectic numerical integrators allow approximate integration of the Hamiltonian flow while maintaining the time-reversibility and measure-preservation properties. Further subject to stability bounds on the time-step, such integrators will exactly conserve some ‘nearby’ Hamiltonian, and so the change in the Hamiltonian will tend to remain small even over long simulated trajectories [19]. These properties make simulated Hamiltonian dynamics an ideal proposal mechanism for a Metropolis MCMC method. The Metropolis accept ratio for a proposal (up , pp ) generated by simulating the dynamic Ns time steps forward from (u, p) with a symplectic integrator and then negating the momentum, is simply exp {H(u, p) − H(up , pp )}. Typically the change in the Hamiltonian will be small and so the probability of acceptance high. To ensure ergodicity, dynamic moves can be interspersed with updates independently sampling a new momentum from N (0, M). In our case the system is subject to a constraint of the form c(u) = 0. By introducing Lagrangian multipliers λi for each of the constraints, the Hamiltonian for a constrained system can be written as 1 H(u, p) = − log πU [u] + pT M−1 p + λT c(u), 2 with the corresponding constrained Hamiltonian dynamic being du = M−1 p, dt dp ∂ log πU [u] ∂c T = − λ, dt ∂u ∂u c(u) = 0, (6) ∂c −1 M p = 0. ∂u (7) Two popular symplectic numerical integrators for simulating the constrained dynamics are SHAKE [35] and its variant RATTLE [2]. These methods are generalisations of the Störmer-Verlet (‘leapfrog’) integrator typically used in standard HMC, and maintain the favourable properties of being timereversible, measure-preserving and having bounded Hamiltonian error growth. The use of constrained dynamics in HMC has been proposed several times. In the molecular dynamics literature, both [16] and [21] suggest using a simulated constrained dynamic within a HMC framework to estimate free-energy profiles. 3 Most relevantly here Brubaker et al. proposed using a constrained HMC variant to perform inference in distributions defined on non-Euclidean manifolds [7]. They give sufficient conditions on H, C and c for the scheme to satisfy detailed balance and be ergodic: that H is C 2 continuous, and C is a ∂c connected smooth and differentiable manifold and ∂u has full row-rank everywhere. 5 Method Our constrained HMC implementation is shown in algorithm 1. We use a generalisation of the RATTLE scheme to simulate the dynamic. The inner updates of the state to solve for the geodesic motion on the constraint manifold are split into multiple smaller steps, which can be considered a special case of the scheme described in [18]. This allows more flexibility in choosing an appropriately small step-size to ensure convergence of the iterative solution of the equations projecting on to the constraint manifold while still allowing a more efficient larger step size for updates to the momentum due to the negative log density gradient. We have assumed M = I here; other mass matrix choices can be equivalently implemented by adding an initial linear transformation stage in the generator. Algorithm 1 Constrained Hamiltonian Monte Carlo algorithm ∂c Input: (u, p) : current state pair with c(u) = 0 and ∂u p = 0, (C, L) : constraint Jacobian and Gram matrix Cholesky decomposition at current u, : Newton iteration tolerance, δt : time step, Ns : number of time steps to simulate, Ni : number of inner geodesic updates per time step ∂c 0 Output: (u0 , p0 ) : new state pair with c(u0 ) = 0, ∂u 0 p = 0 and if (u, p) ∼ πU ,P [·, ·] then (u0 , p0 ) ∼ πU ,P [·, ·], (C0 , L0 ) : constraint Jacobian and Gram matrix Cholesky decomposition at new u0 1: 2: 3: 4: 5: 6: 7: 8: up , pp , Cp , Lp ← S IM DYNAMIC(u, p, C, L) r ← Uniform(0, 1) if r < exp {H(u, p) − H(up , pp )} then u0 , p0 , C0 , L0 ← up , pp , Cp , Lp else u0 , p0 , C0 , L0 ← u, p, C, L n ← N (0, I) p0 ← P ROJECT M OM(n, C0 , L0 ) 21: 22: 23: 24: 25: 26: 27: 28: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 29: function S IM DYNAMIC(u, p, C, L) for s ∈ {1 . . . Ns } do if s = 1 then ∂ log π p̃ ← p + δt 2 ∂u u else ∂ log π p̃ ← p + δt function P ROJECT M OM(p, C, L) return p − CT L−T L−1 Cp 30: 31: 32: U 33: 34: U ∂u function P ROJECT P OS(u, C, L) c ← c(u) while kck∞ > do u ← u − CT L−T L−1 c c ← c(u) return u u 35: p ← P ROJECT M OM(p̃, C, L) u, p, C, L ← I NNER S TEP(u, p, C, L) ∂ log π p̃ ← p + δt 2 ∂u u p ← P ROJECT M OM(p̃, C, L) return u, p, C, L 36: 37: U 38: 39: 40: function I NNER S TEP(u, p, C, L) for i ∈ {1 . . . Ni } do δt p ũ ← u + N i 0 u ← P ROJECT P OS(ũ, C, L) ∂c C ← ∂u u0 L ← chol CCT i p̃ ← N (u0 − u) δt p ← P ROJECT M OM(p̃, C, L) u ← u0 return u, p, C, L Each inner dynamics time-step involves making an unconstrained updates u → ũ and then projecting ∂c T ũ back on to C by solving for λ which satisfy c(ũ − δt2 ∂u λ/2) = 0. This is performed in the function P ROJECT P OS in algorithm 1. Here we use a quasi-Newton method for solving the system of equations in the projection step. The true Newton update would be n o−1 ∂c T ∂c ∂c T u(t+1) ← u(t+1) − ∂u c(u(t+1) ). (8) (t) (t+1) ∂u u ∂u u(t) u This requires recalculating the constraint Jacobian and solving a dense linear system within the optimisation loop. Instead we use a symmetric quasi-Newton update, the Jacobian Gram matrix ∂c ∂c T ∂u ∂u evaluated at the previous state used to condition the moves. This matrix is guaranteed to be positive-definite and a Cholesky decomposition can be calculated outside the optimisation loop allowing cheaper quadratic cost solves within the loop. In the rare cases where the quasi-Newton iteration fails we fall back to a MINPACK implementation of the robust Powell’s Hybrid method [31]. ∂c ∂c T For larger systems, the Choleksy decomposition of the constraint Jacobian Gram matrix ∂u ∂u (line 36) will become a dominant computational cost, generally scaling cubically with the number of constraints. The elementwise or autoregressive noise structures of many models however allows a significantly reduced quadratically scaling computational cost as explained in Appendix C. 4 The momentum updates in the S IM DYNAMIC routine require evaluating the gradient of the logarithm of the target density (3). This can by achieved by using automatic differentiation to directly calculate the gradient from the expression given in (3), however both the log-density and gradient can be more efficiently calculated by reusing the Cholesky decomposition of the constraint Jacobian Gram matrix computed in line 36. Details are given in Appendix B. In the P ROJECT P OS routine convergence is signalled when the elementwise maximum absolute value of the constraint function is below some tolerance . This acts analogously to the parameter in ABC methods, however here we typically set this parameter to some multiple of machine precision and so the approximation introduced is comparable to that otherwise incurred for using non-exact arithmetic. A final implementation detail is the requirement to find some initial u satisfying c(u) = 0. If the generator has one of the two structures described in Appendix C this can usually be easily achieved. In both cases z can be chosen arbitrarily then n found by solving for each of the output constraints independently (elementwise) or sequentially (autoregressive). For more general generators, we could choose a subset of the inputs (or linear projections of the inputs) of equal dimensionality to the number of constraints and plug the resulting system of equations into a black-box solver. 6 Related work Closely related is the aforementioned Constrained HMC method of [7]. This demonstrates the validity of the constrained HMC framework theoretically and experimentally, and we do not claim any original contribution in this respect. The focus in [7] is on performing inference in distributions inherently defined on some fixed non-Euclidean manifold such as the unit sphere or space of orthogonal matrices. Our work builds on [7] by highlighting that conditioning on the output of a generator imposes a constraint on its inputs and so defines a density in input space restricted to some manifold. Unlike the cases considered in [7] our constraints are therefore data-driven and the target density on the manifold implicitly defined by a generator function and base density. Geodesic Monte Carlo [8] also considers applying a HMC scheme to sample from non-linear manifolds embedded in a Euclidean space. Similarly to [7] however the motivation is performing inference with respect to distributions explicitly defined on a manifold such as directional statistics. The method presented in [8] requires an exact solution for the geodesic flow on the manifold. Our use of constrained Hamiltonian dynamics, and in particular the geodesic integration scheme of [20], can be considered an extension for cases when an exact geodesic solution is not available. Instead the geodesic flow is approximately simulated while still maintaining the required measure-preservation and reversibility properties for validity of the overall HMC scheme. Sampling from a Manifold [10] gives a clear exposition on the theory underlying sampling from distributions on embedded submanifolds, including a derivation of the form for conditional densities in this setting which (3) is a particular case of. Hamiltonian ABC [24], also proposes applying HMC to perform conditional inference in the likelihood-free models considered here. An ABC set-up is used with a Gaussian synthetic-likelihood formed by estimating first and second moments from simulated data; this stabilises the likelihood and gradient evaluations for small but introduces extra bias on top of that inherent to ABC methods. Rather than using automatic differentiation to exactly calculate gradients of the generator function, Hamiltonian ABC proposes using an unbiased stochastic gradient estimator. Having only a stochastic gradient requires the use of methods from the series of works extending HMC to such cases [9, 37]. It has been suggested however that the use of stochastic gradients can destroy the favourable properties of Hamiltonian dynamics which enable coherent exploration of high dimesional state spaces [6]. In Hamiltonian ABC it is also observed that representing the generative model as a deterministic function by fixing the random inputs to the generator is a useful method for improving exploration of the state space. This is achieved by including the seed of the pseudo-random number generator in the chain state rather than the set of random inputs as we do. Also related is Optimisation Monte Carlo [25]. The authors propose using an optimiser to find parameters of a simulator model consistent with observed data (to within some tolerance ) given fixed random inputs sampled independently. The optimisation is not measure-preserving and so the Jacobian of the map is approximated with finite differences to weight the samples. Our method also uses an optimiser to find inputs consistent with the observations, however by using a measurepreserving dynamic we avoid having to re-weight samples which can scale poorly with dimensionality. Our method also differs in treating all inputs to a generator equivalently; while the Optimisation Monte Carlo authors similarly identify the simulator models as deterministic functions they distinguish between parameters and random inputs, optimising the first and independently sampling the latter. This can lead to random inputs being sampled for which no parameters can be found consistent 5 (a) (b) (c) Figure 1: Lotka–Volterra (a) Observed predator-prey populations (solid) and ABC sample trajectories with = 10 (dashed) and = 100 (dot-dashed). (b) Marginal empirical histograms for the four parameters (columns) from constrained HMC samples (top) and ABC samples with = 10 (middle) and = 100 (bottom). Horizontal axes shared across columns. Red arrows indicate true parameter values. (c) Mean ESS normalised by compute time for each of four parameters for ABC with = 10 (red), = 100 (green) and our method (blue). Error bars show ±3 standard errors of mean. with the observations (even with a ‘soft’ within constraint). Although optimisation failure is also potentially an issue for our method, as we jointly optimise all inputs and are approximating some exact continuous time constraint-satisfying dynamic we found this occurred rarely in practice if an appropriate step size is chosen. Our method can also be applied in cases were the parameter dimensionality is greater than the number of output constraints unlike in Optimization Monte Carlo. 7 Experiments To illustrate the general applicability of our method we performed conditional inference tasks in three diverse settings: parameter inference in a stochastic Lokta-Volterra predator-prey model simulation, three-dimensional human pose and camera parameter inference given two-dimensional joint position information and finally in-painting of missing regions of digit images using a generative model trained on MNIST. In all three experiments Theano was used to specify the generator function and calculate the required derivatives. All experiments were run on an Intel Core i5-2400 quad-core CPU. 7.1 Lotka–Volterra model parameter inference As a first demonstration we considered a stochastic continuous time and state variant of the Lotka– Volterra model, a common example problem for ABC methods e.g. [25]. In particular we consider parameter inference given a simulated solution of the following stochastic differential equations dq1 = (θ1 q1 − θ2 q1 q2 ) dt + dn1 dq2 = (−θ3 q2 + θ4 q1 q2 ) dt + dn2 (9) 4 where q1 and q2 represent the prey and predator population respectively, θ = {θi }i=1 the system parameters and n1 and n2 zero-mean, unit variance white noise processes. To generate the observed data, an Euler-Maruyama discretisation was used with time-step 1, initial condition q1 = q2 = 100 and θ = [0.4 0.005, 0.05, 0.001] (chosen to give stable oscillatory dynamics). Theh simulation was runifor 50 time-steps with the observed data defined as the concatenated vector x = q1(1) q2(1) ... q1(50) q2(50) . The parameters were given log-normal priors θi ∼ log N (−2, 1). We compared our method to various ABC approaches (§2) using a uniform ball kernel with radius . ABC rejection failed catastrophically, with no acceptances in 106 samples even with a large = 1000. Standard random-walk ABC MCMC also performed very poorly with the dynamic having zero acceptances over multiple runs of 105 updates for = 100 and getting stuck at points in parameter space over many updates for larger = 1000, even with small proposal steps. Because of this poor performance we used a recent method from the pseudo-marginal literature, which uses alternating elliptical slice sampling updates of the parameters and random draws [27]. The slice sampling updates locally adapt the size of steps made to ensure a move can always be made. Using this method we were able to obtain reasonable convergence over long runs for both = 100 and = 10. The results are summarised in Figure 1. Figure 1a shows the simulated data used as observations and ABC sample trajectories for = 10 and = 100 . Though both samples follow the general trends of the observed data there are large discrepancies particularly for = 100. Our method in contrast samples parameters generating trajectories exactly matching the observations at all points. Figure 1b shows the marginal histograms for the parameters. The inferred posterior on the parameters are significantly more tightly distributed about the true values used to generate the observations for our approach and the = 10 case compared to the results for = 100; even for the = 10 case however it can be seen that there are spurious appearing peaks in the distributions for θ3 and θ4 . Figure 1c shows the relative sampling efficiency of our approach against the ABC methods, as measured by the effective sample sizes (ESS) (computed with R-CODA [30]) normalised by run time 6 (a) (b) Figure 2: Human pose (a) RMSEs of 3D pose posterior mean estimates given binocular projections, using samples from our method (blue) versus running HMC in hierarchical model (red) for three different scenes sampled from prior. Horizontal axes show computation time to produce number of samples in estimate. Solid curves are average RMSE over 10 runs with different seeds and shaded regions show ±3 standard errors of mean. (b) Orthographic projections (top front view, bottom side view) of 3D poses consistent with monocular projections. Left most pair (black) shows true pose used to generate observations, right three pairs (coloured) show constrained HMC posterior samples. averaged across 10 sampling runs for each method. Despite the significantly higher per-update cost in our method, the coherent movement about the state space afforded by the Hamiltonian dynamic gave significantly better performance even over the very approximate = 100 case. 7.2 Human pose and camera model inference For our next experiment we considered inferring a 3D human pose and camera model from 2D projection(s) of joint positions. We used a 19 joint skeleton model, with a prior density over poses parametrised by 47 local joint angles Θ learnt from the PosePrior motion capture dataset [1] with a VAE with a 30 dimensional latent representation Z . For the bone lengths L, a log-normal model was fitted to data from the ANSUR anthropometric dataset [15], due to symmetry 13 independent lengths being specified. A simple pin-hole projective camera model with 3 position parameters P and fixed focal-length was used1 . A log-normal prior was placed on the depth position P 3 to enforce positivity with normal priors on P 1 and P 2 . A small amount of Gaussian noise was added to the projected positions to give the observations Y . This ensured the constraint Jacobian was full row-rank everywhere and gave a known hierarchical Gaussian density on {Y , Θ , Z , log L, P 1:2 , log P 3 }. We first considered binocular pose estimation, with the 3D scene information inferred given 2D projections from two cameras with a known offset between them. In this stereo vision case, the disparity between the projections gives information about the depth direction and so we would expect the posterior distribution on the 3D pose to be tightly distributed around the true values used to generate the observations. We compared our constrained HMC method to running standard HMC on the known (unnormalised) conditional density of {Θ , Z , log L, P 1:2 , log P 3 } given Y . Figure 2a shows the root mean squared error (RMSE) between the posterior mean estimate of the 3D joint positions and the true positions used to generate the observations as the number of samples included in the estimate increases for three test scenes. For both methods the horizontal axis have been scaled by computation time and the curves shown are averages across 10 runs. Initial pilot runs were used to find appropriate settings for the time step and number of integrator steps. The constrained HMC method (red curves) tends to give significantly more accurate estimates particularly over longer periods. Visually inspecting the sampled poses and individual run traces (not shown) it seems that the HMC runs tended to often get stuck in local modes corresponding to a subset of joints being ‘incorrectly’ positioned while still broadly matching the (noisy) projections. The complex dependencies of the joint positions on the angle parameters mean the dynamic struggles to find an update which brings the ‘incorrect’ joints closer to their true positions without moving other joints out of line. The constrained HMC method seems to have been less susceptible to this issue. We also considered inferring 3D scene information from a single 2D projection. Monocular projection is inherently information destroying with significant uncertainty to the true pose and camera parameters which generated the observations. Figure 2b shows pairs of orthographic projections of 3D poses: the left most column is the pose used to generate the projection conditioned on and the right three columns are poses sampled using constrained HMC consistent with the observations. The top row shows front x–y views, corresponding to the camera view though with a orthographic rather than perspective projection, the bottom row shows side z–y views with the z axis the depth from the camera. The dynamic is able to move between a range of plausible appearing poses consistent with the observations while reflecting the inherent depth ambiguity from the monocular projection. 1 The camera orientation was assumed fixed to avoid replicating the degrees of freedom specified by the angular orientation of the root joint of the skeleton: only the relative camera–skeleton orientation is important. 7 (a) (b) Figure 3: MNIST In-painting samples generated using constrained HMC (a) and standard HMC (b). The top black-on-white quarter of each image is the fixed observed region and the remaining whiteon-black region the proposed in-painting. In left-right scan order the images in (a) are consecutive samples from a run; in (b) the images are every 40th sample to account for the quicker updates. 7.3 MNIST in-painting As a final task we considered inferring in-paintings for missing regions of digit images Y INF given knowledge of the rest of the pixel values Y OBS . A Gaussian VAE trained on MNIST was used as the generative model, with a 50-dimensional latent representation Z . We compared our method to running standard HMC in the known conditional density on Z given Y OBS (Y INF is conditionally independent of Y OBS given Z and so can be directly sampled from its Gaussian conditional). Example samples are shown in Figure 3. In this case the constrained and standard HMC approaches appear to be performing similarly, with both seeming to be able to find a range of plausible inpaintings given the observed pixels. Without cost adjustment the standard HMC samples show greater correlation between subsequent updates, however for a fairer comparison the samples shown were thinned to account for the approximately 40× larger run-time per constrained HMC sample. Although the constrained dynamic does not improve efficiency here neither does it seem to hurt it. 8 Discussion We have presented a generally applicable framework for performing conditional inference in generative models specified by a differentiable function of continuous random inputs. Though simulating the constrained Hamiltonian dynamics is computationally costly, the resulting coherent exploration of the state space can lead to significantly improved sampling efficiency over alternative methods. Further our approach allows asymptotically exact inference in differentiable likelihood-free models where ABC methods might otherwise be used. We suggest an approach for dealing with two of the key issues in ABC set-ups - allowing inference in continuous spaces as the kernel parameter collapses to zero and providing a method for updating the random draws used in simulator models which maintains a large probability of acceptance in high-dimensional spaces. As well as being of practical importance itself, this approach should prove useful in providing ‘ground truth’ inferences in more complex models to assess the affect of the approximations used in various ABC methods on the quality of the inference. A limitation of our method is the (strong) requirement of differentiability of the generator. This prevents applying our approach to generative models which use discontinuous operations or discrete random inputs. For discrete inputs one option would be to alternate updating the discrete inputs given fixed continuous inputs with some valid MCMC move and applying our approach to update the continuous inputs for fixed discrete inputs; if there is strong coupling between the discrete and continuous inputs however this approach could mix poorly. A common technique in ABC methods is to define a kernel or distance measure in terms of summary statistics of the observed data [22]. If appropriately informative statistics are chosen this helps to alleviate poor scaling of the methods with dimensionality (in both cost and approximation error). Providing the summary statistics are a differentiable function of the observations this idea can be quite naturally applied in our approach by defining the constraint function c in terms of summary statistics computed from the observations y. Given the key dependence of the computational cost of our method on the constraint dimensionality, defining the constraints in terms of lower-dimensional summaries could be key in allowing scaling the method to larger systems. Acknowledgments Thanks to Iain Murray for several useful discussions and to Ben Leimkuhler for pointing out the potential of using a geodesic integration scheme. This work was supported in part by grants EP/F500385/1 and BB/F529254/1 for the University of Edinburgh School of Informatics Doctoral Training Centre in Neuroinformatics and Computational Neuroscience (www.anc.ac.uk/dtc) from the UK Engineering and Physical Sciences Research Council (EPSRC), UK Biotechnology and Biological Sciences Research Council (BBSRC), and the UK Medical Research Council (MRC). 8 A Computing conditional expectations on the output of a generator We here give a derivation for the target density on the constraint manifold in the input space of a differentiable generator for computing conditional expectations on the output. This is largely a restatement of results in [10, §2.3] and is provided mainly to make those results more easily relatable to the notation of this paper. For clarity in this section the measure being integrated with respect to will be explicitly denoted. For a variable of integration x, λD {dx} will denote the D dimensional Lebesgue measure and HD {dx} the D dimensional Hausdorff measure over some space X which will be specified. A key result we will use is Federer’s Co-Area Formula [12, §3.2.12]: Theorem (Co-Area Formula). Let m : X ⊆ RL → V ⊆ RK be Lipschitz with L > K and h : X → R be Lebesgue measurable. Then 1 Z Z Z ∂m ∂m T 2 h(x) h(x) HL−K {dx} λK {dv} (10) λL {dx} = ∂x ∂x −1 X V m (v) h i ∂mi = with ∂m the Jacobian of the map, m−1 (v) the L − K dimensional sub-manifold embed∂x ∂xj i,j ded in X with Hausdorff measure HL−K {dx}, which is the solution set {x ∈ X : m(x) = v}. ∂m T Corollary. If the Jacobian of m has full row-rank everywhere such that ∂m ∂x ∂x > 0 ∀x ∈ X ∗ then for a Lebesgue measurable h : X → R 1 Z Z Z ∂m ∂m T − 2 ∗ L ∗ h (x) λ {dx} = h (x) HL−K {dx} λK {dv} (11) ∂x ∂x X V m−1 (v) − 1 ∂m T 2 which can be easily shown by setting h(x) = h∗ (x) ∂m in (10). ∂x ∂x We will also use what is sometimes termed the Law of the Unconscious Statistician (LOTUS) to express expectations of functions of random (vector) variables when an explicit density on the random output of the function is not known. Theorem (Law of the Unconscious Statistician). Let Y be a random vector on support Y ⊆ RN with density pY [y] with respect to the Lebesgue measure λN {dy} and f : Y → R be Lebesgue measurable. If we define a new random variable Z = f(Y ) then Z E [Z ] = E [f(Y )] = f(y) pY [y] λN {dy} . (12) Y Corollary. If Y is defined as Y = g(U ) for some g : U ⊆ RM → Y then Z E [f(Y )] = E [(f ◦ g)(U )] = f ◦ g(u) pU [u] λM {du} . (13) U This leads us to the main result Theorem. Let U be a random vector with density pU [u] with respect to the Lebesgue measure λM {du} on support U ⊆ RM . Further let g : U → Y be a smooth map, with Y ⊆ RN ; N ≤ M ∂g defining a random vector Y = g(U ). Assume ∂u has full row-rank everywhere. Partition the output space Y = Y1 × Y2 with Y1 ⊆ RN1 and Y2 ⊆ RN2 and Y 1 = g1 (U ), Y 2 = g2 (U ). Then the conditional expectation of some function f of Y given Y 2 = y̆2 has been observed is 1 Z ∂g ∂g T − 2 1 2 2 E [f(Y ) | Y 2 = y̆2 ] = f ◦ g(u) pU [u] HM −N2 {du} . (14) ∂u ∂u pY 2 [y̆2 ] C with pY 2 [y̆2 ] the marginal density on Y 2 with respect to the Lebesgue measure λN2 {dy2 } which must be non-zero for the conditional expectation to be well-defined; C is the M − N2 dimensional sub-manifold defined by the solution set {u ∈ U : g2 (u) = y̆2 }. Proof. By the Law of Total Expectation we have that Z E [f (Y )] = E [f (Y ) | Y 2 = y2 ] Y2 9 pY 2 [y2 ] λN2 {dy2 } . (15) Using LOTUS (12) we get Z E [f (Y ) | Y 2 = y2 ] Y2 pY 2 [y2 ] λ N2 Z {dy2 } = U f ◦ g(u) pU [u] λM {du} . Applying the co-area formula corollary (11) to the right-hand side gives Z E [f (Y ) | Y 2 = y2 ] pY 2 [y2 ] λN2 {dy2 } = (16) (17) Y2 Z Y2 Z g2−1 (y2 ) f ◦ g(u) pU 1 ∂g ∂g T − 2 2 2 HM −N2 {du} λN2 {dy2 } . [u] ∂u ∂u Define Y2? = {y2 ∈ Y2 : pY 2 [y2 ] > 0}. Then we have Z {E [f (Y ) | Y 2 = y2 ]} pY 2 [y2 ] λN2 {dy2 } = (18) (19) Y2? Z Y2? 1 pY 2 [y2 ] Z g2−1 (y2 ) f ◦ g(u) pU 1 ∂g ∂g T − 2 2 2 [u] HM −N2 {du} pY 2 [y2 ] λN2 {dy2 } . ∂u ∂u (20) As this holds for arbitrary Lebesgue measurable f, this implies that the terms inside the braces are equal for y2 ∈ Y2? 2 . As y̆2 ∈ Y2? by assumption and C = g2−1 (y̆2 ) we have 1 Z ∂g ∂g T − 2 1 2 2 (21) HM −N2 {du} . E [f(Y ) | Y 2 = y̆2 ] = f ◦ g(u) pU [u] ∂u ∂u pY 2 [y̆2 ] C Corollary. Define a target density with respect to the Hausdorff measure HM −N2 {du} on C 1 ∂g ∂g T − 2 2 2 (22) πU [u] ∝ pU [u] . ∂u ∂u S If we generate a set of MCMC samples u(s) s=1 which leave πU [u] invariant with respect to HM −N2 {du} on C, by the Law of Large Numbers we can then form a Monte Carlo estimate for the conditional expectation S 1 X E [f(Y ) | Y 2 = y̆2 ] = lim f ◦ g(u(s) ) . (23) S→∞ S s=1 B Evaluating the target density and its gradient For the constrained Hamiltonian dynamics we need to be able to evaluate the logarithm of the target density (22) up to an additive constant and its gradient with respect to u. We have that ∂g ∂g T 1 2 2 log πU [u] = log pU [u] − log (24) − log Z ∂u ∂u 2 where Z is the normalising constant for the density which is independent of u. 2 ∂g2 T In general evaluating the Gram matrix determinant log ∂g ∂u ∂u has computational cost which scales as O(M N22 ). However as part of the constrained dynamics updates the lower-triangular Cholesky decomposition L of the Gram matrix matrix determinant we have ∂g2 ∂g2 T ∂u ∂u is calculated. Using basic properties of the 1 log LLT − log Z 2 1 [u] − log |L| LT − log Z 2 [u] − log |L| − log Z log πU [u] = log pU [u] − = log pU = log pU = log pU [u] − N2 X log(Lii ) − log Z (25) (26) (27) (28) i=1 2 Intuitively we could consider setting f(y) = f ? (y)δ N2 [y2 − y̆2 ] where δ N [·] denotes the N dimensional Dirac delta and f ? is arbitrary, in which case the result follows from the sifting property of the Dirac delta. 10 The base density pU [u] will typically be of a simple form e.g. standard Gaussian, therefore we can evaluate the logarithm of the target density up to an additive constant at a marginal computational cost that scales linearly with dimensionality. For the gradient we can use reverse-mode automatic differentiation to calculate the gradient of (28) with respect to u. This requires propagating partial derivatives through the Choleksy decomposition [26]; efficient implementations for this are present in many automatic differentation frameworks including Theano. Alternatively the gradient of (24) can be explicitly derived. The gradient of log pU [u] will generally Z be trivial and ∂ log = 0. The gradient of the second term can be calculated using ∂u " #−1 " # T T T T 2 2 ∂g2 ∂g2 ∂ ∂ g2 ∂g2 ∂g2 ∂ g2 ∂g2 ∂g2 log + (29) = Trace ∂u ∂u ∂u ∂u ∂ui ∂ui ∂u ∂u ∂u ∂ui ∂u ( T ) ∂ 2 g2 −T −1 ∂g2 = 2 Trace L L . (30) ∂ui ∂u ∂u The matrix inside the square brackets is independent of i and can be computed once by solving the system of equations by forward and backward substitution. The matrix of second partial derivatives ∂ 2 g2 ∂ui ∂u can either be manually derived for the specific generator function or calculated using automatic differentiation. The trace of the matrix product is then just the sum over all indices of the element-wise product of the pair. C Exploiting structure in the generator Often the generator inputs U can be split in to two distinct groups — global inputs Z which effect all of the conditioned on outputs (e.g. inputs which map to model parameters) and local ‘noise’ inputs N , each element of which affect only a subset of the outputs. In particular systems with a constraint c(u) = g2 (u) − y̆2 and a generator function g2 giving the conditioned on outputs which can be expressed in one of the two forms yi = gi (z, ni ) (element-wise) or yi = g̃i (z, yi−1 , ni ) = gi (z, n≤i ) (autoregressive) (31) ∂c ∂u ∂c ∂c ∂c have a Jacobian = [ ∂z ∂n ] in which ∂n is diagonal (element-wise) or triangular (autoregressive). ∂c ∂c T ∂c ∂c T ∂c ∂c T The decomposition of ∂u can then be computed by low-rank Cholesky ∂u = ∂n ∂n + ∂z ∂z ∂c ∂c updates of the triangular / diagonal matrix ∂n with each of the columns of ∂z . As dim(z) = L is often significantly less than, and independent of, the number of outputs conditioned on N2 , the resulting O(LN22 ) cost of the Cholesky updates is a significant improvement over the original O(N23 ). Many DGN models have an element-wise noise structure including the Gaussian VAE. The autoregressive noise structure commonly occurs in stochastic dynamical simulations where the outputs are a time sequence of states, with noise being added each time-step, for example the Lotka-Volterra model considered in the experiments in Section 7. References [1] I. Akhter and M. J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, 2015. [2] H. C. Andersen. RATTLE: A “velocity” version of the SHAKE algorithm for molecular dynamics calculations. Journal of Computational Physics, 1983. [3] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 2009. [4] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 2002. [5] M. Betancourt. A general metric for Riemannian manifold Hamiltonian Monte Carlo. In Geometric science of information. Springer, 2013. [6] M. Betancourt. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In Proceedings of the 32nd International Conference on Machine Learning, 2015. [7] M. A. Brubaker, M. Salzmann, and R. Urtasun. A family of MCMC methods on implicitly defined manifolds. In International Conference on Artificial Intelligence and Statistics, 2012. [8] S. Byrne and M. Girolami. Geodesic Monte Carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013. 11 [9] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceedings of the 31st International Conference on Machine Learning, 2014. [10] P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. In Advances in Modern Statistical Theory and Applications, pages 102–125. Institute of Mathematical Statistics, 2013. [11] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters B, 1987. [12] H. Federer. Geometric measure theory. Springer, 2014. [13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016. [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014. [15] C. C. Gordon, T. Churchill, C. E. Clauser, B. Bradtmiller, J. T. McConville, I. Tebbets, and R. A. Walker. Anthropometric survey of US army personell: Final report. Technical report, United States Army, 1988. [16] C. Hartmann. An ergodic sampling scheme for constrained Hamiltonian systems with applications to molecular dynamics. Journal of Statistical Physics, 2008. [17] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014. [18] B. Leimkuhler and C. Matthews. Efficient molecular dynamics using geodesic integration and solvent– solute splitting. In Proc. R. Soc. A, volume 472, page 20160138. The Royal Society, 2016. [19] B. Leimkuhler and S. Reich. Simulating Hamiltonian dynamics. Cambridge University Press, 2004. [20] B. J. Leimkuhler and R. D. Skeel. Symplectic numerical integrators in constrained Hamiltonian systems. Journal of Computational Physics, 1994. [21] T. Lelièvre, M. Rousset, and G. Stoltz. Langevin dynamics with constraints and computation of free energy differences. Mathematics of computation, 2012. [22] J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian computational methods. Statistics and Computing, 2012. [23] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavaré. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 2003. [24] E. Meeds, R. Leenders, and M. Welling. Hamiltonian ABC. In Proceedings of 31st Conference of Uncertainty in Artificial Intelligence, 2015. [25] T. Meeds and M. Welling. Optimization Monte Carlo: Efficient and embarrassingly parallel likelihood-free inference. In Advances in Neural Information Processing Systems, 2015. [26] I. Murray. Differentiation of the Cholesky decomposition. arXiv preprint arXiv:1602.07527, 2016. [27] I. Murray and M. M. Graham. Pseudo-marginal slice sampling. In International Conference on Artificial Intelligence and Statistics, 2016. [28] R. M. Neal. Slice sampling. Annals of statistics, 2003. [29] R. M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2011. [30] M. Plummer, N. Best, K. Cowles, and K. Vines. CODA: Convergence diagnosis and output analysis for MCMC. R News, 2006. [31] M. J. D. Powell. Numerical Methods for Nonlinear Algebraic Equations, chapter A Hybrid Method for Nonlinear Equations. Gordon and Breach, 1970. [32] O. Ratmann, C. Andrieu, C. Wiuf, and S. Richardson. Model criticism based on likelihood-free inference, with an application to protein network evolution. Proceedings of the National Academy of Sciences, 2009. [33] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, 2014. [34] C. P. Robert, K. Mengersen, and C. Chen. Model choice versus model criticism. Proceedings of the National Academy of Sciences of the United States of America, 2010. [35] J.-P. Ryckaert, G. Ciccotti, and H. J. Berendsen. Numerical integration of the Cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. Journal of Computational Physics, 1977. [36] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, 2016. [37] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011. 12