Adiabatic Monte Carlo Michael Betancourt @betanalpha University of Warwick CRiSM Workshop: Estimating Constants, University of Warwick April 21, 2016 Computational statistics is all about computing expectations with respect to a given target distribution. E⇡ [f ] = Z f (q)⇡(q) dq High-dimensional target distributions exhibit concentration of measure, which frustrates these computations. Markov chains provide a generic scheme for finding and then exploring the resulting typical set. Markov chains provide a generic scheme for finding and then exploring the resulting typical set. In particular, Markov chain define consistent Markov Chain Monte Carlo estimators of the desired expectations. In particular, Markov chain define consistent Markov Chain Monte Carlo estimators of the desired expectations. 1 N N X n=0 f (qn ) In particular, Markov chain define consistent Markov Chain Monte Carlo estimators of the desired expectations. 1 lim N !1 N N X n=0 f (qn ) ! E⇡ [f ] In order to scale to high-dimensional target distributions, however, we need efficient exploration of the typical set. In order to scale to high-dimensional target distributions, however, we need efficient exploration of the typical set. Hamiltonian Monte Carlo Hamiltonian Monte Carlo uses auxiliary momenta and density gradients to generate coherent exploration. q ! (q, p) Hamiltonian Monte Carlo uses auxiliary momenta and density gradients to generate coherent exploration. q ! (q, p) ⇡(q) ! ⇡(q, p) = ⇡(p | q) ⇡(q) Hamiltonian Monte Carlo uses auxiliary momenta and density gradients to generate coherent exploration. q ! (q, p) ⇡(q) ! ⇡(q, p) = ⇡(p | q) ⇡(q) The addition of momenta defines a Hamiltonian that decomposes into a potential energy and kinetic energy. H(p, q) = log ⇡(p|q) ⇡(q) The addition of momenta defines a Hamiltonian that decomposes into a potential energy and kinetic energy. H(p, q) = = log ⇡(p|q) ⇡(q) log ⇡(p|q) log ⇡(q) The addition of momenta defines a Hamiltonian that decomposes into a potential energy and kinetic energy. H(p, q) = = log ⇡(p|q) ⇡(q) log ⇡(p|q) T log ⇡(q) The addition of momenta defines a Hamiltonian that decomposes into a potential energy and kinetic energy. H(p, q) = = log ⇡(p|q) ⇡(q) log ⇡(p|q) T log ⇡(q) V The Hamiltonian defines a vector field aligned with the typical set from which we can generate exploration. dq @T = dt @p dp = dt @T @q @V @q The Hamiltonian defines a vector field aligned with the typical set from which we can generate exploration. dq @T = dt @p dp = dt @T @q @V @q In practice we integrate along this vector field only approximately, using powerful symplectic integrators. In practice we integrate along this vector field only approximately, using powerful symplectic integrators. The numerical error introduced by the integrator can be eliminated with a careful Metropolis correction. @T q !q+✏ @p p!p ✏ ✓ @T @V + @q @q ✓ ⇡(accept) = min 1, ⇡( ◆ ⌧ (p, q)) ⇡(p, q) ◆ http://arxiv.org/abs/1405.3489 Adiabatic Monte Carlo Like any MCMC algorithm, Hamiltonian Monte Carlo struggles to explore multimodal target distributions. Like any MCMC algorithm, Hamiltonian Monte Carlo struggles to explore multimodal target distributions. In Bayesian settings the marginal likelihood is difficult to estimate from the Markov chain output. ⇡(D) = Z ⇡(D | q) ⇡(q) dq In Bayesian settings the marginal likelihood is difficult to estimate from the Markov chain output. ⇡(D) = Z ⇡(D | q) ⇡(q) dq ⇡(D) = Eprior [⇡(D | q)] In Bayesian settings the marginal likelihood is difficult to estimate from the Markov chain output. ⇡(D) = Z ⇡(D | q) ⇡(q) dq ⇡(D) = Eprior [⇡(D | q)] h ⇡(D) = Epost (⇡(D | q)) 1 i Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. 1 ⇡ (q) = ( ⇡(q)) ⇡B (q) Z( ) Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. 1 ⇡ (q) = ( ⇡(q)) ⇡B (q) Z( ) Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. 1 ⇡ (q) = ( ⇡(q)) ⇡B (q) Z( ) Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. 1 ⇡ (q) = ( ⇡(q)) ⇡B (q) Z( ) q Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. 0 0.2 0.4 0.6 0.8 1 Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. =0 (q) = ⇡B (q) q ⇡ 0 0.2 0.4 0.6 0.8 1 Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. =1 (q) = ⇡(q) q ⇡ 0 0.2 0.4 0.6 0.8 1 Both of these problems are facilitated by interpolating between the target and an auxiliary, unimodal distribution. =0 (q) = ⇡B (q) ⇡ =1 (q) = ⇡(q) q ⇡ 0 0.2 0.4 0.6 0.8 1 q To move along the interpolation in practice, however, we need to impose a discrete partition of the interpolation. 0 0.2 0.4 0.6 0.8 1 q To move along the interpolation in practice, however, we need to impose a discrete partition of the interpolation. 0 0.2 0.4 0.6 0.8 1 q To move along the interpolation in practice, however, we need to impose a discrete partition of the interpolation. 0 0.2 0.4 0.6 0.8 1 q If the partition is chosen poorly then transitions will be difficult, and we are left with a delicate tuning problem. 0 0.2 0.4 0.6 0.8 1 q If the partition is chosen poorly then transitions will be difficult, and we are left with a delicate tuning problem. 0 0.2 0.4 0.6 0.8 1 q If the partition is chosen poorly then transitions will be difficult, and we are left with a delicate tuning problem. 0 0.2 0.4 0.6 0.8 1 Mirroring Hamiltonian Monte Carlo, Adiabatic Monte Carlo generates optimal transitions by making the perk dynamic. q ! (q, p, ) Mirroring Hamiltonian Monte Carlo, Adiabatic Monte Carlo generates optimal transitions by making the perk dynamic. q ! (q, p, ) ⇡(q) ! ⇡ (q, p) = ⇡(p | q) ⇡ (q) Introducing auxiliary momenta to the interpolating distributions defines a contact Hamiltonian. H(q, p, ) = log ⇡(p | q) ⇡ (q) Introducing auxiliary momenta to the interpolating distributions defines a contact Hamiltonian. H(q, p, ) = = log ⇡(p | q) ⇡ (q) log ⇡(p | q) log log ⇡B (q) ⇡(q) + log Z( ) + H0 Introducing auxiliary momenta to the interpolating distributions defines a contact Hamiltonian. H(q, p, ) = = log ⇡(p | q) ⇡ (q) log ⇡(p | q) T log log ⇡B (q) ⇡(q) + log Z( ) + H0 Introducing auxiliary momenta to the interpolating distributions defines a contact Hamiltonian. H(q, p, ) = = log ⇡(p | q) ⇡ (q) log ⇡(p | q) log log ⇡B (q) VB ⇡(q) + log Z( ) + H0 Introducing auxiliary momenta to the interpolating distributions defines a contact Hamiltonian. H(q, p, ) = = log ⇡(p | q) ⇡ (q) log ⇡(p | q) log log ⇡B (q) ⇡(q) + log Z( ) + H0 V Introducing auxiliary momenta to the interpolating distributions defines a contact Hamiltonian. H(q, p, ) = = log ⇡(p | q) ⇡ (q) log ⇡(p | q) log log ⇡B (q) ⇡(q) + log Z( ) + H0 The contact Hamiltonian defines a vector field that generates efficient motion along the interpolation. dq @T = dt @p dp = dt @T @q @V + @q d = dt V @T p @p E⇡ [ V ] p The contact Hamiltonian defines a vector field that generates efficient motion along the interpolation. dq @T = dt @p dp = dt @T @q @V + @q d = dt V @T p @p E⇡ [ V ] p The contact Hamiltonian defines a vector field that generates efficient motion along the interpolation. dq @T = dt @p dp = dt @T @q @V + @q d = dt V @T p @p E⇡ [ V ] p The contact Hamiltonian defines a vector field that generates efficient motion along the interpolation. dq @T = dt @p dp = dt @T @q @V + @q d = dt V @T p @p E⇡ [ V ] p Because the contact Hamiltonian is invariant to this motion, we can also recover the normalizing constant. H(q, p, ) = log ⇡(p | q) log log ⇡B (q) ⇡(q) + log Z( ) + H0 Because the contact Hamiltonian is invariant to this motion, we can also recover the normalizing constant. H(q, p, ) = log ⇡(p | q) log log ⇡B (q) ⇡(q) + log Z( ) + H0 Because the contact Hamiltonian is invariant to this motion, we can also recover the normalizing constant. H(q, p, ) = log ⇡(p | q) log log Z( ) = log ⇡B (q) ⇡(q) + log Z( ) + H0 H(q, p, ) Adiabatic Monte Carlo dynamically transitions from the base distribution to the target distribution. Adiabatic Monte Carlo dynamically transitions from the base distribution to the target distribution. Adiabatic Monte Carlo dynamically transitions from the base distribution to the target distribution. Adiabatic Monte Carlo dynamically transitions from the base distribution to the target distribution. To see the optimality of adiabatic transitions, consider the interpolation of a unidimensional distribution. 1 0.8 q 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Adiabatic transitions automatically equilibrate, implicitly generating an optimal interpolation partition. 1 0.8 q 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Adiabatic transitions automatically equilibrate, implicitly generating an optimal interpolation partition. 1 0.012 0.01 0.8 q 0.008 0.6 0.006 0.4 0.004 0.2 0.002 00 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 In theory we can recover the normalizing constant exactly. In practice we can recover it incredibly accurately. 0 log Z( (t) ) -5 -10 -15 Exact Flow -20 (Exact - Flow) / Exact (t) 0.02 0.01 0 -0.01 -0.02 Open Problems The immediate problem with adiabatic transitions is that metastabilities prevent them from being isomorphisms. q X p The immediate problem with adiabatic transitions is that metastabilities prevent them from being isomorphisms. Heating Metastability Cooling Metastability q X p Fortunately we can readily recover from a metastability by resampling the momenta, effectively reheating the system. p ⇠ ⇡(p | q) q X p We also need to compute the intermediate expectations needed to generate each transition. dq @T = dt @p dp = dt @T @q @V + @q d = dt V @T p @p E⇡ [ V ] p Hamiltonian Monte Carlo gives efficient local estimations, which can be aggregated together into a global estimator. E⇡ [ V ] ⇡ PN bn d Z V ( ) n=1 PN b Z n=1 n Finally, there is the problem of correcting for the error from numerical approximations to the exact transitions. (qf , pf ) ⇠ ⇡ =0 q (qi , pi ) ⇠ ⇡ 0 0.2 0.4 0.6 0.8 1 =1 We can’t apply a naive Metropolis correction, but perhaps we can apply a correction with a swap? (qf , pf ) ⇠ ⇡ ? =0 q (qi , pi ) ⇠ ⇡ 0 0.2 0.4 0.6 0.8 1 =1 Unfortunately, swapping states doesn’t work because discretized perks will not, in general, be aligned. (qf , pf ) ⇠ ⇡ ? ⇡0 q (qi , pi ) ⇠ ⇡ 0 0.2 0.4 0.6 0.8 1 ⇡1 Unfortunately, swapping states doesn’t work because discretized perks will not, in general, be aligned. (qf , pf ) ⇠ ⇡ ⇡0 q (qi , pi ) ⇠ ⇡ 0 0.2 0.4 0.6 0.8 1 ⇡1