UNIVERSITY OF CALIFORNIA, IRVINE Advanced Bayesian Computational Methods through Geometric Techniques DISSERTATION

advertisement
UNIVERSITY OF CALIFORNIA, IRVINE
Advanced Bayesian Computational Methods
through Geometric Techniques
DISSERTATION
submitted in partial satisfaction of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
in Statistics
by
Shiwei Lan
Dissertation Committee
Assistant Professor Babak Shahbaba, Chair
Professor Wesley O. Johnson
Assistant Professor Jeffrey Streets
2013
c 2013 Shiwei Lan
DEDICATION
To my dear wife Yuanyuan and lovely daughter Lydia coming next January. . .
i
ii
Contents
List of Figures
vii
List of Tables
ix
List of Algorithms
xi
Acknowledgements
xiii
Curriculum Vitae
xv
Abstract
xvi
1 Introduction
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2 Hamiltonian Monte Carlo
2.1
2.2
2.3
7
Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Hamiltonian Monte Carlo Algorithm . . . . . . . . . . . . . . . . . . . . . .
10
2.2.1
Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . .
11
2.2.2
Proposal guided by Hamiltonian dynamics . . . . . . . . . . . . . . .
11
2.2.3
Leapfrog Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3 Split Hamiltonian Monte Carlo
17
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.2
Splitting the Hamiltonian . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2.1
20
Splitting the Hamiltonian with a partial analytic solution . . . . . . .
iii
CONTENTS
3.2.2
3.3
3.4
3.5
Splitting the Hamiltonian by splitting the data . . . . . . . . . . . . .
21
Application of Split HMC to logistic regression models . . . . . . . . . . . .
23
3.3.1
Split HMC with a partial analytical solution for a logistic model . . .
24
3.3.2
Split HMC with splitting of data for a logistic model . . . . . . . . .
25
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.4.1
Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.4.2
Results on real data sets . . . . . . . . . . . . . . . . . . . . . . . . .
29
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4 Lagrangian Monte Carlo
35
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.2
Riemannian Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . .
36
4.2.1
Hamiltonian dynamics on Riemannian manifold . . . . . . . . . . . .
37
4.2.2
Riemannian Hamiltonian Monte Carlo Algorithm . . . . . . . . . . .
39
4.3
4.4
4.5
4.6
Semi-explicit Lagrangian Monte Carlo
. . . . . . . . . . . . . . . . . . . . .
40
4.3.1
Lagrangian Dynamics: from Momentum to Velocity . . . . . . . . . .
41
4.3.2
Semi-explicit Lagrangian Monte Carlo Algorithm . . . . . . . . . . .
42
4.3.3
Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Explicit Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . .
48
4.4.1
Fully explicit integrator . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.4.2
Volume Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.5.1
Banana-shaped distributions . . . . . . . . . . . . . . . . . . . . . . .
52
4.5.2
Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . .
55
4.5.3
Multivariate T-distributions . . . . . . . . . . . . . . . . . . . . . . .
57
4.5.4
Finite Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . .
58
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5 Wormhole Hamiltonian Monte Carlo
63
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.2
Energy Barrier in HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.3
Wormhole HMC Algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.3.1
Tunnel Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.3.2
Wind Tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.3.3
Wormhole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Mode Searching After Regeneration . . . . . . . . . . . . . . . . . . . . . . .
75
5.4.1
75
5.4
Identifying Regeneration Times . . . . . . . . . . . . . . . . . . . . .
iv
CONTENTS
5.5
5.6
5.4.2
Searching New Modes . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.4.3
Regenerative Wormhole HMC . . . . . . . . . . . . . . . . . . . . . .
79
Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.5.1
5.5.2
Sensor Network Localization . . . . . . . . . . . . . . . . . . . . . . .
Mixture of Gaussians with Known Modes . . . . . . . . . . . . . . . .
81
82
5.5.3
Mixture of Gaussians with Unknown Modes . . . . . . . . . . . . . .
83
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6 Spherical Hamiltonian Monte Carlo for Constrained Target Distributions 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2
6.3
6.4
6.5
Sampling from distributions defined on the unit ball . . . . . . . . . . . . . .
to sphere S
D
88
6.2.1
Change of the domain: from unit ball
. . . . . .
88
6.2.2
6.2.3
Hamiltonian Dynamics on Sphere . . . . . . . . . . . . . . . . . . . .
Spherical HMC algorithm . . . . . . . . . . . . . . . . . . . . . . . .
91
92
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.3.1
Norm constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.3.2 Functional constraints . . . . . . . . . . . . . . . . . . . . . . . . . .
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
98
6.4.1
Truncated Multivariate Gaussian . . . . . . . . . . . . . . . . . . . .
98
6.4.2
Bayesian Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.3
6.4.4
Bridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Modeling synchrony among multiple neurons . . . . . . . . . . . . . . 102
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7 Conclusion
7.1
BD
0 (1)
107
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography
111
References
111
Appendices
115
A
Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.1
Equivalence between Riemannian Hamiltonian dynamics and Lagrangian
dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.2
A.3
B
Stationarity of Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . 116
Convergence of explicit integrator to Lagrangian dynamics . . . . . . 117
Solutions to split Lagrangian dynamics on Sphere . . . . . . . . . . . . . . . 118
v
CONTENTS
vi
List of Figures
1.1
Comparison of RWM, HMC and RHMC . . . . . . . . . . . . . . . . . . . .
4
1.2
Relationship of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1
Illustration of Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . . . .
7
3.1
Comparison of HMC and RWM in simulating 2d Gaussian . . . . . . . . . .
18
3.2
An illustrative binary classification problem . . . . . . . . . . . . . . . . . .
23
3.3
Approximation in Split HMC with a partial analytic solution . . . . . . . . .
25
3.4
Approximation in Split HMC by splitting the data . . . . . . . . . . . . . . .
26
4.1
Comparison of RWM, HMC and RHMC in exploring a banana-shaped distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
37
Comparison of RHMC, sLMC and LMC in exploring a banana-shaped distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.3
Histograms of the banana-shaped distribution . . . . . . . . . . . . . . . . .
53
4.4
Comparison of RHMC, sLMC and LMC in exploring a thin banana-shaped
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.5
Change of sampling efficiency–trade off between geometry and efficiency . . .
57
4.6
Density plots of the generated synthetic Mixture of Gaussians . . . . . . . .
59
5.1
Energy Barrier in HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.2
Comparison of HMC and THMC in sampling from a 2d distribution with two
modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.3
Shape of wind tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.4
Sampling from a mixture of 10 Gaussians with 100 dimension using THMC
with wind vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.5
Wormhole Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.6
Discover unknown modes by down weighting known ones . . . . . . . . . . .
78
vii
LIST OF FIGURES
5.7
Camparison of RDMC and WHMC in location inference of wireless sensor
network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.8
Comparing WHMC to RDMC using K mixtures of D-dimensional Gaussians
83
5.9
Comparing RWHMC to RDMC in terms of REM using K = 10 mixtures of
D-dimensional Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.10 Number of modes identified by RWHMC over time in simulating K = 10
mixtures of Gaussians with D = 10, 100 . . . . . . . . . . . . . . . . . . . . .
84
5.11 Comparison of WHMC and WLMC in simulating a 2d distribution with 2
modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
6.1
6.2
D
. . . . . . . . . . . . . . . . . . .
Transforming unit ball BD
0 (1) to sphere S
Truncated Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . .
91
99
6.3
Bayesian Lasso using different sampling algorithms . . . . . . . . . . . . . . 101
6.4
Sampling Efficiency in Bayesian Lasso . . . . . . . . . . . . . . . . . . . . . . 102
6.5
6.6
Bayesian Bridge Regression by Spherical HMCo . . . . . . . . . . . . . . . . 103
Trace plots of samples–rewarded stimulus . . . . . . . . . . . . . . . . . . . . 104
6.7
Trace plots of samples–non-rewarded stimulus . . . . . . . . . . . . . . . . . 104
viii
List of Tables
3.1
Split HMC vs HMC in sampling efficiency–simulated logistic regression . . .
29
3.2
Split HMC vs HMC in sampling efficiency–logistic regression on real data . .
30
4.1
Efficiency comparison of HMC, RHMC, sLMC and LMC–banana-shaped dis-
4.2
4.3
tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
Efficiency comparison of HMC, RHMC, sLMC and LMC–thin banana-shaped
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Efficiency comparison of HMC, RHMC, sLMC and LMC–5 real logistic regression problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.4
4.5
Densities used for the generation of synthetic Mixture of Gaussian data sets
59
Efficiency comparison of HMC, RHMC, sLMC and LMC–5 mixture of Gaussians 60
6.1
6.2
Moments Matching by RWM, Wall HMC, and Spherical HMC . . . . . . . .
Efficiency comparison of RWM, Wall HMC, and Spherical HMC–Truncated
99
Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3
Efficiency comparison of RWM, Wall HMC, and Spherical HMC–Copula modeling synchrony among multiple neurons . . . . . . . . . . . . . . . . . . . . 105
ix
LIST OF TABLES
x
List of Algorithms
2.1
Hamiltonian Monte Carlo (HMC) . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Split Hamiltonian Monte Carlo with a partial analytic solution (Split HMC-
3.2
PAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Split Hamiltonian Monte Carlo by splitting the data (Split HMC-SD) . . . .
22
22
4.1
Riemannian Hamiltonian Monte Carlo (RHMC) . . . . . . . . . . . . . . . .
40
4.2
Semi-explicit Lagrangian Monte Carlo (sLMC) . . . . . . . . . . . . . . . . .
47
4.3
5.1
Explicit Lagrangian Monte Carlo (LMC) . . . . . . . . . . . . . . . . . . . .
Wormhole Hamiltonian Monte Carlo (WHMC) . . . . . . . . . . . . . . . . .
51
74
5.2
Regenerative Wormhole Hamiltonian Monte Carlo (RWHMC) . . . . . . . .
79
6.1
Spherical Hamiltonian Monte Carlo (Spherical HMC) . . . . . . . . . . . . .
94
xi
14
ACKNOWLEDGEMENTS
I would like to express my greatest gratitude to my advisor Professor Babak
Shahbaba for his insightful guidance and persistent encouragement throughout
my doctorate program. It is truly a blessing to work with him who enabled my
transition to the field of Statistics smoothly. He has not only encouraged me to
think independently at every step of my research, given me numerous support
and advice, but also taught me the necessary presentation and writing skills with
his own experiences. This dissertation would never have been written without
his help.
I am also grateful to the other two members of my dissertation committee, Professor Wesley O. Johnson and Professor Jeffrey Streets. I learned advanced statistics
and Bayesian modeling from Professor Johnson, who also provided me a lot of
support and suggestions during my degree pursuit. I want to thank Professor
Streets for his precious time of discussion, which inspired many ideas in this
dissertation. I would like to express my thankfulness to all my collaborators:
Professor Mark Girolami, Professor Jeffrey Streets, Vasileios Stathopoulos and
Bo Zhou. I want to acknowledge the help from Sungjin Ahn, Yutian Chen and
Anoop Korattikara who patiently answered my questions about details of their
work. I am also thankful to Professor Max Welling for his enlightening comments.
Finally, thanks to my family! My wife, Yuanyuan Li, has sacrificed much in order
to accompany and support me wherever I am. We owe our deepest gratitude to
our parents for their continuing love, care, help and support!
xii
CURRICULUM VITAE
Shiwei Lan
EDUCATION
Doctor of Philosophy in Statistics
University of California, Irvine
2013
Irvine, California
Master of Science in Mathematics
University of California, Irvine
2010
Irvine, California
Bachelor of Science in Mathematics
Nanjing University
2005
Nanjing, China
EXPERIENCE
Graduate Research Assistant
University of California, Irvine
06/2013–present
Irvine, California
Teaching Assistant
University of California, Irvine
09/2006–06/2013
Irvine, California
MATHEMATICAL SKILLS
General: Mathematical/Real/Complex/Numerical Analysis, ODE/PDE
Geometry: Topology, Differential Geometry, Geometric Analysis
Statistics: Bayesian Statistics, Data Analysis, Stochastic Process
COMPUTER SKILLS
C/C++, Matlab, Mathematica, R, SAS, Stata
HONORS
Excellent Graduation
Nanjing University
top 20%
2005
National Scholarship
Nanjing University
4 of 150
2002, 2003, 2004
xiii
REVIEWER
Statistical Analysis and Data Mining
Scandinavian Journal of Statistics
TALKS
Spherical HMC for Constrained Target Distributions
AI/ML seminar
November,2013
UC Irvine
Split HMC
5th International Conference of ERCIM
December,2012
Oviedo, Spain
Lagrangian Dynamical Monte Carlo
AI/ML seminar
November,2012
UC Irvine
PUBLICATIONS
Spherical HMC for Constrained Target Distributions
Shiwei Lan, Bo Zhou, and Babak Shahbaba
http://arxiv.org/abs/1309.4289
2013
Wormhole Hamiltonian Monte Carlo
Shiwei Lan, Jeffrey Streets, and Babak Shahbaba
http://arxiv.org/abs/1306.0063
2013
Split Hamiltonian Monte Carlo
Babak Shahbaba, Shiwei Lan, Wesley O. Johnson and Radford M. Neal
Statistics and Computing,DOI: 10.1007/s11222-012-9373-1.
2013
Lagrangian Dynamical Monte Carlo
Shiwei Lan, Vassilios Stathopoulos, Babak Shahbaba, and Mark Girolami
http://arxiv.org/abs/1211.3759
2012
xiv
ABSTRACT
Modern statistical methods relying on Bayesian inference typically involve intractable models that require computationally intensive algorithms, such as Markov
Chain Monte Carlo (MCMC), for their implementation. While simple MCMC
algorithms (e.g., random walk Metropolis) might be effective at exploring lowdimensional probability distributions, they can be very inefficient for complex,
high-dimensional distributions. More specifically, broader application of MCMC
is hindered by either slow mixing or expensive computational cost. As a result,
many existing MCMC algorithms are not efficient or capable enough to handle
complex models that are now commonly used in statistics and machine learning. This dissertation focuses on utilizing geometrically motivated methods to
improve efficiency of MCMC samplers while lowering the computational cost,
with the aim to extend the application of MCMC methods to complex statistical problems involving heavy computation, complicated distribution structure,
multimodality, and parameter constraints.
We start by extending the standard Hamiltonian Monte Carlo (HMC) algorithm
through splitting the Hamiltonian in a way that allows enhanced movement
around the state space achieved at low computational cost. For more advanced
HMC algorithms defined on Riemannian manifolds, we propose a new method,
Lagrangian Monte Carlo, which is capable of exploring complex probability distributions at relatively low computational cost. For multimodal distributions,
we have developed a geometrically motivated approach, Wormhole Hamiltonian
Monte Carlo, that explores the distribution around the known modes effectively
while identifying previously unknown modes in the process. Furthermore, we
propose another algorithm, Spherical Hamiltonian Monte Carlo, that combines
geometric methods and computational techniques to provide a natural and efficient framework for sampling from constrained distributions. We use a variety
of simulations and real data to illustrate the substantial improvement obtained
by our proposed methods over alternative solutions.
xv
LIST OF ALGORITHMS
xvi
1
Introduction
1.1
Background
In Bayesian Statistics, for given data D, our model P (D|θ) contains parameters θ, which
are usually assumed with some distribution P (θ) based on certain prior knowledge. The
posterior knolwdege of θ is as follows:
P (θ|D) =
1
P (D|θ)P (θ) ∝ P (D|θ)P (θ)
P (D)
It is important, for example, in prediction:
Z
∗
P (y |D) = P (y ∗ |θ)P (θ|D)dθ
Such integration is almost omni-present in Bayesian modeling however is very often intractable in the sense that there is no closed form for it. To infer/estimate the intractable
posterior, we appeal to approximate methods. Two most prominent strategies in the literature are variational inference [1, 2] and Markov Chain Monte Carlo (MCMC) [3, 4].
Taking advantage of mean-field theory [5], variational Bayeisan inference searches a variational distribution Q(θ) in a flexible family that is closest to the true posterior P (θ|D)
by iteratively reducing their distance (Kullback-Leibler divergence, DKL (Q||P )), thus transforms the inference problem to an optimization problem. Variational Bayesian method can
be viewed as an extention of Expectation-Maximization (EM) algorithm [6], but instead
of finding the Maximum A Posterior (MAP) esitmation, it computes an approximation to
the entire posterior distribution for statistical inference/estimation, and it also provides the
optimal lower bound for the marginal likelihood as a byproduct. This method is further
studied by [7, 8, 9, 10, 11, 12, 13] for applications in more general settings.
1
1. INTRODUCTION
While variaiontal Bayes provides a locally-optmial, exact analytical solution to an approximation to the posterior, MCMC, on the other hand, approximates the exact posterior
using a set of samples from a Markov Chain. For example,
∗
Z
P (y |D) =
S
1X
P (y |θ)P (θ|D)dθ ≈
P (y ∗ |θ (s) ),
S s=1
∗
θ (s) ∼ P (θ|D)
By the functional central limit theorm [14], above approximation is unbiased with variance
approximately σ 2 τ /S with autocorrelation time1 τ being interpreted as the number of dependent samples before an equivalently independent point[3, 15, 16, 17].
Compared to variational Bayes, MCMC algorithms tend to provide better approximations
(typically at higher computational cost), especially in high dimensions. MCMC provides a
simple but powerful tool in Bayesian learning [4, 18, 19, 20, 21]. Even though variational
Bayes and MCMC are two different approximate techniques, they can be naturally combined
[22, 23]. This dissertation will concentrate on MCMC methods.
The main theory of Markov chain states that an aperiodic, irreducible Markov chain that
has a stationary distribution π(·) must uniquely converge to π(·) [24, 25]. MCMC method
involves designing reversible transition kernel that has the target distribution as its stationary distribution; then generating samples according to the transition kernel. Regardless of
the starting point, these samples will follow the target distribution once entering the equilibrium. MCMC is introduced to tackle high dimensional integrals in statistics and machine
learning. It is well known, however, MCMC may suffer from slow mixing (converging to
the stationary distribution) and heavy computational burden with large data volume (number of observations) in high dimension (number of features). The complexity of the target
distribution (skewness, multimodality, etc.) can make MCMC sampling of the parameter
space difficult thus rendering a low mixing rate. High dimensionality adds another layer of
difficulty due to the concentration of probability in certain regions.
Rejection and importance [26, 27] sampling are two primitive Monte Carlo algorithms
that remain only for the demonstrative purpose due to their inefficiency in practice. The
Metropolis algorithm [18] is responsible for universality of MCMC. Given the current state
θ, to derive a Markov chain having π(θ) as its stationary distribution, it first makes a
proposal θ ∗ ∼ q(θ ∗ |θ) and accepts it for the next state with probability min{1, π(θ ∗ )/π(θ)}
or stays at its current state. A simple proposal is q(θ ∗ |θ) ∼ N(θ, σ 2 I), called Random Walk
Metropolis (RWM). However, its diffusive behavior makes resulting Markov chains mix slowly
and thus limiting its efficiency in practice. [19] generalizes it to allow asymmetric proposals
1
τ =1+2
P+∞
k=1
ρ(k) where ρ(k) is autocorrelation function at lag k.
2
1.1 Background
(q(θ ∗ |θ) 6= q(θ|θ ∗ )), e.g. independent proposals. Note, Gibbs sampler [20] is a special case
of cyclic Metropolis-Hastings (M-H) algorithm by taking proposal distribution as P (θi∗ |θ −i )
and updating parameters coordinate wise. Though such proposals are always accepted [28],
the full conditionals are not necessarily available or easy to sample from. Besides the recent
advancement of M-H algorithms by [29, 30, 31, 32], careful design of transition kernel is
needed for the Markov chain to converge fast to the target distribution.
Using auxiliary variables could allow us to design efficient MCMC algorithms. This
strategy is successfully used in slice sampling [33], which uniformly samples from the region
under the density plot by alternating uniform sampling in ancillary vertical direction with
unform sampling in the horizontal “slice”. Although slice sampling performs very well for
univariate distributions, its generalization to higher dimensions could be problematic. It is
recently developed by [34, 35].
Hamiltonian Monte Carlo (HMC) [36] is another popular example of MCMC design using
ancillary variables. As a special case of the Metropolis algorithm, HMC augments the states
of interest with ancillary variables, and proposes augmented states that are distant from
the current state by deterministically simulating Hamiltonian dynamics, which nevertheless
are accepted with high probability. Guided by the gradient information of the log density,
HMC reduces the random walk behavior of RWM and significantly improves the efficiency
in exploring the target distribution. We can see from figure 1.1 that RWM moves slowly,
whereas HMC is more efficient in exploring the distribution with the help of geometry. [37]
provides a complete introduction of HMC. [38] address two major issues involving tuning
parameters (trajectory length and step size). [39] generalize HMC to a Riemannian manifold
to futher improve the sampler’s ability to explore complicated distributions. There are other
recent works on HMC by [40, 41, 42, 43, 44, 45].
As the dimension grows, the Hamiltonian dynamical system becomes increasingly restricted by its smallest eigen-direction, requiring smaller step sizes to maintain stability.
Moreover, complicated distribution structure demands local adaptation of both the step
size and the direction for HMC to better explore the parameter space. Riemannian HMC
(RHMC) [39] defines HMC on a Riemannian manifold, which, as argued by [46], is more suitable for sampling from complicated non-Gaussian distributions. Specificially, RHMC uses a
position dependent pre-conditioning matrix G(θ) in HMC to adapt to the local geometry
of the distribution. As seen in figure 1.1, with the geometric information from the second
order derivative matrix (Fisher metric) of the log posterior density, RHMC avoids erratic
behavior of HMC and explores the parameter space more smoothly. RHMC is developed
and generalized by [47, 48, 49, 50, 51].
3
1. INTRODUCTION
Sampling Path of HMC
Sampling Path of RHMC
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
θ
0
2
2
θ2
θ2
Sampling Path of RWM
2
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2
−2
−1
0
θ1
1
2
−2
−2
−1
0
θ1
1
2
−2
−2
−1
0
θ1
1
2
Figure 1.1: The first 10 iterations in sampling from a banana shaped distribution by RWM
, HMC and RHMC. Left: RWM explores the distribution in a non-sysmatic way; Middle:
HMC, with gradient information, becomes more guided in exploration; Right: RHMC uses
more geometric information, curvature, to explore the distribution more straightforwardly.
Figure 1.1 illustrates the motivation of geometry in improving MCMC sampling efficiency.
From RWM to RHMC, the more geometric information a sampler can adopt, the better its
capability to explore the target distribution and thus to enhance the mixing behavior of the
Markov chain. As expected, the computational cost will increase as more geometry is incorporated. This dissertation mainly focuses on using geometry to improve efficiency of MCMC
samplers while keeping the computational cost low, with the aim to make MCMC methods
more applicable to complex statistical problems involving heavy computation, complicated
distribution structure, multimodality and constraints, etc.
There are other interesting Monte Carlo methods not mentioned above. Tempered transition [52] is an elaborate improvement over simulated tempering [53, 54] for sampling from
multimodal distributions. Both take advantage of simulated annealing [55] as an optimization algorithm. [51, 56, 57] discuss sampling from probability distributions defined on a
submanifold embedded in RD . Reversible jump MCMC [58] extends standard MCMC algorithms by allowing the dimension of the posterior to vary. Sequential Monte Carlo (particle
filtering) [59] methods are a set of online posterior density estimation algorithms that are
successfully used in time series modeling. They are more or less related to the work of this
dissertation.
4
1.2 Contributions
1.2
Contributions
Both computational and geometric methods are used to serve the purpose of improving
MCMC samplers’ efficiency. The main contributions of the dissertation are as follows:
• Split Hamiltonian Monte Carlo speeds up HMC by splitting the Hamiltonian to
allow enhanced movement around the state space done at low computational cost.
• Lagrangian Monte Carlo is capable to explore complex probability distributions
the same as RHMC but with reduced cost by avoiding its expensive implicit updates.
• Wormhole Hamiltonian Monte Carlo is a novel geometric MCMC method that
can effectively and efficiently sample from multimodal distributions in high dimension.
• Spherical Hamiltonian Monte Carlo combines geometric and computational techniques to provide a natural and efficient framework for sampling from constrained
distributions.
1.3
Outline
This dissertation is organized as follows. Chapter 2 provides an overview of Hamiltonian
Monte Carlo. Chapter 3 discusses the split HMC method for improving the computational
efficiency of HMC. Chapter 4 explains why Lagrangian dynamics, which uses velocity as
opposed to momentum, is preferred to Riemannian Hamiltonian dynamics. Chapter 5 discusses how geometry can be utilized to facilitate movement between modes in sampling from
multimodal distributions. Chapter 6 combines both computational and geometric methods
to implicitly and efficiently handle constraint problems in sampling. The last chapter 7
provides conclusions and discusses future research directions. The relationship among these
chapters is shown in figure 1.2.
5
1. INTRODUCTION
HMC RHMC Split HMC LMC efficiency complicated structure Spherical HMC Wormhole HMC constraints mul6modality Figure 1.2: Relationship of Chapters
6
2
Hamiltonian Monte Carlo
Hamiltonian Monte Carlo originated from the landmark paper [36], which termed the method
Hybrid Monte Carlo. This work united MCMC and molecular simulation. The statistical
application began with Neal’s application in neural networks [60]. HMC suppress the random walk behavior of RWM by making proposals that are distant from the current state,
yet have a high probability of being accepted. These proposals are found by numerically
simulating Hamiltonian dynamics for some discretized time steps. In this chapter, we review
Hamiltonian dynamics and its application to MCMC, where we are interested in sampling
from the distribution of θ. By using auxiliary variables p, we can improve the computational efficiency of MCMC algorithms. While we provide the physical interpretation of this
method, we can simply consider it as a data augmentation approach.
2.1
Hamiltonian Dynamics
Hamiltonian dynamics is a set of partial differential equations guiding the evolution of the
state of a particle in a closed system according to the law of energy conservation. It provides
useful intuition in its application to MCMC. In this section, a brief overview of Hamiltonian
dynamics and its properties is given. One can find a more detailed review in [37].
Consider a frictionless puck sliding on
a surface of varying height.
The state
space of this system consists of its position, denoted by the vector θ ∈ RD ,
U (θ )
and its momentum, denoted by the vector p ∈ RD . The potential energy, U (θ),
is proportional to the height of the surface at position θ, and the kinetic energy
7
θ
2. HAMILTONIAN MONTE CARLO
is K(p) := pT p/(2m), where m is the
mass of the puck. As the puck moves on
an upward slope, its potential energy increases while its kinetic energy decreases.
The puck keeps climbing to a point where
the kinetic energy becomes zero, then it slides back down, with its potential energy decreasing
and its kinetic energy increasing.
The total energy of the above dynamical system is represented by a function called
Hamiltonian defined as follows:
Definition 2.1 (Hamiltonian). The Hamiltonian H is defined as the total energy, the sum
of the potential and kinetic energy:
H(θ, p) = U (θ) + K(p)
(2.1)
Then the evolution of the state (θ, p) over time t is governed by the following Hamilton
equations (2.2).
Definition 2.2 (Hamiltonian Dynamics). Given differentiable Hamiltonian, H(θ, p), Hamiltonian dynamics is defined by the following partial differential equations:
∂H
∂p
∂H
ṗ = −
∂θ
=
θ̇ =
∇p K(p)
(2.2)
= −∇θ U (θ)
where ˙ is time derivative, and ∇p = [∂/∂p1 , · · · , ∂/∂pD ].
The solution to (2.2) defines a flow1 Tt : M2D × R → M2D ,
t), p(t0 + t)), ∀t0 , t ∈ R.
(θ(t0 ), p(t0 ), t) 7→ (θ(t0 +
"
#
0D×D ID×D
Alternatively, denote the state z := (θ, p), and symplectic matrix J :=
,
−ID×D 0D×D
then Hamiltonian dynamics (2.2) can be rewritten as
ż = J∇z H(z)
2.1.1
Properties
Hamiltonian dynamics has three fundamental properties that are crucial for the application
in MCMC: i) time reversibility; ii) symplecticity (volume-preservation); iii) energy (Hamiltonian) conservation [61].
1
A flow T is a mapping T : M×R → M such that ∀z ∈ M, s, t ∈ R, T (z, 0) = z; T (T (z, t), s) = T (z, s+t).
8
2.1 Hamiltonian Dynamics
Time Reversibility states the one-one correspondence between forward-time evolution,
Tt , and time-reversed evolution, T−t , which is also the inverse of the flow, Tt−1 . Time reversibility of dynamics is used to prove the reversibility of the Markov chain transitions in
HMC, which in turn provides an easy proof of the stationarity of the resulting Markov chain.
Definition 2.3 (Time Reversibility). A dynamical system is time reversible if there exists
an involution1 ν that gives a one-to-one mapping between its forward-time evolution and
time-reversed evolution in the following way:
T−t = ν ◦ Tt ◦ ν
Proposition 2.1 (Time Reversibility). Hamiltonian dynamics (2.2) is time reversible.
Proof.
reverting the direction of momentum, i.e. ν could be a matrix
" Let ν be the mapping
#
ID×D 0D×D
I=
acting on z. Denote z0 = ν(z) = Iz. H is a quadratic funciton of p,
0D×D −ID×D
thus H(z0 ) = H(z). We have
ż0 = I ż = IJ∇z H(z) = IJI T ∇z0 H(z0 ) = −J∇z0 H(z0 )
Therefore, if z(t) = Tt (z0 ) for some initial z0 , then we must have z0 (t) = T−t (z0 0 ) for the
initial z0 0 = ν(z0 ) according to above equation. Thus accroding to the uniquness of the
solution,
Tt (z0 ) = z(t) = ν(z0 (t)) = ν ◦ T−t ◦ ν(z0 )
Notice that ν is involution, then we have T−t = ν ◦ Tt ◦ ν.
Remark 2.1. The time reversibility of Hamiltonian dynamics has the following interpretation. Starting from θ 0 with some initial momentum p0 , evolve the dynamics (2.2) for some
time to reach θ 1 with momentum p1 . Now if starting from θ 1 with flipped momentum −p1 ,
we can arrive θ 0 with momentum −p0 after evolving (2.2) for the same amount of time;
further flipping the direction of −p0 , we can get the original initial state (θ 0 , p0 ).
Volume Preservation means that any infinitesimal region R in the state space will have
the same volume after being mapped by the flow Tt , i.e. Vol(R) = Vol(Tt (R)), ∀t ∈ R. It (or
a stronger condition, symplecticity) is a property that simplifies the acceptance probability
for Metropolis updates. If it does not hold, the acceptance probability needs to be adjusted
with the Jacobian determinant of some discretized evolution to guarantee the stationarity
of the resulting Markov chain. See Proposition 4.3 in Chapter 4.
1
An involution ν is a function that is its own inverse, i.e. ν 2 = ν ◦ ν = id, id the identity map.
9
2. HAMILTONIAN MONTE CARLO
Proposition 2.2 (Volume Preservation). Hamiltonian dynamics (2.2) is volume preserving.
Proof. The easiest proof is to show that the divergence of the vector field (θ̇, ṗ) is zero, thus
the flux accross the boundary of any infinitesimal volume is zero, namely,
" #
θ̇
∂
∂
∂ ∂H
∂ ∂H
θ̇ +
∇·
=
−
=0
ṗ =
T
T
T
∂p
∂pT ∂θ
ṗ
∂θ
∂θ ∂p
One could also refer to the divergence theorem, or more directly to Liouville’s theorem
[62].
Energy Conservation makes the acceptance probability one in Metropolis updates if
(2.2) is analytically solved and makes the acceptance probability only depend on the discretization error when (2.2) is numerically solved.
Proposition 2.3 (Energy Conservation). Hamiltonian dynamics (2.2) is energy conservative.
Proof. Based on the Hamilton equations (2.2),
∂H
∂H
dH
= θ˙T
+ ṗT
=
dt
∂θ
∂p
∂H
∂p
T
T
∂H
∂H
∂H
+ −
=0
∂θ
∂θ
∂p
In practice, Hamiltonian dynamics (2.2) might only be numerically solved. These properties remain valid for the discretized dynamics. They are important in the application
to MCMC, as they constitute an essential part of the proof of stationarity of the induced
Markov chain, and provide convenience in the use of the algorithm. In particular, a numerical method to solve differential equations satisfying properties i) time reversibility and ii)
symplecticity (volume-preservation) is called a geometric integrator [39, 61].
2.2
Hamiltonian Monte Carlo Algorithm
Hamiltonian dynamics can be used to guide the proposal in Metropolis Hastings algorithm
thus can suppress the random walk behavior in the RWM algorithm. The resulting algorithm
is called Hamiltonian Monte Carlo (HMC).
10
2.2 Hamiltonian Monte Carlo Algorithm
2.2.1
Metropolis-Hastings Algorithm
The Metropolis algorithm [18] is a popular MCMC sampling scheme, which was generalized
for asymmetric proposals by [19].
Suppose the target distribution is π(·). We want to derive a transition probability (kernel)
T(θ (n+1) |θ (n) ) for generating samples {θ n } that have the target distribution π(·) as the
stationary distribution. A crucial sufficient but not necessary condition to ensure stationarity
is the detailed balance condition [28]:
π(θ (n) )T(θ (n+1) |θ (n) ) = π(θ (n+1) )T(θ (n) |θ (n+1) )
(2.3)
Given the current state θ (n) , the M-H algorithm makes a proposal θ ∗ according to some
easy-to-sample distribution, q(θ ∗ |θ (n) ); then accepts the proposal θ ∗ with the following acceptance probability:
(
π(θ ∗ )/q(θ ∗ |θ (n) )
αM H (θ (n) , θ ∗ ) = min 1,
π(θ (n) )/q(θ (n) |θ ∗ )
)
(2.4)
Set θ (n+1) = θ ∗ if θ ∗ is accepted or θ (n+1) = θ (n) otherwise.
The MH transition kernel is:
T(θ
(n+1)
|θ
(n)
) = q(θ
(n+1)
|θ
(n)
)α(θ
(n)
,θ
(n+1)
) + δθ(n) (θ
(n+1)
Z
)
q(θ ∗ |θ (n) )(1 − α(θ (n) , θ ∗ ))dθ ∗
(2.5)
The detailed balance condition (2.3) can be verified based on (2.4)(2.5) [28].
Note there are two popular choices of proposal distribution q(θ ∗ |θ (n) ): the independent
sampler, q(θ ∗ |θ (n) ) = q(θ ∗ ), and the symmetric proposal (Metropolis algorithm), which satisfies q(θ ∗ |θ (n) ) = q(θ (n) |θ ∗ ). For RWM, q(θ ∗ |θ (n) ) ∼ N(θ (n) , σ 2 I). For HMC q(θ ∗ |θ (n) ) is
defined by a symmetric deterministic process discussed below. Both RWM and HMC are
specific examples of the Metropolis algirithm.
2.2.2
Proposal guided by Hamiltonian dynamics
In Bayesian statistics, we need posterior samples of modeling parameters θ for inference or
prediction. Here we use Hamiltonian dynamics to guide the proposal q(θ ∗ |θ (n) ) to develop
HMC algorithm. Instead of only using the variables of interest θ alone, we consider the joint
state z = (θ, p), where p is a vector of fictitious variables of the same dimension as θ.
Assume the distribution of interest has density π(θ). We define its potential energy for
the dynamical system as minus the log of the density π(θ). In Bayesian statistics, θ consists
11
2. HAMILTONIAN MONTE CARLO
of the model parameters (and perhaps latent variables). It is of interest to sample from the
posterior distribution of θ given the observed data D. Thus the corresponding potential
energy is defined up to a constant as follows:
U (θ) = − log(π(θ|D)) = −[log(P (θ)) + log(L(θ|D))]
(2.6)
where P (θ) is the prior density and L(θ|D) is the likelihood function.
To make use of Hmailtonian dynamics, we augment the parameter space of θ by creating
the auxiliary momentum vector, p, which is the same dimension as θ. This vector p is
assigned a distribution that is defined by the kinetic energy function K(p) := 12 pT M−1 p,
resulting in a density proportional to exp(−K(p)), i.e.
p ∼ N(0, M)
where M is the mass matrix, which is often set to identity matrix I in standard HMC for
convenience. An alternative more complex choice is the Fisher information matrix, which
can help to explore the parameter space more efficiently. See more details in [39] and in
chapter 4.
The joint density of (θ, p) is defined by the Hamiltonian function as
f (θ, p) ∝ exp(−H(θ, p)) = exp(−U (θ)) exp(−K(p))
(2.7)
Note that θ and p are independent for fixed mass matrix M ≡ const, but not in general,
e.g. for a position dependent mass matrix, G(θ).
The HMC algorithm works as follows: i) given the current state θ (n) , we first sample a
random momentum variable p(n) ∼ N(0, M); ii) evolve the joint state z = (θ, p) for some
time t according to Hamiltonian dynamics (2.2) to get a proposal z∗ = (θ ∗ , p∗ ) = Tt (z); iii)
decide whether to accept the proposal z∗ according to the following acceptance probability:
αHM C (z
(n)
f (z∗ )δT−t (z∗ ) (z(n) )
, z ) = min 1,
= min{1, exp(−H(z∗ ) + H(z(n) ))}
f (z(n) )δTt (z(n) ) (z∗ )
∗
(2.8)
where δ is the Dirac delta function. Finally drop the auxiliary momentum variables p and
repeat i)ii)iii). In fact, step ii) means that the proposal machenism in HMC is actually
deterministic, i.e.
q(z∗ |z(n) ) = δTt (z(n) ) (z∗ )
but the randomness comes from step i) sampling momentum p(n) .
The following theorem ensures the validity of HMC as detailed above:
12
(2.9)
2.2 Hamiltonian Monte Carlo Algorithm
Theorem 2.1. The Markov chain generated by the HMC procedure i)ii)iii) has joint distribution (2.7) as its stationary distribution.
Proof. Let z(n+1) = Tt (z(n) ). It suffices to verify the detailed balance condition (2.3) for
z(n+1) 6= z(n) (Otherwise (2.3) becomes trivial).
LHS = f (z(n) )T(z(n+1) |z(n) ) = f (z(n) )q(z(n+1) |z(n) )αHM C (z(n) , z(n+1) )
)
(
(n+1)
(n)
f
(z
)δ
)
(n+1) ) (z
T
(z
−t
= f (z(n) )δTt (z(n) ) (z(n+1) ) min 1,
f (z(n) )δTt (z(n) ) (z(n+1) )
= min{f (z(n) )δTt (z(n) ) (z(n+1) ), f (z(n+1) )δT−t (z(n+1) ) (z(n) )}
(
)
(n)
(n+1)
f
(z
)δ
(z
)
(n)
Tt (z )
= f (z(n+1) )δT−t (z(n+1) ) (z(n) ) min 1,
f (z(n+1) )δT−t (z(n+1) ) (z(n) )
= f (z(n+1) )q(z(n) |z(n+1) )αHM C (z(n+1) , z(n) ) = f (z(n+1) )T(z(n) |z(n+1) ) = RHS
Remark 2.2. Note in the above proof, it is the difference in energy functions that are used
in (2.8) and their gradients that are used for Tt in (2.9), so they could have been defined upto
a fixed constant.
2.2.3
Leapfrog Method
Observe that in the acceptance probability (2.8), if z∗ = Tt (z(n) ) is analytically evolved,
then according the property iii) energy conservation of Hamiltonian dynamics, we would
have αHM C (z(n) , z∗ ) ≡ 1, i.e. proposals are always accepted. In practice, however, it is
difficult to solve the Hamilton equations (2.2) analytically, so we need to approximate these
equations by discretizing time with some small step size ε. Because of its accuracy (small
local discretization error) and stability (controlled global discretization error), the following
leapfrog method is commonly used to solve (2.2) numerically
p(t + ε/2)
= p(t) − (ε/2)∇θ U (θ(t))
θ(t + ε)
= θ(t) + ε∇p K(p(t + ε/2))
p(t + ε)
= p(t + ε/2) − (ε/2)∇θ U (θ(t + ε))
(2.10)
The leapfrog integrator, also known as the Stömer-Verlet [63] method, denoted as T̂ε ,
is i) time reversible; ii) volume preserving. One can check the time reversibility of T̂ε by
that switching two states z(t) and z(t + ε) and negating time1 don’t change the format of
1
−1
This property is actually called (time) symmetry of an integrator: T̂ε = T̂ε∗ := T̂−ε
, which is not trivial
13
2. HAMILTONIAN MONTE CARLO
integrator
(2.10). The volume preservation of T̂ε can be verified by that Jacobian determinant
∂z(t + ε) ∂z(t) ≡ 1.
In practice, we numerically solve (2.2) using the leapfrog method for L steps with step
size ε, to make a proposal and decide to accept it as a new state with probability (2.8), which
could be less than 1 in this case, or to stay at the current state. During this procedure, the
Metropolis updates leave H fluctuating around some fixed value.
See [37] for more discussion on leapfrog’s numerical properties and [61] and chapter 3 for
more profound interpretation of leapfrog. Algorithm 2.1 below summarizes the HMC steps
to generate a sample.
Algorithm 2.1 Hamiltonian Monte Carlo (HMC)
Initialize θ (1) = current θ
Sample new momentum p(1) ∼ N(0, M)
Calculate current H(θ (1) , p(1) ) = U (θ (1) ) + K(p(1) )
for ` = 1 to L do
% Update the momentum for a half step
p(`+1/2) = p(`) − (ε/2)∇θ U (θ (`) )
% Update the position for a full step
θ (`+1) = θ (`) + εM−1 p(`+1/2)
% Update the momentum for a half step
p(`+1) = p(`+1/2) − (ε/2)∇θ U (θ (`+1) )
end for
Calculate proposed H(θ (L+1) , p(L+1) ) = U (θ (L+1) ) + K(p(L+1) )
αHM C = exp{−Proposed H + Current H}
if runif(1) < αHM C then
Current θ = θ (L+1)
end if
2.3
Discussion
Now we revisit the illustration of Hamiltonian dynamics in figure 2.1. Based on the definition
of potential energy, U (θ), its minimum corresponds to the maximum of the target density.
The sliding puck in figure 2.1 provides the intuition of sampling: recording the proposal
after evolving Hamiltonian dynamics (2.2) for a fixed trajectory length (εL) is equivalent to
for discretized solution. According to the definition 2.3 of time reversibility, we should have checked that
flipping momentum direction and switching states and flipping momentum direction again after evolution
keep the format of the integrator. But since kinetic energy is quadratic in momentum in classical mechanics,
they are equivalent.
14
2.3 Discussion
recording the puck’s position at a fixed time interval. The puck moves faster towards than
away from the lower energy region, therefore it takes less time for the puck to move from
higher energy region to lower energy region than in the reversed direction. Being recorded
at a constant frequency, the puck has a greater chance of visiting the lower energy (higher
density) region.
Even though HMC is advantageous over RWM in guiding the proposals, there are, however, more parameters to tune: the step size ε, the number of leapfrog steps L and the mass
matrix M. The choice of step size ε is crucial– small value of ε leads to slow convergence of
resulting Markov chain; whereas, large value of ε results in low acceptance rate of proposals.
As suggested by [37], the number of leapfrog steps L can be randomized in a certain range
to avoid periodic movement while exploring the distribution. A recent work, No-U-Turn
Sampler (NUTS) [38], gives a tuning-free solution by letting the sampler go for the longest
trajectory without turning back. The mass matrix M can be chosen as the inverse Hessian
of potential energy evaluated at the density mode θ̂ if the target distribution can be well
approximated by a multivariate Gaussian; but in general, a position specific matrix G(θ),
e.g. Fisher information, can be adopted [39]. See more discussion in chapter 4.
15
2. HAMILTONIAN MONTE CARLO
16
3
Split Hamiltonian Monte Carlo
3.1
Introduction
The simple Metropolis algorithm [18] is often effective at exploring low-dimensional distributions, but it can be very inefficient for complex, high-dimensional distributions — successive
states may exhibit high autocorrelation, due to the random walk nature of the movement.
Faster exploration can be obtained using Hamiltonian Monte Carlo (HMC), which was first
introduced by [36], who called it “hybrid Monte Carlo”, and which has been recently reviewed by [37]. HMC reduces the random walk behavior of Metropolis by proposing states
that are distant from the current state, but nevertheless have a high probability of acceptance. These distant proposals are found by numerically simulating Hamiltonian dynamics
for some specified amount of fictitious time.
For this simulation to be reasonably accurate (as required for a high acceptance probability), the step size used must be suitably small. This step size determines the number of
steps needed to produce the proposed new state. Since each step of this simulation requires
a costly evaluation of the gradient of the log density, the step size is the main determinant
of computational cost.
In this chapter, we show how the technique of “splitting” the Hamiltonian [37, 61] can
be used to reduce the computational cost of producing proposals for HMC. In our approach,
splitting “separates” the Hamiltonian, and consequently the simulation of the dynamics, into
two parts. We discuss two contexts in which one of these parts can capture most of the rapid
variation in the energy function, but is computationally cheap. Simulating the other slowly
varying part requires costly steps, but can use a large step size. The result is that fewer
costly gradient evaluations are needed to produce a distant proposal. We illustrate these
splitting methods using logistic regression models. Computer programs for our methods are
publicly available from http://www.ics.uci.edu/~babaks/Site/Codes.html.
17
3. SPLIT HAMILTONIAN MONTE CARLO
Figure 3.1: Comparison of Hamiltonian Monte Carlo (HMC) and Random Walk Metropolis
(RWM) when applied to a bivariate normal distribution. Left plot: The first 30 iterations of
HMC with 20 leapfrog steps. Right plot: The first 30 iterations of RWM with 20 updates per
iterations.
As an illustration, consider sampling from the following bivariate normal distribution
θ ∼ N (µ, Σ),
3
1 0.95
with µ =
and Σ =
0.95 1
3
For HMC, we set L = 20 and ε = 0.15. The left plot in figure 3.1 shows the first 30 states
from an HMC run started with θ = (0, 0). The density contours of the bivariate normal
distribution are shown as gray ellipses. The right plot shows every 20th state from the
first 600 iterations of a run of a simple random walk Metropolis (RWM) algorithm. (This
takes time comparable to that for the HMC run.) The proposal distribution for RWM is a
bivariate normal with the current state as the mean, and 0.152 I2 as the covariance matrix.
(The standard deviation of this proposal is the same as the step size of HMC.) Figure 3.1
shows that HMC explores the distribution more efficiently, with successive samples being
farther from each other, and autocorrelations being smaller. For an extended review of
HMC, its properties, and its advantages over RWM, see [37].
In this example, we have assumed that one leapfrog step for HMC (which requires evaluating the gradient of the log density) takes approximately the same computation time as one
Metropolis update (which requires evaluating the log density), and that both move approximately the same distance. The benefit of HMC comes from this movement being systematic,
rather than in a random walk.1 We now propose a new approach called Split Hamiltonian
1
Indeed, in this two-dimensional example, it is better to use Metropolis with a large proposal standard
deviation, even though this leads to a low acceptance probability, because this also avoids a random walk.
However, in higher-dimensional problems with more than one highly-confining direction, a large proposal
18
3.2 Splitting the Hamiltonian
Monte Carlo (Split HMC), which further improves the performance of HMC by modifying
how steps are done, with the effect of reducing the time for one step or increasing the distance
that one step moves.
3.2
Splitting the Hamiltonian
As discussed by [37], variations on HMC can be obtained by using discretizations of Hamiltonian dynamics derived by “splitting” the Hamiltonian, H, into several terms:
H(θ, p) = H1 (θ, p) + H2 (θ, p) + · · · + HK (θ, p)
We use Ti,t , for i = 1, . . . , k to denote the mapping defined by Hi for time t. Assuming that
we can implement Hamiltonian dynamics for Hk exactly, the composition T1,ε ◦ T2,ε ◦ . . . ◦ Tk,ε
is a valid discretization of Hamiltonian dynamics based on H if the Hi are twice differentiable
[61]. This discretization is symplectic and hence preserves volume. It will also be reversible
if the sequence of Hi are symmetric: Hi (θ, p) = HK−i+1 (θ, p).
Indeed, the leapfrog method (2.10) can be regarded as a symmetric splitting of the
Hamiltonian H(θ, p) = U (θ) + K(p) as
H(θ, p) = U (θ)/2 + K(p) + U (θ)/2
(3.1)
In this case, H1 (θ, p) = H3 (θ, p) = U (θ)/2 and H2 (θ, p) = K(p). Hamiltonian dynamics
for H1 is
∂H1
=
0
∂p
∂H1
1
ṗ = −
= − ∇θ U (θ)
∂θ
2
which for a duration of discretized time step size ε gives the first part of a leapfrog step. For
θ̇ =
H2 , the dynamics is
∂H2
= ∇p K(p)
∂p
∂H2
ṗ = −
=
0
∂θ
For step size ε, this gives the second part of the leapfrog step. Hamiltonian dynamics for H3
θ̇ =
is the same as that for H1 since H1 = H3 , giving the the third part of the leapfrog step.
standard deviation leads to such a low acceptance probability that this strategy is not viable.
19
3. SPLIT HAMILTONIAN MONTE CARLO
3.2.1
Splitting the Hamiltonian with a partial analytic solution
Suppose the potential energy U (θ) can be written as U0 (θ) + U1 (θ). We can split H as
H(θ, p) = U1 (θ)/2 + [U0 (θ) + K(p)] + U1 (θ)/2
(3.2)
Here, H1 (θ, p) = H3 (θ, p) = U1 (θ)/2 and H2 (θ, p) = U0 (θ) + K(p). The first and the last
terms in this splitting are similar to equation (3.1), except that U1 (θ) replaces U (θ), so the
first and the last part of a leapfrog step remain as before, except that we use U1 (θ) rather
than U (θ) to update p. Now suppose that the middle part of the leapfrog, which is based
on the Hamiltonian U0 (θ) + K(p), can be handled analytically — that is, we can compute
the exact dynamics for any duration of time. We hope that since this part of the simulation
introduces no error, we will be able to use a larger step size, and hence take fewer steps,
reducing the computation time for the dynamical simulations.
We are mainly interested in situations where U0 (θ) provides a reasonable approximation
to U (θ), and in particular for Bayesian applications, we can use the Laplace approximation.
Specifically, we approximate U (θ) with U0 (θ), the energy function for N(θ̂, J −1 (θ̂)), where θ̂
is the posterior mode (maximum a posterior, MAP), obtained by fast optimzation algorithms
such Newton-Raphson method, and J (θ̂) is the Hessian matrix of U at θ̂. Finally, we set
U1 (θ) = U (θ) − U0 (θ), the error in this approximation.
[40] have recently proposed a similar splitting strategy for HMC, in which a Gaussian
component is handled analytically, in the context of high-dimensional approximations to
a distribution on an infinite-dimensional Hilbert space. In such applications, the Gaussian
distribution will typically be derived from the problem specification, rather than being found
as a numerical approximation, as we do here.
Using a normal approximation in which U0 (θ) =
1
(θ
2
− θ̂)T J (θ̂)(θ − θ̂), and letting
K(p) = 21 pT p (the energy for the standard normal distribution), H2 (θ, p) = U0 (θ) + K(p)
in the equation (3.2) will be quadratic, and Hamilton’s equations will be a system of firstorder linear differential equations that can be handled analytically [64]. Specifically, setting
θo = θ − θ̂, the dynamical equations can be written as follows:
d θo (t)
0
I θo (t)
=
−J (θ̂) 0 p(t)
dt p(t)
(3.3)
0
I
θo
which can be denoted as ż(t) = Az(t), where z =
, and A =
.
p
−J (θ̂) 0
The solution of this system is z(t) = eAt z(0), where z(0) is the initial value at time t = 0,
P
k
and eAt = +∞
k=0 (At) /k! is a matrix exponential. This matrix exponential can be simplified
20
3.2 Splitting the Hamiltonian
as eAt = ΓeDt Γ−1 using the following eigen-decomposition of matrix A:
A = ΓDΓ−1
where Γ is invertible and D is a diagonal matrix of eigen-values. Therefore the solution to
the system (3.3) is
z(t) = ΓeDt Γ−1 z(0)
and eDt can be easily computed by exponentiating the diagonal elements of D times t.
Remark 3.1. Note Γ and D could be complex matrices since A is not symmetric, but the
solution z(t) must be real. This can be shown as follows.
Eigen-decompose J (θ̂) = Γ∗ D∗ (Γ∗ )−1 where Γ∗ , D∗ are real because J (θ̂) is symmetric
and positive definite. Let θ ∗ = (Γ∗ )T θo , p∗ = (Γ∗ )T p. The dynamics (3.3) can also be solved
as
"
# "
#"
#"
#"
#
∗
∗
∗ −1/2
∗ 1/2
∗ 1/2
∗ 1/2
θ (t)
(D )
0
cos(D ) t sin(D ) t (D )
0 θ (0)
=
∗
∗ 1/2
∗ 1/2
p (t)
0
I − sin(D ) t cos(D ) t
0
I p∗ (0)
The solution can be recognized as stretching-rotating-shrinking the initial state, which is
related to the symplectic structure of the dynamical system (3.3).
The above analytical solution is of course for the middle part (denoted as H2 ) of the
equation (3.2) only. We still need to approximate the overall Hamiltonian dynamics based
on H, using the leapfrog method. Algorithm 3.1 shows the corresponding leapfrog steps —
after an initial step of size ε/2 based on U1 (θ), we obtain the exact solution for a time step
of ε based on H2 (θ, p) = U0 (θ) + K(p), and finish by taking another step of size ε/2 based
on U1 (θ).
3.2.2
Splitting the Hamiltonian by splitting the data
The method discussed in the previous section requires that we be able to handle the Hamiltonian H2 (θ, p) = U0 (θ) + K(p) analytically. If this is not the case, splitting the Hamiltonian
in this way may still be beneficial if the computational cost for U0 (θ) is substantially lower
than for U (θ). In these situations, we can use the following split:
H(θ, p) = U1 (θ)/2 +
M
X
[U0 (θ)/(2M ) + K(p)/M + U0 (θ)/(2M )] + U1 (θ)/2
(3.4)
m=1
for some M > 1. The above discretization can be considered as a nested leapfrog, where
the outer part takes half steps to update p based on U1 alone, and the inner part involves
21
3. SPLIT HAMILTONIAN MONTE CARLO
Algorithm 3.1 Leapfrog for split Hamilto- Algorithm 3.2 Nested leapfrog for split
nian Monte Carlo with a partial analytic so- Hamiltonian Monte Carlo with splitting of
lution
data
Dε −1
Sample initial values for p from N(0, I)
Rε ← Γe Γ
for ` = 1 to L do
Sample initial values for p from N(0, I)
p ← p − (ε/2)∇θ U1 (θ)
for ` = 1 to L do
for m = 1 to M do
p ← p − (ε/2)∇θ U1 (θ)
p ← p − (ε/(2M ))∇θ U0 (θ)
θo ← θ − θ̂
θ ← θ + (ε/M )p
z0 ← (θo , p)
p ← p − (ε/(2M ))∇θ U0 (θ)
(θo , p) ← Rε z0
end
for
θ ← θo + θ̂
p ← p − (ε/2)∇θ U1 (θ)
p ← p − (ε/2)∇θ U1 (θ)
end for
end for
M leapfrog steps of size ε/M based on U0 . Algorithm 3.2 implements this nested leapfrog
method.
For example, suppose our statistical analysis involves a large data set with many observations, but we believe that a small subset of data is sufficient to build a model that performs
reasonably well (compared to the model that uses all the observations). In this case, we can
construct U0 (θ) based on a small part of the observed data, and use the remaining observations to construct U1 (θ). If this strategy is successful, we will be able to use a large step
size for steps based on U1 , reducing the cost of a trajectory computation.
In detail, we divide the observed data, y, into two subsets: R0 , which is used to construct
U0 (θ), and R1 , which is used to construct U1 (θ):
U (θ)
= U0 (θ) + U1 (θ)
U0 (θ)
=
− log[P (θ)] −
X
log[P (yi |θ)]
i∈R0
U1 (θ)
=
−
X
(3.5)
log[P (yi0 |θ)]
i0 ∈R1
Note that the prior P (θ) appears in U0 (θ) only.
[37] discusses a related strategy for splitting the Hamiltonian by splitting the observed
data into multiple subsets. However, instead of randomly splitting data, as proposed there,
here we split data by building an initial model based on the MAP estimate, θ̂, and using
this model to identify the small subset of data that captures most of the information in the
full data set.
22
3.3 Application of Split HMC to logistic regression models
3.3
Application of Split HMC to logistic regression
models
We now look at how Split HMC can be applied to Bayesian logistic regression models for
binary classification problems. We will illustrate this method using the simulated data set
0
-2
-1
x2
1
2
with n = 100 data points and p = 2 covariates that is shown in figure 3.2.
-2
-1
0
1
2
x1
Figure 3.2: An illustrative binary classification problem with n = 100 data points and two
covariates, x1 and x2 , with the two classes represented by white circles and black squares.
The logistic regression model assigns probabilities to the two possible classes (denoted
by 0 and 1) in case i (for i = 1, . . . , n) as follows:
P (yi = 1|xi , α, β) =
exp(α + xTi β)
1 + exp(α + xTi β)
Here, xi is the vector of length p with the observed values of the covariates in case i, α is
the intercept, and β is the vector of p regression coefficients. We use θ to denote the vector
of all p + 1 unknown parameters, (α, β).
Let P (θ) be the prior distribution for θ. The posterior distribution of θ given x and y
Q
is proportional to P (θ) ni=1 P (yi |xi , θ). The corresponding potential energy function is
U (θ) = − log[P (θ)] −
n
X
i=1
23
log[P (yi |xi , θ)]
3. SPLIT HAMILTONIAN MONTE CARLO
We assume the following (independent) priors for the model parameters:
α
∼ N(0, σα2 )
βj
∼ N(0, σβ2 ),
j = 1, . . . , p
where σα and σβ are known constants.
The potential energy function for the above logistic regression model is therefore as
follows:
p
n
X βj2
X
α2
U (θ) = 2 +
−
[yi (α + xTi β) − log(1 + exp(α + xTi β))]
2σα j=1 2σβ2
i=1
The partial derivatives of the energy function with respect to α and the βj are
∂U
∂α
∂U
∂βj
3.3.1
n X
α
exp(α + xTi β)
=
−
yi −
σα2
1 + exp(α + xTi β)
i=1
n
βj X
exp(α + xTi β)
=
−
xij yi −
σβ2
1 + exp(α + xTi β)
i=1
Split HMC with a partial analytical solution for a logistic
model
To apply algorithm 3.1 for Split HMC to this problem, we approximate the potential energy
function U (θ) for the logistic regression model with the potential energy function U0 (θ) of
the normal distribution N(θ̂, J −1 (θ̂)), where θ̂ is the MAP estimate of model parameters.
U0 (θ) usually provides a reasonable approximation to U (θ), as illustrated in figure 3.3. In the
plot on the left, the solid curve shows the value of the potential energy, U , as β1 varies, with
β2 and α fixed to their MAP values, while the dashed curve shows U0 for the approximating
normal distribution. The right plot of figure 3.3 compares the partial derivatives of U and
U0 with respect to β1 , showing that ∂U0 /∂βj provides a reasonable linear approximation to
∂U /∂βj .
Since there is no error when solving Hamiltonian dynamics based on U0 (θ), we would
expect that the total discretization error of the steps taken by algorithm 3.1 will be less that
for the standard leapfrog method, for a given step size, and that we will therefore be able to
use a larger step size — and hence need fewer steps for a given trajectory length — while
still maintaining a good acceptance rate. The step size will still be limited to the region of
stability imposed by the discretization error from U1 = U − U0 , but this limit will tend to
be larger than for the standard leapfrog method.
24
3.3 Application of Split HMC to logistic regression models
Figure 3.3: Left plot: The potential energy, U , for the logistic regression model (solid curve)
and its normal approximation, U0 (dashed curve), as β1 varies, with other parameters at their
MAP values. Right plot: The partial derivatives of U and U0 with respect to β1 .
3.3.2
Split HMC with splitting of data for a logistic model
To apply algorithm 3.2 to this logistic regression model, we split the Hamiltonian by splitting
the data into two subsets. Consider the illustrative example discussed above. In the left plot
of figure 3.4, the thick line represents the classification boundary using the MAP estimate,
θ̂. For the points that fall on this boundary line, the estimated probabilities for the two
groups are equal, both being 1/2. The probabilities of the two classes become less similar
as the distance of the covariates from this line increases. We will define U0 using the points
within the region, R0 , within some distance of this line, and define U1 using the points in
the region, R1 , at a greater distance from this line. Equivalently, R0 contains those points
for which the probability that y = 1 (based on the MAP estimates) is closest to 1/2.
The shaded area in Figure 3.4 shows the region, R0 , containing the 30% of the observations closest to the MAP line, or equivalently the 30% of observations for which the
probability of class 1 is closest (in either direction) to 1/2. The unshaded region containing the remaining data points is denoted as R1 . Using these two subsets, we can split the
energy function U (θ) into two terms: U0 (θ) based on the data points that fall within R0 ,
and U1 based on the data points that fall within R1 (see equation (3.5)). Then, we use the
equation (3.4) to split the Hamiltonian dynamics.
Note that U0 is not used to approximate the potential energy function, U (the exact
value of U is used for the acceptance test at the end of the trajectory to ensuring that
the equilibrium distribution is exactly the target distribution). Rather, ∂U0 /∂βj is used
to approximate ∂U /∂βj , which is the costly computation when we simulate Hamiltonian
dynamics.
25
3. SPLIT HAMILTONIAN MONTE CARLO
Figure 3.4: Left plot: A split of the data into two parts based on the MAP model, represented
by the solid line; the energy function U is then divided into U0 , based on the data points in
R0 , and U1 , based on the data points in R1 . Right plot: The partial derivatives of U and U0
with respect to β1 , with other parameters at their MAP values.
To see that it is appropriate to split the data according to how close the probability of
class 1 is to 1/2, note first that the leapfrog step of the equation (2.10) will have no error
if the derivatives ∇θ U do not depend on θ — that is, when the second derivatives of U are
zero. Recall that for the logistic model,
n
βj X
exp(α + xTi β)
∂U
= 2 −
xij yi −
∂βj
σβ
1 + exp(α + xTi β)
i=1
from which we get
"
2 #
n
∂ 2U
exp(α + xTi β)
δjj 0 X
exp(α + xTi β)
−
= 2 +
xij xij 0
T
∂βj βj 0
σβ
1
+
exp(α
+
x
β)
1 + exp(α + xTi β)
i
i=1
n
δjj 0 X
= 2 +
xij xij 0 P (yi = 1|xi , α, β)[1 − P (yi = 1|xi , α, β)]
σβ
i=1
The product P (yi = 1|xi , α, β)[1 − P (yi = 1|xi , α, β)] is symmetrical around its maximum
P (yi = 1|xi , α, β) = 1/2, justifying our criterion for selecting points in R0 . The right plot of
figure 3.4 shows the approximation of ∂U /∂β1 by ∂U0 /∂β1 with β2 and α fixed to MAP.
3.4
Experiments
In this section, we use simulated and real data to compare our proposed methods to standard
HMC. For each problem, we set the number of leapfrog steps to L = 20 for standard HMC,
26
3.4 Experiments
and find ε such that the acceptance probability (AP) is close to 0.65 [37]. We set L and ε
for the Split HMC methods such that the trajectory length, εL, remains the same, but with
a larger step size and hence a smaller number of steps. Note that this trajectory length is
not necessarily optimal for these problems, but this should not affect our comparisons, in
which the length is kept fixed.
We try to choose ε for the Split HMC methods such that the acceptance probability is
equal to that of standard HMC. However, increasing the step size beyond a certain point
leads to instability of trajectories, in which the error of the Hamiltonian grows rapidly with
L [37], so that proposals are rejected with very high probability. This sometimes limits the
step size of Split HMC to values at which the acceptance probability is greater than the 0.65
aimed at for standard HMC. Additionally, to avoid near periodic Hamiltonian dynamics [37],
we randomly vary the step size over a small range. Specifically, at each iteration of MCMC,
we sample the step size from the Uniform(0.8ε, ε) distribution, where ε is the reported step
size for each experiment.
To measure the efficiency of each sampling method, we use the following autocorrelation
time (ACT) [3, 17]. Throughout this section, we set the number of Markov chain Monte
Carlo (MCMC) iterations for simulating posterior samples to N = 50000.
Definition 3.1 (Autocorrelation Time). Given N posterior samples, we divide them into
batches of size B, then autocorrelation time τ can be estimated as follows:
τ =B
Sb2
S2
where S 2 is the sample variance and Sb2 is the sample variance of batch means.
Remark 3.2. Autocorrelation time can be roughly interpreted as the number of MCMC
transitions required to produce samples that can be considered as independent. In practice,
the posterior samples can be divided into N 1/3 batches of size B = N 2/3 [65].
For the logistic regression problems discussed in this section, we could find the autocorrelation time separately for each parameter and summarize the autocorrelation times using
their maximum value (i.e., for the slowest moving parameter) to compare different methods.
However, since one common goal is to use logistic regression models for prediction, we look
P
at the autocorrelation time, τ , for the log likelihood, ni=1 log[P (yi |xi , θ)] using the posterior
P
samples of θ. We also look at the autocorrelation time for j βj2 (denoting it τβ ), since this
may be more relevant when the goal is interpretation of parameter estimates.
We adjust τ (and similarly τβ ) to account for the varying computation time needed by the
different methods in two ways. One is to compare different methods using τ × s, where s is
27
3. SPLIT HAMILTONIAN MONTE CARLO
the CPU time per iteration, using an implementation written in R. This measures the CPU
time required to produce samples that can be regarded as independent samples. We also
compare in terms of τ × g, where g is the number of gradient computations on the number
of cases in the full data set required for each trajectory simulated by HMC. This will be
equal to the number of leapfrog steps, L, for standard HMC or Split HMC using a normal
approximation. When using data splitting with a fraction f of data in R0 and M inner
leapfrog steps, g will be (f M + (1 − f )) × L. In general, we expect that computation time
will be dominated by the gradient computations counted by g, so that τ × g will provide a
measure of performance independent of any particular implementation. In our experiments,
s was close to being proportional to g, except for slightly larger than expected times for Split
HMC with data splitting.
Note that compared to standard HMC, our two methods involve some computational
overhead for finding the MAP estimate. However, the additional overhead associated with
finding the MAP estimate remains negligible (less than a second for most examples discussed
here) compared to the sampling time.
3.4.1
Simulated data
We first tested the methods on a simulated data set with 100 covariates and 10000 observations. The covariates were sampled as xij ∼ N(0, σj2 ), for i = 1, . . . , 10000 and j = 1, . . . , 100,
with σj set to 5 for the first five variables, to 1 for the next five variables, and to 0.2 for
the remaining 90 variables. We sampled true parameter values, α and βj , independently
from N(0, 1) distributions. Finally, we sampled the class labels according to the model, as
yi ∼ Bernoulli(πi ) with logit(πi ) = α + xTi β.
For the Bayesian logistic regression model, we assumed normal priors with mean zero
and standard deviation 5 for α and βj , where j = 1, . . . , 100. We ran standard HMC,
Split HMC with normal approximation, and Split HMC with data splitting for N = 50000
iterations. For the standard HMC, we set L = 20 and ε = 0.015, so the trajectory length
was 20 × 0.015 = 0.3. For Split HMC with normal approximation and Split HMC with data
splitting, we reduce the number of leapfrog steps to 10 and 3 respectively, while increasing
the step sizes so that the trajectory length remained 0.3. For the data splitting method, we
use 40% of the data points for U0 and set M = 9, which makes g equal 4.2L. Since we set
L = 3, we have g = 12.6, which is smaller than g = L = 20 used for the standard HMC
algorithm.
Table 3.1 shows the results for the three methods. The CPU times (in seconds) per
iteration, s, and τ × s for the Split HMC methods are substantially lower than for standard
28
3.4 Experiments
HMC
L
g
s
AP
τ
τ ×g
τ ×s
τβ
τβ × g
τβ × s
Split HMC
Normal Appr.
Data Splitting
20
20
0.187
0.69
4.6
92
0.864
11.7
234
2.189
10
10
0.087
0.74
3.2
32
0.284
13.5
135
1.180
3
12.6
0.096
0.74
3.0
38
0.287
7.3
92
0.703
Table 3.1: Split HMC (with normal approximation and data splitting) compared to standard
HMC using simulated data, on a data set with n = 10000 observations and p = 100 covariates.
Here, L is the number of leapfrog steps, g is the number of gradient computations, s is the
CPU time (in seconds) per iteration, AP is the acceptance probability, τ is theP
autocorrelation
time based on the log likelihood, and τβ is the autocorrelation time based on j (βj )2 .
HMC. The comparison is similar looking at τ × g. Based on τβ × s and τβ × g, however,
the improvement in efficiency is more substantial for the data splitting method compared to
the normal approximation method mainly because of the difference in their corresponding
values of τβ .
3.4.2
Results on real data sets
In this section, we evaluate our proposed method using three real binary classification problems. The data for these three problems are available from the UCI Machine Learning
Repository (http://archive.ics.uci.edu/ml/index.html). For all data sets, we standardized the numerical variables to have mean zero and standard deviation 1. Further, we
assumed normal priors with mean zero and standard deviation 5 for the regression parameters. We used the setup described at the beginning of Section 3.4, running each Markov
chain for N = 50000 iterations. Table 3.2 summarizes the results using the three sampling
methods.
The first problem, StatLog, involves using multi-spectral values of pixels in a satellite
image in order to classify the associated area into soil or cotton crop. (In the original data,
different types of soil are identified.) The sample size for this data set is n = 4435, and the
number of features is p = 37. For the standard HMC, we set L = 20 and ε = 0.08. For
the two Split HMC methods with normal approximation and data splitting, we reduce L
to 14 and 3 respectively while increasing ε so ε × L remains the same as that of standard
HMC. For the data splitting methods, we use 40% of data points for U0 and set M = 10. As
seen in the table, the Split HMC methods improve efficiency, with the data splitting method
performing better than the normal approximation method.
29
3. SPLIT HAMILTONIAN MONTE CARLO
HMC
StatLog
n = 4435, p = 37
CTG
n = 2126, p = 21
Chess
n = 3196, p = 36
L
g
s
AP
τ
τ ×g
τ ×s
τβ
τβ × g
τβ × s
L
g
s
AP
τ
τ ×g
τ ×s
τβ
τβ × g
τβ × s
L
g
s
AP
τ
τ ×g
τ ×s
τβ
τβ × g
τβ × s
20
20
0.033
0.69
5.6
112
0.190
5.6
112
0.191
20
20
0.011
0.69
6.2
124
0.069
24.4
488
0.271
20
20
0.022
0.62
10.7
214
0.234
23.4
468
0.511
Split HMC
Normal Appr.
Data Splitting
14
14
0.026
0.74
6.0
84
0.144
4.7
66
0.122
13
13
0.008
0.77
7.0
91
0.055
19.6
255
0.154
9
13
0.011
0.73
12.8
115
0.144
18.9
246
0.212
3
13.8
0.023
0.85
4.0
55
0.095
3.8
52
0.090
2
9.8
0.005
0.81
5.0
47
0.028
11.5
113
0.064
2
11.8
0.013
0.62
12.1
143
0.161
19.0
224
0.252
Table 3.2: HMC and Split HMC (normal approximation and data splitting) on three real data
sets. Here, L is the number of leapfrog steps, g is the number of gradient computations, s is the
CPU time (in seconds) per iteration, AP is the acceptance probability, τ is theP
autocorrelation
time based on the log likelihood, and τβ is the autocorrelation time based on j βj2 .
30
3.5 Discussion
The second problem, CTG, involves analyzing 2126 fetal cardiotocograms along with
their respective diagnostic features [66]. The objective is to determine whether the fetal
state class is “pathologic” or not. The data include 2126 observations and 21 features. For
the standard HMC, we set L = 20 and ε = 0.08. We reduced the number of leapfrog steps to
13 and 2 for Split HMC with normal approximation and data splitting respectively. For the
latter, we use 30% of data points for U0 and set M = 14. Both splitting methods improved
performance significantly.
The objective of the last problem, Chess, is to predict chess endgame outcomes — either
“white can win” or “white cannot win”. This data set includes n = 3196 instances, where
each instance is a board-description for the chess endgame. There are p = 36 attributes
describing the board. For standard HMC, we set L = 20 and ε = 0.09. For the two
Split HMC methods with normal approximation and data splitting, we reduced L to 9 and
2 respectively. For the data splitting method, we use 35% of the data points for U0 and
set M = 15. Using the Split HMC methods, the computational efficiency is improved
substantially compared to standard HMC. This time however, the normal approximation
approach performs better than the data splitting method in terms of τ × g, τ × s, and τβ × s,
while the latter performs better in terms of τβ × g.
3.5
Discussion
We have proposed two new methods for improving the efficiency of HMC, both based on
splitting the Hamiltonian in a way that allows much of the movement around the state space
to be performed at low computational cost.
While we demonstrated our methods on binary logistic regression models, they can be
extended to multinomial logistic (MNL) models for multiple classes. For MNL models, the
regression parameters for p covariates and K classes form a matrix of (p + 1) rows and
K columns, which we can regard as a vector of (p + 1) × K elements. For Split HMC
with normal approximation, we can define U0 (θ) using an approximate multivariate normal
N(θ̂, J −1 (θ̂)) as before. For Split HMC with data splitting, we can still construct U0 (θ)
using a small subset of data, based on the class probabilities for each data item found using
the MAP estimates for the parameters (the best way of doing this is a subject for future
research). The data splitting method could be further extended to any model for which it is
feasible to find a MAP estimate, and then divide the data into two parts based on “residuals”
of some form.
Although in theory our method can be used for many statistical models, its usefulness is
of course limited by how well the posterior distribution can be approximated by a Gaussian
31
3. SPLIT HAMILTONIAN MONTE CARLO
distribution in algorithm 3.1, and how well the gradient of the energy function can be approximated using a small but influential subset of data in algorithm 3.2. For example, algorithm
3.1 might not perform well for neural network models, for which the posterior distribution is
usually multimodal. When using neural networks classification models, one could however
use algorithm 3.2 selecting a small subset of data using a simple logistic regression model.
This could be successful when a linear model performs reasonably well, even if the optimal
decision boundary is nonlinear.
The scope of algorithm 3.1 proposed in this chapter might be broadened by finding
better methods to approximate the posterior distribution, such as variational Bayes methods.
Future research could involve finding tractable approximations to the posterior distribution
other than normal distributions. Also, one could investigate other methods for splitting the
Hamiltonian dynamics by splitting the data — for example, fitting a support vector machine
(SVM) to binary classification data, and using the support vectors for constructing U0 .
While the results on simulated data and real problems presented in this chapter have
demonstrated the advantages of splitting the Hamiltonian dynamics in terms of improving the sampling efficiency, our proposed methods do require preliminary analysis of data,
mainly, finding the MAP estimate. As mentioned above, the performance of our approach
obviously depends on how well the corresponding normal distribution based on MAP estimates approximates the posterior distribution, or how well a small subset of data found
using this MAP estimate captures the overall patterns in the whole data set. Moreover, this
preliminary analysis involves some computational overhead. For many problems, however,
the computational cost associated with finding the MAP estimate is negligible compared to
the potential improvement in sampling efficiency for the full Bayesian model. For most of
the examples discussed here, the additional computational cost is less than a second. Of
course, there are situations for which finding the MAP estimate could be an issue; this is
especially true for high dimensional problems. For such cases, it might be more practical to
use algorithm 3.2 after selecting a small but influential subset of data based on probabilities
found using a simpler model. For the neural network example discussed above, we can use a
simple logistic regression model with maximum likelihood estimates to select the data points
for U0 .
Although the normal approximations have been used for Bayesian inference in the past
[see 67], we use it for exploring the parameter space more efficiently while sampling from the
exact distribution. One could of course use the approximate normal (Laplace) distribution
as a proposal distribution in a Metropolis-Hastings algorithm. Using this approach however
the acceptance rates drop substantially (below 10%) for our examples.
32
3.5 Discussion
Another approach to improving HMC has recently been proposed by [39]. Their method,
Riemannian HMC (RHMC), can also substantially improve performance. RHMC utilizes
the geometric properties of the parameter space to explore the best direction, typically at
higher computational cost, to produce distant proposals with high probability of acceptance.
In contrast, our method attempts to find a simple approximation to the Hamiltonian to
reduce the computational time required for reaching distant states. It is possible that these
approaches could be combined, to produce a method that performs better than either method
alone. The recent proposals by [38] for automatic tuning of HMC could also be combined
with our Split HMC methods.
33
3. SPLIT HAMILTONIAN MONTE CARLO
34
4
Lagrangian Monte Carlo
4.1
Introduction
Hamiltonian Monte Carlo (HMC) [36] reduces the random walk behavior of the MetropolisHastings algorithm by proposing samples distant from the current state, which nevetheless
have a high probability of being accepted. These distant proposals are found by numerically
simulating Hamiltonian dynamics for some specified amount of fictitious time [37]. Hamiltonian dynamics can be represented by a function, known as the Hamiltonian, of model
parameters θ ∼ π(θ) and auxiliary momentum parameters p ∼ N(0, M) (with the same
dimension as θ) as follows:
1
H(θ, p) = − log π(θ) + pT M−1 p
2
(4.1)
where M is a symmetric, positive-definite mass matrix.
Hamilton’s equations, which involve differential equations derived from H, determine how
θ and p change over time. In practice, however, solving these equations exactly is difficult
in general, so we need to approximate them by discretizing time, using some small step size
ε. For this purpose, the leapfrog method (2.10) is commonly used.
Hamiltonian dynamics is restricted by the smallest eigen-direction, requiring small step
size to maintain the stability of the numerical discretization. [39] propose a new method,
called Riemannian HMC (RHMC), that exploits the geometric properties of the parameter
space to improve the efficiency of standard HMC, especially in sampling distributions with
complex structure (e.g., high correlation, non-Gaussian shape). Simulating the resulting dynamics, however, is computationally intensive since it involves solving two implicit equations,
which require additional iterative numerical computation (e.g., fixed-point iteration).
35
4. LAGRANGIAN MONTE CARLO
In an attempt to increase the speed of RHMC, we propose a new integrator that is
completely explicit: we propose to replace momentum with velocity in the definition of the
Riemannian Hamiltonian dynamics. As we will see, this is equivalent to using Lagrangian
dynamics as opposed to Hamiltonian dynamics. By doing so, we eliminate one of the implicit
steps in RHMC. Next, we construct a time symmetric integrator to remove the remaining
implicit step in RHMC. This leads to a valid sampling scheme (i.e., converges to the true
target distribution) that involves only explicit equations. We refer to this algorithm as
Lagrangian Monte Carlo (LMC).
In what follows, we begin with a brief review of RHMC and its geometric integrator
in section 2. Section 3 introduces our proposed semi-explicit integrator based on defining
Hamiltonian dynamics in terms of velocity as opposed to momentum. Next, in section 4, we
eliminate the remaining implicit equation and propose a fully explicit integrator. In section
5, we use simulated and real data to evaluate our methods’ performance. Finally, in section
6, we discuss some possible future research directions.
4.2
Riemannian Hamiltonian Monte Carlo
As discussed above, although HMC explores the parameter space more efficiently than random walk Metropolis does, it does not fully exploit the geometric properties of the parameter
space defined by the density π(θ). Indeed, [39] argue that dynamics over Euclidean space
may not be appropriate to guide the exploration of parameter space. To address this issue,
they propose a new method that exploits the Riemannian geometry of the parameter space to
improve standard HMC’s efficiency by automatically adapting to the local structure. They
do this by replacing the fixed mass matrix M in the standard HMC with a more informative
position-specific matrix G(θ), which is set to the Fisher information matrix in the chapter.
The resulted method is named Riemannian Hamiltonian Monte Carlo (RHMC). As an illustrative example, figure 4.1 shows the sampling paths of random walk Metropolis (RWM),
HMC, and RHMC for an artificially created banana-shaped distribution [See 39, discussion
by Luke Bornn and Julien Cornebise]. For this example, we fix the trajectory and choose
the step sizes such that the acceptance probability for all three methods remains around 0.7.
RWM moves slowly and spends most of iterations at the distribution’s low-density tail, and
HMC explores the parameter space in an indirect way, while RHMC moves directly to the
high density region and explores the distribution more efficiently.
36
4.2 Riemannian Hamiltonian Monte Carlo
Sampling Path of HMC
Sampling Path of RHMC
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
θ2
2
θ2
θ2
Sampling Path of RWM
2
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2
−2
−1
0
θ1
1
2
−2
−2
−1
0
1
θ1
2
−2
−2
−1
0
θ1
1
2
Figure 4.1: The first 10 iterations in sampling from a banana shaped distribution with random
walk Metropolis (RWM), Hamiltonian Monte Carlo (HMC), and Riemannian HMC (RHMC).
For all three methods, the trajectory length (i.e., step size ε times number of integration steps
L) is set to 1. For RWM, L=17, for HMC, L=7, and for RHMC, L=5. Solid red lines are the
sampling paths, and black circles are the accepted proposals.
4.2.1
Hamiltonian dynamics on Riemannian manifold
Following [46], we define a family of probability distributions as a manifold in the following
sense.
Definition 4.1 (Statistical Manifold). Consider a family of probability distributions parametrized
by D-dimensional vector θ:
Z
D
D
M := πθ = π(· ; θ) : X → Rπθ (x) ≥ 0,
πθ (x)dx = 1, ∀θ ∈ Θ ⊂ R
X
where the probability density π(·) is defined in general as Radon-Nikodym derivative with
respect to a σ-finite measure, e.g. Lebesgue measure, on the probability space X. If there
open
exists a coordinate system A = {(φ, U )|φ : U ⊂ RD → MD , θ 7→ πθ } statisfying
i) Any parametrization φ ∈ A is a one-to-one mapping U → MD ;
ii) [Compatible Transitions] Given any other one-to-one mapping ψ : V ⊂ RD → MD , the
following holds:
ψ ∈ A ⇐⇒ ψ −1 ◦ φ is a C ∞ diffeomorphism (both the mapping and its inverse are C ∞ ).
then we call (MD , A) a C ∞ differentiable manifold, or Statistical Manifold, regarding all
compatible parametrizations as equivalent.
37
4. LAGRANGIAN MONTE CARLO
Remark 4.1. It is assumed that there exists a true underlying distribution π ∗ (·) that governs
the generation of observations x1 , · · · , xN . Although π ∗ (·) is unknown, the objective is often
to estimate it in order to best model the given data D.
In Bayesian statistics, it is of interest to obtain the posterior π(θ|x) of the model parameter θ given certain prior. Each π(θ|x) specified by a vector of parameters θ is an element
from MD , a model to explain the given observations. When substituting in the given data D,
π(θ|D) becomes a scalar. For the convenience in the following discussion, we assume that
∀x ∈ X, the function θ 7→ π(x; θ) is C ∞ .
In order to calculate quantities such as length, area, volume etc. and to form Hamiltonian
dynamics on the manifold MD , we need to introduce the Riemannian metric on MD [46, 68].
Definition 4.2 (Riemannian Metric). A Riemannian metric on a smooth manifold MD is
a correspondece which associates to each point πθ ∈ MD an inner product h· , ·iπ (a symmetric, bilinear, positive-definite form) on the tangent space Tπ M, such that for any vector
∂
∂
(πθ ), Y = yT ∂θ
(πθ ) on MD , θ 7→ gθ (X(πθ ), Y (πθ )) = hX(πθ ), Y (πθ )iπ =
fields X = xT ∂θ
∂
∂
xT ∂θ
(πθ ), ∂θ
(πθ ) π y =: xT G(θ)y defines a C ∞ function on some U . Then we call (MD , g)
a Riemannian manifold.
Remark 4.2. Following [39, 46], we use Fisher information matrix for the Riemannian
metric G(θ) = (gij (θ))D×D , thus it is also called Fisher metric:
Z
gij (θ) := E[∂i log L(x; θ)∂j log L(x; θ)] =
∂i log L(x; θ)∂j log L(x; θ)dx
(4.2)
with the shorthand notation ∂i = ∂/∂θ i for partial derivative. In the above definition (4.2),
we integrate out all random variables being modeled in the likelihood thus it becomes a function
of θ. G(θ) may (e.g. logistic regression in section 4.5.2) or may not (e.g. banana shaped
distribution in section 4.5.1) involve data. When such integration is not explicit, we use
empirical Fisher Information (section 4.5.4) instead. In certain cases, minus Hessian of
log-prior is also added to Fisher metric to ensure positive-definiteness [39].
Given the target distribution with density π(θ), which could be posterior density of θ,
we introduce the ancilliary momentum p depending on θ: p|θ ∼ N(0, G(θ)), and define the
Hamiltonian as follows:
H(θ, p) = − log π(θ) +
1
1
1
log det G(θ) + pT G(θ)−1 p = φ(θ) + pT G(θ)−1 p
2
2
2
where φ(θ) := − log π(θ) +
1
2
(4.3)
log det G(θ). Based on this Hamiltonian, [39] propose the
38
4.2 Riemannian Hamiltonian Monte Carlo
following Hamiltonian dynamics on the Riemmanian manifold :
G(θ)−1 p
1
ṗ = −∇θ H(θ, p) = −∇θ φ(θ) + ν(θ, p)
2
θ̇ =
∇p H(θ, p) =
(4.4)
where the ith element of the vector ν(θ, p) is (ν(θ, p))i = −pT ∂i (G(θ)−1 )p = (G(θ)−1 p)T ∂i G(θ)G(θ)−1 p
Remark 4.3. As the general Hamiltonian dynamics (section 2.1.1), the Riemannian Hamiltonian dynamics (4.4) also has the corresponding properties important for MCMC application: i) time reversibility; ii) volume preservation; iii) energy conservation.
4.2.2
Riemannian Hamiltonian Monte Carlo Algorithm
In practice, we need to numerically solve the non-separable (containing products of θ and p)
dynamical system (4.4). However, the resulting map (θ, p) → (θ ∗ , p∗ ) based on the standard
leapfrog method (2.10) is neither time-reversible nor symplectic thus not appropriate to be
applied to solve (4.4) [39]. Instead, they use the Stömer-Verlet [63] method as follows:
(n+1/2)
p
θ (n+1)
p(n+1)
1
ε
(n)
(n)
(n+1/2)
∇θ φ(θ ) − ν(θ , p
)
=p −
2
2
i
ε h −1 (n)
(n)
(n+1)
−1
=θ +
G (θ ) + G (θ
) p(n+1/2)
2
ε
1
(n+1)
(n+1)
(n+1/2)
(n+1/2)
=p
−
∇θ φ(θ
) − ν(θ
,p
)
2
2
(n)
(4.5)
(4.6)
(4.7)
where ε is the size of time step. This is also known as generalized leapfrog, which can
be derived by concatenating a symplectic Euler-B integrator of (4.4) with its adjoint symplectic Euler-A integrator [See more details in 61]. The above series of transformations
T̂ε : (θ (n) , p(n) ) 7→ (θ (n+1) , p(n+1) ) provides a deterministic geometric integrator (both timereversible and volume-preserving) to (4.4).
Starting from the current state (θ (1) , p(1) ), we evolve the dynamics (4.4) for L discretized
steps to get a proposal (θ (L+1) , p(L+1) ) and accept it according the following acceptance
probability as (2.8):
αRHM C = min{1, exp(−H(θ (L+1) , p(L+1) ) + H(θ (1) , p(1) ))}
(4.8)
Note the proposal distribution is actually a delta function δT̂Lε (θ(1) ,p(1) ) (θ (L+1) , p(L+1) )).
Algorithm 4.1 summarizes the steps of Riemannian Hamiltonian Monte Carlo (RHMC)
[39]. One major drawback of the generalized leapfrog method is that it involves two implicit
39
4. LAGRANGIAN MONTE CARLO
Algorithm 4.1 Riemannian Hamiltonian Monte Carlo (RHMC)
Initialize θ (1) = current θ
Sample new momentum p(1) ∼ N(0, G(θ (1) ))
Calculate current H(θ (1) , p(1) ) according to equation (4.3)
for ` = 1 to L (leapfrog steps) do
% Update the momemtum with fixed point iteration
p̂(0) = p(`)
for i = 1 to NumOfFixedPointSteps
do
h
i
(`)
(`)
1
ε
(i)
(`)
(i−1)
p̂ = p − 2 ∇θ φ(θ ) − 2 ν(θ , p̂
)
end for
p(`+1/2) = p̂(last i)
% Update the position with fixed point iteration
(0)
θ̂ = θ (`)
for i = 1 to NumOfFixedPointSteps
do i
h
(i−1)
(i)
(`)
(`)
ε
−1
−1
) p(`+1/2)
θ̂ = θ + 2 G (θ ) + G (θ̂
end for
(last i)
θ (`+1) = θ̂
% Update the momentum
exactly
h
i
ε
(`+1)
(`+1/2)
p
=p
− 2 ∇θ φ(θ (`+1) ) − 12 ν(θ (`+1) , p(`+1/2) )
end for
Calculate proposed H(θ (L+1) , v(L+1) ) according to equation (4.3)
logRatio = −ProposedH + CurrentH
Accept or reject the proposal (θ (L+1) , p(L+1) ) according to logRatio
functions: equations (4.5) and (4.6). These functions require extra numerical analysis (e.g.
fixed-point iteration), which results in higher computational cost and simulation error. This
is especially true when solving θ (n+1) because the fixed-point iteration for (4.6) repeatedly
inverts the matrix G(θ). To address this problem, we propose an alternative approach that
uses velocity instead of momentum in the equations of motion.
4.3
Semi-explicit Lagrangian Monte Carlo
In this section, Einstein notation is adopted. Whenever the index appears twice in a mathP
P
ematical expression, we sum over it: e.g., ai bi := i ai bi , Γkij v i v j := i,j Γkij v i v j . A lower
index is used for the covariant tensor, whose components vary by the same transformation
as the change of basis (e.g., gradient), whereas the upper index is reserved for the contravariant tensor, whose components vary in the opposite way as the change of basis in order to
compensate (e.g. velocity vector). Interested readers should refer to [69].
40
4.3 Semi-explicit Lagrangian Monte Carlo
4.3.1
Lagrangian Dynamics: from Momentum to Velocity
In the equations of Hamiltonian dynamics (4.4), the term G(θ)−1 p appears several times.
This motivates us to re-parameterize the dynamics in terms of velocity, v = G(θ)−1 p. Note
that this in fact corresponds to the usual definition of velocity in physics, i.e., momentum
divided by mass. The transformation p 7→ v changes the Hamiltonian dynamics (4.4) to the
following Lagrangian dynamics 1 :
θ̇
= v
v̇
=
− η(θ, v) − G(θ)−1 ∇θ φ(θ)
(4.9)
where η(θ, v) is a vector whose kth element is Γkij (θ)v i v j . Here, Γkij (θ) := 21 g kl (∂i glj + ∂j gil −
∂l gij ) are the Christoffel symbols, where gij and g ij denote (i, j)th element of G(θ) and
G(θ)−1 respectively.
Proposition 4.1. The Riemannian Hamiltonian dynamics (4.4) is equivalent to the Lagrangian dynamics (4.9).
Proof. Appendix A.1.
Remark 4.4. This transformation p 7→ v moves the complexity of the dynamics (4.4) in the
first equation for θ to its second equation in (4.9) where most of the time is spent in finding
a good direction v. In the following section, we will show it helps the develeoped integrator
to resolve the implicitness of updating θ to reduce the associated computational cost.
The introduction of velocity v in place of p is also advocated by [40] to avoid large
momentum variables p for the sake of numerical stability. They consider a constant mass so
the resulting dynamics is still Hamiltonian. Actually we have an example (section 4.5.1.1)
for which RHMC using momentum p is very unstable in numerically simulating the dynamics
(4.4). So the Lagrangian dynamics (4.9) is preferred to the Hamiltonian dynamics (4.4) also
due to the consideration of numerical stability.
Define Lagrangian as kinetic energy minus potential energy: L = 21 vT G(θ)v − φ(θ). The new dynamics
(4.9) can be proved equivalent to the following Euler-Lagrange equation of the second kind:
1
d ∂L
∂L
=
∂θ
dt ∂ θ̇
which is the solution to variation of the total Lagrangian (action), that is, in our case,
θ̈ = −η(θ, θ̇) − G(θ)−1 ∇θ φ(θ)
41
4. LAGRANGIAN MONTE CARLO
Although the Lagrangian dynamics (4.9) in general cannot be recognized as a Hamilton
dyanmics of (θ, v), it nevertheless preserves the original Hamiltonian of the system, which
is intuitive.
Proposition 4.2. The Lagrangian dynamics (4.9) preserves the Hamiltonian H(θ, p) =
H(θ, G(θ)v).
Proof. It suffices to prove
d
H ≡ 0 according to (4.9).
dt
d
∂
T ∂
H(θ, G(θ)v) = θ̇
H(θ, G(θ)v) + v̇T H(θ, G(θ)v)
dt
∂θ
∂v
T
1
= vT ∇θ φ(θ) + vT ∂G(θ)v + −vT Γ(θ)v − G(θ)−1 ∇θ φ(θ) G(θ)v
2
1
= vT ∇θ φ(θ) − (∇θ φ(θ))T v + vT vT ∂G(θ)v − (vT Γ̃(θ)v)T v
2
=0+0=0
where vT Γ(θ)v is a vector whose kth element is Γkij (θ)v i v j . The second 0 is due to the triple
form (vT Γ̃(θ)v)T v = Γ̃ijk v i v j v k = 21 ∂k gij v i v j v k , where Γ̃ is the Christoffel symbol of first
kind with elements Γ̃ijk (θ) := gkl Γlij (θ) = 12 (∂i gkj + ∂j gik − ∂k gij ).
4.3.2
Semi-explicit Lagrangian Monte Carlo Algorithm
Now we want to use the Lagrangian dynamics (4.9) instead of the Riemannian Hamiltonian
dyanmics (4.4) as the proposal mechanism in the Metropolis algorithm. In the following we
derive a time reversible integrator for (4.9), which is not volume preserving; however the
detailed balance condition (2.3) can still be achieved by adjusting the Jacobian determinant
in the acceptance probability.
4.3.2.1
Time reversible integrator
Similarly as the generalized leapfrog (4.5)-(4.7), we concatenate a half step of the following
Euler-B integrator of (4.9) [chap 4 of 61]:
ε
θ (n+1/2) = θ (n) + v(n+1/2)
2
ε
(n+1/2)
(n)
v
= v − [(v(n+1/2) )T Γ(θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
42
4.3 Semi-explicit Lagrangian Monte Carlo
with another half step of its adjoint Euler-A integrator:
ε
θ (n+1) = θ (n+1/2) + v(n+1/2)
2
ε
(n+1)
(n+1/2)
v
=v
− [(v(n+1/2) )T Γ(θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
to get the following semi-explicit time-reversible integrator:
ε
v(n+1/2) = v(n) − [η(θ (n) , v(n+1/2) ) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
(n+1)
(n)
θ
= θ + εv(n+1/2)
ε
v(n+1) = v(n+1/2) − [η(θ (n+1) , v(n+1/2) ) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
(4.10)
(4.11)
(4.12)
Note (4.11) resolves the implicitness of updating θ in the generalized leapfrog method, thus
reduces the associated computational cost. But the equation (4.10) updating v remains
implicit.
4.3.2.2
Detailed balance condition
Note that the integrator (4.10)-(4.12) is (i) time reversible and (ii) energy preserving up to a
global error of order O(ε), where ε is the step size. The resulting map, however, is no longer
volume preserving (see section 4.3.2.3). Nevertheless, based on proposition 4.3, we can still
have detailed balance after determinant adjustment [See also 58].
Proposition 4.3 (Detailed Balance Condition with determinant adjustment). Denote z =
(θ, v), z0 = T̂L (z) for some time reversible integrator T̂L to the Lagrangian dynamics (4.9).
If the acceptance probability is adjusted in the following way:
exp(−H(z0 ))
α̃(z, z ) = min 1,
| det T̂L |
exp(−H(z))
0
(4.13)
then the detailed balance condition still holds
α̃(z, z0 )P(dz) = α̃(z0 , z)P(dz0 )
43
(4.14)
4. LAGRANGIAN MONTE CARLO
Proof.
exp(−H(z0 )) dz0 α̃(z, z )P(dz) = min 1,
exp(−H(z))dz
exp(−H(z)) dz 0 z=T̂L−1 (z0 )
0 dz dz =
min exp(−H(z)), exp(−H(z )) 0 dz0
dz
dz
exp(−H(z)) dz exp(−H(z0 ))dz0 = α̃(z0 , z)P(dz0 )
= min 1,
0
0
exp(−H(z )) dz
0
Before discussing the calcuation of the adjusted acceptance probability (4.13), we define
the energy of Lagrangian dynamics (4.9) as follows:
Definition 4.3 (Energy of Lagrangian Dynamics). Because p|θ ∼ N(0, G(θ)), the distribution of v|θ is N(0, G(θ)−1 ). The energy function E(θ, v) is defined as the sum of the
potential energy, U (θ) = − log π(θ) and the kinetic energy K(θ, v) = − log(P (v|θ)):
E(θ, v) = − log π(θ) −
1
1
log det G(θ) + vT G(θ)v
2
2
(4.15)
Remark 4.5. This energy (4.15) differs from the Hamiltonian H(θ, G(θ)v) (4.3) in the sign
of the middle term due to the difference in distributions of p|θ and v|θ. Note the energy
(4.15) is not preserved by the Lagrangian dynamics (4.9), in constrast to the proposition 4.2.
The energy is related to the Hamiltonian in the following change of variable formula. It is
more natural to work with energy.
Z
Z
∂(θ, p) p7→v
|dθ ∧ dv|
f (θ, p) exp(−H(θ, p))|dθ ∧ dp| =
f (θ, G(θ)v) exp(−H(θ, G(θ)v)) ∂(θ, v) Z
= f (θ, G(θ)v) exp(−E(θ, v))|dθ ∧ dv|
Note, the adjusted acceptance probability (4.13) should be calculated based on H(θ, G(θ)v).
However, the following proposition allows it to be calculated based on the energy function
E(θ, v) (4.15), which is more intuitive.
Proposition 4.4. The adjust acceptance probability (4.13) can be calculated based on either
H(θ, G(θ)v) or E(θ, v).
44
4.3 Semi-explicit Lagrangian Monte Carlo
∂(θ 0 , p0 ) det(G(θ 0 )) ∂(θ 0 , v0 ) , then
Proof. Note that =
∂(θ, p) det(G(θ)) ∂(θ, v) exp(−H(θ 0 , G(θ 0 )v0 )) ∂(θ 0 , p0 ) exp(−H(θ 0 , p0 )) ∂(θ 0 , p0 ) = min 1,
α̃ = min 1,
exp(−H(θ, p)) ∂(θ, p) exp(−H(θ, G(θ)v)) ∂(θ, p) (
)
exp{−(log π(θ 0 ) + 21 log det G(θ 0 ) + 21 v0 T G(θ 0 )v0 )} det(G(θ 0 )) ∂(θ 0 , v0 ) = min 1,
exp{−(log π(θ) + 12 log det G(θ) + 21 vT G(θ)v)} det(G(θ)) ∂(θ, v) (
)
exp{−(log π(θ 0 ) − 12 log det G(θ 0 ) + 21 v0 T G(θ 0 )v0 )} ∂(θ 0 , v0 ) = min 1,
exp{−(log π(θ) − 12 log det G(θ) + 12 vT G(θ)v)} ∂(θ, v) exp(−E(θ 0 , v0 )) ∂(θ 0 , v0 ) = min 1,
exp(−E(θ, v)) ∂(θ, v) Therefore, after solving the Lagrangian dynamics (4.9) by the semi-explicit integrator
(4.10)-(4.12) for L steps, we get a proposal (θ (L+1) , v(L+1) ) to be accepted with the following
acceptance probability:
αsLM C = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JsLM C |}
(4.16)
where JsLM C is the Jacobian matrix of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) according to (4.10)(4.12) with the following determinant calculated in the next section 4.3.2.3.
Proposition 4.5 (Jacobian determinant of semi-explicit integrator).
det JsLM C
L
∂(θ (L+1) , v(L+1) ) Y
det(I − εΩ(θ (n+1) , v(n+1/2) ))
=
:= (n)
∂(θ (1) , v(1) ) , v(n+1/2) ))
n=1 det(I + εΩ(θ
Here, Ω(θ (n+1) , v(n+1/2) ) is a matrix whose (i, j)th element is
4.3.2.3
P
k
(4.17)
(n+1/2) i
Γkj (θ (n+1) ).
vk
Volume Correction
(L+1)
To adjust volume change in (θ (1) , v(1) ) → (θ (L+1)
) according
,v
to (4.10)-(4.12), we need
∂(θ (L+1) , v(L+1) ) to derive the Jacobian determinant, det J := , which can be calculated
(1)
∂(θ , v(1) ) using wedge products [61].
Definition 4.4 (Differential Forms, Wedge Product). The differential one-form α : T MD →
R on a differentiable manifold MD is a smooth mapping from tangent space T MD to R, which
can be expressed as a linear combination of differentials of local coordinates: α = fi dxi =:
f · dx.
45
4. LAGRANGIAN MONTE CARLO
For example, if f : RD → R is a smooth function, then its directional derivative along a
vector v ∈ RD , denoted by df (v) is given by
df (v) =
∂f i
v
∂zi
then df (·) is a linear functional of v, called the differential of f at z and is an example of a
differential one-form. In particular, dz i (v) = v i , thus
df (v) =
∂f i
dz (v),
∂zi
then df =
∂f i
dz
∂zi
The wedge product of two one-forms α, β is a 2-form α ∧ β, an anti-symmetric bilinear
function on tangent space which has the following properties (α, β, γ one-forms, A be a square
matrix of same dimension D):
• α∧α=0
• α ∧ (β + γ) = α ∧ β + α ∧ γ (thus α ∧ β = −β ∧ α)
• α ∧ Aβ = AT α ∧ β
The following proposition enables us to calculate the Jacobian determinant det J.
Proposition 4.6. Let TL : (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) be evolution of a smooth flow, then
dθ (L+1) ∧ dv(L+1) =
∂(θ (L+1) , v(L+1) ) (1)
dθ ∧ dv(1)
∂(θ (1) , v(1) )
Remark 4.6. The Jacobian determinant det J can also be regarded as a Radon-Nikodym
P(dθ (L+1) , dv(L+1) )
derivative of two probability measures: det J =
, where P(dθ, dv) =
P(dθ (1) , dv(1) )
p(θ, v)dθdv.
Proof. of proposition 4.5.
According to the semi-explicit integrator (4.10)-(4.12),
dv(n+1/2) = dv(n) − ε(v(n+1/2) )T Γ(θ (n) )dv(n+1/2) + (∗∗)dθ (n)
dθ (n+1)
= dθ (n) + εdv(n+1/2)
dv(n+1)
= dv(n+1/2) − ε(v(n+1/2) )T Γ(θ (n+1) )dv(n+1/2) + (∗∗)dθ (n+1)
46
4.3 Semi-explicit Lagrangian Monte Carlo
where vT Γ(θ) is a matrix whose (k, j)th element is v i Γkij (θ). Therefore,
dθ (n+1) ∧ dv(n+1) = [I − ε(v(n+1/2) )T Γ(θ (n+1) )]T dθ (n+1) ∧ dv(n+1/2)
= [I − ε(v(n+1/2) )T Γ(θ (n+1) )]T dθ (n) ∧ dv(n+1/2)
= [I − ε(v(n+1/2) )T Γ(θ (n+1) )]T [I + ε(v(n+1/2) )T Γ(θ (n) )]−T dθ (n) ∧ dv(n)
For volume adjustment, we must use the following Jacobian determinant accumulated along
the integration steps:
det JsLM C
L
∂(θ (L+1) , v(L+1) ) Y
det(I − ε(v(n+1/2) )T Γ(θ (n+1) ))
:= =
∂(θ (1) , v(1) ) n=1 det(I + ε(v(n+1/2) )T Γ(θ (n) ))
Algorithm 4.2 Semi-explicit Lagrangian Monte Carlo (sLMC)
Initialize θ (1) = current θ
Sample new velocity v(1) ∼ N(0, G−1 (θ (1) ))
Calculate current E(θ (1) , v(1) ) according to equation (4.15)
for n = 1 to L (leapfrog steps) do
% Update the velocity with fixed point iterations
v̂(0) = v(n)
for i = 1 to NumOfFixedPointSteps do
v̂(i) = v(n) − 2ε G(θ (n) )−1 [(v̂(i−1) )T Γ̃(θ (n) )v̂(i−1) + ∇θ φ(θ (n) )]
end for
v(n+1/2) = v̂(last i)
% Update the position only with simple one step
θ (n+1) = θ (n) + εv(n+1/2)
∆ log detn = log det(I − εΩ(θ (n+1) , v(n+1/2) )) − log det(I + εΩ(θ (n) , v(n+1/2) ))
% Update the velocity exactly
v(n+1) = v(n+1/2) − 2ε G(θ (n+1) )−1 [(v(n+1/2) )T Γ̃(θ (n+1) )v(n+1/2) + ∇θ φ(θ (n+1) )]
end for
Calculate proposed E(θ (L+1) , v(L+1) ) according
to equation (4.15)
PL
logRatio = −ProposedE + CurrentE + n=1 ∆ log detn
Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio
Algorithm 4.2 provides the corresponding steps of the semi-explicit Lagrangian Monte
Carlo (sLMC) algorithm. It has a physical interpretation as exploring the parameter space
along the path on a Riemannian manifold that minimizes the action (total Lagrangian). In
contrast to RHMC augmenting parameter space with momentum, sLMC augments parameter space with velocity. In Section 4.5, we use several experiments to show that switching
47
4. LAGRANGIAN MONTE CARLO
from momentum to velocity can lead to improvements in computational efficiency in some
cases.
4.3.3
Stationarity
Now with proposition 4.3 we can prove that the Markov Chain derived by our reversible integrator with adjusted acceptance probability (4.13) converges to the true target distribution.
One can also find a similar proof in [chap 9 of 70].
Theorem 4.1. The Markov Chain generated by algorithm 4.2 sLMC has the target distribution as its stationary distribution.
Proof. Appendix A.2.
4.4
Explicit Lagrangian Monte Carlo
In this section we modify the semi-explicit integrator (4.10)-(4.12) to become a fully explicit
integrator and validate it as a numerical method to solve the Lagrangian dynamics (4.9).
The derived explicit integrator further reduces computational cost of implicitly updating v
in (4.10). It is time reversible but not volume preserving thus needs determinant adjutstment
in the acceptance probability for the adjusted detailed balance condition (proposition 4.3).
4.4.1
Fully explicit integrator
To resolve the remaining implicit equation (4.10), we propose an additional modification
motivated by the following relationship (notice the symmetry of lower indices in Γ):
vT Γu =
1
(v + u)T Γ(v + u) − vT Γv − uT Γu
2
48
4.4 Explicit Lagrangian Monte Carlo
To keep time-reversibility, we make the modification to both (4.10) and (4.12) as follows:
ε
v(n+1/2) = v(n) − [(v(n+1/2) )T Γ(θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
⇓
ε
v(n+1/2) = v(n) − [(v(n) )T Γ(θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
θ (n+1) = θ (n) + εv(n+1/2)
ε
v(n+1) = v(n+1/2) − [(v(n+1/2) )T Γ(θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
⇓
ε
v(n+1) = v(n+1/2) − [(v(n+1/2) )T Γ(θ (n+1) )v(n+1) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
(4.18)
(4.19)
(4.20)
The time-reversibility of the integrator (4.18)-(4.20) can be shown by the fact that switching
(θ, v)(n+1) and (θ, v)(n) and negating time do not change the format. The resulting integrator
is completely explicit since both updates of velocity (4.18) and (4.20) can be solved by
collecting terms containing v(n+1/2) and v(n+1) respectively:
ε
ε
v(n+1/2) = [I + (v(n) )T Γ(θ (n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )]
2
2
ε (n+1/2) T
ε
(n+1) −1 (n+1/2)
(n+1)
v
= [I + (v
) Γ(θ
)] [v
− G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
2
Therefore we achieve a fully explicit integrator for the Lagrangian dynamics (4.9):
ε
ε
v(n+1/2) = [I + Ω(θ (n) , v(n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )]
2
2
(n+1)
(n)
(n+1/2)
θ
= θ + εv
1
ε
ε
v(n+1) = [I + Ω(θ (n+1) , v(n+ 2 ) )]−1 [v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
2
(4.21)
(4.22)
(4.23)
The following proposition verifies that the derived integrator (4.21)-(4.23) is a valid numerical method to solve the Lagrangian dynamics (4.9) in the sense that the global error
between the numerical solution and the theorectic solution diminishes when the discretization
step size decreases to 0 [See 61, for a similar proof for the generalized leapfrog method].
Proposition 4.7 (Convergence of Numerical Solution). Suppose from the same initial point
z(0) = z0 , we evolve the Lagrangian dynamics (4.9) for some time T to get theorectic solution
z(T ), and numerically solve (4.9) according to the integrator (4.21)-(4.23) with step size ε
for T /ε steps to get a solution z(T /ε) , then
kz(T ) − z(T /ε) k → 0,
49
as ε → 0
4. LAGRANGIAN MONTE CARLO
Proof. Appendix A.3.
This fully explicit integrator (4.21)-(4.23) is (i) time reversible and (ii) energy preserving
up to a global error of order O(ε). The resulting map is not volume preserving as the
Jacobian determinant of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) by (4.21)-(4.23) is not 1.
Proposition 4.8 (Jacobian determinant of fully explicit integrator).
det JLM C :=
L
Y
det(G(θ (n+1) ) − ε/2Ω̃(θ (n+1) , v(n+1) )) det(G(θ (n) ) − ε/2Ω̃(θ (n) , v(n+1/2) ))
det(G(θ (n+1) ) + ε/2Ω̃(θ (n+1) , v(n+1/2) )) det(G(θ (n) ) + ε/2Ω̃(θ (n) , v(n) ))
(4.24)
P i
Here, Ω̃(θ, v) denotes G(θ)Ω(θ, v) whose (k, j)th element is equal to
i v Γ̃ijk (θ), with
1
l
Γ̃ijk (θ) = gkl Γij (θ) = 2 (∂i gkj + ∂j gik − ∂k gij ).
n=1
As a result, the acceptance probability must be adjusted as follows:
αLM C = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JLM C |}
4.4.2
(4.25)
Volume Correction
As in section (4.3.2.3), we use the wedge product on the system of equations (4.21)-(4.23)
to calculate its Jacobian determiant.
Proof. of proposition 4.8
The Jacobian matrix of the integrator (4.21)-(4.23) for two consecutive steps is
∂(θ (n+1) , v(n+1) )
ε
ε
=[I + (v(n+1/2) )T Γ(θ (n+1) )]−T [I − (v(n+1) )T Γ(θ (n+1) )]T ·
(n)
(n)
2
2
∂(θ , v )
ε (n) T
ε
[I + (v ) Γ(θ (n) )]−T [I − (v(n+1/2) )T Γ(θ (n) )]T
2
2
Accumulating all the determinants along L integration steps:
∂(θ (L+1) , v(L+1) ) det JLM C := (1)
(1)
∂(θ , v ) =
=
L
Y
det(I − ε/2(v(n+1) )T Γ(θ (n+1) )) det(I − ε/2(v(n+1/2) )T Γ(θ (n) ))
n=1
L
Y
det(I + ε/2(v(n+1/2) )T Γ(θ (n+1) )) det(I + ε/2(v(n) )T Γ(θ (n) ))
det(G(θ (n+1) ) − ε/2(v(n+1) )T Γ̃(θ (n+1) )) det(G(θ (n) ) − ε/2(v(n+1/2) )T Γ̃(θ (n) ))
(n+1)
) + ε/2(v(n+1/2) )T Γ̃(θ (n+1) )) det(G(θ (n) ) + ε/2(v(n) )T Γ̃(θ (n) ))
n=1 det(G(θ
50
4.4 Explicit Lagrangian Monte Carlo
Algorithm 4.3 Explicit Lagrangian Monte Carlo (LMC)
Initialize θ (1) = current θ
Sample new velocity v(1) ∼ N(0, G(θ (1) )−1 )
Calculate current E(θ (1) , v(1) ) according to equation (4.15)
∆ log det = 0
for n = 1 to L do
∆ log det = ∆ log det − log det(G(θ (n) ) + ε/2Ω̃(θ (n) , v(n) ))
% Update the velocity explicitly with a half step:
v(n+1/2) = [G(θ (n) )+ 2ε Ω̃(θ (n) , v(n) )]−1 [G(θ (n) )v(n) − 2ε ∇θ φ(θ (n) )]
∆ log det = ∆ log det + log det(G(θ (n) ) − ε/2Ω̃(θ (n) , v(n+1/2) ))
% Update the position with a full step:
1
θ (n+1) = θ (n) + εv(n+ 2 )
∆ log det = ∆ log det − log det(G(θ (n+1) ) + ε/2Ω̃(θ (n+1) , v(n+1/2) ))
% Update the velocity explicitly with a half step:
v(n+1) = [G(θ (n+1) )+ 2ε Ω̃(θ (n+1) , v(n+1/2) )]−1 [G(θ (n+1) )v(n+1/2) − 2ε ∇θ φ(θ (n+1) )]
∆ log det = ∆ log det + log det(G(θ (n+1) ) − ε/2Ω̃(θ (n+1) , v(n+1) ))
end for
Calculate proposed E(θ (L+1) , v(L+1) ) according to equation (4.15)
logRatio = −ProposedE + CurrentE + ∆ log det
Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio
Algorithm 4.3 shows the corresponding steps for the fully explicit Lagrangian Monte Carlo
(LMC) algorithm. In both algorithms 4.2 and 4.3, the position update is relatively simple
while the computational time is dominated by choosing the “right” direction (velocity) using
the geometry of parameter space. In sLMC, solving θ explicitly reduces computation cost
by (F − 1)O(D2.373 ) where F is the number of fixed-point iterations, and D is the number
of parameters. This is because for each fixed-point iteration, it takes O(D2.373 ) elementary
linear algebraic operations to invert G(θ). The connection terms Γ̃(θ) in Ω̃ do not add
substantial computational cost since they are obtained from permuting three dimensions of
the array ∂G(θ), which is also computed in RHMC. The additional price of determinant
adjustment is O(D2.373 ).
LMC avoids the fixed-point iteration method in updating v. Therefore, it further reduces
computation by (F − 1)O(D2 ). Besides, it resolves possible convergence issues associated
with using the fixed-point iteration method (section 4.5.1.1). However, because it involves
additional matrix inversions to update v, its benefits could be undermined occasionally. This
is evident from our experimental results presented in section 4.5.3.
51
4. LAGRANGIAN MONTE CARLO
4.5
Experimental Results
In this section, we use both simulated and real data to evaluate our methods, sLMC and
LMC, compared to standard HMC and RHMC. Following [39], we use a time-normalized
effective sample size (ESS) [17] to compare these methods.
Definition 4.5 (Effective Sample Size). For S samples, effective sample sizs is calculated
as follows:
K
ESS = S[1 + 2Σk=1
ρ(k)]−1
where ρ(k) is the autocorrelation function with lag k, and K 1.
Remark 4.7. Effective sample size can be understood as the number of nearly independent
samples. So the more effective samples a sampling algorithm can generate within fixed CPU
time (time-normalized ESS), the more efficient it is regarded.
Minimum, median, and maximum values of ESS over all parameters are provided for
comparing different algorithms. More specifically, we use the minimum ESS normalized by
CPU time (s), min(ESS)/s, as the measure of sampling efficiency. All computer programs
and data sets discussed in this chapter are available online at http://www.ics.uci.edu/
~babaks/Site/Codes.html.
4.5.1
Banana-shaped distributions
The banana-shaped distribution, which we used above for illustration, can be constructed
as the posterior distribution of θ = (θ1 , θ2 )|y based on the following model:
y|θ ∼ N(θ1 + θ22 , σy2 )
θ ∼ N(0, σθ2 I2 )
2
The data {yi }100
i=1 are generated with θ1 + θ2 = 1, σy = 2, and σθ = 1.
As we can see in figure 4.2, similar to RHMC, sLMC and LMC explore the parameter
space efficiently by adapting to its local geometry. The histograms of posterior samples
shown in figure 4.3 confirm that our algorithms converge to the true posterior distributions
of θ1 and θ2 , whose density functions are shown as red solid curves.
Table 4.1 compares the performance of these algorithms based on 20000 MCMC iterations
after 5000 burn-in. For this specific example, sLMC has the best performance followed by
LMC. As discussed above, although LMC is fully explicit, its numerical benefits (obtained by
52
4.5 Experimental Results
Sampling Path of RHMC
Sampling Path of sLMC
Sampling Path of LMC
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
θ2
2
θ2
2
θ2
2
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2
−2
−1
0
1
2
−2
−2
−1
0
θ1
1
2
−2
−2
−1
0
θ1
1
2
θ1
Figure 4.2: The first 10 iterations in sampling from the banana-shaped distribution with
Riemannian HMC (RHMC), semi-explicit Lagrange Monte Carlo (sLMC) and explicit LMC
(LMC). For all three methods, the trajectory length (i.e., step size times number of integration
steps) is set to 1.45 and number of integration steps is set to 10. Solid red lines show the
sampling path, and each point represents an accepted proposal.
RHMC
sLMC
LMC
1.2
1.2
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
−3
−2
−1
0
1
2
1.2
0
−3
−2
−1
0
1
2
0
−3
0.6
0.6
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
1
2
1
2
1
0.5
0
0
θ
0.6
−1
−1
1
1
−2
−2
θ
θ
0
−2
−1
θ
0
θ
2
2
1
2
−2
−1
0
1
2
θ
2
Figure 4.3: Histograms of 1 million posterior samples of θ1 and θ2 for the banana-shaped
distribution using RHMC (left), sLMC (middle) and LMC (right). Solid red curves are the
true density functions.
53
4. LAGRANGIAN MONTE CARLO
Method AP
s/Iter
ESS
min(ESS)/s
HMC 0.79 6.96e-04
(288,614,941)
20.65
RHMC 0.78 4.56e-03 (4514,5779,7044)
49.50
sLMC 0.84 7.90e-04 (2195,3476,4757)
138.98
LMC
0.73 7.27e-04 (1139,2409,3678)
78.32
Table 4.1: Comparing alternative methods using a banana-shaped distribution. For each
method, the trajectory length is kept 1.2 and the step size is tuned to make the acceptance
rate comparable. We provide the acceptance probability (AP), the CPU time (s) for each
iteration, ESS (min., med., max.) and the time-normalized ESS.
removing implicit equations) can be negated in certain examples since it involves additional
matrix inversion operations to update v.
4.5.1.1
Thinner banana-shaped distribution
In this section we discuss the issue of solutions given by fixed point iteration in RHMC. It
turns out that sLMC and LMC not only reduce the computational cost of RHMC, but are
also more numerically stable than RHMC by avoiding the fixed point iteration. Actually,
(n)
forh fixed point iteration
i to find a solution to (4.6), the iterated function f (·) = θ +
ε
G−1 (θ (n) ) + G−1 (·) p(n+1/2) has to satisfy certain contraction condition e.g. Lipschitz
2
condition with constant 0 ≤ L < 1. When this is not satisfied, [39] argue that fixed point
iteration is still used, not to get the exact solution, but to generate a proposal after several
runs (5 or 6 in practice). However, very enormous solutions indicating strong divergent
behavior can be given by fixed point iteration with limited number of runs. We observe this
phenomenon in the following experiment where G is very ill-conditioned (See more discussion
on condition number in section 4.5.3).
If we increase the number of records y to 10000, the posterior distribution of θ|y becomes
more concentrated on θ1 + θ22 = 1 thereafter a ’thinner banana’, challenging for both HMC
and RHMC: HMC bounces more in the thinner banana resulting a slow exploration; RHMC
updates θ by the fixed point iteration which frequently gives divergent solutions due to the
ill nature of metric G(θ) (with condition number as large as 104 ).
Figure 4.4 shows that RHMC frequently gives solutions divergent to infinity as its sampling path (red lines) goes beyond the range of the figure, rendering 7 of 10 proposals rejected.
While sLMC and LMC still explore the distribution well and accept most of the proposals (10
and 8 respectively). Table 4.2 compares these algorithms in simulating the thinner bananashaped distribution based on 20000 MCMC iterations after 5000 burn-in. Note, RHMC has
54
4.5 Experimental Results
Sampling Path of sLMC
Sampling Path of LMC
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
θ2
2
θ2
θ2
Sampling Path of RHMC
2
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2
−2
−1
0
θ1
1
2
−2
−2
−1
0
θ1
1
2
−2
−2
−1
0
θ1
1
2
Figure 4.4: The first 10 iterations in sampling from the thinner banana-shaped distribution
with RHMC, sLMC and LMC. For all three methods, the trajectory length (i.e., step size
times number of integration steps) is set to 0.95 and number of integration steps is set to 10.
Solid red lines show the sampling path, and dot points represent accepted proposals while cross
represent rejected ones.
Method AP
s/Iter
ESS
min(ESS)/s
HMC 0.82 3.22e-03
(545,567,590)
8.46
RHMC 0.70 1.37e-02 (506,995,1484)
1.84
sLMC 0.84 1.01e-03 (1022,1806,2589)
50.57
LMC
0.80 1.63e-03 (545,1197,1848)
16.77
Table 4.2: Comparing alternative methods using a ’thinner’ banana-shaped distribution. For
each method, the trajectory length is kept 1 and the step size is tuned to make the acceptance
rate comparable. We provide the acceptance probability (AP), the CPU time (s) for each
iteration, ESS (min., med., max.) and the time-normalized ESS.
to significantly reduce the step size to mitigate the issue of divergent solutions, surprisingly
performing even worse than HMC.
4.5.2
Logistic Regression Models
Next, we evaluate our methods based on five binary classification problems used in [39].
These are Australian Credit data, German Credit data, Heart data, Pima Indian data, and
Ripley data. These data sets are publicly available from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). For each problem, we use a logistic regression
55
4. LAGRANGIAN MONTE CARLO
model,
exp(xTi β)
p(yi = 1|xi , β) =
,
1 + exp(xTi β)
β ∼ N(0, 100I)
i = 1, . . . , n
where yi is a binary outcome for the ith observation, xi is the corresponding vector of
predictors (with the first element equal to 1), and β is the set of regression parameters.
We use standard HMC, RHMC, sLMC, and LCM to simulate 20000 posterior samples
for β. We fix the trajectory length for different algorithms, and tune the step sizes so that
they have comparable acceptance rates. Results (after discarding the initial 5000 iterations)
are summarized in Table 4.3, and show that in general our methods improve the sampling
efficiency measured in terms of minimum ESS per second compared to RHMC on these
examples.
Data
Australian
D=14,N=690
German
D=24,N=1000
Heart
D=13,N=270
Pima
D=7,N=532
Ripley
D=2,N=250
method
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
AP
0.75
0.72
0.83
0.75
0.74
0.76
0.71
0.70
0.71
0.73
0.77
0.76
0.85
0.81
0.81
0.82
0.88
0.74
0.80
0.79
s/Iter
6.13E-03
2.96E-02
2.17E-02
1.60E-02
1.31E-02
6.55E-02
4.13E-02
3.74E-02
1.75E-03
2.12E-02
1.30E-02
1.15E-02
5.75E-03
1.64E-02
8.98E-03
7.90E-03
1.50E-03
1.09E-02
6.79E-03
5.36E-03
ESS
(1225,3253,10691)
(7825,9238,9797)
(10184,13001,13735)
(9636,10443,11268)
(766,4006,15000)
(14886,15000,15000)
(13395,15000,15000)
(13762,15000,15000)
(378,850,2624)
(6263,7430,8191)
(10318,11337,12409)
(10347,10724,11773)
(887,4566,12408)
(4349,4693,5178)
(4784,5437,5592)
(4839,5193,5539)
(820,3077,15000)
(12876,15000,15000)
(15000,15000,15000)
(12611,15000,15000)
min(ESS)/s
13.32
17.62
31.29
40.17
3.90
15.15
21.64
24.54
14.44
19.68
52.73
59.80
10.28
17.65
35.50
40.84
36.39
78.83
147.38
157.02
Table 4.3: Comparing alternative methods using five binary classification problems discussed
in [39]. For each dataset, the number of predictors, D, and the number of observations, N , are
specified. For each method, we provide the acceptance probability (AP), the CPU time (s) for
each iteration, ESS (min., med., max.) and the time-normalized ESS.
56
4.5 Experimental Results
4.5.3
Multivariate T-distributions
The computational complexity of standard HMC is O(D). This is substantially lower than
O(D2.373 ), which is the computational complexity of the three geometrically motivated methods discussed here (RHMC, sLMC, and LMC). On the other hand, these three methods could
have substantially better mixing rates compared to standard HMC, whose mixing time is
mainly determined by the condition number of the target distribution defined as the ratio
of the maximum and minimum eigenvalues of its covariance matrix: λmax /λmin .
λ max /λ min = 10000
D = 20
450
HMC
RHMC
sLMC
LMC
800
700
HMC
RHMC
sLMC
LMC
400
350
600
Min(ESS)/s
Min(ESS)/s
300
500
400
300
250
200
150
200
100
100
50
0
1e+1
1e+2
1e+3
1e+4
1e+5
0
10
20
30
40
50
Dimension
λm a x/λm i n
Figure 4.5: Left: Sampling efficiency, Min(ESS)/s vs. the condition number for a fixed
dimension (D = 20). Right: Sampling efficiency vs dimension for a fixed condition number
(λmax /λmin = 10000). Each algorithm is tuned to have an acceptance rate around 70%. Results
are based on 5000 samples after discarding the initial 1000 samples.
In this section, we illustrate how efficiency of these sampling algorithms changes as the
condition number varies using multivariate t-distributions with the following density function:
−(ν+D)/2
Γ((ν + D)/2)
1 T −1
−D/2
−1/2
π(x) =
(πν)
|Σ|
1+ x Σ x
Γ(ν/2)
ν
(4.26)
where ν is the degrees of freedom and D is the dimension. In our first simulation, we fix the
dimension at D = 20 and vary the condition number of Σ from 10 to 105 . As the condition
number increases, one can expect HMC to be more restricted by the smallest eigen-direction,
whereas RHMC, sLMC, and LMC would adapt to the local geometry. Results presented in
figure 4.5 (left panel) show that this is in fact the case: for higher conditional numbers,
geometrically motivated methods perform substantially better than standard HMC. Note
57
4. LAGRANGIAN MONTE CARLO
that our two proposed algorithms, sLMC and LMC, provide substantial improvements over
RHMC.
For our second simulation, we fix the condition number at 10000 and let the dimension
changes from 10 to 50. Our results (Figure 4.5, right panel) show that the gain by exploiting geometric properties of the target distribution could be undermined eventually as the
dimension increases.
4.5.4
Finite Mixture of Gaussians
Finally we consider finite mixtures of univariate Gaussian components of the form
p(x|θ) =
K
X
πk N(x|µk , σk2 )
(4.27)
k=1
where θ is the vector of size D = 3K − 1 of all the parameters πk , µk and σk2 and N(·|µ, σ 2 )
is a Gaussian density with mean µ and variance σ 2 . A common choice of prior takes the
form
p(θ) = D(π1 , . . . , πK |λ)
K
Y
N(µk |m, β −1 σk2 )IG(σk2 |b, c)
(4.28)
k=1
where D(·|λ) is the symmetric Dirichlet distribution with parameter λ, and IG(·|b, c) is the
inverse Gamma distribution with shape parameter b and scale parameter c.
Although the posterior distribution associated with this model is formally explicit, it is
computationally intractable, since it can be expressed as a sum of K N terms corresponding
to all possible allocations of observations {xi }N
i=1 to mixture components [chap. 9 of 71]. We
want to use this model to test the efficiency of posterior sampling θ using the four methods.
A more extensive comparison of Riemannian Manifold MCMC and HMC, Gibbs sampling
and standard Metropolis-Hastings for finite Gaussian mixture models can be found in [47].
Due to the non-analytic nature of the expected Fisher information, I(θ), we use the empirical
Fisher information as metric tensor [chap. 2 of 72],
Definition 4.6 (Empirical Fisher information).
G(θ) = ST S −
where N × D score matrix S has elements Si,d =
58
1 T
ss
N
P
∂ log p(xi |θ)
T
and s = N
i=1 Si,· .
∂θd
4.6 Discussion
Dataset
name
Density function
Num. of
parameters
Kurtotic
Bimodal
Skewed
9
20 N
Trimodal
x|
Claw
2
1 2
N(x|0, 1) + 13 N x|0, 10
3
1 2 2
2 2
1
+
N
x|
−
1,
N
x|1,
2
3
3
2
3
1
3
1 2
1) + 4 N x| 2 , 3
4 N (x|0,
2 2
9
1
− 65 , 53
+ 20
N x| 56 , 35
+ 10
N x|0,
P4 1 i
1 2
1
N
(x|0,
1)
+
N
x|
−
1,
i=0 10
2
2
10
5
5
5
1 2
8
4
17
Table 4.4: Densities used for the generation of synthetic Mixture of Gaussian data sets.
1.6
0.35
1.4
0.3
0.45
0.35
0.7
0.3
0.6
0.25
0.5
0.25
0.2
0.4
0.2
0.15
0.3
0.1
0.2
0.4
0.35
1.2
0.25
0.3
1
0.2
0.8
0.15
0.6
0.15
0.1
0.4
0.1
0.05
0.2
0
−3
0.05
0.1
0.05
−2
−1
0
1
2
3
0
−3
−2
−1
0
1
2
3
0
−3
−2
−1
0
1
2
3
0
−3
−2
−1
0
1
2
3
0
−3
−2
−1
0
1
2
3
Figure 4.6: Densities used to generate synthetic datasets. From left to right the densities are
in the same order as in Table 4.4. The densities are taken from [72]
We show five Gaussian mixtures in table 4.4 and figure 4.6 and compare sampling efficiency of HMC, RMHMC, sLMC and LMC using simulated datasets in table 4.5. As before,
our two algorithms outperform RHMC.
4.6
Discussion
Following the method of [39] for more efficient exploration of parameter space, we have
proposed new sampling schemes to reduce the computational cost associated with using
a position-specific mass matrix. To this end, we have developed a semi-explicit (sLMC)
integrator and a fully explicit (LMC) integrator for RHMC and demonstrated their advantage
in improving computational efficiency over the generalized leapfrog (RHMC) method used
by [39]. It is easy to show that if G(θ) ≡ M, our method reduces to standard HMC.
Compared to HMC, whose local and global errors are O(ε3 ) and O(ε2 ) respectively, LMC’s
local error is O(ε2 ), and its global error is O(ε) (proposition 4.7). Although the numerical
solutions converge to the true solutions of the corresponding dynamics at a slower rate
for LMC compared to HMC, in general, the approximation remains adequate leading to
reasonably high acceptance rates while providing a more computationally efficient sampling
mechanism. Compared to RHMC, our LMC method has the additional advantage of being
more stable by avoiding implicit updates relying on the fixed point iteration method: RHMC
59
4. LAGRANGIAN MONTE CARLO
Data
claw
trimodal
skewed
kurtotic
bimodal
Method
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
AP
0.88
0.80
0.86
0.82
0.77
0.79
0.82
0.80
0.83
0.85
0.82
0.84
0.78
0.82
0.85
0.81
0.73
0.86
0.81
0.85
s
7.01E-01
5.08E-01
3.76E-01
2.92E-01
3.43E-01
9.94E-02
4.02E-02
4.84E-02
1.78E-01
5.10E-02
2.26E-02
2.52E-02
2.85E-01
4.72E-02
2.54E-02
2.70E-02
1.61E-01
5.38E-02
2.06E-02
2.06E-02
(1916,
(1524,
(2531,
(2436,
(2244,
(4701,
(4978,
(4899,
(2915,
(5000,
(4698,
(4935,
(3013,
(5000,
(5000,
(5000,
(2923,
(5000,
(4935,
(5000,
ESS
3761,
3474,
4332,
3455,
2945,
4928,
5000,
4982,
3237,
5000,
4940,
5000,
3331,
5000,
5000,
5000,
2991,
5000,
4996,
5000,
4970)
4586)
5000)
4608)
3159)
5000)
5000)
5000)
3630)
5000)
5000)
5000)
3617)
5000)
5000)
5000)
3091)
5000)
5000)
5000)
min(ESS)/s
0.54
0.60
1.35
1.67
1.30
9.46
24.77
20.21
3.27
19.63
41.68
39.09
2.11
21.20
39.34
36.90
3.62
18.56
48.00
46.43
Table 4.5: Acceptance probability (AP), seconds per iteration (s), ESS (min., med., max.)
and time-normalized ESS for Gaussian mixture models. Results are calculated on a 5,000
sample chain with a 5,000 sample burn-in session. For HMC the burn-in session was 20,000
samples in order to ensure convergence.
could occasionally give highly divergent solutions, especially for ill conditioned metrics, G(θ).
Future directions could involve splitting Hamiltonian [37, 41, 73, 74] to develop explicit
geometric integrators. For example, one could split a non-separable Hamiltonian dynamic
into several smaller dynamics some of which can be solved analytically. Specifically, the
Lagrangian dynamics (4.9) could be split into the following two smaller dynamics
(
(
θ̇ = v
v̇ = − 12 G(θ)−1 ∇θ φ(θ)
θ̇ = 0
v̇ = −vT Γ(θ)v
(4.29)
the first one separable and the second one solvable element wise. A similar idea has been
explored by [75], where the Hamiltonian, instead of the dynamic, is split. Recently, [51]
propose an alternative splitting essentially similar to (4.29):
(
(
θ̇ = 0
v̇ = − 21 G(θ)−1 ∇θ φ(θ)
θ̇ = v
v̇ = −vT Γ(θ)v
60
(4.30)
4.6 Discussion
the first one only updating v and the second one having analytical solution as geodesic when
available. See more discussion in chapter 6.
Because our methods involve costly matrix inversions, another possible research direction
could be to approximate the mass matrix (and the Christoffel symbols as well) to reduce
computational cost. For many high-dimensional problems, the mass matrix could be appropriately approximated by a highly sparse or structured (e.g., tridiagonal) matrix. This could
further improve our method’s computational efficiency.
61
4. LAGRANGIAN MONTE CARLO
62
5
Wormhole Hamiltonian Monte Carlo
5.1
Introduction
It is well known that standard Markov Chain Monte Carlo (MCMC) methods (e.g., Metropolis algorithms) tend to fail when the target distribution is multimodal [3, 52, 76, 77, 78, 79,
80]. These methods typically fail to move from one mode to another since such moves require passing through low probability regions. This is especially true for high dimensional
problems with isolated modes. Therefore, despite recent advances in computational Bayesian
methods, designing effective MCMC samplers for multimodal distribution has remained a
major challenge. In the statistics and machine learning literature, many methods have been
proposed to address this issue [see 52, 77, 78, 79, 81, 82, 83, 84, 85, for example]. However,
these methods tend to suffer from the curse of dimensionality [83, 85].
In this chapter, we propose a new algorithm, which exploits and modifies the Riemannian
geometric properties of the target distribution to create wormholes connecting modes in
order to facilitate moving between them. Our method can be regarded as an extension of
Hamiltonian Monte Carlo (HMC, chapter 2). Compared to random walk Metropolis (RWM),
standard HMC explores the target distribution more efficiently by exploiting its geometric
properties. However, it also tends to fail when the target distribution is multimodal since
the modes are separated by high energy barriers (low probability regions) [79].
Before presenting our proposed method, we provide an explanation of energy barriers that
prevent standard HMC from moving between modes in the next section. We then introduce
our method in three steps assuming the locations of the modes are known (either exactly
or approximately), possibly through some optimization techniques [e.g. 55, 86]. Later, we
relax this assumption by incorporating a mode searching algorithm in our method in order
to identify new modes and to update the network of wormholes. To this end, we use the
63
5. WORMHOLE HAMILTONIAN MONTE CARLO
regeneration method [87, 88, 89]. Throughout this chapter, we evaluate our method’s performance by comparing it to a state-of-the-art algorithm called Regenerative Darting Monte
Carlo (RDMC) [85], which is designed for sampling from multimodal distributions. RDMC
itself is an improved version of the Darting Monte Carlo (DMC) algorithm [79, 90]. We show
that our proposed approach performs substantially better than RDMC, especially for high
dimensional problems.
5.2
Energy Barrier in HMC
HMC [36, 37] is a Metropolis algorithm with proposals made by numerically simulating
Hamiltonian dynamics of an augmented state space (position θ and ancillary momentum p).
Since guided by Hamiltonian dynamics, HMC improves upon RWM by proposing states that
are distant from the current state, but nevertheless accepted with high probability (Chapter
2 provides details of HMC algorithm). However, HMC does not full exploit the geometric
structure of the target distribution thus it may not explore complicated distributions efficiently. [39] define HMC on a Riemannian manifold (RHMC, see chapter 4 for more details)
by replacing the fixed mass matrix M with the position dependent Fisher metric G(θ) to
adapt to the local geometry of parameter space. In the remainder of the chapter, we use the
notation G0 to generally refer to a Riemannian metric, which is not necessarily the Fisher
information.
Even though we have seen the improvement of the ability to explore the target distribution by utilizing more and more geometric information (gradient, metric), these energy
(Hamiltonian) based algorithms alone cannot explore multimodal distributions very well due
to the energy barrier phenomenon, that is, the sampler gets trapped in some of the modes,
unable to move to other modes due to being isolated by low probability regions.
Recall that in HMC, potential energy is defined as minus log of the target density, so each
local maximum (mode) of density corresponds to a local minimum of potential energy (well),
and low density region corresponds to the energy barrier. The total energy (Hamiltonian) is
(approximately) preserved in the Hamiltonian dynamical system (section 2.1.1), but it may
not be enough to support the sampler to escape from one energy well to another. Figure
5.1 showing a frictionless puck sliding on a surface with 2 local minimums illustrates such
phenomenon: once a initial velocity (or momentum) is sampled, with value v0 , the whole
system (θ, v) evolves with some fixed energy H = U (θ0 ) + K(v0 ) until the highest point
where the kinetic energy has completely converted to the potential energy thus v = 0; if
it stays within the same energy well it starts from, then it will start sliding backwards to
64
5.3 Wormhole HMC Algorithm
U (θ )
v=0
v=v0
Energy Barrier
θ
Figure 5.1: The frictionless puck starts from the left energy well (corresponding to the left
mode of density) cannot pass over the energy barrier into the right energy well (corresponding
to the right mode of density).
the bottom, lacking momentum to pass over the barrier into the other energy well. Note,
it is not as simple as increasing the intial velocity will endow the sampler more energy to
overcome the barrier. In practice Hamiltonian dynamics (2.2) is solved numerically, so larger
velocity means larger leap at each discretized step, which causes larger error, and in turn
higher chance of rejection.
In the following section, we introduce a natural modification of the base metric G0 such
that the associated Hamiltonian dynamical system has a much greater chance of moving
between isolated modes.
5.3
Wormhole HMC Algorithm
We need a concept called distance on a manifold to develop our method.
Definition 5.1 (Distance on a manifold). Let (M, G(θ)) be a Riemannian manifold. Given
65
5. WORMHOLE HAMILTONIAN MONTE CARLO
a differentiable curve θ(t)1 : [0, T ] → M, one can define its arclength as follows
Z
`(θ) :=
T
q
θ̇(t)T G(θ(t))θ̇(t)dt
(5.1)
0
Given any two points θ 1 , θ 2 ∈ M, there exists (nearly always satisfied in statistical models)
a curve θ(t) : [0, T ] → M satisfying the boundary conditions θ(0) = θ 1 , θ(T ) = θ 2 whose
arclength is minimal among the curves connecting θ 1 and θ 2 . The length of such a minimal
curve defines a distance function on M.
Remark 5.1. The minimal curve satisifies the following geodesic equation
T
θ̈ + θ̇ Γ(θ)θ̇ = 0
(5.2)
Thus the minimal curve is also called minimizing geodesic. The solution to (5.2) is proved
equivalent to Hamilton flow with only kinetic energy (see section 4.3.1). In Euclidean space,
where G(θ) ≡ I, the shortest curve connecting θ 1 and θ 2 is simply a straight line with the
Euclidean length kθ 1 − θ 2 k2 .
In the following, we use Hamilton flow (2.2), Riemannian Hamilton flow (4.4), or Lagrangian flow (4.9) to define the distance on manifold whenever appropriate from the context.
5.3.1
Tunnel Metric
To overcome the energy barrier, we propose to replace the base metric G0 with a new metric
with which the distance between modes is shortened. This way, we can facilitate moving
between modes by creating high-speed “tunnels” connecting modes under through the energy
barrier.
Let θ̂ 1 and θ̂ 2 be two modes of the target distribution. We define a straight line segment,
vT := θ̂ 2 − θ̂ 1 , and refer to a small neighborhood (tube) of the line segment as a tunnel.
Next, we define a tunnel metric, GT (θ), in the vicinity of the tunnel. The metric GT (θ)
is an inner product assigning a non-negative real number to a pair of tangent vectors u, w:
GT (θ)(u, w) ∈ R+ . To shorten the distance in the direction of vT , we project both u, w
to the plane normal to vT and then take the Euclidean inner product of those projected
vectors.
1
Here we identify the curve defined on manifold, which should have been written πθ(t) = φ(θ(t)), as its
coordinate θ(t). Therefore, the curve length should be
1/2
Z T
Z Ts Z Tq
T
dφ(θ(t)) dφ(θ(t))
∂ ∂
,
dt =
θ̇
,
θ̇dt =
θ̇(t)T G(θ(t))θ̇(t)dt
dt
dt
∂θ ∂θ
0
0
0
66
5.3 Wormhole HMC Algorithm
∗
Definition 5.2 (Tunnel Metric). Set vT
= vT /kvT k. First, define a pseudo tunnel metric
∗
GT as follows:
∗
∗
∗
∗
∗
∗ T
G∗T (u, w) := hu − hu, vT
ivT
, w − hw, vT
ivT
i = uT [I − vT
(vT
) ]w
∗
∗ T
∗
Note that G∗T := I − vT
(vT
) is semi-positive definite (degenerate at vT
6= 0). We then
modify it to be positive definite, and define the tunnel metric GT as follows:
∗
∗ T
∗
∗ T
GT = G∗T + εvT
(vT
) = I − (1 − ε)vT
(vT
)
(5.3)
where 0 < ε 1 is a small positive number.
∗
and all others
Remark 5.2. The smallest eigen-value of GT is ε with eigen-direction vT
∗
are 1 with eigen-directions normal to vT
. It has a clear interpretation as cutting off the
∗
projection of any vector v to the tunnel direction vT
in the following sense.
∗
∗ T
∗
∗ T
∗
∗
v = [(1 − ε)vT
(vT
) ]v + [I − (1 − ε)vT
(vT
) ]v = (1 − ε)hv, vT
ivT
+ GT v
∗
being removed after multiplying GT .
with most of the projection to vT
To see that the tunnel metric GT in fact shortens the distance between θ̂ 1 and θ̂ 2 ,
consider a simple case where θ(t) follows a straight line: θ(t) = θ̂ 1 + vT t, t ∈ [0, 1]. In this
case, the distance under GT is
Z
dist(θ̂ 1 , θ̂ 2 ) =
1
p
√
vT T GT vT dt = εkvT k kvT k
0
which is much smaller than the Euclidean distance.
Next, we define the overall metric, G, for the whole parameter space of θ as a weighted
sum of the base metric G0 and the tunnel metric GT ,
G(θ) = (1 − m(θ))G0 (θ) + m(θ)GT
(5.4)
where m(θ) ∈ (0, 1) is a mollifying function designed to make the tunnel metric GT influential
only in the vicinity of the tunnel chosen as follows:
m(θ) := exp{−(kθ − θ̂ 1 k + kθ − θ̂ 2 k − kθ̂ 1 − θ̂ 2 k)/F }
(5.5)
where the influence factor F > 0, is a free parameter that can be tuned to modify the extent
of the influence of GT : decreasing F makes the influence of GT more restricted around the
tunnel. The resulting metric leaves the base metric almost intact outside of the tunnel, while
67
5. WORMHOLE HAMILTONIAN MONTE CARLO
making the transition of the metric from outside to inside smooth. Within the tunnel, the
∗
trajectories are mainly guided in the tunnel direction vT
: G(θ) ≈ GT , so G(θ)−1 ≈ GT −1
∗
has the dominant eigen-vector vT
(with eigen-value 1/ε 1), thereafter v ∼ N(0, G(θ)−1 )
∗
.
tends to be directed in vT
The tunnel metric GT is constant and calculated before we start the Markov chain. Each
time, one only need to recalculate a mollifier, adding an almost negligible cost compared to
updating G(θ) in RHMC [39].
We use the mixed overall metric (5.4) to substitute for Fisher metric in RHMC[39] or
LMC (see chapter 4), and call the resulting algorithm as Tunnel Hamiltonian Monte Carlo
(THMC). Figure 5.2 compares THMC with standard HMC based on the following illustrative
example discussed in [91]:
θd ∼ N(θd , σd2 ), d = 1, 2.
1
1
xi ∼ N(θ1 , σx2 ) + N(θ1 + θ2 , σx2 ).
2
2
Here, we set θ1 = 0, θ2 = 1, σ12 = 10, σ22 = 1, σx2 = 2, and generate 1000 data points from
the above model. In figure 5.2, the dots show the posterior samples of (θ1 , θ2 ) given the
simulated data. As we can see, the two modes are far from each other, and moving from one
mode to the other requires passing through a low density region. While HMC is trapped in
one mode, THMC moves easily between the two modes. For this example, we set G0 = I
to make THMC comparable to standard HMC. Further, we use 0.03 and 0.3 for ε and F
respectively.
THMC
1
1
0.5
0.5
θ2
θ2
HMC
0
0
−0.5
−0.5
−1
−1
0
0.5
1
0
θ1
0.5
1
θ1
Figure 5.2: Comparing HMC and THMC in terms of sampling from a 2d posterior distribution
of mixture of 2 Gaussians with tied means.
68
5.3 Wormhole HMC Algorithm
For more than two modes, we can construct a network of tunnels by creating a tunnel
between any two modes. Alternatively, we can create a tunnel between neighboring modes
only. We can define the neighborhood using, for example, a minimal spanning tree [92].
5.3.2
Wind Tunnel
The above method could fail occasionally when the target distribution is highly concentrated
around its modes. This often happens in high-dimensional problems. In such cases, the effect
of tunnel metric diminishes fast as the sampler leaves one mode towards another mode. To
address this issue, we propose to add an external vector field f to the Lagrangian dynamics
(equation (4.9) in section 4.3.1) to enforce the movement between modes shown as below:
θ̇
= v + f (θ, v)
v̇
=
(5.6)
− η(θ, v) − G(θ)−1 ∇θ φ(θ)
We define the wind vector f (θ, v) in terms of the position θ and the velocity v.
Definition 5.3 (Wind Vector). A wind vector f (θ, v) is defined as follows:
∗
∗
∗
f (θ, v) := exp{−V (θ)/(DF )}U (θ)hv, vT
ivT
= m(θ)hv, vT
iv
with mollifier m(θ) := exp{−V (θ)/(DF )}, where D is the dimension, F > 0 is the influence
factor, and V (θ) is a vicinity function indicating the Euclidean distance from the line segment
vT ,
∗
∗
V (θ) := hθ − θ̂ 1 , θ − θ̂ 2 i + |hθ − θ̂ 1 , vT
i||hθ − θ̂ 2 , vT
i|
(5.7)
Contour of Tunnel
A tunnel in MOG
2
15
1.5
2
2
1.5
1.5
10
1
1
0.5
0.4
0.3
0.2
0.1
5
0.2
0.1
θ̂ 2
1.5
0.3
0.1
0.1
0.3
0.4
0.5
0.5
1
2
−1
−5
0.2
0.2
0.
4
1
−0.5
0
1.5
0.5
0.4
0.3
θ̂ 1
0
2
2
1.5
1
0.5
0.30.4
1
0.5
−10
1
1.5
2
1.5
2
−1.5
−2
−2
−15
−1.5
−1
−0.5
0
0.5
1
1.5
2
−10
−5
0
5
Figure 5.3: Left: contour of the vicinity function in equation (5.7) wind tunnel; Right: a
tunnel is shown in 2d mixture of 5 Gaussians
69
5. WORMHOLE HAMILTONIAN MONTE CARLO
The contour of this vicinity function V (θ) looks like tunnel in deed as shown in figure
5.3. Note, the resulting wind vector field has three desirable properties: i) it is confined to a
neighborhood of each tunnel; ii) it enforces the movement along the tunnel; iii) its influence
diminishes at the end of the tunnel when the sampler reaches the other mode.
Now to use the wind Lagrangian dynamics (5.6) to propose, we need a proper integrator
in order to satisfy the detailed balance condition (2.3). we construct the time reversible
integrator for the system (5.6) by concatenating its Euler-B integrator with its Euler-A
integrator [61] (see also section 4.3.2.1):
ε
ε
v(n+1/2) = [I + Ω(θ (n) , v(n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )]
2
2
(n+1)
(n)
(n)
(n+1/2)
(n+1/2)
θ
= θ + ε[v
+ (f (θ , v
) + f (θ (n+1) , v(n+1/2) ))/2]
1
ε
ε
v(n+1) = [I + Ω(θ (n+1) , v(n+ 2 ) )]−1 [v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]
2
2
(5.8)
(5.9)
(5.10)
where implicit equation (5.9) can be solved by fixed point iteration.
The integrator (5.8)-(5.10) is time reversible and numerically stable, however not volume preserving. Therefore we need to adjust acceptance rate by the Jacobian determinant
calculated by wedge product (see section 4.3.2.3):
ε
ε
dθ (n+1) ∧ dv(n+1) =[I + Ω(θ (n+1) , v(n+1/2) )]−T [I − Ω(θ (n+1) , v(n+1) )]T ·
2
2
ε
ε
(n+1)
(n+1/2) −1
,v
)] [I + ∇θT f (θ (n) , v(n+1/2) )]·
[I − ∇θT f (θ
2
2
ε
ε
(n)
(n)
(n) −T
[I + Ω(θ , v )] [I − Ω(θ , v(n+1/2) )]T · dθ (n) ∧ dv(n)
2
2
(5.11)
∗
∗ T
where ∇θT f (θ, v) = vm
(vm
) v∇m(θ)T .
We then accept the proposal obtained by implementing (5.8)-(5.10) for L steps with the
following probability
αW T = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JW T |}
∂(θ (n+1) , v(n+1) ) where the Jocabian determinant det JW T = n=1 and the energy E is
(n)
∂(θ , v(n) ) defined in (4.15) (see more details in chapter 4). Figure 5.4 illustrates this approach based
QL
on sampling from a mixture of 10 Gaussian distributions with dimension D = 100.
70
4
4
3
3
2
2
1
1
x2
x2
5.3 Wormhole HMC Algorithm
0
0
−1
−1
−2
−2
−3
−3
−4
−4
−2
0
2
−4
−4
4
x1
−2
0
2
4
x1
Figure 5.4: Sampling from a mixture of 10 Gaussian distributions with dimension D = 100
using THMC along with a wind vector f (θ, v) to enforce moving between modes in higher
dimensions.
5.3.3
Wormhole
While the previous examples show that our addition of tunnels to Hamiltonian dynamics
succeeds in facilitating a rapid transition between modes, the implementation has the downside that the native HMC dynamics are overridden in a neighborhood of the tunnel, possibly
preventing the sampler from properly exploring some of the low probability regions, as well
as some pieces of a mode. Indeed, any tunneling mechanism which modifies the dynamics
in the existing parameter space will suffer from this issue. Thus we are inevitably led to the
idea of allowing the tunnels to pass through an extra dimension so as not to interfere with
the existing HMC dynamics in the given parameter space, and we call such tunnels wormholes. In particular we introduce an extra auxiliary variable θD+1 ∼ N(0, 1) corresponding
to an auxiliary dimension. We use θ̃ := (θ, θD+1 ) to denote the position parameters in the
resulting D +1 dimensional space MD × R. θD+1 can be viewed as random noise independent
2
of θ and contributes 21 θD+1
to the total potential energy. At the end of the sampling, we
discard θD+1 as projecting θ̃ to the real world. Correspondingly we augment velocity v with
one extra dimension, denoted as ṽ := (v, vD+1 ).
We refer to MD × {−h} as the real world, and MD × {+h} as the mirror world. The two
worlds are connected by networks of wormholes as shown in figure 5.5. We construct these
wormholes in a ’mobile network’ fashion. When the sampler is near a mode (θ̂ 1 , −h) in the
real world, we build a wormhole network by connecting it to all the modes in the mirror
world. Similarly, we connect the corresponding mode in the mirror world, (θ̂ 1 , +h), to all
71
5. WORMHOLE HAMILTONIAN MONTE CARLO
Wormhole
3
Auxiliary dimension
1
1
4
Mirror World
2
5
0.5
0
−0.5
3
−1
−15
1
4
5
−5
0
θ1
5
−5
−10
15
10
5
0
Real World
10
15
2
Wormhole→
−10
θ2
−15
Figure 5.5: Illustrating a wormhole network connecting the real world to the mirror world
(h = 1). As an example, the cylinder shows a wormhole connecting mode 5 in the real world
to its mirror image. The dashed lines show two sets of wormholes. The red lines shows the
wormholes when the sampler is close to mode 1 in the real world, and the magenta lines show
the wormholes when the sampler is close to mode 5 in the mirror world.
the modes in the real world. Note, such construction allows the sampler to jump from one
mode to the vicinity of itself, avoiding overzealous blow in the wind tunnel.
Note, several wormholes starting from the same mode may still have chance to influence
each other in the intersected region provided they exist simultaneously. To furhter resolve
the interference, we adopt a stochastic way to weigh these wormholes through a random
wind vector f̃ , instead of derterministically weighing wind tunnels by the vicinity function
(5.7). Now suppose that the current position, θ̃, of the sampler is near a mode denoted
∗
as θ̃ 0 . A network of wormholes connects this mode to all the modes in the opposite world
∗
θ̃ k , k = 1, · · · K.
Definition 5.4 (Random Wind Vector). A random wind vector f̃ (θ̃, ṽ) is defined as follows
X
X
X

∗
m
(
θ̃)δ
mk (θ̃) < 1
(1
−
m
(
θ̃))δ
(·)
+
(·),
if

k
ṽ
k
2(θ̃ k −θ̃)/e



k
k
k
f̃ (θ̃, ṽ) ∼ P m (θ̃)δ ∗
X
(·)
k


 k P 2(θ̃k −θ̃)/e ,
if
mk (θ̃) ≥ 1

k mk (θ̃)
k
where e is the stepsize, δ is the Kronecker delta function, and mk (θ̃) = exp{−Vk (θ̃)/(DF )}
with Vk (θ̃) the vicinity function defined similarly to (5.7) along the k-th wormhole in the
72
5.3 Wormhole HMC Algorithm
network,
∗
∗
∗
∗
∗
∗
Vk (θ̃) = hθ̃ − θ̃ 0 , θ̃ − θ̃ k i + |hθ̃ − θ̃ 0 , ṽT
i||hθ̃ − θ̃ k , ṽT
i|
k
k
∗
∗
∗
∗
∗
where ṽT
= (θ̃ k − θ̃ 0 )/kθ̃ k − θ̃ 0 k.
k
∗
For each updte f̃ (θ̃, ṽ) is either ṽ or 2(θ̃ k − θ̃)/e according to the position dependent
probabilities defined in terms of mk (θ̃). We then make proposals with the following modified
Lagrangian dynamics with random wind vector field in the extended space:
θ̃˙
= f̃ (θ̃, ṽ)
ṽ˙
=
(5.12)
− η(θ̃, ṽ) − G(θ̃)−1 ∇θ̃ φ(θ̃)
Note that compared to the first equation in (5.6), ṽ is now absorbed into f̃ (θ̃, ṽ). To solve the
modified Lagrangian dynamic (5.12) in a time-reversible manner, we still refer to (5.8)-(5.10)
except that solving (5.9) by fixed point iteration involves random vectors:
θ̃
(`+1)
= θ̃
(`)
(`)
+ ε/2[f̃ (θ̃ , ṽ(`+1/2) ) + f (θ̃
(`+1)
, ṽ(`+1/2) )]
(5.13)
∗
Therefore in each update, the sampler either stays at the vicinity of θ̃ 0 or proposes a move
∗
(`)
towards a mode θ̃ k in the opposite world depending on the values of f̃ (θ̃ , ṽ(`+1/2) ) and
f̃ (θ̃
(`+1)
(`)
∗
(`)
, ṽ(`+1/2) ). For example, if we have f̃ (θ̃ , ṽ(`+1/2) ) = 2(θ̃ k −θ̃ )/e, and f̃ (θ̃
(`+1)
, ṽ(`+1/2) ) =
ṽ(`+1/2) , then equation (5.13) becomes
θ̃
(`+1)
e
∗
= θ̃ k + ṽ(`+1/2)
2
which indicates that a move to the k-th mode in the opposite world has in fact occurred.
Note that the movement θ̃
(`)
→ θ̃
(`+1)
lim kθ̃
e→0
in this case is discontinuous since
(`+1)
(`)
− θ̃ k ≥ 2h > 0
where 2h is the distance between the two worlds and should be chosen at the same scale of
average distance among modes. Therefore, in such cases, there will be an energy gap, ∆E =
E(θ̃
(`+1)
(`)
, ṽ(`+1) ) − E(θ̃ , ṽ(`) ), between the two states. Instead of volume correction (5.11)
(see also section 4.3.2.3) which is not well defined1 here, we adjust the Metropolis acceptance
probability to account for the resulting energy gap. Further, we limit the maximum number
jumps within each iteration of MCMC (i.e., over L leapfrog steps) to 1 in order to avoid
overzealous jumps between the two worlds. Algorithm 5.1 provides the details of our sampling
73
5. WORMHOLE HAMILTONIAN MONTE CARLO
Algorithm 5.1 Wormhole Hamiltonian Monte Carlo (WHMC)
Prepare the modes θ ∗k , k = 1, · · · K
(1)
Set θ̃ = current θ̃
Sample velocity ṽ(1) ∼ N(0, ID+1 )
(1)
(1)
Calculate E(θ̃ , ṽ(1) ) = U (θ̃ ) + K(ṽ(1) )
Set ∆ log det = 0, ∆E = 0, Jumped = false.
for ` = 1 to L do
(`)
1
ṽ(`+ 2 ) = ṽ(`) − 2e ∇θ̃ U (θ̃ )
if Jumped then
(`+1)
(`)
1
θ̃
= θ̃ + eṽ(`+ 2 )
else
∗
∗
Find the closest mode θ̃ 0 and build a network connecting it to all modes θ̃ k , k =
1, · · · K in the opposite world
for m = 1 to M do
Calculate mk (ˆθ̃ (m) ), k = 1, · · · K
Sample u ∼ Unif(0, 1)
P
if u < 1 −
m (ˆθ̃ (m) ) then
k
k
1
1
Set f (ˆθ̃ (m) , ṽ(`+ 2 ) ) = ṽ(`+ 2 )
else
P
Choose one of the k wormholes according to probability {mk / k0 mk0 } and set
∗
1
f (ˆθ̃ (m) , ṽ(`+ 2 ) ) = 2(θ̃ − ˆθ̃ (m) )/e
k
end if
ˆθ̃ (m+1) = θ̃ (`) + e [f (ˆθ̃ (m) , ṽ(`+ 21 ) ) + f (θ̃ (`) , ṽ(`+ 21 ) )]
2
end for
(`+1)
θ̃
= ˆθ̃ (M +1)
end if
(`+1)
1
ṽ(`+1) = ṽ(`+ 2 ) − 2e ∇θ̃ U (θ̃
)
If a modal jump truly happens, set Jumped = true, calculate energy gap ∆E.
end for
(L+1)
(L+1)
Calculate E(θ̃
, ṽ(L+1) ) = U (θ̃
) + K(ṽ(L+1) )
(L+1)
(1)
p = exp{−E(θ̃
, ṽ(L+1) ) + E(θ̃ , ṽ(1) ) + ∆E}
(L+1)
Accept or reject the proposal (θ̃
, ṽ(L+1) ) according to p
method Wormhole Hamiltonian Monte Carlo (WHMC).
We close this section with some comments on the width of wormholes. When modes have
drastically different shapes (high density region), jumping from small, round, concentrated
mode might be easier than jumping from long, narrow, spanned mode. This is because for
the latter mode, the sampler may wander around the narrow wings but have less chance
to enter the wormhole if it is not wide enough. So it is plausible to adapt the width of
1
∇θ̃T f (θ̃, ṽ) in (5.11) has elements either all 0 (staying) or all ∞ (jumping)
74
5.4 Mode Searching After Regeneration
wormholes to the shape of modes. One possibility is to project the principal direction of the
mode to the plane perpendicular to wormhole direction. More adaptive wormhole should
work even better.
5.4
Mode Searching After Regeneration
So far, we assumed that the locations of modes are known. This is of course not a realistic
assumption in many situations. In this section, we relax this assumption by extending
our method to search for new modes proactively and to update the network of wormholes
dynamically. In general, however, allowing such adaptation to take place infinitely often
will disturb the stationary distribution of the chain, rendering the process no longer Markov
[89, 93]. To avoid this issue, we use the regeneration method discussed by [87, 88, 89, 94].
Regeneration allows adaptation to occur infinitely often without affecting the stationary
distribution or the consistency of sample path averages.
Informally, a regenerative process “starts again” probabilistically at each of a set of random stopping times, called regeneration times [94]. These regeneration times divide the chain
into segments, called tours, which are independent from each other [88, 89, 94]. Therefore,
at regeneration times, the transition mechanism can be modified based on the entire history
of the chain up to that point without disturbing consistency of MCMC estimators. In our
method, when the regeneration occurs, we search for new modes and update the network of
wormholes moving forward until the next regeneration time. When searching for new modes
at regeneration times, we learn about the distribution around the known modes from the
history of the chain to increase the possibility of finding new modes as opposed to rediscovering known ones. In what follows, we discuss how our method identifies regeneration times
and how it discovers new modes.
5.4.1
Identifying Regeneration Times
The main idea of regeneration is to regard the transition kernel T(θ t+1 |θ t ), e.g., MetropolisHastings algorithm with independent proposal (section 2.2.1), as a mixture of two kernels,
Q and R [85, 87],
T(θ t+1 |θ t ) = S(θ t )Q(θ t+1 ) + (1 − S(θ t ))R(θ t+1 |θ t )
75
(5.14)
5. WORMHOLE HAMILTONIAN MONTE CARLO
where Q(θ t+1 ) is an independence kernel, and the residual kernel R(θ t+1 |θ t ) is defined as
follows

 T(θ t+1 |θ t ) − S(θ t )Q(θ t+1 ) ,
1 − S(θ t )
R(θ t+1 |θ t ) =

1,
if S(θ t ) ∈ [0, 1)
(5.15)
if S(θ t ) = 1
S(θ t ) is the mixing coefficient between the two kernels such that
T(θ t+1 |θ t ) ≥ S(θ t )Q(θ t+1 ),
∀θ t , θ t+1
(5.16)
Now suppose that at iteration t, the current state is θ t . There are two ways to identify
regeneration times.
Prospective Regeneration Generate a Bernoulli random variable Bt+1 with success
probability S(θ t ),
Bt+1 |θ t ∼ Bern(S(θ t ))
(5.17)
If Bt+1 = 1, sample θ t+1 from the independence kernel θ t+1 ∼ Q(·); otherwise, use the
residual kernel to generate θ t+1 ∼ R(·|θ t ). When Bt+1 = 1, the chain regenerates and the
transition mechanism Q(·) becomes independent of the current state θ t . To sum up,
P[θ t+1 |Bt+1 , θ t ] = Q(θ t+1 )δ1 (Bt+1 ) + R(θ t+1 |θ t )δ0 (Bt+1 )
(5.18)
where δ is the Kronecker delta function.
Note, S(·) has to be between 0 and 1 as in definition (5.15), and non-regerative states have
to be sampled from the residual kernel, which might not be easy. The following retrospective
procedure avoid these two constrictions thus is preferred in practice.
Retrospective Regeneration For this method, Bernoulli random variable Bt+1 is always
generated after sampling θ t+1 . This way, Q(·) does not need to be normalized and we do
not need to specify R(·|θ t ) explicitly [87, 89]. To implement this approach, we first generate
θ t+1 using the original transition kernel θ t+1 |θ t ∼ T(·|θ t ). Then, we sample Bt+1 from
the Bernoulli distribution with retrospective success probability calculated as follows (notice
equations (5.17)(5.18))
P[Bt+1 = 1, θ t+1 |θ t ]
P[θ t+1 |θ t ]
P[θ t+1 |Bt+1 = 1, θ t ]P[Bt+1 = 1|θ t ]
S(θ t )Q(θ t+1 )
=
=
P[θ t+1 |θ t ]
T(θ t+1 |θ t )
r(θ t , θ t+1 ) :=P[Bt+1 = 1|θ t+1 , θ t ] =
76
(5.19)
5.4 Mode Searching After Regeneration
If Bt+1 = 1, a regeneration has occurred, then we discard θ t+1 and sample from the independence kernel θ t+1 ∼ Q(·). At regeneration times, we redefine the dynamics using the past
sample path. This process is discussed in the following section.
Remark 5.3. It is essential to find function S ≥ 0 and probability measure Q (not necessarily
normalized) satisfying the condition (5.16), which is also called splitting MCMC kernel, and
the pair (S, Q) is called atom [88, 89]. For MH algorithms (section 2.2.1), it is much easier
to split the MCMC kernel for the independent proposal mechanism than for the symmetric
one. Suppose the proposal kernel is an independent sampler, q(θ t+1 |θ t ) = q(θ t+1 ), then by
(2.4)(2.5) we can split the MH transition kernel
T(θ t+1 |θ t ) = q(θ t+1 |θ t )α(θ t , θ t+1 ) + δθt (θ t+1 )
Z
q(θ ∗ |θ t )(1 − α(θ t , θ ∗ ))dθ ∗
π(θ t+1 )/q(θ t+1 )
≥ q(θ t+1 |θ t )α(θ t , θ t+1 ) = q(θ t+1 ) min 1,
π(θ t )/q(θ t )
1
π(θ t+1 )/q(θ t+1 )
· min c,
=: Q(θ t+1 )S(θ t )
≥ q(θ t+1 ) min 1,
c
π(θ t )/q(θ t )
with some c > 0 and
S(θ t ) = min c,
1
π(θ t )/q(θ t )
π(θ t+1 )/q(θ t+1 )
, Q(θ t+1 ) = q(θ t+1 ) min 1,
c
(5.20)
However this is difficult for symmetric proposal kernel q(θ t+1 |θ t ) = q(θ t |θ t+1 ). [88, 89]
provide one splitting for it which however quickly fails as dimension grows.
In our method, the independence kernel Q(θ t+1 ), is defined as in (5.20) with the proposal
kernel q(θ t+1 ) specified by a mixture of Gaussians with means centered at the k known modes
prior to regeneration. The covariance matrix for each mixture component is set to the inverse
Hessian evaluated at the mode. The relative weight of each mixture component could be
initialized as 1/k, but updated at regeneration times to be proportional to the number of
times the corresponding mode has been visited up to that regeneration time.
5.4.2
Searching New Modes
When the chain regenerates, we can modify the transition kernel by including newly found
modes in the mode library and updating the wormhole network accordingly. This way, starting with a limited number of modes (identified by some preliminary optimization process),
our wormhole HMC will discover unknown modes on the fly without affecting the stationarity
of the chain.
77
5. WORMHOLE HAMILTONIAN MONTE CARLO
Energy Contour
Residual Energy Contour (T=1.2)
Residual Energy Contour (T=1.05)
10
10
10
5
5
0
Known Modes
0
Known Modes
0
−5
−5
5
−5
Unknown Modes
Unknown Modes
−10
−10
−10
−10
−5
0
5
10
−10
−5
0
5
10
−10
−5
0
5
10
Figure 5.6: Left panel: True energy contour (red: known modes, blue: unknown modes).
Middle panel: Residual energy contour at T = 1.2. Right panel: Residual energy contour at
T = 1.05.
To search for new modes after regeneration, we could simply do optimization on the
original target density function π(θ) with some random starting point. This, however, could
lead to frequently rediscovering the known modes. To reduce this computational waste, we
propose a surgery on π(θ) to remove/down-weight the known modes using the history of
the chain up to the regeneration time and use an optimization algorithm on the resulting
residual density. To this end, we fit a mixture of Gaussians with the best knowledge of modes
(locations, Hessians and relative weights) prior to the regeneration. It has the same density
as q(θ) in the independence kernel Q(θ) (5.20), which will be adapted at future regeneration
times.
The residual density function could be simply defined as πr (θ) = π(θ) − q(θ) with the
corresponding residual potential energy as follows,
Ur (θ) = log(πr (θ) + c) = − log(π(θ) − q(θ) + c)
(5.21)
where the constant c > 0 is used to make the term inside the log function positive. However, in regions where the mixture of Gaussians is a good fit for the original density, the
corresponding residual energy, Ur , becomes flat, causing gradient-based minimization, such
as Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, to fail. To avoid this issue, we
propose to use the following tempered residual potential energy:
1
Ur (θ, T ) = − log f (θ) − exp
log q(θ) + c
T
(5.22)
where T is the temperature.
Figure 5.6 shows how the residual energy function changes at different temperature. As
the temperature cools down, known modes become more and more down weighted so the
optimization algorithm would have a higher chance of discovering unknown modes.
78
5.4 Mode Searching After Regeneration
When the optimizer finds new modes (checked by the smallest distance to known modes
being bigger than some threshold), they are added to the existing mode library, and the
wormhole network is updated accordingly.
Algorithm 5.2 Regenerative Wormhole Hamiltonian Monte Carlo (RWHMC)
Initially search modes θ̂ 1 , · · · , θ̂ k
for n = 1 to L do
Sample θ̃ = (θ, θD+1 ) as the current state according to WHMC (algorithm 5.1).
Fit a mixture of Gaussians q(θ) with known modes, Hessians
and relative
weights.
o
n
π(θ ∗ )/q(θ ∗ )
∗
Propose θ ∼ q(·) and accept it with probability α = min 1, π(θ)/q(θ) .
if θ ∗ accepted then
Determine if θ ∗ is a regeneration using (5.19)(5.20) with θ t = θ, θ t+1 = θ ∗ .
if Regeneration occurs then
Search new modes by minimizing Ur (θ, T ) (5.22); if new modes discovered, update
the library, wormhole network and q(θ).
Discard θ ∗ , sample θ (n+1) ∼ Q(·) as in (5.20) using rejection sampling.
else
Set θ (n+1) = θ ∗ .
end if
else
Set θ (n+1) = θ̃.
end if
end for
5.4.3
Regenerative Wormhole HMC
Before giving the regenerative version of Wormhole HMC algorithm, we comment on the
indepence kernel Q in (5.20) regarding the underlying mechanism to guide the jump among
modes.
What we need for our WHMC to adapt new modes is a timing rule (regeneration) without
breaking the stationarity of the Markov chain. A splitting for WHMC kernel to identify
regerentation times would be ideal, which however, is difficult in practice1 . Therefore, we
introduce the mixture of Gaussians q(θ) using the best knowledge of discovered modes as an
indepdent proposal for the target density π(θ), which can be viewed as another mechanism
aside from WHMC to help jump among modes (similar as Truncated Dirichlet Process
Mixture of Gaussians in RDMC [85]). It is valid to use several different proposals (WHMC
The symmetric proposal for WHMC ((2.9), q(z∗ |zt ) = δTe (zt ) (z∗ ) with Te integrator for (5.12) described
in algorithm 5.1) is hard to be expressed as a product of separate functions of zt and z∗ respectively.
1
79
5. WORMHOLE HAMILTONIAN MONTE CARLO
and mixture of Gaussians) in a hybrid sampler in a random or systematic scheme [25, 89].
Only the second jumping mechanism (MH with mixture of Gaussians as proposal) is splitted
in the process of identifying regeneration times. However, as we will see in the next section
5.5, without WHMC, this alone fails in high dimension as RDMC [85].
We use these two proposal mechanisms in a cyclic manner [85] and summarize the Regenerative Wormhole HMC (RWHMC) in algorithm 5.2.
5.5
Empirical Results
In this section, we evaluate the performance of our method, henceforth called Wormhole
Hamiltonian Monte Carlo (WHMC), using three examples. The first example, which is
discussed in [85, 95], involves inference regarding the locations of sensors in a network.
The second example involves sampling from mixtures of Gaussian distributions with varying
number of modes and dimensions. In this example, which is discussed in [85], the locations
of modes are assumed to be known. For our third example, we also use mixtures of Gaussian
distribution, but this time we assume that the locations of modes are unknown.
We evaluate our method’s performance by comparing it to Regeneration Darting Monte
Carlo (RDMC) [85], which is one of the most recent algorithms designed for sampling from
multimodal distributions based on the Darting Monte Carlo (DMC) [79] approach. DMC
defines high density regions around the modes. When the sampler enters these regions, a
jump between the regions will be attempted. RDMC enriches the DMC method by using
the regeneration approach [88, 89].
We compare the two methods (i.e., WHMC and RDMC) in terms of Relative Error of
Mean (REM) [85] and R (MPSRF) statistic [96]. REM summarizes the errors in approximating the expectation of variables across all dimensions.
Definition 5.5 (Relative Error of Mean). Given samples {θ(k)}tk=1 , relative error of mean
estimated by the samples at time t is defined
REM(t) = kθ(t) − θ ∗ k1 /kθ ∗ k1
where θ(t) is the mean of MCMC samples obtained by time t and θ ∗ is the true mean.
The R statistic measures the convergence rate to the stationary distribution based on
within and between variances across multiple chains, and it approaches 1 when the chains
converge.
80
5.5 Empirical Results
Definition 5.6 (R (Multivariate Potential Scale Reduction Factor)). Denote θjt as j-th
chain at time t for j = 1, · · · , m, t =, · · · , n. Estimate the posterior variance-covariance
matrix by
1 B
n−1
W + (1 + ) ,
V̂ =
n
m n
where
m
n
XX
1
W =
(θjt − θj· )(θjt − θj· )T ,
m(n − 1) j=1 t=1
m
1 X
B/n =
(θj· − θ·· )(θj· − θ·· )T
m − 1 j=1
Then R (multivariate potential scale reduction factor, MPSRF) is estimated by
R̂ := max
a
aT V̂ a
= λ1 (W −1 V̂ )
aT W a
where λ1 is the largest eigenvalue.
Because RDMC uses standard HMC algorithm with flat metric, we set the metric G0 ≡ I
to make the two algorithms comparable. However, our approach can be easily modified to
use other metrics such as Fisher metric.
5.5.1
Sensor Network Localization
For our first example, we use a problem discussed in [85, 95]. We assume N sensors are
scattered in a planar region with 2d locations denoted as {xi }N
i=1 . The distance Yij between a
pair of sensors (xi , xj ) is observed with probability π(xi , xj ) = exp(−kxi −xj k2 /(2R2 )). If the
distance is in fact observed (Yij > 0), then Yij follows a Gaussian distribution N(kxi −xj k, σ 2 )
with small σ; otherwise Yij = 0. That is,
Zij = I(Yij > 0)|x ∼ Binom(1, π(xi , xj ))
Yij |Zij = 1, x ∼ N(kxi − xj k, σ 2 )
where Zij is a binary indicator set to 1 if the distance between xi and xj is observed.
Given a set of observations Yij and prior distribution of x, which is assumed to be uniform
in this example, it is of interest to infer the posterior distribution of all the sensor locations.
Following [85], we set N = 8, R = 0.3, σ = 0.02, and add three additional base sensors
with known locations to avoid ambiguities of translation, rotation, and negation (mirror
symmetry). The location of the 8 sensors form a multimodal distribution with dimension
D = 16.
81
5. WORMHOLE HAMILTONIAN MONTE CARLO
RDMC
RDMC vs WHMC
WHMC
0.16
0.8
0.8
0.14
0.6
0.6
0.12
0.4
0.4
RDMC
WHMC
REM
y
y
0.1
0.2
0.2
0
0
−0.2
−0.2
0.08
0.06
0.04
0.02
0
0.2
0.4
0.6
x
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0
100
200
300
x
400
500
600
700
800
Seconds
Figure 5.7: Posterior samples for sensor locations using RDMC (left panel) and WHMC
(middle panel) along with their corresponding REM over time (right panel).
Figure 5.7 shows the posterior samples based on the two methods. As we can see, RDMC
very rarely visits one of the modes (shown in red in the top middle part); whereas, WHMC
generates enough samples from this mode to make it discernible. As a result, WHMC has
substantially lower REM compared to RDMC (Figure 5.7, right panel).
5.5.2
Mixture of Gaussians with Known Modes
Next, we evaluate the performance of our method based on sampling from K mixtures of Ddimensional Gaussian distributions with with known modes. We relax this assumption in the
next section. The means of these distributions are randomly generated from D-dimensional
uniform distributions such that the average pairwise distances remains around 20. The
corresponding covariance matrices are constructed in a way that mixture components have
different density functions. Simulating samples from the resulting D dimensional mixture
of K Gaussians is challenging because the modes are far apart and the high density regions
have different shapes.
The left panels of figure 5.8 compare the two methods for varying number of mixture
components with fixed dimension (D = 20). The right panels show the results for varying
number of dimensions with fixed number of mixture components (K = 10). For both scenarios, we stop the two algorithms after 500 seconds and compare their REM and R. We run 10
chains from different locations to calculate R; and we use these 10 chains to estimate REM
along with its 95% confidence interval. As we can see, WHMC has substantially lower REM
and R (i.e., converges faster) compared to RDMC, especially when the number of modes and
dimensions increase.
82
5.5 Empirical Results
D=20
K=10
3.5
1
RDMC
WHMC
RDMC
WHMC
REM (after 500 sec)
REM (after 500 sec)
3
0.8
0.6
0.4
0.2
0
2.5
2
1.5
1
0.5
5
10
15
0
20
10
20
K
40
100
40
100
D
D=20
K=10
1.1
1.8
RDMC
WHMC
1.7
RDMC
WHMC
1.08
R (after 500 sec)
R (after 500 sec)
1.6
1.06
1.04
1.02
1.5
1.4
1.3
1.2
1.1
1
1
5
10
15
20
10
K
20
D
Figure 5.8: Comparing WHMC to RDMC using K mixtures of D-dimensional Gaussians.
Left panels show REM (along with 95% confidence interval) and R based on 10 MCMC chains
for varying number of mixture components with fixed dimension (D = 20). Right panels
show REM (along with 95% confidence interval) and R based on 10 MCMC chains for varying
number of dimensions with fixed number of mixture components (K = 10).
5.5.3
Mixture of Gaussians with Unknown Modes
We now evaluate our method’s performance in terms of searching for new modes and updating the network of wormholes. For this example, we simulate a mixture of 10 Gaussian
distributions with D dimension for D = 10, 100, and compare our method to RDMC. While
RDMC runs four parallel HMC chains initially to discover a subset of modes and to fit a
truncated Gaussian distribution around each identified mode, we run four parallel optimizers (different starting point) using BFGS. At regeneration times, for each chain of RDMC
uses the Dirichlet process mixture model to fit a new truncated Gaussian around modes and
possibly identify new modes. We on the other hand run the BGFS algorithm based on the
residual energy function (with T = 1.05) to discover new modes for each chain.
Figure 5.9 shows RWHMC reduces REM much faster than RDMC for both D = 10
and D = 1000. For both methods the recorded time (horizontal axis) accounts for the
computational overhead for adapting the transition kernels. For D = 10, our method has
a substantially lower REM compared to RDMC. For D = 100, while our method identifies
83
5. WORMHOLE HAMILTONIAN MONTE CARLO
K=10, D=10
K=10, D=100
3
RDMC
RWHMC
RDMC
RWHMC
2
2.5
1.5
REM
REM
2
1.5
1
1
0.5
0.5
0
0
100
200
300
400
500
600
700
0
0
800
100
Seconds
200
300
400
500
600
700
800
Seconds
Figure 5.9: Comparing RWHMC to RDMC in terms of REM using K = 10 mixtures of
D-dimensional Gaussians. Left panels: D = 20. Right panels: D = 100.
new modes over time and reduces REM substantially, RDMC fails to identify new modes so
as a result its REM stays high over time. Figure 5.10 shows the number of identified modes
by our parallelized RWHMC over time for D = 10 and D = 100 separately.
11
Number of Modes Discovered
10
9
8
7
6
5
4
3
2
1
0
D=10
D=100
100
200
300
400
500
600
700
800
Seconds
Figure 5.10: Number of identified modes over time using our regenerative WHMC method
for K = 10 mixtures of Gaussians with D = 10, 100.
5.6
Discussion
We have proposed a new algorithm, called Wormhole Hamiltonian Monte Carlo, for sampling
from multimodal distributions. Using empirical results, we have shown that our method
84
5.6 Discussion
performs well in high dimensions.
Moving continuously, wind tunnel weighs the jumping routes deterministically via smooth
mollifier function. It has local HMC as continuous movement interrupted by quick leap as
being blown through a tunnel. While despite of the extra dimension, Wormhole algorithm
moves continuously most of the time, with some discontinuous jump via routes weighted
in a stochastic way, directly aiming at a mode. Regenerative WHMC extends WHMC by
adapting the chain through regeneration to allow mode searching on the fly. Our method
involves several parameters that require tuning. However, these parameters can be adjusted
at regeneration times without affecting the stationary distribution or the consistency of
sample path averages.
Although we used a flat base metric (i.e., I) in the examples discussed in this chapter,
our method can be easily extended by specifying a more informative base metric (e.g., Fisher
information) that adapts to local geometry. For example, figure 5.11 shows the additional
improvement in REM for the illustrative example of section 5.3.1 by using Fisher information
instead of I. In this example, Wormhole Lagrangian Monte Carlo (WLMC) is similar to
WHMC, but uses Lagrangian Monte Carlo (LMC, see chapter 4) (as opposed to HMC), i.e.
base metric G0 is Fisher metric.
Figure 5.11: Comparing Wormhole Lagrangian Monte Carlo (WLMC) to WHMC posterior
sampling a 2d mixture of 2 Gaussians with tied means in section (5.3.1). WLMC is similar to
WHMC, but it uses Fisher information as its based metric instead of the flat metric (I) used
in WHMC. (Shaded areas represent the 95% confidence intervals based on 10 MCMC chains.)
85
5. WORMHOLE HAMILTONIAN MONTE CARLO
Further technical improvements can be made by finding better (and possibly adaptive)
tunnel metrics, mollifiers, and vicinity functions.
86
6
Spherical Hamiltonian Monte Carlo
for Constrained Target Distributions
6.1
Introduction
Many commonly used statistical models in Bayesian analysis involve high-dimensional probability distributions confined to constrained domains. Some examples include regression
models with norm constraints (e.g., Lasso), probit models, many copula models, and Latent Dirichlet Allocation (LDA) models. Very often, the resulting models are intractable,
simulating samples for Monte Carlo estimations is quite challenging [45, 57, 97, 98, 99], and
mapping the domain to the entire Euclidean space for convenience would be computationally
inefficient due to exploring a much larger space. In this chapter, we propose a novel Markov
Chain Monte Carlo (MCMC) method, which provides a natural and computationally efficient framework for sampling from constrained target distributions. Our method is based on
Hamiltonian Monte Carlo (HMC, chapter 2) [36, 37], which is a Metropolis algorithm with
proposals guided by Hamiltonian dynamics.
In recent years, several methods have been proposed to improve the computational efficiency of HMC [38, 39, 40, 41, 49, 51]. In general, these methods do not directly address
problems with constrained target distributions. In contrast, in this chapter, we focus on
improving HMC-based algorithms when the target distribution is constrained. When dealing with constrained target distributions, the standard HMC algorithm needs to check each
proposal to ensure it is within the boundaries imposed by the constraints. It is quite computationally inefficient to discard those not statisfiying the constraints. Alternatively, as
discussed by [37], one could modify standard HMC such that the sampler bounces off the
boundaries by letting the potential energy go to infinity for parameter values that violate
87
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
the constraints. This approach, however, is not very efficient either due to constant monitoring boundary hitting time and frequent bouncing. There are some recent papers in this
research direction. [51] discuss an approach for distributions defined on a simplex. [57] propose a modified version of HMC for handling constraint functions c(θ) = 0. [45] propose an
HMC algorithm with an exact analytical solution for truncated Gaussian distributions. All
these methods provide interesting solutions for specific types of constraints. Our proposed
method, however, provides a general and computationally efficient framework for handling
many types of constraints.
In what follows, we first present our method for distributions confined to the unit ball in
section 6.2. The unit ball is a special case of q-norm constraints. In section 6.3.1, we discuss
the application of our method for q-norm constraints in general. In section 6.4, we evaluate
our proposed method using simulated and real data. Finally, we discuss future directions in
section 6.5.
6.2
Sampling from distributions defined on the unit
ball
In many cases, bounded connected constrained regionsqcan be bijectively mapped to the
PD 2
D
D-dimensional unit ball BD
: kθk2 =
0 (1) := {θ ∈ R
i=1 θ i ≤ 1}. Therefore, in this
section, we first focus on distributions confined to the unit ball with the constraint kθk2 ≤ 1.
6.2.1
D
Change of the domain: from unit ball BD
0 (1) to sphere S
We start by augmenting the original D-dimensional parameter θ with an extra auxiliary
variable θD+1 to form an extended (D + 1)-dimensional parameter θ̃ = (θ, θD+1 ) such that
p
kθ̃k2 = 1 so θD+1 = ± 1 − kθk22 . This way, the domain of the target distribution is changed
D
D+1
from the unit ball BD
: kθ̃k2 = 1}, through
0 (1) to the D-dimensional sphere, S := {θ̃ ∈ R
the following transformation:
TB→S :
BD
0 (1)
−→ S ,
D
q
θ→
7 θ̃ = (θ, ± 1 − kθk22 )
(6.1)
Note that although θD+1 can be either positive or negative, its sign does not affect our Monte
Carlo estimates since after applying the above transformation, we need adjust our estimates
according to the following change of variable theorem.
88
6.2 Sampling from distributions defined on the unit ball
Proposition 6.1 (Change of Variable: from unit Ball to hyper-Sphere).
Z
dθ B dθ̃ S =
π(θ̃)|θD+1 |dθ̃ S
π(θ)dθ B =
π(θ̃) D
d
θ̃
SD
BD
(1)
S
S
+
+
0
Z
Z
(6.2)
where π(θ̃) ≡ π(θ).
dθ B = |θD+1 |, or equivalently to show the Jacobian determinant
Proof. It suffices to show dθ̃ S p
of TB→S+ is 1/|θD+1 | since the map TB→S+ : θ 7→ θ̃ = (θ, 1 − kθk22 ) bijectively maps the
unit ball BD
0 (1) to upper-hemisphere S+ :
dθ̃ 1
S
|dTB→S+ | := =
dθ B |θD+1 |
D
If we view {θ, BD
0 (1)} as a coordinate chart for manifold S , then by the volume form
[68, 100], we have
p
dθ̃ S = det GS (θ) dθ B
where GS (θ) is the canonical metric on sphere SD . Therefore it suffices to prove
p
det GS (θ) = 1/|θD+1 |
(6.3)
In the following we calculate canonical metric GS (θ). For SD , the first fundamental
form ds2 , i.e., squared infinitesimal length of a curve, is explicitly expressed in terms of the
differential form dθ and the canonical metric GS (θ) as follows:
ds2 = hdθ, dθiGS = dθ T GS (θ)dθ
which can be obtained as follows [100]:
ds2 =
D+1
X
i=1
dθi2 =
D
X
dθi2 + (d(θD+1 (θ)))2 = dθ T dθ +
i=1
89
(θ T dθ)2
2
= dθ T [I + θθ T /θD+1
]dθ
1 − kθk22
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
Therefore, the canonical metric GS (θ) on SD is1
GS (θ) = ID +
θθ T
θθ T
=
I
+
D
2
θD+1
1 − kθk22
(6.4)
The determinant of canonical metric GS (θ) is given by the matrix determinant lemma,
θθ T
θT θ
1
det GS (θ) = det ID + 2
=1+ 2
= 2
θD+1
θD+1
θD+1
(6.5)
thus (6.3) follows, and the inverse of GS (θ) is obtained by the Sherman-Morrison-Woodbury
formula [101]
GS (θ)
−1
−1
2
θθ T /θD+1
θθ T
= ID + 2
= ID −
= ID − θθ T
T
2
θD+1
1 + θ θ/θD+1
(6.6)
Remark 6.1. According to the formula (6.2), we can do the Monte Carlo estimation directly
with samples θ̃ ∼ π(θ̃)dθ̃ S each associated with a weight |θD+1 |. Alternatively, when we need
samples θ ∼ π(θ)dθ B for estimation or inference, we can resample {θ̃} according to their
weights and drop the auxillary variables θD+1 .
Note, the necessity of re-weighting θ̃ by |θD+1 | to recover samples on the unit ball θ ∼
π(θ)dθ B is verified in our experiments. Otherise, it would have ’oversampled’ from around
D
the boundary, due to the change of geometry from unit ball BD
0 (1) to sphere S .
Using the above transformation (6.1), we redefine the Hamiltonian dynamics on the
sphere. This way, the resulting HMC sampler can move freely on SD while implicitly handling
the constraints imposed on the original parameters. As illustrated in figure 6.1, the boundary
of the constraint, i.e., kθk2 = 1, corresponds to the equator of the sphere SD . Therefore,
as the sampler moves on the sphere, passing across the equator from one hemisphere to the
other (from A to B on the right) translates to “bouncing back” off the the boundary in the
original parameter space (form A to B on the left).
T
For any vector ṽ = (v, vD+1 ) ∈ Tθ̃ SD = {ṽ ∈ RD+1 : θ̃ ṽ = 0}, one could view GS (θ) as a mean to
express the length of ṽ in v:
1
vT GS (θ)v = kvk22 +
vT θθ T v
(−θD+1 vD+1 )2
2
= kvk22 +
= kvk22 + vD+1
= kṽk22
2
2
θD+1
θD+1
90
6.2 Sampling from distributions defined on the unit ball
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
A
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●
●●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●
●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
A
●
●
B
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
B
D
Figure 6.1: Transforming unit ball BD
0 (1) to sphere S .
6.2.2
Hamiltonian Dynamics on Sphere
By defining HMC on the sphere (hence named Spherical HMC ), besides handling the constraints implicitly, the computational efficiency of the sampling algorithm could be improved
by using the splitting technique previously exploited by [40, 41, 51]. To this end, we first
need to study the Hamiltonian dynamics defined on the mainfold (SD , GS (θ)) (see section
4.2.1).
Consider a family of target distributions, {π(· ; θ)}, defined on the unit ball BD
0 (1) endowed with the Euclidean metric I. The potential energy is defined as U (θ) := − log π(· ; θ).
Associated with the ancillary velocity variables v is define the kinetic energy K(v) = 21 vT Iv
D
for v ∈ Tθ BD
0 (1), which is a D-dimensional vector sampled from the tangent space of B0 (1).
Therefore, the Hamiltonian is defined on BD
0 (1) as
1
H(θ, v) = U (θ) + K(v) = U (θ) + vT Iv
2
(6.7)
Next, we derive the corresponding Hamiltonian function on SD . The potential energy
U (θ̃) = U (θ) remains the same since the distribution is fully defined in terms of the original
parameter θ, i.e., the first D elements of θ̃. However, the kinetic energy, K(ṽ) := 12 ṽT ṽ,
changes since the velocity ṽ = (v, vD+1 ) is now sampled from the tangent space of the sphere,
T
Tθ̃ SD := {ṽ ∈ RD+1 |θ̃ ṽ = 0}, with vD+1 = −θ T v/θD+1 . Therefore, on the sphere SD , the
Hamiltonian H ∗ (θ̃, ṽ) is defined as follows:
1
H ∗ (θ̃, ṽ) = U (θ̃) + K(ṽ) = U (θ̃) + ṽT ṽ
2
91
(6.8)
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
D
If we view {θ, BD
0 (1)} as a coordinate chart of S , this is equivalent to replacing the Euclidean
metric I with the canonical spherical metric GS (θ) = ID + θθ T /(1 − k|θk22 ) in the definition
of H(θ, v) (6.7) as we rewrite the Hamiltonian function (6.8) (see the footnote 1 ):
1
1
H ∗ (θ̃, ṽ) = U (θ̃) + ṽT ṽ = U (θ) + vT GS (θ)v
2
2
(6.9)
I
Now we can sample the velocity v ∼ N(0, GS (θ)−1 ) and set ṽ =
v. Al−θ T /θD+1
ternatively, we can sample ṽ directly from the standard (D + 1)-dimensional Gaussian and
project it onto Tθ̃ SD :
I
−1
I − θ/θD+1
ṽ ∼ N 0,
GS (θ)
−θ T /θD+1
which simplifies to
T
ṽ ∼ ID+1 − θ̃ θ̃ N(0, ID+1 )
(6.10)
The Hamiltonian function (6.9), H = U (θ) + 12 pT GS (θ)−1 p defines the Hamiltonian
dynamics on the Riemannian manifold (SD , GS (θ)) in terms of (θ, p = GS (θ)v) [39, see also
chapter 4]:
θ̇
= GS (θ)−1 p
(6.11)
1
− ∇θ U (θ) + (GS (θ)−1 p)T dGS (θ)GS (θ)−1 p
2
which is equivalent to the following Lagrangian dynamics in terms of (θ, v) (see more details
ṗ
=
in chapter 4):
6.2.3
θ̇
= v
v̇
=
− vT Γ(θ)v − GS (θ)−1 ∇θ U (θ)
(6.12)
Spherical HMC algorithm
Now we use the splitting technique [61] to derive an efficient geometric (time reversible and
volume preserving) integrator for the above Riemannian Hamiltonian dynamics (6.11) [51],
or Lagrangian dynamics (6.12).
[51] split the Hamiltonian (6.9) as
1
H ∗ (θ, p) = U (θ)/2 + pT GS (θ)−1 p + U (θ)/2
2
92
6.2 Sampling from distributions defined on the unit ball
and the Hamiltonian dynamics corresponging to U (θ)/2 and 21 pT GS (θ)−1 p are as follows:
(
(
θ̇ = 0
ṗ = − 12 ∇θ U (θ)
θ̇ = GS (θ)−1 p
ṗ =
1
(GS (θ)−1 p)T dGS (θ)GS (θ)−1 p
2
(6.13)
They realize that the second dynamics in (6.13) is equivalent to the geodesic equation (5.2),
but solve it on the condition that the manifold must be embedded into a larger Euclidean
space.
To avoid such strong assumption, we propose to split Lagrangian dynamics instead of
splitting Hamiltonian dynamics. Although splitting Hamiltonian and its usefulness in improving HMC is well studied in research [41, 51, 61], splitting Lagrangian has not been
discussed in the literature, to the best of our knowledge. Nevertheless, we can split the
Lagrangian dynamics (6.12) into smaller ones corresponding to U (θ)/2 and 21 vT GS (θ)v by
applying the transformation p 7→ v to the dynamics (6.13) respectively (see section 4.3.1):
(
(
θ̇ = 0
v̇ =
− 12 GS (θ)−1 ∇θ U (θ)
θ̇ = v
v̇ = −vT Γ(θ)v
(6.14)
In the following we solve these dynamics (6.14) defined on SD .
Proposition 6.2. The dynamics (6.14) have the following solutions respectively:



 θ̃(t) = θ̃(0)
" #
!
I
t
D

− θ̃(0)θ(0)T ∇θ U (θ(0))

 ṽ(t) = ṽ(0) − 2
0T


 θ̃(t) = θ̃(0) cos(kṽ(0)k2 t) + ṽ(0) sin(kṽ(0)k2 t)
kṽ(0)k2

 ṽ(t) = −θ̃(0)kṽ(0)k sin(kṽ(0)k t) + ṽ(0) cos(kṽ(0)k t)
2
2
2
(6.15)
(6.16)
where t denotes time.
Proof. Appendix B.
Remark 6.2. Based on the solutions (6.15) and (6.16), we have kθ̃(t)k2 = 1 if kθ̃(0)k2 = 1
and ṽ(t) ∈ Tθ̃(t) SD if ṽ(0) ∈ Tθ̃(0) SD .
We observe that the whole dynamics do not take place on an embedded manifold, but
that they occur on a manifold whose geodesics are known explicitly. With this viewpoint the
applicability of the ideas of [51] should be further expanded.
Note that (6.15) and (6.16) are both symplectic. Due to the explicit formula for the
geodesic flow on sphere, the second dynamics in (6.14) is simulated exactly. Therefore,
93
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
Algorithm 6.1 Spherical Hamiltonian Monte Carlo (Spherical HMC)
(1)
Initialize θ̃ at current θ̃ after transformation
Sample a new momentum value ṽ(1) ∼ N(0, ID+1 )
(1)
(1)
Set ṽ(1) ← ṽ(1) − θ̃ (θ̃ )T ṽ(1)
(1)
Calculate H(θ̃ , ṽ(1) ) = U (θ (1) ) + K(ṽ(1) ) for the current state
for ` = 1 to L do (`)
ID
(`) T
ε
(`+ 21 )
(`)
= ṽ − 2
− θ̃ (θ ) ∇θ U (θ (`) )
ṽ
0
θ̃
ṽ
(`+1)
(`+ 12 )
ṽ(`+1)
= θ̃
(`)
1
cos(kṽ(`+ 2 ) kε) +
(`)
(`+ 21 )
1
ṽ(`+ 2 )
kṽ
(`+ 1
2)
1
k
sin(kṽ(`+ 2 ) kε)
(`+ 12 )
1
1
)
(`+ )
← −θ̃ kṽ k sin(kṽ
kε) + ṽ(`+
2 cos(kṽ 2 kε)
(`+1)
1
ID
− θ̃
(θ (`+1) )T ∇θ U (θ (`+1) )
= ṽ(`+ 2 ) − 2ε
0
end for
(L+1)
Calculate H(θ̃
, ṽ(L+1) ) = U (θ (L+1) ) + K(ṽ(L+1) ) for the proposed state
(L+1)
(1)
α = min{1, exp(−H(θ̃
, ṽ(L+1) ) + H(θ̃ , ṽ(1) ))}
(L+1)
Accept or reject the proposal (θ̃
, ṽ(L+1) ) according to α
(n)
Calculate the corresponding weight |θD+1 |
updating θ̃ does not involve discretization error so we can use large step sizes. This could
lead to improved computational efficiency. Since this step is in fact a rotation on sphere, we
set the trajectory length to be 2π/D and randomize the number of leapfrog steps to avoid
periodicity. Algorithm 6.1 shows the steps for implementing this approach, henceforth called
Spherical HMC.
6.3
Constraints
In this section we discuss several types of constraints which can be transformed to ball type
constraints so that Spherical HMC can be applied to sample from target distributions with
these constraints.
6.3.1
Norm constraints
The unit ball region discussed in the previous section is in fact a special case of q-norm
constraints. In this section we discuss constraints given by of q-norm of parameters.
94
6.3 Constraints
Definition 6.1 (q-norm). ∀β ∈ RD , q-norm (q > 0) of β is defined as follows:

D

X


(
|β |q )1/q ,
i
kβkq =
q ∈ (0, +∞)
(6.17)
i=1



 max |βi |,
1≤i≤D
q = +∞
For example, when β are regression parameters, q = 1 corresponds to Lasso method, and
q = 2 corresponds to ridge regression. In what follows, we show how this type of constraints
can be transformed to SD .
6.3.1.1
Norm constraints with q = +∞
When q = +∞, the norm inequality defines a hypercube. Note that hypercube, and in
general hyper-rectangle, RD := {β ∈ RD : l ≤ β ≤ u}, can be bijectively transformed
to the unit hypercube, CD := [−1, 1]D = {β ∈ RD : kβk∞ ≤ 1}, by proper shifting and
scaling of the original parameters. [37] discusses this kind of constraints, which could be
handled by adding a term to the energy function such that the energy goes to infinity for
values that violate the constraints. This creates “energy walls” at boundaries. As a result,
the sampler bounces off the energy wall whenever it reaches the boundary. As mentioned
earlier, this approach, henceforth called Wall HMC, has limited applications and tends to
be computationally inefficient.
To use Spherical HMC, the unit hypercube can be bijectively transformed to its inscribed
unit ball through the following map:
TC→B : [−1, 1]D → BD
0 (1),
β 7→ θ = β
kβk∞
kβk2
(6.18)
Further, as discussed in the previous section, the resulting unit ball BD
0 (1) can be mapped
to the sphere SD through TB→S for which the Spherical HMC can be used. The following
proposition gives the weights needed for the change of domains from hyper-rectangle RD to
sphere SD .
Proposition 6.3. The Jacobian determinant (weight) of TS→R is as follows:
D
Y
kθkD
ui − li
2
|dTS→R | = |θD+1 |
D
kθk∞ i=1 2
95
(6.19)
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
Proof. First, we note
TS→R = TC→R ◦ TB→C ◦ TS→B : θ̃ 7→ θ 7→ β 0 = θ
u−l 0 u+l
kθk2
7→ β =
β +
kθk∞
2
2
The corresponding Jacobian matrices are
TB→C
TC→R
"
!#
eTarg max |θ|
kθk2
θT
=
I+θ
−
kθk∞
kθk22 θ arg max |θ|
dβ
u−l
:
= diag
2
d(β 0 )T
dβ 0
: T
dθ
where earg max |θ| is a vector with (arg max |θ|)-th element 1 and all others 0. Therefore,
D
Y
dβ dβ 0 dθ B kθkD
ui − li
2
|dTS→R | = |dTC→R | |dTB→C | |dTS→B | = = |θD+1 |
0 T
T
D
kθk∞ i=1 2
d(β )
dθ
dθ̃ S
6.3.1.2
Norm constraints with q ∈ (0, +∞)
A domain constrained by q-norm QD := {β ∈ RD : kβkq ≤ 1} for q ∈ (0, +∞) can be
transformed to the unit ball BD
0 (1) bijectively via the following map:
TQ→B : QD → BD
0 (1),
βi 7→ θi = sgn(βi )|βi |q/2
(6.20)
D
As before, the unit ball BD
for which we can use
0 (1) can be transformed to the sphere S
the Spherical HMC method. The following proposition gives the weights needed for the
transformaton from QD to SD .
Proposition 6.4. The Jacobian determinant (weight) of TS→Q is as follows:
D
2
|dTS→Q | =
q
D
Y
!2/q−1
|θi |
|θD+1 |
i=1
Proof. Note
TS→Q = TB→Q ◦ TS→B : θ̃ 7→ θ 7→ β = sgn(θ)|θ|2/q
The Jacobian matrix for TB→Q is
dβ
2
=
diag(|θ|2/q−1 )
T
q
dθ
96
(6.21)
6.3 Constraints
Therefore the Jacobian Determinant of TS→Q is
D
dβ dθ B = 2
|dTS→Q | = |dTB→Q | |dTS→B | = T q
dθ
dθ̃ S
6.3.2
D
Y
!2/q−1
|θi |
|θD+1 |
i=1
Functional constraints
[45] discuss linear and quadratic constraints for multivariate Gaussian distribution. Since the
target distribution is simple, Hamiltonian dynamics can be exactly simulated and the hitting
time can be analytically obtained. As admitted, most of the computation is spent in finding
wall-hitting time and wall bouncing. In this section, we treat this type of constraints by
mapping the constrained domain to the sphere SD for sampling from general distributions.
6.3.2.1
Linear constraints
In genreal, M linear constraints can be written as l ≤ Xβ ≤ u, with X an M ×D matrix, β a
D-vector and l, u both M -vectors. Assume there are no conflicting inequalities. Do Singular
Value Decomposition (SVD) X = LΣRT where LM ×M , RD×D are both orthogonal matrices
and Σ M ×D is a rectangle diagonal matrix with positive diagonal entries σ1 , · · · , σK where
K = rank(X). Notice these inequalities actually constrains only K variables β ∗ := RT β.
Without loss of generality, we assume X is full rank. For the convenience of discussion, we
assume M ≥ D = K. Then (XT X)D×D is invertible.
Now we can consider the hyper-rectangle type constraints for η := Xβ: l ≤ η ≤ u and
apply the same procedure in section 6.3.1.1 to sample η using Spherical HMC. Followingly,
we get samples of β = (XT X)−1 XT η, which simplifies as β = X−1 η when X is square
invertible matrix. Needless to say, this method doesn’t scale up well if M D; when that
happens, we can directly get bounds of β using norm inequalities, which sufficies the need
in many scenarios.
6.3.2.2
Quadratic constraints
There is no easy solution for general quadratic constraints l ≤ β T Xβ + bT β ≤ u, where
l, u > 0 scalars. Here we consider for X symmetric and positive definite. By spectrum
theorm, we have decomposition X = QΣQT with Q orthogonal and Σ diagonal of positive
√
entries. By shifting and scaling β 7→ β ∗ = ΣQT (β + 12 X−1 b), we only need to consider
97
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
the concentric-ball type constraints for β ∗ :
T D : l∗ ≤ kβ ∗ k22 = (β ∗ )T β ∗ ≤ u∗ ,
1
1
l∗ = l + bT Xb, u∗ = u + bT Xb
4
4
(6.22)
which can further be mapped to unit ball as follows:
TT→B :
√
BD
0 (
√
u∗ )\BD
0 (
l∗ )
−→
BD
0 (1),
√
β ∗ kβ ∗ k2 − l∗
√
β →
7 θ=
√
kβ ∗ k2 u∗ − l∗
∗
√
√
D
D
∗
∗
whose inverse is TB→T : BD
0 (1) −→ B0 ( u )\B0 ( l ),
√
l∗ ).
θ 7→ β ∗ =
(6.23)
√
√
θ
(( u∗ − l∗ )kθk2 +
kθk2
We conclude this section with a comment on general functional constraints. Unless a
bijective differentiable mapping from the constraint domain to the sphere exists, Spherical
HMC cannot be applied directly. However, one can still find a piecewise linear envelope
(e.g. tangent planes) of the domain that can be mapped to the sphere, then sampling by
Spherical HMC on the evelope, and ditching a small portion of samples off boundary of the
orginal constraint can still improve the efficiency compared to standard HMC with simple
truncation.
6.4
Experimental results
In this section, we evaluate our proposed method, Spherical HMC, by comparing its efficiency
to that of Random Walk Metropolis (RWM) and Wall HMC using simulated and real data.
To this end, we define efficiency in terms of time-normalized effective sample size (ESS,
definition 4.5, see section 4.5) [17]. Roughly speaking, ESS can be interpreted as number
of samples that can be regarded as independent. We use the minimum ESS normalized by
the CPU time, as the overall measure of efficiency: min(ESS)/s. All computer codes are
available online at http://www.ics.uci.edu/~slan/lanzi/CODES.html.
6.4.1
Truncated Multivariate Gaussian
For illustration purposes, we first start with a truncated bivariate Gaussian distribution,
β1
1 .5
∼ N 0,
,
.5 1
β2
0 ≤ β1 ≤ 5,
0 ≤ β2 ≤ 1
The lower and upper limits are l = (0, 0) and u = (5, 1) respectively. The original rectangle
domain can be mapped to the 2-dimensional unit sphere through the following transforma98
6.4 Experimental results
tion:
T : [0, 5] × [0, 1] → S2 ,
2
1
1
2
β 7→ β 0 = (2β − (u + l))/(u − l)
q
0
0 kβ k∞
2
7→ θ = β
7→ θ̃ = θ, 1 − kθk2
kβ 0 k2
0.1
β2
0
−2
−1
0.02
0.06
0.08
β2
0
−2
−1
0.04
0.12
0.14
0.16
−2
−1
0
β1
1
2
−2
−1
0
β1
1
2
Figure 6.2: Density plots of a truncated bivariate Gaussian using exact density function (left)
and MCMC samples from Spherical HMC (Right)
The left panel of figure 6.2 shows the heatmap based on the exact density function, and
the right panel shows the corresponding heatmap based on MCMC samples from Spherical
HMC. Table 6.1 compares the true mean and covariance (by R package ’tmvtnorm’ [102])
of the above truncated bivariate Gaussian distribution with the point estimates obtained
from RWM, Wall HMC, and Spherical HMC using 100000 MCMC iterations. Overall, all
methods provide reasonably well estimates.
Method
Truth
RWM
Wall HMC
Spherical HMC
Mean 0.7906
0.4889
0.7764
0.4891
0.7929
0.4890
0.7925
0.4892
Covariance 0.3269 0.0172
0.0172 0.0800
0.3216 0.0152
0.0152 0.0801
0.3283 0.0163
0.0163 0.0800
0.3261 0.0170
0.0170 0.0797
Table 6.1: Comparing the point estimates of mean and covariance matrix of a bivariate
truncated Gaussian distribution using RWM, Wall HMC, and Spherical HMC.
99
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
To evaluate the efficiency of the above three methods (RWM, Wall HMC, and Spherical
HMC), we repeat the this experiment for higher dimensions, D = 10, and D = 100. As
before, we set the mean to zero and set the (i, j)-th element of the covariance matrix to
Σij = 1/(1 + |i − j|). Further, we impose the following constraints on the parameters,
0 ≤ β1 ≤ 5;
Dim
D=10
D=100
Method
RWM
Wall HMC
Spherical HMC
RWM
Wall HMC
Spherical HMC
0 ≤ βi ≤ 0.5, i 6= 1.
AP s/Iteration
0.64
1.6E-04
0.93
5.8E-04
0.81
9.7E-04
0.72
1.3E-03
0.94
1.4E-02
0.88
1.5E-02
Min(ESS)/s
8.80
426.79
602.78
0.06
14.23
40.12
Table 6.2: Sampling Efficiency in of RWM, Wall HMC, and Spherical HMC for generating
samplers from truncated Gaussian distributions.
For each method, we obtain 10000 MCMC samples after discarding the initial 1000
samples. We set the tuning parameters of algorithms such that their overall acceptance rates
are within a reasonable range. For RWM, above 95% of times proposed states are rejected
due to violating the constraints. As shown in table 6.2, Spherical HMC is substantially more
efficient than RWM and Wall HMC. On average, Wall HMC bounces off the wall around
7.68 and 31.10 times per iteration for D = 10 and D = 100 respectively. In contrast, by
augmenting the parameter space, Spherical HMC handles the constraints in an efficient way.
6.4.2
Bayesian Lasso
In regression analysis, overly complex models tend to overfit the data. Regularized regression models control complexity by imposing a penalty on model parameters. By far, the
most popular model in this group is Lasso (least absolute shrinkage and selection operator)
proposed by [103]. In this approach, the coefficients are obtained by minimizing the residual
P
sum of squares (RSS) subject to D
j=1 |βj | ≤ t.
[104] and [105] have proposed a Bayesian alternative method, called Bayesian Lasso. More
specifically, the penalty term is replaced by a Laplace prior distribution of the form P (β) ∝
QD
j=1 exp(−λ|βj |), which can be represented as a scale mixture of normal distributions [106].
This leads to a hierarchical Bayesian model with full conditional conjugacy. Therefore, the
Gibbs sampler can be used for inference.
100
6.4 Experimental results
Our proposed method in this chapter can directly handle the constraints in Lasso. Therefore, we can conveniently use Gaussian priors for model parameters, β|σ 2 ∼ N(0, σ 2 I), and
use Spherical HMC with the transformation discussed in section 6.3.1.2.
3
3 9
20
6 3
20
4
6
10
0
0.2
0.4
0.6
0.8
Shrinkage Factor
1
2
−10
5
−20
−30
−20
−30
−30
−20
5
2
−10
2
−10
7
7
1
1 10
Coefficients
0
1
10
Coefficients
0
Coefficients
0
10
8
10
4
4
20
10
9
30
Bayesian Lasso
Spherical HMC
30
Bayesian Lasso
Wall HMC
30
Bayesian Lasso
Gibbs Sampler
0
0.2
0.4
0.6
0.8
Shrinkage Factor
1
0
0.2
0.4
0.6
0.8
Shrinkage Factor
1
Figure 6.3: Bayesian Lasso using three different sampling algorithms: Gibbs sampler (left),
Wall HMC (middle) and Spherical HMC (right)
We now evaluate our method based on the diabetes data set discussed in [104]. Figure 6.3
compares coefficient estimates given by the Gibbs sampler [104], Wall HMC, and Spherical
HMC algorithms as the shrinkage factor s := kβ̂
β̂
OLS
Lasso
k1 /kβ̂
OLS
k1 changes from 0 to 1. Here,
denotes the ordinary least square (OLS) estimates. For the Gibbs sampler, we choose
different λ so that the corresponding s varies from 0 to 1. For Wall HMC and Spherical
HMC, we fix the number of leapfrog steps to 10 and set the trajectory length such that they
have comparable acceptance rates around 70%.
Figure 6.4 compares the sampling efficiency of these three methods. As we impose tighter
constraints (i.e., lower shrinkage factors), our method becomes substantially more efficient
than the Gibbs sampler and Wall HMC.
6.4.3
Bridge regression
The Lasso model discussed in the previous section is in fact a member of a family of regression models called Bridge regression [104, 107, 108], where the coefficients are obtained
P
q
by minimizing the residual sum of squares subject to D
j=1 |βj | ≤ t. For Lasso, q = 1,
101
1000
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
0
200
Min(ESS)/s
400
600
800
Gibbs Sampler
Wall HMC
Spherical HMC
0.1
0.2
0.3
0.4
0.5
0.6
Shrinkage Factor
0.7
0.8
0.9
1
Figure 6.4: Sampling efficiency of different algorithms for Bayesian Lasso based on the diabetes dataset.
which allows the model to force some of the coefficients to become exactly zero (i.e., become
excluded from the model).
As mentioned earlier, our Spherical HMC method can easily handle this type of constraints through the following transformation:
T : Q →S ,
D
D
βi 7→
βi0
= βi /t 7→ θi =
sgn(βi0 )|βi0 |q/2 ,
q
2
θ→
7 θ̃ = θ, 1 − kθk2
Figure 6.5 compares the parameter estimates of Bayesian Lasso to the estimates obtained
from two Bridge regression models with q = 1.2 and q = 0.8 for the diabetes dataset [104]
using our Spherical HMC algorithm. As expected, tighter constraints (e.g., q = 0.8) would
lead to faster shrinkage of regression parameters as we decrease s.
6.4.4
Modeling synchrony among multiple neurons
[109] have recently proposed a semiparametric Bayesian model to capture dependencies
among multiple neurons by detecting their co-firing patterns over time. In this approach,
after discretizing time, there is at most one spike in each interval. The resulting sequence of
1’s (spike) and 0’s (silence) for each neuron is called a spike train, which is denoted as Y and
is modeled using the logistic function of a continuous latent variable with a Gaussian process
prior. For n neurons, the joint probability distribution of spike trains, Y1 , . . . , Yn , is coupled
to the marginal distributions using a parametric copula model. Let H be n-dimensional
102
6.4 Experimental results
Beysian Bridge Regression
q=1.2
Beysian Bridge Regression
q=0.8
30
3
30
20
20
0
0.2
0.4
0.6
0.8
Shrinkage Factor
10
4
6
1
7
5 6
8
1
10
Coefficients
0
−10
−30
−30
−30
5
−20
2
−10
2
5
−20
−20
−10
Coefficients
0
7 1
10
Coefficients
0
8
10
6
10
4
20
9
3
9
30
9
Beysian Bridge Regression
Lasso (q=1)
0
0.2
0.4
0.6
0.8
Shrinkage Factor
1
0
0.2
0.4
0.6
0.8
Shrinkage Factor
1
Figure 6.5: Bayesian Bridge Regression by Spherical HMC: Lasso (q=1, left), q=1.2 (middle),
and q=0.8 (right).
distribution functions with marginals F1 , ..., Fn . In general, an n-dimensional copula is a
function with the following form:
H(y1 , ..., yn ) = C(F1 (y1 ), ..., Fn (yn )), for all y1 , . . . , yn
Here, C defines the dependence structure between the marginals. [109] use a special case of
the Farlie-Gumbel-Morgenstern (FGM) copula family [110, 111, 112, 113], for which C has
the following form:
"
1+
n
X
X
βj1 j2 ...jk
k=2 1≤j1 <···<jk ≤n
k
Y
#
(1 − Fjl )
n
Y
Fi
i=1
l=1
where Fi = Fi (yi ). Restricting the model to second-order interactions, we have
"
H(y1 , . . . , yn ) = 1 +
X
# n
2
Y
Y
βj1 j2
(1 − Fjl )
Fi
1≤j1 <j2 ≤n
l=1
i=1
Here, Fi = P (Yi ≤ yi ) for the ith neuron (i = 1, . . . , n), where y1 , . . . , yn denote the firing
status of n neurons at time t. βj1 ,j2 captures the relationship between the j1 th and j2 th
neurons, with βj1 ,j2 = 0 interpreted as “no relationship” between the two neurons. To
ensure that probability distribution functions remain within [0, 1], the following constraints
103
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
Shperical HMC
0.6
0.4
0.6
0.2
0.4
β14
0.2
0.0
β14
−0.2
0.0
−0.4
−0.2
−0.4
−0.4
−0.2
0.0
β14
0.2
0.4
0.6
0.8
Wall HMC
0.8
RWM
0
500
1000
1500
2000
0
500
1000
Iterations
1500
2000
0
500
1000
1500
2000
1500
2000
Iterations
Iterations
Figure 6.6: Trace Plots of β14 under the rewarded stimulus.
Shperical HMC
0.6
0.4
0.6
0.2
0.4
β34
0.2
0.0
β34
−0.2
0.0
−0.4
−0.2
−0.4
−0.4
−0.2
0.0
β34
0.2
0.4
0.6
0.8
Wall HMC
0.8
RWM
0
500
1000
1500
2000
0
500
1000
Iterations
1500
2000
0
500
1000
Iterations
Iterations
Figure 6.7: Trace Plots of β34 under the non-rewarded stimulus.
on all
n
2
parameters βj1 j2 are imposed:
1+
X
1≤j1 <j2 ≤n
βj1 j2
2
Y
εjl ≥ 0,
ε1 , · · · , εn ∈ {−1, 1}
l=1
Considering all possible combinations of εj1 and εj2 in the above condition, there are n(n−1)
linear inequalities, which can be combined into the following inequality:
X
|βj1 j2 | ≤ 1
1≤j1 <j2 ≤n
For this model, we can use the square root mapping described in section 6.3.1.2 to transform
the original domain (q = 1) of parameters to the unit ball before using Spherical HMC.
We apply our method to a real dataset based on an experiment investigating the role of
prefrontal cortical area in rats with respect to reward-seeking behavior discussed in [109].
Here, we focus on 5 simultaneously recorded neurons under two scenarios: I) rewarded
(pressing a lever by rats delivers 0.1 ml of 15% sucrose solution), and II) non-rewarded
(nothing happens after pressing a lever by rats). The copula model detected significant
associations among three neurons: the first and forth neurons (β1,4 ) under the rewarded
scenario, and the third and forth neurons (β3,4 ) under the non-rewarded scenario. All other
parameters were deemed non-significant (based on 95% posterior probability interval). The
104
6.5 Discussion
Scenario
I
II
Method
RWM
Wall HMC
Spherical HMC
RWM
Wall HMC
Spherical HMC
AP s/Iteration
0.69
8.2
0.67
17.0
0.83
17.0
0.67
8.1
0.75
19.4
0.81
18.0
Min(ESS)/s
2.8E-04
7.0E-03
2.0E-02
2.8E-04
1.8E-03
2.2E-02
Table 6.3: Comparing sampling efficiencies of RWM, Wall HMC, and Spherical HMC based
on the copula model for detecting synchrony among five neurons under rewarded stimulus and
non-rewarded stimulus.
trace plots of β14 under the rewarded stimulus and β34 under the non-rewarded stimulus are
provided in figure 6.6 and figure 6.7 respectively. As we can see in table 6.3, Spherical HMC
is order(s) of magnitudes more efficient than RWM and Wall HMC.
6.5
Discussion
We have introduced a new efficient sampling algorithm for constrained distributions. Our
method first maps the parameter space to the unit ball and then augments the resulting
space to a sphere. A dynamical system is then defined on the sphere to propose new states
that are guaranteed to remain within the boundaries imposed by the constraints. We have
also shown how our method can be used for other types of constraints after mapping them to
the unit ball. Further, by using the splitting strategy, we could improve the computational
efficiency of our algorithm. We split Lagrangian dynamics and solve corresponding dynamics
without requiring emmbedding manifold into a larger space, which extends [51]. Note, the
radii of ball B and sphere S don’t have to be restricted to 1, as assumed in this chapter for
the convenience of discussion.
In this chapter, we assumed the Euclidean metric I on unit ball, BD
0 (1). The proposed
approach can be extended to more complex metrics, such as the Fisher information metric
GF (θ), in order to exploit the geometric properties of the parameter space [39]. This way,
2
the metric for the augmented space could be defined as GF (θ) + θθ T /θD+1
. Under such
a metric however, we might not be able to find the geodesic flow analytically. Therefore,
the added benefit from using the Fisher information metric might be undermined by the
resulting computational overhead. See [39, 51] for more discussion.
We have discussed several applications of our method in this chapter. The proposed
method can be applied to other problems involved constrained target distributions. Further,
the ideas presented here can be employed in other MCMC algorithms.
105
6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED
TARGET DISTRIBUTIONS
106
7
Conclusion
Markov Chain Monte Carlo is a crucial tool for Bayesian statistics, not only because it can
handle intractable integration, which is almost omnipresent in moden Baysian modeling, but
it also naturally provides interval estimates. The wider application of MCMC is however
hindered by either slow mixing rates or expensive computational cost. Hamiltonian Monte
Carlo is an efficient Metropolis-Hastings algorithm. It uses Hamiltonian dynamics to guide
the proposal so that the sampler can make several consecutive and systematic move towards
a distant state. Yet the stantdard HMC algorithm is not efficient or capable enough to
handle statistical or machine learning problems that involve certain complicated probability
distributions. This dissertation is an attempt to use geometry to help solve these challenges,
including computational burden, exploration of complex distribution structure, multimodal
distributions and constrained distributions. The experimental results provided here confirm
the potential for substantial improvement over traditional solutions.
Split HMC improves the computation efficiency of HMC by splitting the Hamiltonian
into smaller dynamics, one of which can be simulated exactly or at lower cost thus with
larger step size and fewer steps. Two scenarios have been discussed: one case is when the
potential energy can be well approximated by a quadratic function so that the dynamics has
a partial analytic solution; the other case is when the most influential terms of the potetial
and their gradients can be evaluated based on a small subset of data, thus the simulation
is computationally less expensive. In both scenarios, the original potential energy or its
gradient has to be well approximated to avoid large errors.
Lagrangian Monte Carlo reduces the computational cost of RHMC by removing the
expensive implicit updates using velocity instead of momentum. The original Hamiltonian
dynamics on a Riemannian manifold is shown to be equivalent to Lagrangian dynamics,
which is the solution to the variation of action in Physics. A semi-explicit integrator is
107
7. CONCLUSION
derived the same way as the generalized leapfrog and further made explicit by a symmetric
construction.
Wormhole HMC is a novel geometric MCMC algorithm designed for sampling from multimodal distributions, a challenging problem in high dimension. By tunneling the metric,
adding an external vector field, passing through an extra anxillary dimension, wormhole
plays a role to facilitate the movement between modes, and naturally embeds the jumping
mechanism in HMC algorithm. Besides, with the regeneration technique to allow adaptation,
the sampler can proactively search unkown modes, as opposed to rediscovering known ones,
and dynamically update the wormhole network on the fly without affecting the stationarity.
Spherical HMC provides a natural and efficient framework to sample from constrained
distributions. It first maps the constrained domain to a unit ball, then augments it to a
sphere in one higher dimension such that the original boundary corresponds to the equator
of the sphere. The sampler defined on sphere handles the constraints implicitly by moving
freely on the sphere generating proposals that remain within the boundary when mapped
back to the original space. Although we discussed applications of this method using HMC,
the proposed framework can be easily extended to other MCMC samplers.
The work presented here is by no means a comprehensive application of geometry in
Bayesian inference. The author believes that using other geometrically motivated methods
could substantially advance the development of MCMC methodology. With computational
methods to balance the added cost, these methods could broaden the application of MCMC
to large, complicated problems.
7.1
Future Directions
Even though our proposed methods in this dissertation show the benefits of using geometry
in Bayesian inference, the associated computational overhead can not be neglected. Occasionally, the extra computational cost overwhelms the gain (see section 4.5.3). However,
this means that we should attempt to develop better geometrical methods and find better
integration of these methods with computational techniques. In the following I will point
out some possible future directions.
Matrix Calculation In general, this is a challenging problem in numerical analysis. Many
matrix calculations, e.g. multiplication, inversion, etc. have complexity O(D2.373 ). Therefore, it is quite expensive to work with full matrices in our proposed methods. To avoid this
issue, we could use sparse or structurally sparse (e.g., tri-diagonal) matrices instead. For
example, we could approximate full matrices with some simpler and easier-to-calculate forms
108
7.1 Future Directions
[48]. We can also take adavante of the feature of specific problems which involve structured
matrices [114, 115].
Stochastic Updates When the data volume is extremely large, it is not computationally practical to directly apply these geometric methods, considering that each update of
geometric terms and the accetpance test require scanning all of the data. The idea of
stochastic updates stems from [116], where stochastic gradient calculated with uniformly
sampled subset of data is used for optimization. For the variational Bayes, [117] develop
stochastic variational Bayes by solving the variational inference with stochastic optimization. For MCMC methods, [91, 118] are poineers in using stochastic gradient to reduce the
computational cost. Their method is based on Lagevin dynamics, which is a simpler version
of HMC with one leap in each iteration. Extension of this method to HMC, however, might
be challenging since the introduced errors will accumulate along the trajectory, rendering
more diffusive movement. [91] also avoid the acceptance test for the propoposal by annealing
the step size along the sampling. It is shown that there is a trade-off between computational
cost and accuracy gain. [119] point out an interesting approach to reduce the computational
cost in the Metropolis-Hastings algorithms by using sequential testing for acceptance tests.
The (stochastic gradient) Lagevin versions of the algorithms presented in this dissertation
are worth investigation for more scalable application.
Geometric Variational Bayes Variational Bayes relies on iteratively reducing the distance (Kullback-Leibler divergence) between a variational distribution and the true posterior
distribution. However, K-L is not always the best choice of distance function between distributions; in fact, the metric is not a proper distance measure since it is not symmetric.
Besides, K-L divergence could have complicated forms, e.g. K-L divergence between N(0, σ 2 )
and N(0, σ 2 + δ 2 ). On the other hand, if we view the family of distributions as a manifold
[46], with propoer metric (e.g. Fisher metric), we can define their distance as the length
(or simply kinetic energy, which is the squared length) of geodesic connecting them. In this
example, such distance would be as simple as 12 (log(1 + δ 2 /σ 2 ))2 . One future direction could
be to develop a geometric version of variational Bayes by using geodesic based distance function. Variation of energy is a fully developed concept in geometry and should be naturally
adopted to provide an easier alternative to current variational Bayes methods based on K-L
divergence.
109
7. CONCLUSION
110
[15] T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. Estimation
of statistical errors in molecular simulation calculations.
Molecular Physics, 57:89–95, 1986. 2
[16] Brian D. Ripley. Stochastic simulation. John Wiley & Sons, Inc.,
New York, NY, USA, 1987. 2
[17] C. J. Geyer. Practical Markov Chain Monte Carlo. Statistical Science, 7(4):473–483, 1992. 2, 27, 52, 98
References
[18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller,
and E. Teller. Equation of State Calculations by Fast
Computing Machines.
The Journal of Chemical Physics,
21(6):1087–1092, 1953. 2, 11, 17
[19] W. K. Hastings.
Monte Carlo sampling methods using Markov chains and their applications. Biometrika,
57(1):97–109, 1970. 2, 11
[1] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and
Lawrence K. Saul. An Introduction to Variational Methods
for Graphical Models. Mach. Learn., 37(2):183–233, November 1999. 1
[20] Stuart Geman and Donald Geman. Stochastic Relaxation,
Gibbs Distributions, and the Bayesian Restoration of
Images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741,
November 1984. 2, 3
[2] Tommi S. Jaakkola. Tutorial on Variational Approximation
Methods. In In Advanced Mean Field Methods: Theory and Practice, pages 129–159. MIT Press, 2000. 1
[21] Alan E. Gelfand and Adrian F. M. Smith. Sampling-Based Approaches to Calculating Marginal Densities. Journal of the
American Statistical Association, 85(410):398–409, 1990. 2
[3] R. M. Neal. Probabilistic Inference Using Markov Chain Monte
Carlo Methods. Technical Report CRG-TR-93-1, Department of
Computer Science, University of Toronto, 1993. 1, 2, 27, 63
[22] Tommi Jaakkola and Michael I. Jordan. Variational probabilistic inference and the QMR-DT database. Journal of
Artificial Intelligence Research, 10:291–322, 1999. 2
[4] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer- Verlag, 2nd
edition, 2004. 1, 2
[23] Zoubin Ghahramani and Matthew J. Beal. Variational Inference for Bayesian Mixtures of Factor Analysers. In In Advances in Neural Information Processing Systems 12, pages 449–
455. MIT Press, 2000. 2
[5] Radford M. Neal and Geoffrey E. Hinton. Learning in graphical models. chapter A view of the EM algorithm that justifies
incremental, sparse, and other variants, pages 355–368. MIT
Press, 1999. 1
[24] R. Durrett. Probability: theory and examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge U.
Press, 4th edition, August 2010. 2
[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B,
39(1):1–38, 1977. 1
[25] Luke Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701–1728, 1994. 2,
80
[26] John Geweke. Bayesian Inference in Econometric Models
Using Monte Carlo Integration. Econometrica, 57(6):1317–
39, 1989. 2
[7] Hagai Attias. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings
of the Fifteenth conference on Uncertainty in artificial intelligence,
UAI’99, pages 21–30. Morgan Kaufmann Publishers Inc., 1999.
1
[27] A. F. M. Smith and A. E. Gelfand. Bayesian Statistics without
Tears: A Sampling-Resampling Perspective. The American
Statistician, 46(2):84–88, May 1992. 2
[8] Hagai Attias.
A Variational Bayesian Framework for
Graphical Models. In In Advances in Neural Information Processing Systems 12, pages 209–215. MIT Press, 2000. 1
[28] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and
MichaelI. Jordan. An Introduction to MCMC for Machine
Learning. Machine Learning, 50(1-2):5–43, 2003. 3, 11
[9] Wim Wiegerinck.
Variational approximations between
mean field theory and the junction tree algorithm. In
Proceedings of the Sixteenth conference on Uncertainty in artificial
intelligence, UAI’00, pages 626–633. Morgan Kaufmann Publishers Inc., 2000. 1
[29] W. R. Gilks, N. G. Best, and K. K. C. Tan. Adaptive Rejection Metropolis Sampling within Gibbs Sampling. Journal of the Royal Statistical Society. Series C (Applied Statistics),
44(4):pp. 455–472, 1995. 3
[30] Yves F. Atchade and Francois Perron. Improving on the independent Metropolis-Hastings algorithm. Statistica Sinica,
15(3-18), 2005. 3
[10] Zoubin Ghahramani and Matthew J. Beal. Propagation Algorithms for Variational Bayesian Learning. In In Advances in
Neural Information Processing Systems 13, pages 507–513. MIT
Press, 2001. 1
[31] Ragnar Hauge Lars Holden and Marit Holden. Adaptive independent Metropolis–Hastings. Annals of Applied Probability,
19(1):395–413, 2009. 3
[11] Mark Girolami.
A Variational Method for Learning
Sparse and Overcomplete Representations. Neural Comput., 13(11):2517–2532, November 2001. 1
[32] Paolo Giordani and Robert Kohn. Adaptive Independent
Metropolis–Hastings by Fast Estimation of Mixtures of
Normals. Journal of Computational and Graphical Statistics,
19(2):243–259, 2010. 3
[12] Eric P. Xing, Michael I. Jordan, and Stuart Russell. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the Nineteenth conference
on Uncertainty in Artificial Intelligence, UAI’03, pages 583–591.
Morgan Kaufmann Publishers Inc., 2003. 1
[33] Radford M. Neal.
Slice sampling.
31(3):705–767, 2003. 3
[13] C. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A Variational Inference Engine for Bayesian Networks.
In
S. Becker, S. Thrun, and K. Obermayer, editors, Advances in
Neural Information Processing Systems 15, pages 777–784. MIT
Press, Cambridge, MA, 2003. 1
Annals of Statistics,
[34] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical slice sampling. JMLR: W&CP, 9:541–548, 2010. 3
[35] Robert Nishihara, Iain Murray, and Ryan P. Adams. Parallel MCMC with Generalized Elliptical Slice Sampling.
http://arxiv.org/abs/1210.7477, 2012. 3
[14] C. Kipnis and S. R. S. Varadhan. Central limit theorem for
additive functionals of reversible Markov processes and
applications to simple exclusions. Commun. Math. Phys.,
104:1–19, 1986. 2
[36] S. Duane, A. D. Kennedy, B J. Pendleton, and D. Roweth. Hybrid
Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987. 3, 7,
17, 35, 64, 87
111
REFERENCES
[37] R. M. Neal. MCMC using Hamiltonian dynamics. In
S. Brooks, A. Gelman, G. Jones, and X. L. Meng, editors, Handbook
of Markov Chain Monte Carlo. Chapman and Hall/CRC, 2010.
3, 7, 14, 15, 17, 18, 19, 22, 27, 35, 60, 64, 87, 95
[56] P. Diaconis, S. Holmes, and M. Shahashahani. Sampling from a
Manifold. In Galin Jones and Xiaotong Shen, editors, Advances
in Modern Statistical Theory and Applications: A Festschrift in
honor of Morris L. Eaton, 10, pages 102–125. Institute of Mathematical Statistics, 2013. 4
[38] M. Hoffman and A. Gelman. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.
arxiv.org/abs/1111.4246, 2011. 3, 15, 33, 87
[57] Marcus A. Brubaker, Mathieu Salzmann, and Raquel Urtasun. A
Family of MCMC Methods on Implicitly Defined Manifolds. In Neil D. Lawrence and Mark A. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), 22, pages 161–172, 2012.
4, 87, 88
[39] M. Girolami and B. Calderhead. Riemann manifold Langevin
and Hamiltonian Monte Carlo methods.
Journal of
the Royal Statistical Society, Series B, (with discussion)
73(2):123–214, 2011. 3, 10, 12, 15, 33, 35, 36, 38, 39, 52, 54, 55,
56, 59, 64, 68, 87, 92, 105
[58] Peter J. Green. Reversible jump Markov chain Monte
Carlo computation and Bayesian model determination.
Biometrika, 82:711–732, 1995. 4, 43
[40] A. Beskos, F. J. Pinski, J. M. Sanz-Serna, and A. M. Stuart. Hybrid Monte-Carlo on Hilbert spaces. Stochastic Processes
and their Applications, 121:2201–2230, 2011. 3, 20, 41, 87, 91
[59] Arnaud Doucet, Nando Freitas, and Neil Gordon. An Introduction to Sequential Monte Carlo Methods. In Arnaud Doucet, Nando Freitas, and Neil Gordon, editors, Sequential
Monte Carlo Methods in Practice, Statistics for Engineering and
Information Science, pages 3–14. Springer New York, 2001. 4
[41] Babak Shahbaba, Shiwei Lan, Wesley O. Johnson, and RadfordM.
Neal. Split Hamiltonian Monte Carlo. Statistics and Computing, pages 1–11, 2013. 3, 60, 87, 91, 93
[60] Radford M. Neal.
Bayesian Learning for Neural Networks.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996. 7
[42] Michael Betancourt and Leo C. Stein. The Geometry of
Hamiltonian Monte Carlo. http://arxiv.org/abs/1112.4118,
December 2011. 3
[61] B. Leimkuhler and S. Reich. Simulating Hamiltonian Dynamics.
Cambridge University Press, 2004. 8, 10, 14, 17, 19, 39, 42, 45,
49, 70, 92, 93
[43] Jascha Sohl-Dickstein. Hamiltonian Monte Carlo with Reduced Momentum Flips. http://arxiv.org/abs/1205.1939,
May 2012. 3
[62] V. I. Arnold.
Mathematical Methods of Classical Mechanics.
Springer, 2nd edition, May 1989. 10
[44] Jascha Sohl-Dickstein and Benjamin J. Culpepper. Hamiltonian
Annealed Importance Sampling for partition function estimation. http://arxiv.org/abs/1205.1925, May 2012. 3
[63] L. Verlet. Computer ”Experiments” on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones
Molecules. Phys. Rev., 159(1):98–103, 1967. 13, 39
[45] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo
for Truncated Multivariate Gaussians. ArXiv e-prints, August 2013. 3, 87, 88, 97
[64] A. D. Polyanin, V. F. Zaitsev, and A. Moussiaux. Handbook of First
Order Partial Differential Equations. Taylor & Francis, London,
2002. 20
[46] S. Amari and H. Nagaoka. Methods of Information Geometry, 191
of Translations of Mathematical monographs. Oxford University
Press, 2000. 3, 37, 38, 109
[65] Madeleine B. Thompson. A Comparison of Methods for
Computing Autocorrelation Time. Technical Report, (1007),
2010. 27
[47] V. Stathopoulos and M. Girolami. Manifold MCMC for Mixtures. In C. P. Robert K. Mengersen and M. D. Titteringhton,
editors, Mixture: Estimation and Applications, pages 255–276.
John Wiley & Sons, Ltd, 2011. 3, 58
[66] D. Ayres de Campos, J. Bernardes, A. Garrido, J Marques de Sa,
and L Pereira-Leite. SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. Journal of MaternalFetal Medicine, 9:311–318, 2000. 31
[48] Yichuan Zhang and Charles Sutton. Quasi-Newton Methods for Markov Chain Monte Carlo. In J. Shawe-Taylor,
R. S. Zemel, P. Bartlett, F. C. N. Pereira, and K. Q. Weinberger,
editors, Advances in Neural Information Processing Systems 24,
pages 2393–2401. 2011. 3, 109
[67] Luke Tierney and Joseph B. Kadane. Accurate Approximations for Posterior Moments and Marginal Densities.
Journal of the American Statistical Association, 81(393):82–86,
1986. 32
[68] Manfredo P. do Carmo. Riemannian Geometry.
Boston, 1 edition, January 1992. 38, 89
[49] S. Lan, V. Stathopoulos, B. Shahbaba, and M. Girolami. Lagrangian Dynamical Monte Carlo. arxiv.org/abs/1211.3759,
2012. 3, 87
Birkhäuser
[69] Richard L. Bishop and Samuel I. Goldberg. Tensor Analysis on
Manifolds. Dover Publications, Inc., December 1980. 40
[50] Ziyu Wang, Shakir Mohamed, and Nando de Freitas. Adaptive
Hamiltonian and Riemann Manifold Monte Carlo Samplers. http://arxiv.org/abs/1302.6182, February 2013. 3
[70] Jun S. Liu. Monte Carlo Strategies in Scientific Computing, chapter Molecular Dynamics and Hybrid Monte Carlo. SpringerVerlag, 2001. 48
[51] S. Byrne and M. Girolami. Geodesic Monte Carlo on Embedded Manifolds. ArXiv e-prints, January 2013. 3, 4, 60, 87,
88, 91, 92, 93, 105
[71] Jean-Michel Marin, Kerrie L. Mengersen, and Christian Robert.
Bayesian modelling and inference on mixtures of distributions. In D. Dey and C.R. Rao, editors, Handbook of Statistics:
Volume 25. Elsevier, 2005. 58
[52] R. M. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6(4):353,
1996. 4, 63
[72] Geoffrey McLachlan and David Peel.
John Wiley & Sons, Inc., 2005. 58, 59
[53] E. Marinari and G. Parisi. Simulated tempering: a new
Monte Carlo scheme. Europhysics Letters, 19:451–8, 1992.
4
Finite Mixture Models.
[73] A. Dullweber, B. Leimkuhler, R. Mclachlan, England Cb Ew, Andreas Dullweber, Benedict Leimkuhler, and Robert Mclachlan.
Split-Hamiltonian methods for rigid body molecular dynamics. J. Chem. Phys, 107:5840–5852, 1997. 60
[54] Charles J. Geyer and Elizabeth A. Thompsonb.
Annealing
Markov Chain Monte Carlo With Applications to Ancestral Inference. Journal of the American Statistical Association,
90(431):909–920, Sep 1995. 4
[74] J.C. Sexton and D.H. Weingarten. Hamiltonian evolution for
the hybrid Monte Carlo algorithm. Nuclear Physics B,
380(3):665 – 677, 1992. 60
[55] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization
by Simulated Annealing. Science, Number 4598, 13 May 1983,
220(4598):671–680, 1983. 4, 63
[75] Siu A. Chin. Explicit symplectic integrators for solving
nonseparable Hamiltonians. Phys. Rev. E, 80:037701, Sep
2009. 60
112
REFERENCES
[76] G. Celeux, M. Hurn, and C. P. Robert. Computational and
inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95:957–
970, 2000. 63
[96] S. P. Brooks and A. Gelman. General Methods for Monitoring Convergence of Iterative Simulations. Journal of
Computational and Graphical Statistics, 7(4):pp. 434–455, 1998.
80
[77] R. M. Neal. Annealed importance sampling. Statistics and
Computing, 11(2):125–139, 2001. 63
[97] Peter Neal and Gareth O. Roberts. Optimal scaling for random walk Metropolis on spherically constrained target
densities. Methodology and Computing in Applied Probability,
Vol.10(No.2):277–297, June 2008. 87
[78] D. Rudoy and P. J. Wolfe. Monte Carlo Methods for MultiModal Distributions. In Signals, Systems and Computers,
2006. ACSSC ’06. Fortieth Asilomar Conference on, pages 2019–
2023, 2006. 63
[98] Chris Sherlock and Gareth O. Roberts. Optimal scaling of
the random walk Metropolis on elliptically symmetric
unimodal targets. Bernoulli, Vol.15(No.3):774–798, August
2009. 87
[79] C. Sminchisescu and M. Welling. Generalized darting Monte
Carlo. Pattern Recognition, 44(10-11), 2011. 63, 64, 80
[99] Peter Neal, Gareth O. Roberts, and Wai Kong Yuen. Optimal
scaling of random walk Metropolis algorithms with discontinuous target densities. Annals of Applied Probability,
Volume 22(Number 5):1880–1927, 2012. 87
[80] R. V. Craiu, Jeffrey R., and Chao Y. Learn From Thy Neighbor: Parallel-Chain and Regional Adaptive MCMC.
Journal of the American Statistical Association, 104(488):1454–
1466, 2009. 63
[100] Michael Spivak. A Comprehensive Introduction to Differential Geometry, 1. Publish or Perish, Inc., Houston, second edition,
1979. 89
[81] G. R. Warnes. The normal kernel coupler: An adaptive
Markov Chain Monte Carlo method for efficiently sampling from multi-modal distributions. Technical Report
Technical Report No. 395, University of Washington, 2001. 63
[101] Gene H. Golub and Charles F. Van Loan. Matrix computations
(3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA,
1996. 90
[82] K. B. Laskey and J. W. Myers. Population Markov Chain
Monte Carlo. Machine Learning, 50:175–196, 2003. 63
[102] Stefan Wilhelm and Manjunath B G. tmvtnorm: Truncated Multivariate Normal and Student t Distribution, 2013. R package
version 1.4-8. 99
[83] G. E. Hinton, M. Welling, and A. Mnih. Wormholes Improve
Contrastive Divergence. In Advances in Neural Information
Processing Systems 16, 2004. 63
[103] Robert Tibshirani. Regression Shrinkage and Selection Via
the Lasso. Journal of the Royal Statistical Society, Series B,
58(1):267–288, 1996. 100
[84] C. J. F. Ter Braak. A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy
Bayesian computing for real parameter spaces. Statistics
and Computing, 16(3):239–249, 2006. 63
[104] Trevor Park and George Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686,
2008. 100, 101, 102
[85] S. Ahn, Y. Chen, and M. Welling. Distributed and adaptive
darting Monte Carlo through regenerations. In Proceedings of the 16th International Conference on Artificial Intelligence
and Statistics (AI Stat), 2013. 63, 64, 75, 79, 80, 81
[105] Chris Hans. Bayesian lasso regression. Biometrika, 96(4):835–
845, 2009. 100
[106] M. West. On scale mixtures of normal distributions.
Biometrika, 74(3):646–648, 1987. 100
[86] C. Sminchisescu and B. Triggs. Building Roadmaps of Local Minima of Visual Models. In In European Conference on
Computer Vision, pages 566–582, 2002. 63
[107] Ildiko E. Frank and Jerome H. Friedman. A Statistical View
of Some Chemometrics Regression Tools. Technometrics,
35(2):109–135, 1993. 101
[87] Esa Nummelin. General Irreducible Markov Chains and NonNegative Operators, 83 of Cambridge Tracts in Mathematics.
Cambridge University Press, 1984. 64, 75, 76
[108] Nicholas G. Polson, James G. Scott, and Jesse Windle. The
Bayesian Bridge. http://arxiv.org/abs/1109.2279v2, 2012.
101
[88] Per Mykland, Luke Tierney, and Bin Yu. Regeneration in
Markov Chain Samplers. Journal of the American Statistical Association, 90(429):pp. 233–241, 1995. 64, 75, 77, 80
[109] B. Shahbaba, B. Zhou, H. Ombao, D. Moorman, and S. Behseta.
A semiparametric Bayesian model for neural coding.
arXiv:1306.6103, 2013. 102, 103, 104
[89] Walter R. Gilks, Gareth O. Roberts, and Sujit K. Sahu. Adaptive Markov Chain Monte Carlo through Regeneration. Journal of the American Statistical Association, 93(443):pp.
1045–1054, 1998. 64, 75, 76, 77, 80
[110] D. J. G. Farlie. The Performance of Some Correlation Coefficients for a General Bivariate Distribution. Biometrika,
47(3/4), 1960. 103
[90] John E. Straub Ioan Andricioaei and Arthur F. Voter. Smart
darting monte carlo.
The Journal of Chemical Physics,
114(16):6994–7000, 2001. 64
[111] E. J. Gumbel. Bivariate Exponential Distributions. Journal
of the American Statistical Association, 55:698–707, 1960. 103
[91] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic
Gradient Langevin Dynamics. In Proceedings of the International Conference on Machine Learning, 2011. 68, 109
[112] D. Morgenstern.
Einfache beispiele zweidimensionaler
verteilungen.
Mitteilungsblatt für Mathematische Statistik,
8:234–235, 1956. 103
[92] J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA, 2005. 69
[113] Roger B. Nelsen. An Introduction to Copulas. Springer Series in
Statistics. Springer-Verlag New York, Inc., Secaucus, NJ, USA,
2nd edition, 2006. 103
[93] A. E. Gelfand and D. K. Dey. Bayesian model choice: Asymptotic and exact calculation. Journal of the Royal Statistical
Society. Series B., 56(3):501–514, 1994. 75
[114] Zhen Chen and David B. Dunson. Random Effects Selection in
Linear Mixed Models. Biometrics, 59(4):pp. 762–769, 2003.
109
[94] Anthony E. Brockwell and Joseph B. Kadane. Identification
of regeneration times in MCMC simulation, with application to adaptive schemes. Journal of Computational and
Graphical Statistics, 14:436–458, 2005. 75
[115] Mohsen Pourahmadi. Cholesky Decompositions and Estimation of A Covariance Matrix: Orthogonality of Variance–
Correlation Parameters. Biometrika, 94(4):1006–1013, 2007.
109
[95] A. T. Ihler, J. W. Fisher III, R. L. Moses, and A. S. Willsky. Nonparametric belief propagation for self-localization of sensor networks. IEEE Journal on Selected Areas in Communications, 23(4):809–819, 2005. 80, 81
[116] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of Mathematical Statistics, 22(3):400–407,
1951. 109
113
REFERENCES
ing. In John Langford and Joelle Pineau, editors, Proceedings of
the 29th International Conference on Machine Learning (ICML12), pages 1591–1598, New York, NY, USA, 2012. ACM. 109
[117] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res.,
14(1):1303–1347, May 2013. 109
[119] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity
in MCMC Land: Cutting the Metropolis-Hastings Budget. http://arxiv.org/abs/1304.5299, April 2013. 109
[118] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian
Posterior Sampling via Stochastic Gradient Fisher Scor-
114
Appendix
A
A.1
Lagrangian Monte Carlo
Equivalence between Riemannian Hamiltonian dynamics and
Lagrangian dynamics
Proof of Proposition 4.1. The first equation in (4.9) is directly obtained from the transformation p 7→ v: θ̇k = g kl pl = v k . For the second equation in (4.9), we have from the definition
ṗl =
∂glj i j
d(glj (θ)v j )
θ̇ v + glj v̇ j = ∂i glj v i v j + glj v̇ j
=
dt
∂θi
(A.1)
Further, from equation (4.4) we have
1
1
ṗl = −∂l φ(θ) + vT ∂l G(θ)v = −∂l φ + ∂l gij v i v j
2
2
i j
j
= ∂i glj v v + glj v̇
which means
1
glj v̇ j = −(∂i glj − ∂l gij )v i v j − ∂l φ
2
By multiplying G−1 = (g kl ) on both sides, we have
1
v̇ k = δjk v̇ j = −g kl (∂i glj − ∂l gij )v i v j − g kl ∂l φ
2
(A.2)
Since i, j are symmetric in the first summand (see equation (A.1)), switching them gives the
following equations:
1
v̇ k = −g kl (∂j gli − ∂l gji )v i v j − g kl ∂l φ
(A.3)
2
Then the final form of the second equation (4.9) is obtained by adding equations (A.2) and
(A.3) and dividing the results by two:
v̇ k = −Γkij (θ)v i v j − g kl (θ)∂l φ(θ)
115
REFERENCES
Here, Γkij (θ) := 21 g kl (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols of the second kind.
A.2
Stationarity of Lagrangian Monte Carlo
Proof of Theorem 4.1. Starting from position θ ∼ π(θ) at time 0, we generate a velocity
v ∼ N(0, G(θ)). Then evolve (θ, v) according to our time reversible integrator T̂ to reach
reach a new state (θ ∗ , v∗ ) with θ ∗ ∼ f (θ ∗ ) after acceptance test. We want to prove that
f (·) = π(·), which can be done by showing Ef [h(θ ∗ )] = Eπ [h(θ ∗ )] for any square integrable
function h. Denote z := (θ, v) and P(dz) := exp(−E(z))dz.
Note that z∗ = (θ ∗ , v∗ ) can be reached from two scenarios: either the proposal is accepted
or rejected. Therefore,
Z
∗
Ef [h(θ )] = h(θ ∗ )[P(dT̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) + P(dz∗ )(1 − α̃(z∗ , T̂ (z∗ )))]
Z
Z
∗
∗
= h(θ )P(dz ) + h(θ ∗ )[P(dT̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) − P(dz∗ )α̃(z∗ , T̂ (z∗ ))]
So it suffices to prove
Z
Z
∗
−1 ∗
−1 ∗
∗
h(θ )P(dT̂ (z ))α̃(T̂ (z ), z ) = h(θ ∗ )P(dz∗ )α̃(z∗ , T̂ (z∗ ))
(A.4)
Denote the involution ν : (θ, v) 7→ (θ, −v). First, by time reversibility we have T̂ −1 (z∗ )) =
ν T̂ ν(z∗ )). Further, we claim α̃(ν(z),
z0 ) = α̃(z,
ν(z0 )). This
is true
because: i) E is quadratic
0
dz dν(z0 ) dν(z0 ) = in v so E(ν(z)) = E(z); ii) dν(ν(z)) = dz . Then that follows from
dν(z) definition of the adjusted acceptance probability (4.13) and the equivalence discussed in
proposition 4.4. Therefore
Z
Z
∗
−1 ∗
−1 ∗
∗
h(θ )P(dT̂ (z ))α̃(T̂ (z ), z ) = h(θ ∗ )P(dν T̂ ν(z∗ ))α̃(ν T̂ ν(z∗ ), z∗ )
Z
(A.5)
∗
∗
∗
∗
= h(θ )P(dT̂ ν(z ))α̃(T̂ ν(z ), ν(z ))
Next, applying the detailed balance condition (4.14) to ν(z∗ ) we get
P(dT̂ ν(z∗ ))α̃(T̂ ν(z∗ ), ν(z∗ )) = P(dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ ))
116
A Lagrangian Monte Carlo
substitute it in (A.5) and continue,
Z
h(θ ∗ )P(dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ ))
Z
ν(z∗ )7→z∗
=
h(θ ∗ )P(dz∗ )α̃(z∗ , T̂ (z∗ ))
=
Therefore, the equation (A.4) holds, and we complete the proof.
A.3
Convergence of explicit integrator to Lagrangian dynamics
Proof of Proposition 4.7. We first look at how the discretization error en = kz(tn ) − z(n) k =
k(θ(tn ), v(tn )) − (θ (n) , v(n) )k changes over two consecutive steps, also known as local error,
and then investigate how such error accumulates over multiple steps, i.e. global error.
Assume f (θ, v) := vT Γ(θ)v + G(θ)−1 ∇θ φ(θ) is smooth, hence, f and its derivatives are
uniformly bounded as (θ, v) evolves within finite time duration T . First we expand the true
solution z(tn+1 ) at tn :
1
z(tn+1 ) = z(tn ) + ż(tn )ε + z̈(tn )ε2 + o(ε2 )
2
#
"
"
# "
#
−f (θ(tn ), v(tn ))
θ(tn )
v(tn )
1
ε2 + o(ε2 )
=
+
ε+
∂f
∂f
2
− ∂θT v(tn ) + ∂vT f (θ(tn ), v(tn ))
v(tn )
−f (θ(tn ), v(tn ))
"
# "
#
θ(tn )
v(tn )
+
ε + O(ε2 )
=
v(tn )
−f (θ(tn ), v(tn ))
"
#
(n+1)
θ
Next, we simplify the expression of the numerical solution z(n+1) = (n+1) by the integrator
v
(4.21)-(4.23) and compare it to the above true solutions. To this end, we rewrite equation
(4.21) as follows:
ε
ε
v(n+1/2) = [I + (v(n) )T Γ(θ (n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )]
2
2
ε (n) T
ε
(n) −1
(n)
= v − [I + (v ) Γ(θ )] [(v(n) )T Γ(θ (n) )v(n) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
2
ε (n) T
(n) −1 ε
(n)
(n)
(n)
= v − [I + (v ) Γ(θ )]
f (θ , v )
2
2
ε
ε2
ε
= v(n) − f (θ (n) , v(n) ) + [I + (v(n) )T Γ(θ (n) )]−1 [(v(n) )T Γ(θ (n) )]f (θ (n) , v(n) )
2
4
2
ε
(n)
(n)
(n)
2
= v − f (θ , v ) + O(ε )
2
117
REFERENCES
Similarly, from equation (4.23) we have
ε
v(n+1) = v(n+1/2) − f (θ (n+1) , v(n+1/2) ) + O(ε2 )
2
Substituting v(n+1/2) in the above equation, we obtain v(n+1) as follows:
ε
ε
v(n+1) = v(n) − f (θ (n) , v(n) ) − f (θ (n+1) , v(n) ) + O(ε2 )
2
2
ε
(n)
(n)
(n)
= v − f (θ , v )ε + [f (θ (n) , v(n) ) − f (θ (n) + O(ε), v(n) )] + O(ε2 )
2
(n)
(n)
(n)
= v − f (θ , v )ε + O(ε2 )
From (4.19) and the above equations, we have the following numerical solution:
"
z(n+1)
# "
# "
#
θ (n+1)
θ (n)
v(n)
= (n+1) = (n) +
ε + O(ε2 )
(n)
(n)
v
v
−f (θ , v )
Therefore, the local error is
en+1
"
# "
#
θ(t ) − θ (n)
(n)
v(tn ) − v
n
(n+1)
2 = kz(tn+1 ) − z
k=
+
ε + O(ε )
v(tn ) − v(n)
−[f (θ(tn ), v(tn )) − f (θ (n) , v(n) )]
≤ (1 + M ε)en + O(ε2 )
where M = c supt∈[0,T ] k∇f (θ(t), v(t))k for some constant c > 0. Accumulating the local
errors by iterating the above inequality for L = T /ε steps provides the following global
error:
en+1 ≤ (1 + M ε)en + O(ε2 ) ≤ (1 + M ε)2 en−1 + 2O(ε2 ) ≤ · · · ≤ (1 + M ε)n e1 + nO(ε2 )
≤ (1 + M ε)L ε + LO(ε2 ) ≤ (eM T + T )ε → 0,
B
as ε → 0
Solutions to split Lagrangian dynamics on Sphere
Proof of Proposition 6.2. To solve the first dynamics in (6.14), we note that
θ̇D+1 =
v̇D+1
d
dt
q
θT
1 − kθk22 = −
θ̇
θD+1
d θT v
=−
dt θD+1
=0
T
θ̇ v + θ T v̇
θT v
1 θT
=−
+ 2 θ̇D+1 =
GS (θ)−1 ∇θ U (θ)
θD+1
θD+1
2 θD+1
118
B Solutions to split Lagrangian dynamics on Sphere
Therefore, we have
θ̃(t) = θ̃(0)
"
#
I
t
[I − θ(0)θ(0)T ]∇θ U (θ)
ṽ(t) = ṽ(0) −
T
2 − θ θ(0)(0)
D+1
"
#
"
# " #
T
I
−
θ(0)θ(0)
I
where
[I − θ(0)θ(0)T ] =
=
− θ̃(0)θ(0)T . Note this dyθ(0)T
T
− θD+1 (0)
−θD+1 (0)θ(0)
0T
namics only involves updating velocity ṽ in the tangent space Tθ̃ SD .
The second dynamics in (6.14) only involves the kinetic energy, hence, it is equivalent to
the geodesic flow on the sphere SD with a great circle (orthodrome or Riemannian circle) as
its analytical solution. To solve it, we need to calculate the Christoffel symbols, Γ(θ), first.
2
Note that the (i, j)-th element of GS is gij = δij + θi θj /θD+1
, and the (i, j, k)-th element of
2
4
dGS is gij,k = (δik θj + θi δjk )/θD+1 + 2θi θj θk /θD+1 . Therefore
I
1
Γkij = g kl [glj,i + gil,j − gij,l ]
2
1 kl
2
2
= (δ − θk θl )[(δli θj + θl δji )/θD+1
+ (δij θl + θi δlj )/θD+1
2
2
4
− (δil θj + θi δjl )/θD+1
+ 2θi θj θl /θD+1
]
2
2
= (δ kl − θk θl )θl /θD+1
[δij + θi θj /θD+1
]
2
= θk [δij + θi θj /θD+1
] = [GS (θ) ⊗ θ]ijk
Using these results, we can write the second equation evolving v as v̇ = −vT GS (θ)vθ =
−kṽk22 θ. Further, we have
θ̇D+1 = −
θT
θ̇
θD+1
= vD+1
T
v̇D+1
θ̇ v + θ T v̇
θT v
=−
+ 2 θ̇D+1 = −kṽk22 θD+1
θD+1
θD+1
Therefore, we can rewrite the geodesic equations (the second dynamics in (6.14)) as
θ̃˙ = ṽ
ṽ˙ = − kṽk22 θ̃
(B.6)
(B.7)
d
Multiplying both sides of equation (B.7) by ṽT to obtain
kṽk22 = 0, and the rest is
dt
straightforward.
119
Download