This article was downloaded by: [The UC Irvine Libraries]

advertisement
This article was downloaded by: [The UC Irvine Libraries]
On: 19 June 2015, At: 03:06
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Computational and Graphical
Statistics
Publication details, including instructions for authors and
subscription information:
http://amstat.tandfonline.com/loi/ucgs20
Markov Chain Monte Carlo from
Lagrangian Dynamics
Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba & Mark Girolami
Accepted author version posted online: 04 Apr 2014.Published
online: 16 Jun 2015.
Click for updates
To cite this article: Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba & Mark Girolami (2015)
Markov Chain Monte Carlo from Lagrangian Dynamics, Journal of Computational and Graphical
Statistics, 24:2, 357-378, DOI: 10.1080/10618600.2014.902764
To link to this article: http://dx.doi.org/10.1080/10618600.2014.902764
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
Conditions of access and use can be found at http://amstat.tandfonline.com/page/termsand-conditions
Markov Chain Monte Carlo From Lagrangian
Dynamics
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
Shiwei LAN, Vasileios STATHOPOULOS, Babak SHAHBABA,
and Mark GIROLAMI
Hamiltonian Monte Carlo (HMC) improves the computational efficiency of the
Metropolis–Hastings algorithm by reducing its random walk behavior. Riemannian
HMC (RHMC) further improves the performance of HMC by exploiting the geometric
properties of the parameter space. However, the geometric integrator used for RHMC
involves implicit equations that require fixed-point iterations. In some cases, the computational overhead for solving implicit equations undermines RHMC’s benefits. In an
attempt to circumvent this problem, we propose an explicit integrator that replaces the
momentum variable in RHMC by velocity. We show that the resulting transformation is
equivalent to transforming Riemannian Hamiltonian dynamics to Lagrangian dynamics.
Experimental results suggest that our method improves RHMC’s overall computational
efficiency in the cases considered. All computer programs and datasets are available online (http://www.ics.uci.edu/babaks/Site/Codes.html) to allow replication of the results
reported in this article.
Key Words: Explicit integrator; Hamiltonian Monte Carlo; Riemannian manifold.
1. INTRODUCTION
Hamiltonian Monte Carlo (HMC; Duane et al. 1987) reduces the random walk behavior
of the Metropolis–Hastings algorithm by proposing samples that are distant from the current
state, but nevertheless have a high probability of acceptance. These distant proposals
are found by numerically simulating Hamiltonian dynamics for some specified amount
of fictitious time (Neal 2010). Hamiltonian dynamics can be represented by a function,
known as the Hamiltonian, of model parameters θ and auxiliary momentum parameters
p ∼ N (0, M) (with the same dimension as θ ) as follows:
1
H (θ, p) = − log p(θ ) + pT M−1 p,
2
where M is a symmetric, positive-definite mass matrix.
(1)
Shiwei Lan (E-mail: slan@uci.edu) is a postdoc scholar and Babak Shahbaba (E-mail: babaks@uci.edu)
is Associate Professor, Department of Statistics, University of California, Irvine, Irvine, CA 92697.
Vasileios Stathopoulos (E-mail: v.stathopoulos@ucl.ac.uk) is Research Associate and Mark Girolami (E-mail:
m.girolami@warwick.ac.uk) is Professor of Statistics, University of Warwick, Coventry CV4 7AL, UK.
C 2015
American Statistical Association, Institute of Mathematical Statistics,
and Interface Foundation of North America
Journal of Computational and Graphical Statistics, Volume 24, Number 2, Pages 357–378
DOI: 10.1080/10618600.2014.902764
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/jcgs.
357
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
358
S. LAN ET AL.
Hamilton’s equations, which involve differential equations derived from H, determine
how θ and p change over time. In practice, however, solving these equations exactly is
difficult in general, so we need to approximate them by discretizing time, using some small
step size ε. For this purpose, the leapfrog method is commonly used.
Hamiltonian dynamics is restricted by the smallest eigen-direction, requiring small step
size to maintain the stability of the numerical discretization. Girolami and Calderhead
(2011) proposed a new method, called Riemannian HMC (RHMC), which exploits the
geometric properties of the parameter space to improve the efficiency of standard HMC,
especially in sampling distributions with complex structure (e.g., high correlation, nonGaussian shape). Simulating the resulting dynamics, however, is computationally intensive
since it involves solving two implicit equations, which require additional iterative numerical
computation (e.g., fixed-point iteration).
In an attempt to increase the speed of RHMC, we propose a new integrator that is
completely explicit: we propose to replace momentum with velocity in the definition of the
Riemannian Hamiltonian dynamics. As we will see, this is equivalent to using Lagrangian
dynamics as opposed to Hamiltonian dynamics. By doing so, we eliminate one of the
implicit steps in RHMC. Next, we construct a time symmetric integrator to remove the
remaining implicit step in RHMC. This leads to a valid sampling scheme (i.e., converges to
the true target distribution) that involves only explicit equations. We refer to this algorithm
as Lagrangian Monte Carlo (LMC).
In what follows, we begin with a brief review of RHMC and its geometric integrator.
Section 3 introduces our proposed semi-explicit integrator based on defining Hamiltonian
dynamics in terms of velocity as opposed to momentum. Next, in Section 4, we eliminate
the remaining implicit equation and propose a fully explicit integrator. In Section 5, we use
simulated and real data to evaluate our methods’ performance. Finally, in Section 6, we
discuss some possible future research directions.
2. RIEMANNIAN HAMILTONIAN MONTE CARLO
As discussed above, although HMC explores the parameter space more efficiently than
random walk Metropolis (RWM) does, it does not fully exploit the geometric properties
of parameter space defined by the density p(θ). Indeed, Girolami and Calderhead (2011)
argued that dynamics over Euclidean space may not be appropriate to guide the exploration
of parameter space. To address this issue, they proposed a new method, called RHMC, which
exploits the Riemannian geometry of the parameter space (Amari and Nagaoka 2006)
to improve standard HMC’s efficiency by automatically adapting to the local structure.
They did this by using a position-specific mass matrix M = G(θ ). More specifically, they
set G(θ ) to the Fisher information matrix E[∇ log p(θ )∇ log p(θ )T ]. As a result, p|θ ∼
N (0, G(θ )), and the Hamiltonian is defined as follows:
H (θ, p) = − log p(θ) +
1
1
1
log det G(θ) + pT G(θ)−1 p = φ(θ ) + pT G(θ )−1 p,
2
2
2
(2)
where φ(θ) := − log p(θ ) + 12 log det G(θ). Based on this Hamiltonian, Girolami and
Calderhead (2011) proposed the following Hamiltonian dynamics on the Riemannian
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
359
manifold:
θ̇ = ∇p H (θ , p) = G(θ )−1 p
(3)
1
ṗ = −∇θ H (θ , p) = −∇θ φ(θ ) + ν(θ , p).
2
With the shorthand notation ∂i = ∂/∂θ i for partial derivative, the ith element of the vector
ν(θ, p) is
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
(ν(θ, p))i = −pT ∂i (G(θ)−1 )p = (G(θ )−1 p)T ∂i G(θ )G(θ)−1 p.
The above equations of dynamics are nonseparable (they contain products of θ and p),
and the resulting map (θ, p) → (θ ∗ , p∗ ) based on the standard leapfrog method is neither
time-reversible nor symplectic. Therefore, we cannot use the standard leapfrog algorithm
(Girolami and Calderhead 2011). Instead, they use the Stömer–Verlet (Verlet 1967) method
as follows:
ε
1
(n)
(n)
(n+1/2)
(n)
(n+1/2)
∇θ φ(θ ) − ν(θ , p
=p −
)
(4)
p
2
2
ε −1 (n)
G (θ ) + G−1 (θ (n+1) ) p(n+1/2)
θ (n+1) = θ (n) +
(5)
2
ε
1
∇θ φ(θ (n+1) ) − ν(θ (n+1) , p(n+1/2) ) ,
(6)
p(n+1) = p(n+1/2) −
2
2
where ε is the size of time step.
This is also known as generalized leapfrog, which can be derived by concatenating a
symplectic Euler-B integrator of (3) with its adjoint symplectic Euler-A integrator (see
more details in Leimkuhler and Reich 2004). The above series of transformations are (i)
deterministic, (ii) reversible, and (iii) volume-preserving. Therefore, the effective proposal
distribution is a delta function δ((θ (1) , p(1) ), (θ (L+1) , p(L+1) )) and the acceptance probability
is simply:
exp(H (θ (1) , p(1) ) − H (θ (L+1) , p(L+1) )).
(7)
Here, (θ (1) , p(1) ) is the current state, and (θ (L+1) , p(L+1) ) is the proposed sample after L
leapfrog steps.
As an illustrative example, Figure 1 shows the sampling paths of RWM, HMC, and
RHMC for an artificially created banana-shaped distribution. (See Girolami and Calderhead
2011, discussion by Luke Bornn and Julien Cornebise.) For this example, we fix the
trajectory and choose the step sizes such that the acceptance probability for all three
methods remains around 0.7. RWM moves slowly and spends most of iterations at the
distribution’s low-density tail, and HMC explores the parameter space in an indirect way,
while RHMC moves directly to the high-density region and explores the distribution more
efficiently.
One major drawback of this geometric integrator, which is both time-reversible and
volume-preserving, is that it involves two implicit functions: Equations (4) and (5). These
functions require extra numerical analysis (e.g., fixed-point iteration), which results in
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
360
S. LAN ET AL.
Figure 1. The first 10 iterations in sampling from a banana-shaped distribution with random walk Metropolis
(RWM), Hamiltonian Monte Carlo (HMC), and Riemannian HMC (RHMC). For all three methods, the trajectory
length (i.e., step size ε times number of integration steps L) is set to 1. For RWM, L = 17, for HMC, L = 7, and
for RHMC, L = 5. Solid red lines are the sampling paths, and black circles are the accepted proposals.
higher computational cost and simulation error. This is especially true when solving
θ (n+1) because the fixed-point iteration for (5) repeatedly inverts matrix G. To address
this problem, we propose an alternative approach that uses velocity instead of momentum
in the equations of motion.
3. MOVING FROM MOMENTUM TO VELOCITY
In the equations of Hamiltonian dynamics (3), the term G(θ )−1 p appears several times.
This motivates us to reparameterize the dynamics in terms of velocity, v = G(θ)−1 p. Note
that this in fact corresponds to the usual definition of velocity in physics, that is, momentum
divided by mass. The transformation p → v changes the Hamiltonian dynamics (3) to the
following form (derivation in Appendix A):
θ̇ = v
(8)
v̇ = −η(θ, v) − G(θ )−1 ∇θ φ(θ),
k
k
i j
where η(θ , v) is a vector whose kth element is
i,j ij (θ)v v . Here, ij (θ) :=
1
kl
ij
l g (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols, where gij and g denote (i, j )th
2
element of G(θ) and G(θ )−1 , respectively.
This transformation moves the complexity of the dynamics in the first equation for θ to
its second equation where it spends more time in finding a “good” direction v. By resolving
the implicitness of updating θ in the generalized leapfrog method, we reduce the associated
computational cost. Concatenating the Euler-B integrator with its adjoint Euler-A integrator
(Leimkuhler and Reich 2004, see also Appendix B) leads to the following semi-explicit
integrator:
ε
v(n+1/2) = v(n) − [η(θ (n) , v(n+1/2) ) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
θ (n+1) = θ (n) + εv(n+1/2)
ε
v(n+1) = v(n+1/2) − [η(θ (n+1) , v(n+1/2) ) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )].
2
(9)
(10)
(11)
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
361
Note that Equation (9) updating v remains implicit (more details are available in Appendix
B).
The introduction of velocity v in place of p is also advocated by Beskos et al. (2011)
to avoid large variables p for the sake of numerical stability. They considered a constant
mass so the resulting dynamics is still Hamiltonian. In general, however, the new dynamics
(8) cannot be recognized as Hamiltonian dynamics of (θ, v); rather, it is equivalent to the
following Euler–Lagrange equation of the second kind:
d ∂L
∂L
=
,
∂θ
dt ∂ θ̇
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
which is the solution to variation of the Lagrangian: L = 12 vT G(θ )v − φ(θ ). That is, in our
case,
θ̈ = −η(θ, θ̇ ) − G(θ)−1 ∇θ φ(θ).
Therefore, we refer to the new dynamics (8) as Lagrangian dynamics. Although the
resulting dynamics is not Hamiltonian, it nevertheless remains a valid proposal-generating
mechanism, which preserves the original Hamiltonian H (θ , p = G(θ )v) (proof in Appendix
A); thus, the acceptance probability is only determined by the discretization error from the
numerical integration as usual (to be discussed below).
Because p|θ ∼ N (0, G(θ )), the distribution of v|θ is N (0, G(θ )−1 ). We define the energy
function E(θ , v) as the sum of the potential energy, U (θ) = − log p(θ ) and kinetic energy
K(θ, v) = − log(p(v|θ)):
E(θ , v) = − log p(θ ) −
1
1
log det G(θ) + vT G(θ )v.
2
2
(12)
Note that the integrator (9)–(11) is (i) reversible and (ii) energy-preserving up to a
global error with order O(ε), where ε is the step size. (See Proposition B.1 in Appendix
B and a proof of distribution invariance in Appendix B.1.) The resulting map, however,
is not volume-preserving. Therefore, the acceptance probability based on E(θ, v) must
be adjusted with the determinant of the transformation to ensure that a detailed balance
condition is satisfied (Green 1995),
αsLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JsLMC |},
where JsLMC is the Jacobian matrix of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) according to (9)–(11)
with the following determinant (see more in Appendix B.2):
L
∂(θ (L+1) , v(L+1) ) det(I − ε(θ (n+1) , v(n+1/2) ))
=
det JsLMC := .
(1)
∂(θ , v(1) ) det(I + ε(θ (n) , v(n+1/2) ))
n=1
(n+1/2) i (n+1)
Here, (θ (n+1) , v(n+1/2) ) is a matrix whose (i, j )th element is k vk
kj (θ
).
The dynamics is now defined by the semi-explicit integrator (9)–(11). Algorithm 1 (see
pseudocode in Appendix D) provides the corresponding steps. We refer to this approach
as semi-explicit Lagrangian Monte Carlo (sLMC), which has a physical interpretation as
exploring the parameter space along the path on a Riemannian manifold that minimizes
362
S. LAN ET AL.
the action (total Lagrangian). Contrast this to RHMC augmenting parameter space with
momentum, sLMC augments parameter space with velocity. In Section 5, we use several
experiments to show that switching from momentum to velocity may lead to improvements
in computational efficiency in some cases.
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
4. EXPLICIT LAGRANGIAN MONTE CARLO
To further resolve the remaining implicit equation (9), we modify η(θ (n) , v(n+1/2) ), to be
an asymmetric function of both v(n) and v(n+1/2) such that v(n+1/2) can be solved explicitly
(see details in Appendix C). We now propose a fully explicit integrator for Lagrangian
dynamics (8) as follows:
−1 ε
ε
(13)
v(n) − G(θ (n) )−1 ∇θ φ(θ (n) ) .
v(n+1/2) = I + (θ (n) , v(n) )
2
2
θ (n+1) = θ (n) + εv(n+1/2) .
(14)
−1 ε
ε
1
v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) ) . (15)
v(n+1) = I + (θ (n+1) , v(n+ 2 ) )
2
2
In Appendix C.1, we prove that the solution obtained from this integrator numerically
converges to the true solution of the Lagrangian dynamics. This integrator is (i) reversible
and (ii) energy-preserving up to a global error with order O(ε). The resulting map is not
volume-preserving as the Jacobian determinant of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) by 13–15
is (see details in Appendix C.2)
det JLMC :=
L
˜ (n+1) , v(n+1) )) det(G(θ (n) ) − ε/2(θ
˜ (n) , v(n+1/2) ))
det(G(θ (n+1) ) − ε/2(θ
n=1
˜ (n+1) , v(n+1/2) )) det(G(θ (n) ) + ε/2(θ
˜ (n) , v(n) ))
det(G(θ (n+1) ) + ε/2(θ
.
˜
Here, (θ,
v) denotes G(θ )(θ, v) whose (k, j )th element is equal to i vi ˜ ij k (θ ),
with ˜ ij k (θ) = gkl ijl (θ) = 12 (∂i gkj + ∂j gik − ∂k gij ). As a result, the acceptance probability
must be adjusted as follows:
αLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JLMC |}.
We refer to this approach that involves a fully explicit integrator 13–15 as LMC. Algorithm 2 (see pseudocode in Appendix D) shows the corresponding steps for this method. In
both algorithms 1 and 2, the position update is relatively simple while the computational
time is dominated by choosing the “right” direction (velocity) using the geometry of parameter space. In sLMC, solving θ explicitly reduces computation cost by (F − 1)O(D 2.373 ),
where F is the number of fixed-point iterations, and D is the number of parameters. This is
because for each fixed-point iteration, it takes O(D 2.373 ) elementary linear algebraic opera˜ ) in ˜ do not add substantial computational
tions to invert G(θ ). The connection terms (θ
cost since they are obtained from permuting three dimensions of the array ∂G(θ), which is
also computed in RHMC. The additional price of determinant adjustment is O(D 2.373 ).
LMC avoids the fixed-point iteration method. Therefore, it reduces computation by
(F − 1)O(D 2 ). Further, it resolves possible convergence issues associated with using the
fixed-point iteration method. However, because it involves additional matrix inversions to
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
363
update v, its benefits could be undermined occasionally. This is evident from our experimental results presented in Section 5.
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
5. EXPERIMENTAL RESULTS
In this section, we use both simulated and real data to evaluate our methods, sLMC
and LMC, compared to standard HMC and RHMC. Following Girolami and Calderhead (2011), we use a time-normalized effective sample size (ESS) to compare these
K
γ (k)]−1 for each
methods. For B posterior samples, we calculate ESS = B[1 + 2 k=1
K
parameter, where k=1 γ (k) is the sum of K monotone sample autocorrelations (Geyer
1992). Minimum, median, and maximum values of ESS over all parameters are provided for comparing different algorithms. More specifically, we use the minimum ESS
normalized by CPU time (sec), min(ESS)/sec, as the measure of sampling efficiency.
All computer programs and datasets discussed in this article are available online at
http://www.ics.uci.edu/babaks/Site/Codes.html.
5.1 SIMULATING A BANANA-SHAPED DISTRIBUTION
The banana-shaped distribution, which we used above for illustration, can be constructed
as the posterior distribution of θ = (θ1 , θ2 )|y based on the following model:
y|θ ∼ N(θ1 + θ22 , σy2 )
θ ∼ N (0, σθ2 I 2 ).
2
The data {yi }100
i=1 are generated with θ1 + θ2 = 1, σy = 2, and σθ = 1.
As we can see in Figure 2, similar to RHMC, sLMC and LMC explore the parameter
space efficiently by adapting to its local geometry. Table 1 compares the performances
of these algorithms based on 20,000 Markov chain Monte Carlo (MCMC) iterations after
5000 burn-in. For this specific example, sLMC has the best performance followed by LMC.
As discussed above, although LMC is fully explicit, its numerical benefits (obtained by
removing implicit equations) can be negated in certain examples since it involves additional
matrix inversion operations to update v. The histograms of posterior samples shown in
Figure 3 confirm that our algorithms converge to the true posterior distributions of θ1 and
θ2 , whose density functions are shown as red solid curves.
Table 1. Comparing alternative methods using a banana-shaped distribution
Method
AP
sec
ESS
min(ESS)/sec
HMC
RHMC
sLMC
LMC
0.79
0.78
0.84
0.73
6.96e-04
4.56e-03
7.90e-04
7.27e-04
(288,614,941)
(4514,5779,7044)
(2195,3476,4757)
(1139,2409,3678)
20.65
49.50
138.98
78.32
NOTE: For each method, we provide the acceptance probability (AP), the CPU time (sec) for each iteration, ESS
(min., med., max.), and the time-normalized ESS.
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
364
S. LAN ET AL.
Figure 2. The first 10 iterations in sampling from the banana-shaped distribution with Riemannian HMC
(RHMC), semi-explicit Lagrange Monte Carlo (sLMC), and explicit LMC (LMC). For all three methods, the
trajectory length (i.e., step size times number of integration steps) is set to 1.45 and number of integration steps
is set to 10. Solid red lines show the sampling path, and each point represents an accepted proposal.
5.2 LOGISTIC REGRESSION MODELS
Next, we evaluate our methods based on five binary classification problems used in
Girolami and Calderhead (2011). These are Australian Credit data, German Credit data,
Heart data, Pima Indian data, and Ripley data. These datasets are publicly available from
the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). For each problem,
we use a logistic regression model,
p(yi = 1|x i , β) =
exp(x Ti β)
,
1 + exp(x Ti β)
i = 1, . . . , n
β ∼ N (0, 100I),
Figure 3. Histograms of 1 million posterior samples of θ1 and θ2 for the banana-shaped distribution using RHMC
(left), sLMC (middle), and LMC (right). Solid red curves are the true density functions.
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
365
Table 2. Comparing alternative methods using five binary classification problems discussed in Girolami and
Calderhead (2011)
Data
Method
AP
sec
ESS
min(ESS)/sec
Australian
D = 14, N = 690
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
0.75
0.72
0.83
0.75
0.74
0.76
0.71
0.70
0.71
0.73
0.77
0.76
0.85
0.81
0.81
0.82
0.88
0.74
0.80
0.79
6.13E-03
2.96E-02
2.17E-02
1.60E-02
1.31E-02
6.55E-02
4.13E-02
3.74E-02
1.75E-03
2.12E-02
1.30E-02
1.15E-02
5.75E-03
1.64E-02
8.98E-03
7.90E-03
1.50E-03
1.09E-02
6.79E-03
5.36E-03
(1225,3253,10691)
(7825,9238,9797)
(10184,13001,13735)
(9636,10443,11268)
(766,4006,15000)
(14886,15000,15000)
(13395,15000,15000)
(13762,15000,15000)
(378,850,2624)
(6263,7430,8191)
(10318,11337,12409)
(10347,10724,11773)
(887,4566,12408)
(4349,4693,5178)
(4784,5437,5592)
(4839,5193,5539)
(820,3077,15000)
(12876,15000,15000)
(15000,15000,15000)
(12611,15000,15000)
13.32
17.62
31.29
40.17
3.90
15.15
21.64
24.54
14.44
19.68
52.73
59.80
10.28
17.65
35.50
40.84
36.39
78.83
147.38
157.02
German
D = 24, N = 1000
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
Heart
D = 13, N = 270
Pima
D = 7, N = 532
Ripley
D = 2, N = 250
NOTES: For each dataset, the number of predictors, D, and the number of observations, N, are specified. For
each method, we provide the acceptance probability (AP), the CPU time (sec) for each iteration, ESS (min., med.,
max.), and the time-normalized ESS.
where yi is a binary outcome for the ith observation, x i is the corresponding vector of
predictors (with the first element equal to 1), and β is the set of regression parameters.
We use standard HMC, RHMC, sLMC, and LCM to simulate 20,000 posterior samples
for β. We fix the trajectory length for different algorithms, and tune the step sizes so that
they have comparable acceptance rates. Results (after discarding the initial 5000 iterations)
are summarized in Table 2, and show that in general our methods improve the sampling
efficiency measured in terms of minimum ESS per second compared to RHMC on these
examples.
5.3 RESULTS FOR MULTIVARIATE t-DISTRIBUTIONS
The computational complexity of standard HMC is O(D). This is substantially lower
than O(D 2.373 ), which is the computational complexity of the three geometrically motivated
methods discussed here (RHMC, sLMC, and LMC). On the other hand, these three methods
could have substantially better mixing rates compared to standard HMC, whose mixing
time is mainly determined by the condition number of the target distribution defined as the
ratio of the maximum and minimum eigenvalues of its covariance matrix: λmax /λmin .
In this section, we illustrate how efficiency of these sampling algorithms changes as
the condition number varies using multivariate t-distributions with the following density
366
S. LAN ET AL.
D = 20
450
HMC
RHMC
sLMC
LMC
800
700
λ max /λ min = 10000
HMC
RHMC
sLMC
LMC
400
350
600
Min(ESS)/s
Min(ESS)/s
300
500
400
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
300
250
200
150
200
100
100
50
0
1e+1
1e+2
1e+3
λ m a x /λ m i n
1e+4
1e+5
0
10
20
30
40
50
Dimension
Figure 4. Left: sampling efficiency, Min(ESS)/s versus the condition number for a fixed dimension (D = 20).
Right: sampling efficiency versus dimension for a fixed condition number (λmax /λmin = 10,000). Each algorithm
is tuned to have an acceptance rate around 70%. Results are based on 5000 samples after discarding the initial
1000 samples.
function:
p(x) =
1
((ν + D)/2)
(π ν)−D/2 | |−1/2 1 + x T
(ν/2)
ν
−1
−(ν+D)/2
x
,
(16)
where ν is the degrees of freedom and D is the dimension. In our first simulation, we fix the
dimension at D = 20 and vary the condition number of from 10 to 105 . As the condition
number increases, one can expect HMC to be more restricted by the smallest eigen-direction,
whereas RHMC, sLMC, and LMC would adapt to the local geometry. Results presented
in Figure 4 (left panel) show that this is in fact the case: for higher conditional numbers,
geometrically motivated methods perform substantially better than standard HMC. Note
that our two proposed algorithms, sLMC and LMC, provide substantial improvements over
RHMC.
For our second simulation, we fix the condition number at 10,000 and let the dimension
changes from 10 to 50. Our results (Figure 4, right panel) show that the gain by exploiting geometric properties of the target distribution could be undermined eventually as the
dimension increases.
5.4 FINITE MIXTURE OF GAUSSIANS
Finally, we consider finite mixtures of univariate Gaussian components of the form
K
p(xi |θ) =
πk N (xi |μk , σk2 ),
(17)
k=1
where θ is the vector of size D = 3K − 1 of all the parameters πk , μk , and σk2 and
N (·|μ, σ 2 ) is a Gaussian density with mean μ and variance σ 2 . A common choice of prior
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
367
Figure 5. Densities used to generate synthetic datasets. From left to right the densities are in the same order as
in Table 3. The densities are taken from McLachlan and Peel (2000).
takes the form
p(θ ) = D(π1 , . . . , πK |λ)
K
N (μk |m, β −1 σk2 )IG(σk2 |b, c),
(18)
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
k=1
where D(·|λ) is the symmetric Dirichlet distribution with parameter λ, and IG(·|b, c) is the
inverse Gamma distribution with shape parameter b and scale parameter c.
Although the posterior distribution associated with this model is formally explicit, it is
computationally intractable, since it can be expressed as a sum of K N terms corresponding
to all possible allocations of observations xi to mixture components (Marin, Mengersen,
and Robert 2005, chap. 9). We want to use this model to test the efficiency of posterior
sampling θ using the four methods. A more extensive comparison of Riemannian Manifold
MCMC and HMC, Gibbs sampling, and standard Metropolis–Hastings for finite Gaussian
mixture models can be found at Stathopoulos and Girolami (2011). Due to the nonanalytic
nature of the expected Fisher information, I(θ), we use the empirical Fisher information
as metric tensor (McLachlan and Peel 2000, chap. 2),
G(θ) = ST S −
1 T
ss ,
N
T
i |θ)
where N × D score matrix S has elements Si,d = ∂ log∂θp(x
and s = N
i=1 Si,· .
d
We show five Gaussian mixtures in Table 3 and Figure 5 and compare sampling efficiency
of HMC, RMHMC, sLMC, and LMC using simulated datasets in Table 4. As before, our
two algorithms outperform RHMC.
Table 3. Densities used for the generation of synthetic mixture of Gaussian datasets
Dataset
name
Kurtotic
Bimodal
Skewed
Trimodal
Claw
Num. of
parameters
Density function
9
20 N
x| −
2
1
1 2
3 N (x|0, 1) + 3 N x|0, 10
1
2 1
2 2
+ 2 N x|1, 23
2 N x| − 1, 3
3
1
3
1 2
4 N (x|0, 1) + 4 N x| 2 , 3
2 6
3 2
9
1
N x| 65 , 35
N
+ 20
+ 10
5, 5
1
2N
(x|0, 1) +
4
1
i=0 10 N
x| 2i
− 1,
5
5
x|0,
1 2
10
1 2
4
5
8
17
368
S. LAN ET AL.
Table 4. Acceptance probability (AP), seconds per iteration (s), ESS (min., med., max.), and time-normalized
ESS for Gaussian mixture models
Data
Method
AP
sec
ESS
min(ESS)/sec
Claw
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
HMC
RHMC
sLMC
LMC
0.88
0.80
0.86
0.82
0.77
0.79
0.82
0.80
0.83
0.85
0.82
0.84
0.78
0.82
0.85
0.81
0.73
0.86
0.81
0.85
7.01E-01
5.08E-01
3.76E-01
2.92E-01
3.43E-01
9.94E-02
4.02E-02
4.84E-02
1.78E-01
5.10E-02
2.26E-02
2.52E-02
2.85E-01
4.72E-02
2.54E-02
2.70E-02
1.61E-01
5.38E-02
2.06E-02
2.06E-02
(1916, 3761, 4970)
(1524, 3474, 4586)
(2531, 4332, 5000)
(2436, 3455, 4608)
(2244, 2945, 3159)
(4701, 4928, 5000)
(4978, 5000, 5000)
(4899, 4982, 5000)
(2915, 3237, 3630)
(5000, 5000, 5000)
(4698, 4940, 5000)
(4935, 5000, 5000)
(3013, 3331, 3617)
(5000, 5000, 5000)
(5000, 5000, 5000)
(5000, 5000, 5000)
(2923, 2991, 3091)
(5000, 5000, 5000)
(4935, 4996, 5000)
(5000, 5000, 5000)
0.54
0.60
1.35
1.67
1.30
9.46
24.77
20.21
3.27
19.63
41.68
39.09
2.11
21.20
39.34
36.90
3.62
18.56
48.00
46.43
Trimodal
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
skewed
Kurtotic
Bimodal
NOTES: Results are calculated on a 5000 sample chain with a 5000 sample burn-in session. For HMC the burn-in
session was 20,000 samples to ensure convergence.
6. CONCLUSIONS AND DISCUSSION
Following the method of Girolami and Calderhead (2011) for more efficient exploration
of parameter space, we have proposed new sampling schemes to reduce the computational
cost associated with using a position-specific mass matrix. To this end, we have developed
a semi-explicit (sLMC) integrator and a fully explicit (LMC) integrator for RHMC and
demonstrated their advantage in improving computational efficiency over the generalized
leapfrog (RHMC) method used by Girolami and Calderhead (2011). It is easy to show that
if G(θ) ≡ M, our method reduces to standard HMC.
Compared to HMC, whose local and global errors are O(ε3 ) and O(ε2 ), respectively,
LMC’s local error is O(ε2 ), and its global error is O(ε) (proof in Appendix C.1). Although
the numerical solutions converge to the true solutions of the corresponding dynamics at a
slower rate for LMC compared to HMC, in general, the approximation remains adequate
leading to reasonably high acceptance rates while providing a more computationally efficient sampling mechanism. Compared to RHMC, our LMC method has the additional
advantage of being more stable by avoiding implicit updates relying on the fixed-point
iteration method: RHMC could occasionally give highly divergent solutions, especially for
ill-conditioned metrics, G(θ).
Future directions could involve splitting Hamiltonian (Dullweber, Leimkuhler, McLachlan 1987; Sexton and Weingarten 1992; Neal 2010; Shahbaba et al. 2014) to develop explicit
geometric integrators. For example, one could split a nonseparable Hamiltonian dynamic
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
369
into several smaller dynamics some of which can be solved analytically. A similar idea has
been explored by Chin (2009), where the Hamiltonian, instead of the dynamic, is split.
Because our methods involve costly matrix inversions, another possible research direction could be to approximate the mass matrix (and the Christoffel symbols as well) to
reduce computational cost. For many high-dimensional problems, the mass matrix could
be appropriately approximated by a highly sparse or structured sparse (e.g., tridiagonal)
matrix. This could further improve our method’s computational efficiency.
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
APPENDICES: DERIVATIONS AND PROOFS
In what follows, we show the detailed derivations of our methods. We adopt Einstein notation,
so whenever the index appears twice in a mathematical expression, we sum over it: for example,
ai bi := i ai bi , ijk v i v j := i,j ijk v i v j . A lower index is used for the covariant tensor, whose
components vary by the same transformation as the change of basis (e.g., gradient), whereas the
upper index is reserved for the contravariant tensor, whose components vary in the opposite way as
the change of basis to compensate (e.g., velocity vector). Interested readers should refer to Bishop
and Goldberg (1980).
APPENDIX A: TRANSFORMATION OF HAMILTONIAN
DYNAMIC
To derive the dynamic (8) from the Hamiltonian dynamic (3), the first equation in (8) is directly
k
obtained from the assumed transformation: θ̇ = gkl pl = vk . For the second equation in (8), we have
ṗl =
∂glj i j
d(glj (θ )vj )
=
θ̇ v + glj v̇j = ∂i glj vi vj + glj v̇j .
dt
∂θ i
Further, from Equation (3) we have
1
1
ṗl = −∂l φ(θ ) + vT ∂l G(θ)v = −∂l φ + gij,l vi vj
2
2
= ∂i glj vi vj + glj v̇j ,
which means
1
glj v̇ = − ∂i glj − ∂l gij vi vj − ∂l φ.
2
j
By multiplying G−1 = (gkl ) on both sides, we have
1
(A.1)
v̇k = δjk v̇j = −gkl (∂i glj − ∂l gij )vi vj − gkl ∂l φ.
2
Since i, j are symmetric in the first summand, switching them gives the following equations:
1
k
kl
(A.2)
v̇ = −g ∂j gli − ∂l gj i vi vj − gkl ∂l φ,
2
which in turn gives the final form of Equation (8) after adding Equations (A.1) and (A.2) and dividing
the results by two:
v̇k = −ijk (θ )vi vj − gkl (θ)∂l φ(θ).
370
S. LAN ET AL.
Here, ijk (θ ) := 12 gkl (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols of second kind.
Note that the new dynamics (8) still preserves the original Hamiltonian H (θ, p = G(θ)v). This is
of course intuitive, but it also can be proven as follows:
∂
d
T ∂
H (θ, G(θ )v) = θ̇
H (θ, G(θ)v) + v̇T H (θ, G(θ)v)
dt
∂θ
∂v
T
1
= vT ∇θ φ(θ) + vT ∂G(θ)v + −vT (θ)v − G(θ)−1 ∇θ φ(θ) G(θ)v
2
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
1
T
˜
v
= vT ∇θ φ(θ) − (∇θ φ(θ))T v + vT vT ∂G(θ)v − (vT (θ)v)
2
= 0 + 0 = 0,
where vT (θ )v is a vector whose kth element is ijk (θ)vi vj . The second 0 is due to the triple
T
˜
v = ˜ ij k vi vj vk = 12 ∂k gij vi vj vk , where ˜ is the Christoffel symbol of first kind with
form (vT (θ)v)
elements ˜ ij k (θ ) := gkl ijl (θ ) = 12 (∂i gkj + ∂j gik − ∂k gij ).
APPENDIX B: DERIVATION OF SEMI-EXPLICIT LAGRANGIAN
MONTE CARLO (SLMC)
To construct a time-reversible integrator for (8), we concatenate a half-step of the following
Euler-B integrator of (8) (Leimkuhler and Reich 2004, chap. 4):
ε
θ (n+1/2) = θ (n) + v(n+1/2)
2
ε (n+1/2) T
(n+1/2)
(n)
v
= v − [(v
) (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
with another half-step of its adjoint Euler-A integrator:
ε
θ (n+1) = θ (n+1/2) + v(n+1/2)
2
ε (n+1/2) T
(n+1)
(n+1/2)
v
=v
− [(v
) (θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )].
2
Then, we obtain the following semi-explicit integrator:
ε
v(n+1/2) = v(n) − [(v(n+1/2) )T (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]
2
θ (n+1) = θ (n) + εv(n+1/2)
ε
v(n+1) = v(n+1/2) − [(v(n+1/2) )T (θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )].
2
The resulting integrator, however, is no longer volume-preserving (see Appendix B.2). Nevertheless, based on proposition B.1, we can still have detailed balance after determinant adjustment (also
see Green 1995).
Proposition B.1 (Detailed balance condition with determinant adjustment). Denote z = (θ, v),
z = T̂L (z) for some time-reversible integrator T̂L to the Lagrangian dynamic. If the acceptance
probability is adjusted in the following way:
exp(−H (z ))
| det T̂L | ,
(B.1)
α̃(z, z ) = min 1,
exp(−H (z))
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
371
then the detailed balance condition still holds
α̃(z, z )P (dz) = α̃(z , z)P (dz )
(B.2)
Proof.
α̃(z, z )P (dz)
=
z=T̂L−1 (z )
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
=
=
exp(−H (z )) dz exp(−H (z))dz
min 1,
exp(−H (z)) dz dz dz dz
min exp(−H (z)), exp(−H (z )) dz dz exp(−H (z)) dz exp(−H (z ))z = α̃(z , z)P (dz )
min 1,
exp(−H (z )) dz Note, the acceptance probability should be calculated based on H (θ, G(θ)v). However, it could
also be calculated as follows based on the energy function E(θ, v) defined in Section 3. To show their
,p )
)) ∂(θ ,v )
| = det(G(θ
|
|and prove as follows:
equivalence, we note that | ∂(θ
∂(θ ,p)
det(G(θ )) ∂(θ ,v)
exp(−H (θ , G(θ )v )) ∂(θ , p ) exp(−H (θ , p )) ∂(θ , p ) =
min
1,
α̃ = min 1,
exp(−H (θ, p)) ∂(θ , p) exp(−H (θ, G(θ)v)) ∂(θ, p) exp{−(log p(θ ) + 12 log det G(θ ) + 12 v T G(θ )v )} det(G(θ )) ∂(θ , v ) = min 1,
exp{−(log p(θ) + 12 log det G(θ) + 12 vT G(θ)v)} det(G(θ)) ∂(θ, v) exp{−(log p(θ ) − 12 log det G(θ ) + 12 v T G(θ )v )} ∂(θ , v ) = min 1,
exp{−(log p(θ) − 12 log det G(θ) + 12 vT G(θ)v)} ∂(θ, v) exp(−E(θ , v )) ∂(θ , v ) = min 1,
.
exp(−E(θ, v)) ∂(θ, v) B.1 DISTRIBUTION INVARIANCE WITH VOLUME CORRECTION
Now with proposition B.1, we can prove that the Markov chain derived by our reversible integrator
with adjusted acceptance probability (B.1) converges to the true target density. One can also find the
similar proof in Liu (2001, chap. 9).
Starting from position θ ∼ p(θ|X) at time 0, we generate a velocity v ∼ N (0, G(θ)). Then evolve
(θ , v) according to our time-reversible integrator T̂ to reach a new state (θ ∗ , v∗ ) with θ ∗ ∼ f (θ ∗ ). We
want to prove that f (·) = p(·|X), which can be done by showing Ef [h(θ ∗ )] = Eθ |X [h(θ ∗ )] for any
square integrable function h. Denote z := (θ, v) and P (dz) := exp(−E(z))dz.
Note that z∗ = (θ ∗ , v∗ ) can be reached from two scenarios: either the proposal is accepted or
rejected. Therefore,
Ef [h(θ ∗ )] = h(θ ∗ )[P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) + P (dz∗ )(1 − α̃(z∗ , T̂ (z∗ )))]
=
h(θ ∗ )P (dz∗ ) +
h(θ ∗ )[P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) − P (dz∗ )α̃(z∗ , T̂ (z∗ ))].
372
S. LAN ET AL.
So it suffices to prove
h(θ ∗ )P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) = h(θ ∗ )P (dz∗ )α̃(z∗ , T̂ (z∗ )).
(B.3)
Denote the involution ν : (θ, v) → (θ , −v). First, by time reversibility we have T̂ −1 (z∗ )) =
ν T̂ ν(z∗ )). Further, we claim α̃(ν(z), z ) = α̃(z, ν(z )). This is true because: (i) E is quadratic in v
dν(z )
)
dz
| = | dν(ν(z))
| = | dν(z
|. Then that follows from definition of the adjusted
so E(ν(z)) = E(z); (ii) | dν(z)
dz
acceptance probability (B.1) and the equivalence discussed under proposition B.1. Therefore,
h(θ ∗ )P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) = h(θ ∗ )P (dν T̂ ν(z∗ ))α̃(ν T̂ ν(z∗ ), z∗ )
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
=
h(θ ∗ )P (d T̂ ν(z∗ ))α̃(T̂ ν(z∗ ), ν(z∗ )).
(B.4)
Next, applying the detailed balance condition (B.2) to ν(z∗ ), we get
P (d T̂ ν(z∗ ))α̃(T̂ ν(z∗ ), ν(z∗ )) = P (dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ ))
substitute it in (B.4) and continue,
=
ν(z∗ )→z∗
=
h(θ ∗ )P (dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ ))
h(θ ∗ )P (dz∗ )α̃(z∗ , T̂ (z∗ )).
Therefore, Equation (B.3) holds, and we complete the proof.
B.2 VOLUME CORRECTION
(L+1)
(L+1)
)
To adjust volume, we must derive the Jacobian determinant, det J := | ∂(θ∂(θ (1) ,v
|, which can be
,v(1) )
calculated using wedge products.
Definition B.1 (Differential forms, wedge product). The differential one-form α : T M D → R on
a differential manifold M D is a smooth mapping from tangent space T M D to R, which can be
expressed as a linear combination of differentials of local coordinates: α = fi dx i =: f · dx.
For example, if f : RD → R is a smooth function, then its directional derivative along a vector
v ∈ RD , denoted by df (v), is given by
df (v) =
∂f i
v,
∂zi
then df (·) is a linear functional of v, called the differential of f at z and is an example of a differential
one-form.
In particular, dzi (v) = v i , thus
df (v) =
∂f i
dz (v),
∂zi
then df =
∂f i
dz .
∂zi
The wedge product of two one-forms α, β is a two-form α ∧ β, an antisymmetric bilinear function
on tangent space, which has the following properties (α, β, γ one-forms, A be a square matrix of
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
373
same dimension D):
• α∧α =0
• α ∧ (β + γ ) = α ∧ β + α ∧ γ (thus, α ∧ β = −β ∧ α)
• α ∧ Aβ = AT α ∧ β
The following proposition enables us to calculate the Jacobian determinant denoted as det J.
Proposition B.2. Let TL : (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) be evolution of a smooth flow, then
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
dθ (L+1) ∧ dv(L+1) =
∂(θ (L+1) , v(L+1) ) (1)
dθ ∧ dv(1) .
∂(θ (1) , v(1) )
Note that the Jacobian determinant det J can also be regarded as a Radon–Nikodym derivative of
(L+1) ,dv(L+1) )
two probability measures: det J = P (dθ
, where P (dθ, dv) = p(θ, v)dθdv. We have
P (dθ (1) ,dv(1) )
dv(n+1/2) = dv(n) − ε(v(n+1/2) )T (θ (n) )dv(n+1/2) + (∗∗)dθ (n)
dθ (n+1) = dθ (n) + εdv(n+1/2)
dv(n+1) = dv(n+1/2) − ε(v(n+1/2) )T (θ (n+1) )dv(n+1/2) + (∗∗)dθ (n+1) ,
where vT (θ ) is a matrix whose (k, j )th element is vi ijk (θ). Therefore,
dθ (n+1) ∧ dv(n+1) = [I − ε(v(n+1/2) )T (θ (n+1) )]T dθ (n+1) ∧ dv(n+1/2)
= [I − ε(v(n+1/2) )T (θ (n+1) )]T dθ (n) ∧ dv(n+1/2)
= [I − ε(v(n+1/2) )T (θ (n+1) )]T [I + ε(v(n+1/2) )T (θ (n) )]−T dθ (n) ∧ dv(n) .
For volume adjustment, we must use the following Jacobian determinant accumulated along the
integration steps:
L
∂(θ (L+1) , v(L+1) ) det(I − ε(v(n+1/2) )T (θ (n+1) ))
=
.
(B.5)
det JsLMC := ∂(θ (1) , v(1) ) det(I + ε(v(n+1/2) )T (θ (n) ))
n=1
As a result, the acceptance probability becomes
αsLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JsLMC |}.
APPENDIX C: DERIVATION OF EXPLICIT LAGRANGIAN
MONTE CARLO (LMC)
To resolve the remaining implicit Equation (9), we now propose an additional modification
motivated by the following relationship (notice the symmetry of lower indices in ):
vT u =
1
(v + u)T (v + u) − vT v − uT u .
2
374
S. LAN ET AL.
To keep time-reversibility, we make the modification to both (9) and (11) as shown below:
ε
v(n+1/2) = v(n) − [(v(n+1/2) )T (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )].
2
⇓
ε
v(n+1/2) = v(n) − [(v(n) )T (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )].
2
θ (n+1) = θ (n) + εv(n+1/2)
ε
v(n+1) = v(n+1/2) − [(v(n+1/2) )T (θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )].
2
(C.1)
(C.2)
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
⇓
ε
v(n+1) = v(n+1/2) − [(v(n+1/2) )T (θ (n+1) )v(n+1) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )].
2
(C.3)
The time-reversibility of the integrator (C.1)–(C.3) can be shown by switching (θ, v)(n+1) and
(θ, v)(n) and negating velocity. The resulting integrator is completely explicit since both updates of
velocity (C.1) and (C.3) can be solved by collecting terms containing v(n+1/2) and v(n+1) , respectively:
−1 ε
ε
v(n) − G(θ (n) )−1 ∇θ φ(θ (n) ) .
(C.4)
v(n+1/2) = I + (v(n) )T (θ (n) )
2
2
−1 ε
ε
v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) ) .
(C.5)
v(n+1) = I + (v(n+1/2) )T (θ (n+1) )
2
2
C.1 CONVERGENCE OF NUMERICAL SOLUTION
We now show that the discretization error en = z(tn ) − z(n) = (θ(tn ), v(tn )) − (θ (n) , v(n) ) (i.e.,
the difference between the true solution and the numerical solution) accumulated over final time
interval [0, T ] is bounded and goes to zero as the step size ε goes to zero. (See Leimkuhler
and Reich 2004 for a similar proof for the generalized leapfrog method.) Here, we assume that
f(θ, v) := vT (θ)v + G(θ)−1 ∇θ φ(θ) is smooth; hence, f and its derivatives are uniformly bounded
as (θ, v) evolves within finite time duration T. We expand the true solution z(tn+1 ) at tn :
1
z(tn+1 ) = z(tn ) + ż(tn )ε + z̈(tn )ε2 + o(ε2 )
2
1
θ (tn )
v(tn )
−f(θ(tn ), v(tn ))
=
+
ε+
ε2 + o(ε2 )
∂f
∂f
v(tn ) + ∂v
f(θ(tn ), v(tn ))
v(tn )
−f(θ (tn ), v(tn ))
2 − ∂θ
θ (tn )
v(tn )
=
+
ε + O(ε2 ).
v(tn )
−f(θ (tn ), v(tn ))
(n+1)
Next, we simplify the expression of the numerical solutions z(n+1) = [ θv(n+1) ] for the fully explicit
integrator and compare it to the above true solutions. To this end, we rewrite Equation (C.4) as
follows:
−1 ε
ε
v(n+1/2) = I + (v(n) )T (θ (n) )
v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )
2
2
−1 ε (n) T
ε
(n)
(n)
(v(n) )T (θ (n) )v(n) + G(θ (n) )−1 ∇θ φ(θ (n) )
= v − I + (v ) (θ )
2
2
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
375
−1 ε
ε
= v(n) − I + (v(n) )T (θ (n) )
f(θ (n) , v(n) )
2
2
−1 ε
ε
ε2 I + (v(n) )T (θ (n) )
= v(n) − f(θ (n) , v(n) ) +
(v(n) )T (θ (n) ) f(θ (n) , v(n) )
2
4
2
ε
= v(n) − f(θ (n) , v(n) ) + O(ε2 ).
2
Similarly, from Equation (C.5) we have
ε
v(n+1) = v(n+1/2) − f(θ (n+1) , v(n+1/2) ) + O(ε2 ).
2
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
Substituting v(n+1/2) in the above equation, we obtain v(n+1) as follows:
ε
ε
v(n+1) = v(n) − f(θ (n) , v(n) ) − f(θ (n+1) , v(n) ) + O(ε2 )
2
2
ε
= v(n) − f(θ (n) , v(n) )ε + [f(θ (n) , v(n) ) − f(θ (n) + O(ε), v(n) ] + O(ε2 )
2
= v(n) − f(θ (n) , v(n) )ε + O(ε2 ).
From (C.4), (C.2), and (C.5), we have the following numerical solution:
θ (n+1)
θ (n)
v(n)
(n+1)
=
z
ε + O(ε2 ).
=
+
v(n+1)
v(n)
−f(θ (n) , v(n) )
Therefore, the local error is
en+1
θ(t ) − θ (n)
n
= z(tn+1 ) − z
=
v(tn ) − v(n)
v(tn ) − v(n)
2 +
ε + O(ε )
(n)
(n)
−[f(θ(tn ), v(tn )) − f(θ , v )]
(n+1)
≤ (1 + Mε)en + O(ε2 ),
where M = c supt∈[0,T ] ∇f(θ (t), v(t)) for some constant c > 0. Accumulating the local errors by
iterating the above inequality for L = T /ε steps provides the following global error:
en+1 ≤ (1 + Mε)en + O(ε2 ) ≤ (1 + Mε)2 en−1 + 2O(ε2 ) ≤ · · · ≤ (1 + Mε)n e1 + nO(ε2 )
≤ (1 + Mε)L ε + LO(ε2 ) ≤ (eMT + T )ε → 0,
as ε → 0.
C.2 VOLUME CORRECTION
As before, using the wedge product on the system (C.4), (C.2), and (C.5), the Jacobian matrix is
−T T
ε
∂(θ (n+1) , v(n+1) )
ε
I − (v(n+1) )T (θ (n+1) ) ·
= I + (v(n+1/2) )T (θ (n+1) )
(n)
2
2
∂(θ , v(n) )
T
−T
ε
ε
I − (v(n+1/2) )T (θ (n) ) .
I + (v(n) )T (θ (n) )
2
2
As these new equations show, our derived integrator is not symplectic so the acceptance probability
needs to be adjusted by the following Jacobian determinant, det J, to preserve the detailed balance
376
S. LAN ET AL.
condition:
det JLMC
∂(θ (L+1) , v(L+1) ) := ∂(θ (1) , v(1) ) =
L
det(I − ε/2(v(n+1) )T (θ (n+1) )) det(I − ε/2(v(n+1/2) )T (θ (n) ))
n=1
=
det(I + ε/2(v(n+1/2) )T (θ (n+1) )) det(I + ε/2(v(n) )T (θ (n) ))
L
˜ (n+1) )) det(G(θ (n) ) − ε/2(v(n+1/2) )T (θ
˜ (n) ))
det(G(θ (n+1) ) − ε/2(v(n+1) )T (θ
n=1
˜ (n+1) )) det(G(θ (n) ) + ε/2(v(n) )T (θ
˜ (n) ))
det(G(θ (n+1) ) + ε/2(v(n+1/2) )T (θ
.
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
(C.6)
As a result, the acceptance probability is
αLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JLMC |}.
APPENDIX D: PSEUDOCODES
Algorithm 1 Semi-explicit Lagrangian Monte Carlo (sLMC)
Initialize θ (1) = current θ
Sample new velocity v(1) ∼ N (0, G−1 (θ (1) ))
Calculate current E(θ (1) , v(1) ) according to Equation (12)
for n = 1 to L (leapfrog steps) do
% Update the velocity with fixed-point iterations
v̂(0) = v(n)
for i = 1 to NumOfFixedPointSteps do
˜ (n) )v̂(i−1) + ∇θ φ(θ (n) )]
v̂(i) = v(n) − 2 G(θ (n) )−1 [(v̂(i−1) )T (θ
end for
v(n+1/2) = v̂(last i)
% Update the position only with simple one step
θ (n+1) = θ (n) + v(n+1/2)
log detn = log det(I − (θ (n+1) , v(n+1/2) )) − log det(I + (θ (n) , v(n+1/2) ))
% Update the velocity exactly
˜ (n+1) )v(n+1/2) + ∇θ φ(θ (n+1) )]
v(n+1) = v(n+1/2) − 2 G(θ (n+1) )−1 [(v(n+1/2) )T (θ
end for
Calculate proposed E(θ (L+1) , v(L+1) ) according to Equation (12)
logRatio = −ProposedH + CurrentH + N
n=1 log detn
Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio
MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS
377
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
Algorithm 2 Explicit Lagrangian Monte Carlo (LMC)
Initialize θ (1) = current θ
Sample new velocity v(1) ∼ N (0, G(θ (1) )−1 )
Calculate current E(θ (1) , v(1) ) according to Equation (12)
log det = 0
for n = 1 to L do
˜ (n) , v(n) ))
log det = log det − log det(G(θ (n) ) + /2(θ
% Update the velocity explicitly with a half-step:
˜ (n) , v(n) )]−1 [G(θ (n) )v(n) − ∇θ φ(θ (n) )]
v(n+1/2) = [G(θ (n) )+ 2 (θ
2
˜ (n) , v(n+1/2) ))
log det = log det + log det(G(θ (n) ) − /2(θ
% Update the position with a full step:
1
θ (n+1) = θ (n) + v(n+ 2 )
˜ (n+1) , v(n+1/2) ))
log det = log det − log det(G(θ (n+1) ) + /2(θ
% Update the velocity explicitly with a half-step:
˜ (n+1) , v(n+1/2) )]−1 [G(θ (n+1) )v(n+1/2) − ∇θ φ(θ (n+1) )]
v(n+1) = [G(θ (n+1) )+ 2 (θ
2
˜ (n+1) , v(n+1) ))
log det = log det + log det(G(θ (n+1) ) − /2(θ
end for
Calculate proposed E(θ (L+1) , v(L+1) ) according to Equation (12)
logRatio = −ProposedE + CurrentE + log det
Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio
ACKNOWLEDGMENTS
VS and MAG are grateful to the UK Engineering and Physical Sciences Research Council (EPSRC) for supporting
this research through project grant EP/K015664/1. MAG is grateful for support by an EPSRC Established Research
Fellowship EP/J016934/1 and a Royal Society Wolfson Research Merit Award. SL and BS are supported by NSF
grant IIS-1216045 and NIH grant R01-AI107034.
[Received March 2013. Revised August 2013.]
REFERENCES
Amari, S., and Nagaoka, H. (2000), Methods of Information Geometry, New York: Oxford University Press. [358]
Beskos, A., Pinski, J. F., Sanz-Serna, J. M., and Stuart, A. M. (2011), “Hybrid Monte Carlo on Hilbert Spaces,”
Stochastic Processes and Applications, 121, 2201–2230. [361]
Bishop, R. L., and Goldberg, S. I., (2000), Tensor Analysis on Manifolds, Mineola, NY: Dover Publications, Inc.
[369]
Chin, S. A. (2009), “Explicit Symplectic Integrators for Solving Nonseparable Hamiltonians,” Physical Review
E, 80, 037701–037701-4. [369]
Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B,
195, 216–222. [357]
Dullweber, A., Leimkuhler, B., and McLachlan, R. (1997), “Split-Hamiltonian Methods for Rigid Body Molecular
Dynamic,” Journal of Chemical Physics, 107, 5840–5852. [368]
Geyer, C. J. (1992), “Practical Markov Chain Monte Carlo,” Statistical Science, 7, 473–483. [363]
Girolami, M., and Calderhead, B. (2011), “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”
(with discussion), Journal of the Royal Statistical Society, Series B, 73, 123–214. [358,359,363,364,365,368]
378
S. LAN ET AL.
Green, P. J. (1995), “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination,” Biometrika, 82, 711–732. [361,370]
Leimkuhler, B., and Reich, S. (2004), Simulating Hamiltonian Dynamic, New York: Cambridge University Press.
[359,360,370,374]
Liu, J. S. (2001), “Molecular Dynamic and Hybrid Monte Carlo,” in Monte Carlo Strategies in Scientific Computing, ed. Jun S. Liu, New York: Springer-Verlag, pp. 183–204. [371]
Marin, J. M., Mengersen, K. L., and Robert, C. (2005), “Bayesian Modeling and Inference on Mixtures of
Distributions,” in Handbook of Statistics (Vol. 25), eds. D. Dey and C. R. Rao, Amsterdam: Elsevier,
pp. 459–507. [367]
McLachlan, G. J., and Peel, D. (2000), Finite Mixture Models, New York: Wiley. [367]
Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015
Neal, R. M. (2010), “MCMC Using Hamiltonian Dynamic,” in Handbook of Markov Chain Monte Carlo, eds.
S. Brooks, A. Gelman, G. Jones, and X. L. Meng, Boca Raton, FL: Chapman and Hall/CRC, pp. 113–162.
[357,368]
Sexton, J. C., and Weingarten, D. H. (1992), “Hamiltonian Evolution for the Hybrid Monte Carlo Algorithm,”
Nuclear Physics B, 380, 665–677. [368]
Shahbaba, B., Lan, S., Johnson, W., and Neal, R. (2014), “Split Hamiltonian Monte Carlo,” Statistics and
Computing, 24, 339–349. [368]
Stathopoulos, V., and Girolami, M. (2011), “Manifold MCMC for Mixtures,” in Mixture Estimation and Applications, eds. K. Mengersen, C. P. Robert, and M. D. Titteringhton, West Sussex, UK: Wiley, pp. 255–276.
[367]
Verlet, L. (1967), “Computer “Experiments” on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones
Molecules,” Physical Review, 159, 98–103. [359]
Download