This article was downloaded by: [The UC Irvine Libraries] On: 19 June 2015, At: 03:06 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Computational and Graphical Statistics Publication details, including instructions for authors and subscription information: http://amstat.tandfonline.com/loi/ucgs20 Markov Chain Monte Carlo from Lagrangian Dynamics Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba & Mark Girolami Accepted author version posted online: 04 Apr 2014.Published online: 16 Jun 2015. Click for updates To cite this article: Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba & Mark Girolami (2015) Markov Chain Monte Carlo from Lagrangian Dynamics, Journal of Computational and Graphical Statistics, 24:2, 357-378, DOI: 10.1080/10618600.2014.902764 To link to this article: http://dx.doi.org/10.1080/10618600.2014.902764 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 Conditions of access and use can be found at http://amstat.tandfonline.com/page/termsand-conditions Markov Chain Monte Carlo From Lagrangian Dynamics Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 Shiwei LAN, Vasileios STATHOPOULOS, Babak SHAHBABA, and Mark GIROLAMI Hamiltonian Monte Carlo (HMC) improves the computational efficiency of the Metropolis–Hastings algorithm by reducing its random walk behavior. Riemannian HMC (RHMC) further improves the performance of HMC by exploiting the geometric properties of the parameter space. However, the geometric integrator used for RHMC involves implicit equations that require fixed-point iterations. In some cases, the computational overhead for solving implicit equations undermines RHMC’s benefits. In an attempt to circumvent this problem, we propose an explicit integrator that replaces the momentum variable in RHMC by velocity. We show that the resulting transformation is equivalent to transforming Riemannian Hamiltonian dynamics to Lagrangian dynamics. Experimental results suggest that our method improves RHMC’s overall computational efficiency in the cases considered. All computer programs and datasets are available online (http://www.ics.uci.edu/babaks/Site/Codes.html) to allow replication of the results reported in this article. Key Words: Explicit integrator; Hamiltonian Monte Carlo; Riemannian manifold. 1. INTRODUCTION Hamiltonian Monte Carlo (HMC; Duane et al. 1987) reduces the random walk behavior of the Metropolis–Hastings algorithm by proposing samples that are distant from the current state, but nevertheless have a high probability of acceptance. These distant proposals are found by numerically simulating Hamiltonian dynamics for some specified amount of fictitious time (Neal 2010). Hamiltonian dynamics can be represented by a function, known as the Hamiltonian, of model parameters θ and auxiliary momentum parameters p ∼ N (0, M) (with the same dimension as θ ) as follows: 1 H (θ, p) = − log p(θ ) + pT M−1 p, 2 where M is a symmetric, positive-definite mass matrix. (1) Shiwei Lan (E-mail: slan@uci.edu) is a postdoc scholar and Babak Shahbaba (E-mail: babaks@uci.edu) is Associate Professor, Department of Statistics, University of California, Irvine, Irvine, CA 92697. Vasileios Stathopoulos (E-mail: v.stathopoulos@ucl.ac.uk) is Research Associate and Mark Girolami (E-mail: m.girolami@warwick.ac.uk) is Professor of Statistics, University of Warwick, Coventry CV4 7AL, UK. C 2015 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 24, Number 2, Pages 357–378 DOI: 10.1080/10618600.2014.902764 Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/jcgs. 357 Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 358 S. LAN ET AL. Hamilton’s equations, which involve differential equations derived from H, determine how θ and p change over time. In practice, however, solving these equations exactly is difficult in general, so we need to approximate them by discretizing time, using some small step size ε. For this purpose, the leapfrog method is commonly used. Hamiltonian dynamics is restricted by the smallest eigen-direction, requiring small step size to maintain the stability of the numerical discretization. Girolami and Calderhead (2011) proposed a new method, called Riemannian HMC (RHMC), which exploits the geometric properties of the parameter space to improve the efficiency of standard HMC, especially in sampling distributions with complex structure (e.g., high correlation, nonGaussian shape). Simulating the resulting dynamics, however, is computationally intensive since it involves solving two implicit equations, which require additional iterative numerical computation (e.g., fixed-point iteration). In an attempt to increase the speed of RHMC, we propose a new integrator that is completely explicit: we propose to replace momentum with velocity in the definition of the Riemannian Hamiltonian dynamics. As we will see, this is equivalent to using Lagrangian dynamics as opposed to Hamiltonian dynamics. By doing so, we eliminate one of the implicit steps in RHMC. Next, we construct a time symmetric integrator to remove the remaining implicit step in RHMC. This leads to a valid sampling scheme (i.e., converges to the true target distribution) that involves only explicit equations. We refer to this algorithm as Lagrangian Monte Carlo (LMC). In what follows, we begin with a brief review of RHMC and its geometric integrator. Section 3 introduces our proposed semi-explicit integrator based on defining Hamiltonian dynamics in terms of velocity as opposed to momentum. Next, in Section 4, we eliminate the remaining implicit equation and propose a fully explicit integrator. In Section 5, we use simulated and real data to evaluate our methods’ performance. Finally, in Section 6, we discuss some possible future research directions. 2. RIEMANNIAN HAMILTONIAN MONTE CARLO As discussed above, although HMC explores the parameter space more efficiently than random walk Metropolis (RWM) does, it does not fully exploit the geometric properties of parameter space defined by the density p(θ). Indeed, Girolami and Calderhead (2011) argued that dynamics over Euclidean space may not be appropriate to guide the exploration of parameter space. To address this issue, they proposed a new method, called RHMC, which exploits the Riemannian geometry of the parameter space (Amari and Nagaoka 2006) to improve standard HMC’s efficiency by automatically adapting to the local structure. They did this by using a position-specific mass matrix M = G(θ ). More specifically, they set G(θ ) to the Fisher information matrix E[∇ log p(θ )∇ log p(θ )T ]. As a result, p|θ ∼ N (0, G(θ )), and the Hamiltonian is defined as follows: H (θ, p) = − log p(θ) + 1 1 1 log det G(θ) + pT G(θ)−1 p = φ(θ ) + pT G(θ )−1 p, 2 2 2 (2) where φ(θ) := − log p(θ ) + 12 log det G(θ). Based on this Hamiltonian, Girolami and Calderhead (2011) proposed the following Hamiltonian dynamics on the Riemannian MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 359 manifold: θ̇ = ∇p H (θ , p) = G(θ )−1 p (3) 1 ṗ = −∇θ H (θ , p) = −∇θ φ(θ ) + ν(θ , p). 2 With the shorthand notation ∂i = ∂/∂θ i for partial derivative, the ith element of the vector ν(θ, p) is Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 (ν(θ, p))i = −pT ∂i (G(θ)−1 )p = (G(θ )−1 p)T ∂i G(θ )G(θ)−1 p. The above equations of dynamics are nonseparable (they contain products of θ and p), and the resulting map (θ, p) → (θ ∗ , p∗ ) based on the standard leapfrog method is neither time-reversible nor symplectic. Therefore, we cannot use the standard leapfrog algorithm (Girolami and Calderhead 2011). Instead, they use the Stömer–Verlet (Verlet 1967) method as follows: ε 1 (n) (n) (n+1/2) (n) (n+1/2) ∇θ φ(θ ) − ν(θ , p =p − ) (4) p 2 2 ε −1 (n) G (θ ) + G−1 (θ (n+1) ) p(n+1/2) θ (n+1) = θ (n) + (5) 2 ε 1 ∇θ φ(θ (n+1) ) − ν(θ (n+1) , p(n+1/2) ) , (6) p(n+1) = p(n+1/2) − 2 2 where ε is the size of time step. This is also known as generalized leapfrog, which can be derived by concatenating a symplectic Euler-B integrator of (3) with its adjoint symplectic Euler-A integrator (see more details in Leimkuhler and Reich 2004). The above series of transformations are (i) deterministic, (ii) reversible, and (iii) volume-preserving. Therefore, the effective proposal distribution is a delta function δ((θ (1) , p(1) ), (θ (L+1) , p(L+1) )) and the acceptance probability is simply: exp(H (θ (1) , p(1) ) − H (θ (L+1) , p(L+1) )). (7) Here, (θ (1) , p(1) ) is the current state, and (θ (L+1) , p(L+1) ) is the proposed sample after L leapfrog steps. As an illustrative example, Figure 1 shows the sampling paths of RWM, HMC, and RHMC for an artificially created banana-shaped distribution. (See Girolami and Calderhead 2011, discussion by Luke Bornn and Julien Cornebise.) For this example, we fix the trajectory and choose the step sizes such that the acceptance probability for all three methods remains around 0.7. RWM moves slowly and spends most of iterations at the distribution’s low-density tail, and HMC explores the parameter space in an indirect way, while RHMC moves directly to the high-density region and explores the distribution more efficiently. One major drawback of this geometric integrator, which is both time-reversible and volume-preserving, is that it involves two implicit functions: Equations (4) and (5). These functions require extra numerical analysis (e.g., fixed-point iteration), which results in Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 360 S. LAN ET AL. Figure 1. The first 10 iterations in sampling from a banana-shaped distribution with random walk Metropolis (RWM), Hamiltonian Monte Carlo (HMC), and Riemannian HMC (RHMC). For all three methods, the trajectory length (i.e., step size ε times number of integration steps L) is set to 1. For RWM, L = 17, for HMC, L = 7, and for RHMC, L = 5. Solid red lines are the sampling paths, and black circles are the accepted proposals. higher computational cost and simulation error. This is especially true when solving θ (n+1) because the fixed-point iteration for (5) repeatedly inverts matrix G. To address this problem, we propose an alternative approach that uses velocity instead of momentum in the equations of motion. 3. MOVING FROM MOMENTUM TO VELOCITY In the equations of Hamiltonian dynamics (3), the term G(θ )−1 p appears several times. This motivates us to reparameterize the dynamics in terms of velocity, v = G(θ)−1 p. Note that this in fact corresponds to the usual definition of velocity in physics, that is, momentum divided by mass. The transformation p → v changes the Hamiltonian dynamics (3) to the following form (derivation in Appendix A): θ̇ = v (8) v̇ = −η(θ, v) − G(θ )−1 ∇θ φ(θ), k k i j where η(θ , v) is a vector whose kth element is i,j ij (θ)v v . Here, ij (θ) := 1 kl ij l g (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols, where gij and g denote (i, j )th 2 element of G(θ) and G(θ )−1 , respectively. This transformation moves the complexity of the dynamics in the first equation for θ to its second equation where it spends more time in finding a “good” direction v. By resolving the implicitness of updating θ in the generalized leapfrog method, we reduce the associated computational cost. Concatenating the Euler-B integrator with its adjoint Euler-A integrator (Leimkuhler and Reich 2004, see also Appendix B) leads to the following semi-explicit integrator: ε v(n+1/2) = v(n) − [η(θ (n) , v(n+1/2) ) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 θ (n+1) = θ (n) + εv(n+1/2) ε v(n+1) = v(n+1/2) − [η(θ (n+1) , v(n+1/2) ) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]. 2 (9) (10) (11) MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 361 Note that Equation (9) updating v remains implicit (more details are available in Appendix B). The introduction of velocity v in place of p is also advocated by Beskos et al. (2011) to avoid large variables p for the sake of numerical stability. They considered a constant mass so the resulting dynamics is still Hamiltonian. In general, however, the new dynamics (8) cannot be recognized as Hamiltonian dynamics of (θ, v); rather, it is equivalent to the following Euler–Lagrange equation of the second kind: d ∂L ∂L = , ∂θ dt ∂ θ̇ Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 which is the solution to variation of the Lagrangian: L = 12 vT G(θ )v − φ(θ ). That is, in our case, θ̈ = −η(θ, θ̇ ) − G(θ)−1 ∇θ φ(θ). Therefore, we refer to the new dynamics (8) as Lagrangian dynamics. Although the resulting dynamics is not Hamiltonian, it nevertheless remains a valid proposal-generating mechanism, which preserves the original Hamiltonian H (θ , p = G(θ )v) (proof in Appendix A); thus, the acceptance probability is only determined by the discretization error from the numerical integration as usual (to be discussed below). Because p|θ ∼ N (0, G(θ )), the distribution of v|θ is N (0, G(θ )−1 ). We define the energy function E(θ , v) as the sum of the potential energy, U (θ) = − log p(θ ) and kinetic energy K(θ, v) = − log(p(v|θ)): E(θ , v) = − log p(θ ) − 1 1 log det G(θ) + vT G(θ )v. 2 2 (12) Note that the integrator (9)–(11) is (i) reversible and (ii) energy-preserving up to a global error with order O(ε), where ε is the step size. (See Proposition B.1 in Appendix B and a proof of distribution invariance in Appendix B.1.) The resulting map, however, is not volume-preserving. Therefore, the acceptance probability based on E(θ, v) must be adjusted with the determinant of the transformation to ensure that a detailed balance condition is satisfied (Green 1995), αsLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JsLMC |}, where JsLMC is the Jacobian matrix of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) according to (9)–(11) with the following determinant (see more in Appendix B.2): L ∂(θ (L+1) , v(L+1) ) det(I − ε(θ (n+1) , v(n+1/2) )) = det JsLMC := . (1) ∂(θ , v(1) ) det(I + ε(θ (n) , v(n+1/2) )) n=1 (n+1/2) i (n+1) Here, (θ (n+1) , v(n+1/2) ) is a matrix whose (i, j )th element is k vk kj (θ ). The dynamics is now defined by the semi-explicit integrator (9)–(11). Algorithm 1 (see pseudocode in Appendix D) provides the corresponding steps. We refer to this approach as semi-explicit Lagrangian Monte Carlo (sLMC), which has a physical interpretation as exploring the parameter space along the path on a Riemannian manifold that minimizes 362 S. LAN ET AL. the action (total Lagrangian). Contrast this to RHMC augmenting parameter space with momentum, sLMC augments parameter space with velocity. In Section 5, we use several experiments to show that switching from momentum to velocity may lead to improvements in computational efficiency in some cases. Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 4. EXPLICIT LAGRANGIAN MONTE CARLO To further resolve the remaining implicit equation (9), we modify η(θ (n) , v(n+1/2) ), to be an asymmetric function of both v(n) and v(n+1/2) such that v(n+1/2) can be solved explicitly (see details in Appendix C). We now propose a fully explicit integrator for Lagrangian dynamics (8) as follows: −1 ε ε (13) v(n) − G(θ (n) )−1 ∇θ φ(θ (n) ) . v(n+1/2) = I + (θ (n) , v(n) ) 2 2 θ (n+1) = θ (n) + εv(n+1/2) . (14) −1 ε ε 1 v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) ) . (15) v(n+1) = I + (θ (n+1) , v(n+ 2 ) ) 2 2 In Appendix C.1, we prove that the solution obtained from this integrator numerically converges to the true solution of the Lagrangian dynamics. This integrator is (i) reversible and (ii) energy-preserving up to a global error with order O(ε). The resulting map is not volume-preserving as the Jacobian determinant of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) by 13–15 is (see details in Appendix C.2) det JLMC := L ˜ (n+1) , v(n+1) )) det(G(θ (n) ) − ε/2(θ ˜ (n) , v(n+1/2) )) det(G(θ (n+1) ) − ε/2(θ n=1 ˜ (n+1) , v(n+1/2) )) det(G(θ (n) ) + ε/2(θ ˜ (n) , v(n) )) det(G(θ (n+1) ) + ε/2(θ . ˜ Here, (θ, v) denotes G(θ )(θ, v) whose (k, j )th element is equal to i vi ˜ ij k (θ ), with ˜ ij k (θ) = gkl ijl (θ) = 12 (∂i gkj + ∂j gik − ∂k gij ). As a result, the acceptance probability must be adjusted as follows: αLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JLMC |}. We refer to this approach that involves a fully explicit integrator 13–15 as LMC. Algorithm 2 (see pseudocode in Appendix D) shows the corresponding steps for this method. In both algorithms 1 and 2, the position update is relatively simple while the computational time is dominated by choosing the “right” direction (velocity) using the geometry of parameter space. In sLMC, solving θ explicitly reduces computation cost by (F − 1)O(D 2.373 ), where F is the number of fixed-point iterations, and D is the number of parameters. This is because for each fixed-point iteration, it takes O(D 2.373 ) elementary linear algebraic opera˜ ) in ˜ do not add substantial computational tions to invert G(θ ). The connection terms (θ cost since they are obtained from permuting three dimensions of the array ∂G(θ), which is also computed in RHMC. The additional price of determinant adjustment is O(D 2.373 ). LMC avoids the fixed-point iteration method. Therefore, it reduces computation by (F − 1)O(D 2 ). Further, it resolves possible convergence issues associated with using the fixed-point iteration method. However, because it involves additional matrix inversions to MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 363 update v, its benefits could be undermined occasionally. This is evident from our experimental results presented in Section 5. Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 5. EXPERIMENTAL RESULTS In this section, we use both simulated and real data to evaluate our methods, sLMC and LMC, compared to standard HMC and RHMC. Following Girolami and Calderhead (2011), we use a time-normalized effective sample size (ESS) to compare these K γ (k)]−1 for each methods. For B posterior samples, we calculate ESS = B[1 + 2 k=1 K parameter, where k=1 γ (k) is the sum of K monotone sample autocorrelations (Geyer 1992). Minimum, median, and maximum values of ESS over all parameters are provided for comparing different algorithms. More specifically, we use the minimum ESS normalized by CPU time (sec), min(ESS)/sec, as the measure of sampling efficiency. All computer programs and datasets discussed in this article are available online at http://www.ics.uci.edu/babaks/Site/Codes.html. 5.1 SIMULATING A BANANA-SHAPED DISTRIBUTION The banana-shaped distribution, which we used above for illustration, can be constructed as the posterior distribution of θ = (θ1 , θ2 )|y based on the following model: y|θ ∼ N(θ1 + θ22 , σy2 ) θ ∼ N (0, σθ2 I 2 ). 2 The data {yi }100 i=1 are generated with θ1 + θ2 = 1, σy = 2, and σθ = 1. As we can see in Figure 2, similar to RHMC, sLMC and LMC explore the parameter space efficiently by adapting to its local geometry. Table 1 compares the performances of these algorithms based on 20,000 Markov chain Monte Carlo (MCMC) iterations after 5000 burn-in. For this specific example, sLMC has the best performance followed by LMC. As discussed above, although LMC is fully explicit, its numerical benefits (obtained by removing implicit equations) can be negated in certain examples since it involves additional matrix inversion operations to update v. The histograms of posterior samples shown in Figure 3 confirm that our algorithms converge to the true posterior distributions of θ1 and θ2 , whose density functions are shown as red solid curves. Table 1. Comparing alternative methods using a banana-shaped distribution Method AP sec ESS min(ESS)/sec HMC RHMC sLMC LMC 0.79 0.78 0.84 0.73 6.96e-04 4.56e-03 7.90e-04 7.27e-04 (288,614,941) (4514,5779,7044) (2195,3476,4757) (1139,2409,3678) 20.65 49.50 138.98 78.32 NOTE: For each method, we provide the acceptance probability (AP), the CPU time (sec) for each iteration, ESS (min., med., max.), and the time-normalized ESS. Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 364 S. LAN ET AL. Figure 2. The first 10 iterations in sampling from the banana-shaped distribution with Riemannian HMC (RHMC), semi-explicit Lagrange Monte Carlo (sLMC), and explicit LMC (LMC). For all three methods, the trajectory length (i.e., step size times number of integration steps) is set to 1.45 and number of integration steps is set to 10. Solid red lines show the sampling path, and each point represents an accepted proposal. 5.2 LOGISTIC REGRESSION MODELS Next, we evaluate our methods based on five binary classification problems used in Girolami and Calderhead (2011). These are Australian Credit data, German Credit data, Heart data, Pima Indian data, and Ripley data. These datasets are publicly available from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). For each problem, we use a logistic regression model, p(yi = 1|x i , β) = exp(x Ti β) , 1 + exp(x Ti β) i = 1, . . . , n β ∼ N (0, 100I), Figure 3. Histograms of 1 million posterior samples of θ1 and θ2 for the banana-shaped distribution using RHMC (left), sLMC (middle), and LMC (right). Solid red curves are the true density functions. MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 365 Table 2. Comparing alternative methods using five binary classification problems discussed in Girolami and Calderhead (2011) Data Method AP sec ESS min(ESS)/sec Australian D = 14, N = 690 HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC 0.75 0.72 0.83 0.75 0.74 0.76 0.71 0.70 0.71 0.73 0.77 0.76 0.85 0.81 0.81 0.82 0.88 0.74 0.80 0.79 6.13E-03 2.96E-02 2.17E-02 1.60E-02 1.31E-02 6.55E-02 4.13E-02 3.74E-02 1.75E-03 2.12E-02 1.30E-02 1.15E-02 5.75E-03 1.64E-02 8.98E-03 7.90E-03 1.50E-03 1.09E-02 6.79E-03 5.36E-03 (1225,3253,10691) (7825,9238,9797) (10184,13001,13735) (9636,10443,11268) (766,4006,15000) (14886,15000,15000) (13395,15000,15000) (13762,15000,15000) (378,850,2624) (6263,7430,8191) (10318,11337,12409) (10347,10724,11773) (887,4566,12408) (4349,4693,5178) (4784,5437,5592) (4839,5193,5539) (820,3077,15000) (12876,15000,15000) (15000,15000,15000) (12611,15000,15000) 13.32 17.62 31.29 40.17 3.90 15.15 21.64 24.54 14.44 19.68 52.73 59.80 10.28 17.65 35.50 40.84 36.39 78.83 147.38 157.02 German D = 24, N = 1000 Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 Heart D = 13, N = 270 Pima D = 7, N = 532 Ripley D = 2, N = 250 NOTES: For each dataset, the number of predictors, D, and the number of observations, N, are specified. For each method, we provide the acceptance probability (AP), the CPU time (sec) for each iteration, ESS (min., med., max.), and the time-normalized ESS. where yi is a binary outcome for the ith observation, x i is the corresponding vector of predictors (with the first element equal to 1), and β is the set of regression parameters. We use standard HMC, RHMC, sLMC, and LCM to simulate 20,000 posterior samples for β. We fix the trajectory length for different algorithms, and tune the step sizes so that they have comparable acceptance rates. Results (after discarding the initial 5000 iterations) are summarized in Table 2, and show that in general our methods improve the sampling efficiency measured in terms of minimum ESS per second compared to RHMC on these examples. 5.3 RESULTS FOR MULTIVARIATE t-DISTRIBUTIONS The computational complexity of standard HMC is O(D). This is substantially lower than O(D 2.373 ), which is the computational complexity of the three geometrically motivated methods discussed here (RHMC, sLMC, and LMC). On the other hand, these three methods could have substantially better mixing rates compared to standard HMC, whose mixing time is mainly determined by the condition number of the target distribution defined as the ratio of the maximum and minimum eigenvalues of its covariance matrix: λmax /λmin . In this section, we illustrate how efficiency of these sampling algorithms changes as the condition number varies using multivariate t-distributions with the following density 366 S. LAN ET AL. D = 20 450 HMC RHMC sLMC LMC 800 700 λ max /λ min = 10000 HMC RHMC sLMC LMC 400 350 600 Min(ESS)/s Min(ESS)/s 300 500 400 Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 300 250 200 150 200 100 100 50 0 1e+1 1e+2 1e+3 λ m a x /λ m i n 1e+4 1e+5 0 10 20 30 40 50 Dimension Figure 4. Left: sampling efficiency, Min(ESS)/s versus the condition number for a fixed dimension (D = 20). Right: sampling efficiency versus dimension for a fixed condition number (λmax /λmin = 10,000). Each algorithm is tuned to have an acceptance rate around 70%. Results are based on 5000 samples after discarding the initial 1000 samples. function: p(x) = 1 ((ν + D)/2) (π ν)−D/2 | |−1/2 1 + x T (ν/2) ν −1 −(ν+D)/2 x , (16) where ν is the degrees of freedom and D is the dimension. In our first simulation, we fix the dimension at D = 20 and vary the condition number of from 10 to 105 . As the condition number increases, one can expect HMC to be more restricted by the smallest eigen-direction, whereas RHMC, sLMC, and LMC would adapt to the local geometry. Results presented in Figure 4 (left panel) show that this is in fact the case: for higher conditional numbers, geometrically motivated methods perform substantially better than standard HMC. Note that our two proposed algorithms, sLMC and LMC, provide substantial improvements over RHMC. For our second simulation, we fix the condition number at 10,000 and let the dimension changes from 10 to 50. Our results (Figure 4, right panel) show that the gain by exploiting geometric properties of the target distribution could be undermined eventually as the dimension increases. 5.4 FINITE MIXTURE OF GAUSSIANS Finally, we consider finite mixtures of univariate Gaussian components of the form K p(xi |θ) = πk N (xi |μk , σk2 ), (17) k=1 where θ is the vector of size D = 3K − 1 of all the parameters πk , μk , and σk2 and N (·|μ, σ 2 ) is a Gaussian density with mean μ and variance σ 2 . A common choice of prior MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 367 Figure 5. Densities used to generate synthetic datasets. From left to right the densities are in the same order as in Table 3. The densities are taken from McLachlan and Peel (2000). takes the form p(θ ) = D(π1 , . . . , πK |λ) K N (μk |m, β −1 σk2 )IG(σk2 |b, c), (18) Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 k=1 where D(·|λ) is the symmetric Dirichlet distribution with parameter λ, and IG(·|b, c) is the inverse Gamma distribution with shape parameter b and scale parameter c. Although the posterior distribution associated with this model is formally explicit, it is computationally intractable, since it can be expressed as a sum of K N terms corresponding to all possible allocations of observations xi to mixture components (Marin, Mengersen, and Robert 2005, chap. 9). We want to use this model to test the efficiency of posterior sampling θ using the four methods. A more extensive comparison of Riemannian Manifold MCMC and HMC, Gibbs sampling, and standard Metropolis–Hastings for finite Gaussian mixture models can be found at Stathopoulos and Girolami (2011). Due to the nonanalytic nature of the expected Fisher information, I(θ), we use the empirical Fisher information as metric tensor (McLachlan and Peel 2000, chap. 2), G(θ) = ST S − 1 T ss , N T i |θ) where N × D score matrix S has elements Si,d = ∂ log∂θp(x and s = N i=1 Si,· . d We show five Gaussian mixtures in Table 3 and Figure 5 and compare sampling efficiency of HMC, RMHMC, sLMC, and LMC using simulated datasets in Table 4. As before, our two algorithms outperform RHMC. Table 3. Densities used for the generation of synthetic mixture of Gaussian datasets Dataset name Kurtotic Bimodal Skewed Trimodal Claw Num. of parameters Density function 9 20 N x| − 2 1 1 2 3 N (x|0, 1) + 3 N x|0, 10 1 2 1 2 2 + 2 N x|1, 23 2 N x| − 1, 3 3 1 3 1 2 4 N (x|0, 1) + 4 N x| 2 , 3 2 6 3 2 9 1 N x| 65 , 35 N + 20 + 10 5, 5 1 2N (x|0, 1) + 4 1 i=0 10 N x| 2i − 1, 5 5 x|0, 1 2 10 1 2 4 5 8 17 368 S. LAN ET AL. Table 4. Acceptance probability (AP), seconds per iteration (s), ESS (min., med., max.), and time-normalized ESS for Gaussian mixture models Data Method AP sec ESS min(ESS)/sec Claw HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC 0.88 0.80 0.86 0.82 0.77 0.79 0.82 0.80 0.83 0.85 0.82 0.84 0.78 0.82 0.85 0.81 0.73 0.86 0.81 0.85 7.01E-01 5.08E-01 3.76E-01 2.92E-01 3.43E-01 9.94E-02 4.02E-02 4.84E-02 1.78E-01 5.10E-02 2.26E-02 2.52E-02 2.85E-01 4.72E-02 2.54E-02 2.70E-02 1.61E-01 5.38E-02 2.06E-02 2.06E-02 (1916, 3761, 4970) (1524, 3474, 4586) (2531, 4332, 5000) (2436, 3455, 4608) (2244, 2945, 3159) (4701, 4928, 5000) (4978, 5000, 5000) (4899, 4982, 5000) (2915, 3237, 3630) (5000, 5000, 5000) (4698, 4940, 5000) (4935, 5000, 5000) (3013, 3331, 3617) (5000, 5000, 5000) (5000, 5000, 5000) (5000, 5000, 5000) (2923, 2991, 3091) (5000, 5000, 5000) (4935, 4996, 5000) (5000, 5000, 5000) 0.54 0.60 1.35 1.67 1.30 9.46 24.77 20.21 3.27 19.63 41.68 39.09 2.11 21.20 39.34 36.90 3.62 18.56 48.00 46.43 Trimodal Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 skewed Kurtotic Bimodal NOTES: Results are calculated on a 5000 sample chain with a 5000 sample burn-in session. For HMC the burn-in session was 20,000 samples to ensure convergence. 6. CONCLUSIONS AND DISCUSSION Following the method of Girolami and Calderhead (2011) for more efficient exploration of parameter space, we have proposed new sampling schemes to reduce the computational cost associated with using a position-specific mass matrix. To this end, we have developed a semi-explicit (sLMC) integrator and a fully explicit (LMC) integrator for RHMC and demonstrated their advantage in improving computational efficiency over the generalized leapfrog (RHMC) method used by Girolami and Calderhead (2011). It is easy to show that if G(θ) ≡ M, our method reduces to standard HMC. Compared to HMC, whose local and global errors are O(ε3 ) and O(ε2 ), respectively, LMC’s local error is O(ε2 ), and its global error is O(ε) (proof in Appendix C.1). Although the numerical solutions converge to the true solutions of the corresponding dynamics at a slower rate for LMC compared to HMC, in general, the approximation remains adequate leading to reasonably high acceptance rates while providing a more computationally efficient sampling mechanism. Compared to RHMC, our LMC method has the additional advantage of being more stable by avoiding implicit updates relying on the fixed-point iteration method: RHMC could occasionally give highly divergent solutions, especially for ill-conditioned metrics, G(θ). Future directions could involve splitting Hamiltonian (Dullweber, Leimkuhler, McLachlan 1987; Sexton and Weingarten 1992; Neal 2010; Shahbaba et al. 2014) to develop explicit geometric integrators. For example, one could split a nonseparable Hamiltonian dynamic MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 369 into several smaller dynamics some of which can be solved analytically. A similar idea has been explored by Chin (2009), where the Hamiltonian, instead of the dynamic, is split. Because our methods involve costly matrix inversions, another possible research direction could be to approximate the mass matrix (and the Christoffel symbols as well) to reduce computational cost. For many high-dimensional problems, the mass matrix could be appropriately approximated by a highly sparse or structured sparse (e.g., tridiagonal) matrix. This could further improve our method’s computational efficiency. Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 APPENDICES: DERIVATIONS AND PROOFS In what follows, we show the detailed derivations of our methods. We adopt Einstein notation, so whenever the index appears twice in a mathematical expression, we sum over it: for example, ai bi := i ai bi , ijk v i v j := i,j ijk v i v j . A lower index is used for the covariant tensor, whose components vary by the same transformation as the change of basis (e.g., gradient), whereas the upper index is reserved for the contravariant tensor, whose components vary in the opposite way as the change of basis to compensate (e.g., velocity vector). Interested readers should refer to Bishop and Goldberg (1980). APPENDIX A: TRANSFORMATION OF HAMILTONIAN DYNAMIC To derive the dynamic (8) from the Hamiltonian dynamic (3), the first equation in (8) is directly k obtained from the assumed transformation: θ̇ = gkl pl = vk . For the second equation in (8), we have ṗl = ∂glj i j d(glj (θ )vj ) = θ̇ v + glj v̇j = ∂i glj vi vj + glj v̇j . dt ∂θ i Further, from Equation (3) we have 1 1 ṗl = −∂l φ(θ ) + vT ∂l G(θ)v = −∂l φ + gij,l vi vj 2 2 = ∂i glj vi vj + glj v̇j , which means 1 glj v̇ = − ∂i glj − ∂l gij vi vj − ∂l φ. 2 j By multiplying G−1 = (gkl ) on both sides, we have 1 (A.1) v̇k = δjk v̇j = −gkl (∂i glj − ∂l gij )vi vj − gkl ∂l φ. 2 Since i, j are symmetric in the first summand, switching them gives the following equations: 1 k kl (A.2) v̇ = −g ∂j gli − ∂l gj i vi vj − gkl ∂l φ, 2 which in turn gives the final form of Equation (8) after adding Equations (A.1) and (A.2) and dividing the results by two: v̇k = −ijk (θ )vi vj − gkl (θ)∂l φ(θ). 370 S. LAN ET AL. Here, ijk (θ ) := 12 gkl (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols of second kind. Note that the new dynamics (8) still preserves the original Hamiltonian H (θ, p = G(θ)v). This is of course intuitive, but it also can be proven as follows: ∂ d T ∂ H (θ, G(θ )v) = θ̇ H (θ, G(θ)v) + v̇T H (θ, G(θ)v) dt ∂θ ∂v T 1 = vT ∇θ φ(θ) + vT ∂G(θ)v + −vT (θ)v − G(θ)−1 ∇θ φ(θ) G(θ)v 2 Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 1 T ˜ v = vT ∇θ φ(θ) − (∇θ φ(θ))T v + vT vT ∂G(θ)v − (vT (θ)v) 2 = 0 + 0 = 0, where vT (θ )v is a vector whose kth element is ijk (θ)vi vj . The second 0 is due to the triple T ˜ v = ˜ ij k vi vj vk = 12 ∂k gij vi vj vk , where ˜ is the Christoffel symbol of first kind with form (vT (θ)v) elements ˜ ij k (θ ) := gkl ijl (θ ) = 12 (∂i gkj + ∂j gik − ∂k gij ). APPENDIX B: DERIVATION OF SEMI-EXPLICIT LAGRANGIAN MONTE CARLO (SLMC) To construct a time-reversible integrator for (8), we concatenate a half-step of the following Euler-B integrator of (8) (Leimkuhler and Reich 2004, chap. 4): ε θ (n+1/2) = θ (n) + v(n+1/2) 2 ε (n+1/2) T (n+1/2) (n) v = v − [(v ) (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 with another half-step of its adjoint Euler-A integrator: ε θ (n+1) = θ (n+1/2) + v(n+1/2) 2 ε (n+1/2) T (n+1) (n+1/2) v =v − [(v ) (θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]. 2 Then, we obtain the following semi-explicit integrator: ε v(n+1/2) = v(n) − [(v(n+1/2) )T (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 θ (n+1) = θ (n) + εv(n+1/2) ε v(n+1) = v(n+1/2) − [(v(n+1/2) )T (θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]. 2 The resulting integrator, however, is no longer volume-preserving (see Appendix B.2). Nevertheless, based on proposition B.1, we can still have detailed balance after determinant adjustment (also see Green 1995). Proposition B.1 (Detailed balance condition with determinant adjustment). Denote z = (θ, v), z = T̂L (z) for some time-reversible integrator T̂L to the Lagrangian dynamic. If the acceptance probability is adjusted in the following way: exp(−H (z )) | det T̂L | , (B.1) α̃(z, z ) = min 1, exp(−H (z)) MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 371 then the detailed balance condition still holds α̃(z, z )P (dz) = α̃(z , z)P (dz ) (B.2) Proof. α̃(z, z )P (dz) = z=T̂L−1 (z ) Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 = = exp(−H (z )) dz exp(−H (z))dz min 1, exp(−H (z)) dz dz dz dz min exp(−H (z)), exp(−H (z )) dz dz exp(−H (z)) dz exp(−H (z ))z = α̃(z , z)P (dz ) min 1, exp(−H (z )) dz Note, the acceptance probability should be calculated based on H (θ, G(θ)v). However, it could also be calculated as follows based on the energy function E(θ, v) defined in Section 3. To show their ,p ) )) ∂(θ ,v ) | = det(G(θ | |and prove as follows: equivalence, we note that | ∂(θ ∂(θ ,p) det(G(θ )) ∂(θ ,v) exp(−H (θ , G(θ )v )) ∂(θ , p ) exp(−H (θ , p )) ∂(θ , p ) = min 1, α̃ = min 1, exp(−H (θ, p)) ∂(θ , p) exp(−H (θ, G(θ)v)) ∂(θ, p) exp{−(log p(θ ) + 12 log det G(θ ) + 12 v T G(θ )v )} det(G(θ )) ∂(θ , v ) = min 1, exp{−(log p(θ) + 12 log det G(θ) + 12 vT G(θ)v)} det(G(θ)) ∂(θ, v) exp{−(log p(θ ) − 12 log det G(θ ) + 12 v T G(θ )v )} ∂(θ , v ) = min 1, exp{−(log p(θ) − 12 log det G(θ) + 12 vT G(θ)v)} ∂(θ, v) exp(−E(θ , v )) ∂(θ , v ) = min 1, . exp(−E(θ, v)) ∂(θ, v) B.1 DISTRIBUTION INVARIANCE WITH VOLUME CORRECTION Now with proposition B.1, we can prove that the Markov chain derived by our reversible integrator with adjusted acceptance probability (B.1) converges to the true target density. One can also find the similar proof in Liu (2001, chap. 9). Starting from position θ ∼ p(θ|X) at time 0, we generate a velocity v ∼ N (0, G(θ)). Then evolve (θ , v) according to our time-reversible integrator T̂ to reach a new state (θ ∗ , v∗ ) with θ ∗ ∼ f (θ ∗ ). We want to prove that f (·) = p(·|X), which can be done by showing Ef [h(θ ∗ )] = Eθ |X [h(θ ∗ )] for any square integrable function h. Denote z := (θ, v) and P (dz) := exp(−E(z))dz. Note that z∗ = (θ ∗ , v∗ ) can be reached from two scenarios: either the proposal is accepted or rejected. Therefore, Ef [h(θ ∗ )] = h(θ ∗ )[P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) + P (dz∗ )(1 − α̃(z∗ , T̂ (z∗ )))] = h(θ ∗ )P (dz∗ ) + h(θ ∗ )[P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) − P (dz∗ )α̃(z∗ , T̂ (z∗ ))]. 372 S. LAN ET AL. So it suffices to prove h(θ ∗ )P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) = h(θ ∗ )P (dz∗ )α̃(z∗ , T̂ (z∗ )). (B.3) Denote the involution ν : (θ, v) → (θ , −v). First, by time reversibility we have T̂ −1 (z∗ )) = ν T̂ ν(z∗ )). Further, we claim α̃(ν(z), z ) = α̃(z, ν(z )). This is true because: (i) E is quadratic in v dν(z ) ) dz | = | dν(ν(z)) | = | dν(z |. Then that follows from definition of the adjusted so E(ν(z)) = E(z); (ii) | dν(z) dz acceptance probability (B.1) and the equivalence discussed under proposition B.1. Therefore, h(θ ∗ )P (d T̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) = h(θ ∗ )P (dν T̂ ν(z∗ ))α̃(ν T̂ ν(z∗ ), z∗ ) Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 = h(θ ∗ )P (d T̂ ν(z∗ ))α̃(T̂ ν(z∗ ), ν(z∗ )). (B.4) Next, applying the detailed balance condition (B.2) to ν(z∗ ), we get P (d T̂ ν(z∗ ))α̃(T̂ ν(z∗ ), ν(z∗ )) = P (dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ )) substitute it in (B.4) and continue, = ν(z∗ )→z∗ = h(θ ∗ )P (dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ )) h(θ ∗ )P (dz∗ )α̃(z∗ , T̂ (z∗ )). Therefore, Equation (B.3) holds, and we complete the proof. B.2 VOLUME CORRECTION (L+1) (L+1) ) To adjust volume, we must derive the Jacobian determinant, det J := | ∂(θ∂(θ (1) ,v |, which can be ,v(1) ) calculated using wedge products. Definition B.1 (Differential forms, wedge product). The differential one-form α : T M D → R on a differential manifold M D is a smooth mapping from tangent space T M D to R, which can be expressed as a linear combination of differentials of local coordinates: α = fi dx i =: f · dx. For example, if f : RD → R is a smooth function, then its directional derivative along a vector v ∈ RD , denoted by df (v), is given by df (v) = ∂f i v, ∂zi then df (·) is a linear functional of v, called the differential of f at z and is an example of a differential one-form. In particular, dzi (v) = v i , thus df (v) = ∂f i dz (v), ∂zi then df = ∂f i dz . ∂zi The wedge product of two one-forms α, β is a two-form α ∧ β, an antisymmetric bilinear function on tangent space, which has the following properties (α, β, γ one-forms, A be a square matrix of MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 373 same dimension D): • α∧α =0 • α ∧ (β + γ ) = α ∧ β + α ∧ γ (thus, α ∧ β = −β ∧ α) • α ∧ Aβ = AT α ∧ β The following proposition enables us to calculate the Jacobian determinant denoted as det J. Proposition B.2. Let TL : (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) be evolution of a smooth flow, then Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 dθ (L+1) ∧ dv(L+1) = ∂(θ (L+1) , v(L+1) ) (1) dθ ∧ dv(1) . ∂(θ (1) , v(1) ) Note that the Jacobian determinant det J can also be regarded as a Radon–Nikodym derivative of (L+1) ,dv(L+1) ) two probability measures: det J = P (dθ , where P (dθ, dv) = p(θ, v)dθdv. We have P (dθ (1) ,dv(1) ) dv(n+1/2) = dv(n) − ε(v(n+1/2) )T (θ (n) )dv(n+1/2) + (∗∗)dθ (n) dθ (n+1) = dθ (n) + εdv(n+1/2) dv(n+1) = dv(n+1/2) − ε(v(n+1/2) )T (θ (n+1) )dv(n+1/2) + (∗∗)dθ (n+1) , where vT (θ ) is a matrix whose (k, j )th element is vi ijk (θ). Therefore, dθ (n+1) ∧ dv(n+1) = [I − ε(v(n+1/2) )T (θ (n+1) )]T dθ (n+1) ∧ dv(n+1/2) = [I − ε(v(n+1/2) )T (θ (n+1) )]T dθ (n) ∧ dv(n+1/2) = [I − ε(v(n+1/2) )T (θ (n+1) )]T [I + ε(v(n+1/2) )T (θ (n) )]−T dθ (n) ∧ dv(n) . For volume adjustment, we must use the following Jacobian determinant accumulated along the integration steps: L ∂(θ (L+1) , v(L+1) ) det(I − ε(v(n+1/2) )T (θ (n+1) )) = . (B.5) det JsLMC := ∂(θ (1) , v(1) ) det(I + ε(v(n+1/2) )T (θ (n) )) n=1 As a result, the acceptance probability becomes αsLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JsLMC |}. APPENDIX C: DERIVATION OF EXPLICIT LAGRANGIAN MONTE CARLO (LMC) To resolve the remaining implicit Equation (9), we now propose an additional modification motivated by the following relationship (notice the symmetry of lower indices in ): vT u = 1 (v + u)T (v + u) − vT v − uT u . 2 374 S. LAN ET AL. To keep time-reversibility, we make the modification to both (9) and (11) as shown below: ε v(n+1/2) = v(n) − [(v(n+1/2) )T (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]. 2 ⇓ ε v(n+1/2) = v(n) − [(v(n) )T (θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )]. 2 θ (n+1) = θ (n) + εv(n+1/2) ε v(n+1) = v(n+1/2) − [(v(n+1/2) )T (θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]. 2 (C.1) (C.2) Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 ⇓ ε v(n+1) = v(n+1/2) − [(v(n+1/2) )T (θ (n+1) )v(n+1) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )]. 2 (C.3) The time-reversibility of the integrator (C.1)–(C.3) can be shown by switching (θ, v)(n+1) and (θ, v)(n) and negating velocity. The resulting integrator is completely explicit since both updates of velocity (C.1) and (C.3) can be solved by collecting terms containing v(n+1/2) and v(n+1) , respectively: −1 ε ε v(n) − G(θ (n) )−1 ∇θ φ(θ (n) ) . (C.4) v(n+1/2) = I + (v(n) )T (θ (n) ) 2 2 −1 ε ε v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) ) . (C.5) v(n+1) = I + (v(n+1/2) )T (θ (n+1) ) 2 2 C.1 CONVERGENCE OF NUMERICAL SOLUTION We now show that the discretization error en = z(tn ) − z(n) = (θ(tn ), v(tn )) − (θ (n) , v(n) ) (i.e., the difference between the true solution and the numerical solution) accumulated over final time interval [0, T ] is bounded and goes to zero as the step size ε goes to zero. (See Leimkuhler and Reich 2004 for a similar proof for the generalized leapfrog method.) Here, we assume that f(θ, v) := vT (θ)v + G(θ)−1 ∇θ φ(θ) is smooth; hence, f and its derivatives are uniformly bounded as (θ, v) evolves within finite time duration T. We expand the true solution z(tn+1 ) at tn : 1 z(tn+1 ) = z(tn ) + ż(tn )ε + z̈(tn )ε2 + o(ε2 ) 2 1 θ (tn ) v(tn ) −f(θ(tn ), v(tn )) = + ε+ ε2 + o(ε2 ) ∂f ∂f v(tn ) + ∂v f(θ(tn ), v(tn )) v(tn ) −f(θ (tn ), v(tn )) 2 − ∂θ θ (tn ) v(tn ) = + ε + O(ε2 ). v(tn ) −f(θ (tn ), v(tn )) (n+1) Next, we simplify the expression of the numerical solutions z(n+1) = [ θv(n+1) ] for the fully explicit integrator and compare it to the above true solutions. To this end, we rewrite Equation (C.4) as follows: −1 ε ε v(n+1/2) = I + (v(n) )T (θ (n) ) v(n) − G(θ (n) )−1 ∇θ φ(θ (n) ) 2 2 −1 ε (n) T ε (n) (n) (v(n) )T (θ (n) )v(n) + G(θ (n) )−1 ∇θ φ(θ (n) ) = v − I + (v ) (θ ) 2 2 MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 375 −1 ε ε = v(n) − I + (v(n) )T (θ (n) ) f(θ (n) , v(n) ) 2 2 −1 ε ε ε2 I + (v(n) )T (θ (n) ) = v(n) − f(θ (n) , v(n) ) + (v(n) )T (θ (n) ) f(θ (n) , v(n) ) 2 4 2 ε = v(n) − f(θ (n) , v(n) ) + O(ε2 ). 2 Similarly, from Equation (C.5) we have ε v(n+1) = v(n+1/2) − f(θ (n+1) , v(n+1/2) ) + O(ε2 ). 2 Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 Substituting v(n+1/2) in the above equation, we obtain v(n+1) as follows: ε ε v(n+1) = v(n) − f(θ (n) , v(n) ) − f(θ (n+1) , v(n) ) + O(ε2 ) 2 2 ε = v(n) − f(θ (n) , v(n) )ε + [f(θ (n) , v(n) ) − f(θ (n) + O(ε), v(n) ] + O(ε2 ) 2 = v(n) − f(θ (n) , v(n) )ε + O(ε2 ). From (C.4), (C.2), and (C.5), we have the following numerical solution: θ (n+1) θ (n) v(n) (n+1) = z ε + O(ε2 ). = + v(n+1) v(n) −f(θ (n) , v(n) ) Therefore, the local error is en+1 θ(t ) − θ (n) n = z(tn+1 ) − z = v(tn ) − v(n) v(tn ) − v(n) 2 + ε + O(ε ) (n) (n) −[f(θ(tn ), v(tn )) − f(θ , v )] (n+1) ≤ (1 + Mε)en + O(ε2 ), where M = c supt∈[0,T ] ∇f(θ (t), v(t)) for some constant c > 0. Accumulating the local errors by iterating the above inequality for L = T /ε steps provides the following global error: en+1 ≤ (1 + Mε)en + O(ε2 ) ≤ (1 + Mε)2 en−1 + 2O(ε2 ) ≤ · · · ≤ (1 + Mε)n e1 + nO(ε2 ) ≤ (1 + Mε)L ε + LO(ε2 ) ≤ (eMT + T )ε → 0, as ε → 0. C.2 VOLUME CORRECTION As before, using the wedge product on the system (C.4), (C.2), and (C.5), the Jacobian matrix is −T T ε ∂(θ (n+1) , v(n+1) ) ε I − (v(n+1) )T (θ (n+1) ) · = I + (v(n+1/2) )T (θ (n+1) ) (n) 2 2 ∂(θ , v(n) ) T −T ε ε I − (v(n+1/2) )T (θ (n) ) . I + (v(n) )T (θ (n) ) 2 2 As these new equations show, our derived integrator is not symplectic so the acceptance probability needs to be adjusted by the following Jacobian determinant, det J, to preserve the detailed balance 376 S. LAN ET AL. condition: det JLMC ∂(θ (L+1) , v(L+1) ) := ∂(θ (1) , v(1) ) = L det(I − ε/2(v(n+1) )T (θ (n+1) )) det(I − ε/2(v(n+1/2) )T (θ (n) )) n=1 = det(I + ε/2(v(n+1/2) )T (θ (n+1) )) det(I + ε/2(v(n) )T (θ (n) )) L ˜ (n+1) )) det(G(θ (n) ) − ε/2(v(n+1/2) )T (θ ˜ (n) )) det(G(θ (n+1) ) − ε/2(v(n+1) )T (θ n=1 ˜ (n+1) )) det(G(θ (n) ) + ε/2(v(n) )T (θ ˜ (n) )) det(G(θ (n+1) ) + ε/2(v(n+1/2) )T (θ . Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 (C.6) As a result, the acceptance probability is αLMC = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JLMC |}. APPENDIX D: PSEUDOCODES Algorithm 1 Semi-explicit Lagrangian Monte Carlo (sLMC) Initialize θ (1) = current θ Sample new velocity v(1) ∼ N (0, G−1 (θ (1) )) Calculate current E(θ (1) , v(1) ) according to Equation (12) for n = 1 to L (leapfrog steps) do % Update the velocity with fixed-point iterations v̂(0) = v(n) for i = 1 to NumOfFixedPointSteps do ˜ (n) )v̂(i−1) + ∇θ φ(θ (n) )] v̂(i) = v(n) − 2 G(θ (n) )−1 [(v̂(i−1) )T (θ end for v(n+1/2) = v̂(last i) % Update the position only with simple one step θ (n+1) = θ (n) + v(n+1/2) log detn = log det(I − (θ (n+1) , v(n+1/2) )) − log det(I + (θ (n) , v(n+1/2) )) % Update the velocity exactly ˜ (n+1) )v(n+1/2) + ∇θ φ(θ (n+1) )] v(n+1) = v(n+1/2) − 2 G(θ (n+1) )−1 [(v(n+1/2) )T (θ end for Calculate proposed E(θ (L+1) , v(L+1) ) according to Equation (12) logRatio = −ProposedH + CurrentH + N n=1 log detn Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio MARKOV CHAIN MONTE CARLO FROM LAGRANGIAN DYNAMICS 377 Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 Algorithm 2 Explicit Lagrangian Monte Carlo (LMC) Initialize θ (1) = current θ Sample new velocity v(1) ∼ N (0, G(θ (1) )−1 ) Calculate current E(θ (1) , v(1) ) according to Equation (12) log det = 0 for n = 1 to L do ˜ (n) , v(n) )) log det = log det − log det(G(θ (n) ) + /2(θ % Update the velocity explicitly with a half-step: ˜ (n) , v(n) )]−1 [G(θ (n) )v(n) − ∇θ φ(θ (n) )] v(n+1/2) = [G(θ (n) )+ 2 (θ 2 ˜ (n) , v(n+1/2) )) log det = log det + log det(G(θ (n) ) − /2(θ % Update the position with a full step: 1 θ (n+1) = θ (n) + v(n+ 2 ) ˜ (n+1) , v(n+1/2) )) log det = log det − log det(G(θ (n+1) ) + /2(θ % Update the velocity explicitly with a half-step: ˜ (n+1) , v(n+1/2) )]−1 [G(θ (n+1) )v(n+1/2) − ∇θ φ(θ (n+1) )] v(n+1) = [G(θ (n+1) )+ 2 (θ 2 ˜ (n+1) , v(n+1) )) log det = log det + log det(G(θ (n+1) ) − /2(θ end for Calculate proposed E(θ (L+1) , v(L+1) ) according to Equation (12) logRatio = −ProposedE + CurrentE + log det Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio ACKNOWLEDGMENTS VS and MAG are grateful to the UK Engineering and Physical Sciences Research Council (EPSRC) for supporting this research through project grant EP/K015664/1. MAG is grateful for support by an EPSRC Established Research Fellowship EP/J016934/1 and a Royal Society Wolfson Research Merit Award. SL and BS are supported by NSF grant IIS-1216045 and NIH grant R01-AI107034. [Received March 2013. Revised August 2013.] REFERENCES Amari, S., and Nagaoka, H. (2000), Methods of Information Geometry, New York: Oxford University Press. [358] Beskos, A., Pinski, J. F., Sanz-Serna, J. M., and Stuart, A. M. (2011), “Hybrid Monte Carlo on Hilbert Spaces,” Stochastic Processes and Applications, 121, 2201–2230. [361] Bishop, R. L., and Goldberg, S. I., (2000), Tensor Analysis on Manifolds, Mineola, NY: Dover Publications, Inc. [369] Chin, S. A. (2009), “Explicit Symplectic Integrators for Solving Nonseparable Hamiltonians,” Physical Review E, 80, 037701–037701-4. [369] Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B, 195, 216–222. [357] Dullweber, A., Leimkuhler, B., and McLachlan, R. (1997), “Split-Hamiltonian Methods for Rigid Body Molecular Dynamic,” Journal of Chemical Physics, 107, 5840–5852. [368] Geyer, C. J. (1992), “Practical Markov Chain Monte Carlo,” Statistical Science, 7, 473–483. [363] Girolami, M., and Calderhead, B. (2011), “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods” (with discussion), Journal of the Royal Statistical Society, Series B, 73, 123–214. [358,359,363,364,365,368] 378 S. LAN ET AL. Green, P. J. (1995), “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination,” Biometrika, 82, 711–732. [361,370] Leimkuhler, B., and Reich, S. (2004), Simulating Hamiltonian Dynamic, New York: Cambridge University Press. [359,360,370,374] Liu, J. S. (2001), “Molecular Dynamic and Hybrid Monte Carlo,” in Monte Carlo Strategies in Scientific Computing, ed. Jun S. Liu, New York: Springer-Verlag, pp. 183–204. [371] Marin, J. M., Mengersen, K. L., and Robert, C. (2005), “Bayesian Modeling and Inference on Mixtures of Distributions,” in Handbook of Statistics (Vol. 25), eds. D. Dey and C. R. Rao, Amsterdam: Elsevier, pp. 459–507. [367] McLachlan, G. J., and Peel, D. (2000), Finite Mixture Models, New York: Wiley. [367] Downloaded by [The UC Irvine Libraries] at 03:06 19 June 2015 Neal, R. M. (2010), “MCMC Using Hamiltonian Dynamic,” in Handbook of Markov Chain Monte Carlo, eds. S. Brooks, A. Gelman, G. Jones, and X. L. Meng, Boca Raton, FL: Chapman and Hall/CRC, pp. 113–162. [357,368] Sexton, J. C., and Weingarten, D. H. (1992), “Hamiltonian Evolution for the Hybrid Monte Carlo Algorithm,” Nuclear Physics B, 380, 665–677. [368] Shahbaba, B., Lan, S., Johnson, W., and Neal, R. (2014), “Split Hamiltonian Monte Carlo,” Statistics and Computing, 24, 339–349. [368] Stathopoulos, V., and Girolami, M. (2011), “Manifold MCMC for Mixtures,” in Mixture Estimation and Applications, eds. K. Mengersen, C. P. Robert, and M. D. Titteringhton, West Sussex, UK: Wiley, pp. 255–276. [367] Verlet, L. (1967), “Computer “Experiments” on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules,” Physical Review, 159, 98–103. [359]