UNIVERSITY OF CALIFORNIA, IRVINE Advanced Bayesian Computational Methods through Geometric Techniques DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Statistics by Shiwei Lan Dissertation Committee Assistant Professor Babak Shahbaba, Chair Professor Wesley O. Johnson Assistant Professor Jeffrey Streets 2013 c 2013 Shiwei Lan DEDICATION To my dear wife Yuanyuan and lovely daughter Lydia coming next January. . . i ii Contents List of Figures vii List of Tables ix List of Algorithms xi Acknowledgements xiii Curriculum Vitae xv Abstract xvi 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Hamiltonian Monte Carlo 2.1 2.2 2.3 7 Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Hamiltonian Monte Carlo Algorithm . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Proposal guided by Hamiltonian dynamics . . . . . . . . . . . . . . . 11 2.2.3 Leapfrog Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Split Hamiltonian Monte Carlo 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Splitting the Hamiltonian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 20 Splitting the Hamiltonian with a partial analytic solution . . . . . . . iii CONTENTS 3.2.2 3.3 3.4 3.5 Splitting the Hamiltonian by splitting the data . . . . . . . . . . . . . 21 Application of Split HMC to logistic regression models . . . . . . . . . . . . 23 3.3.1 Split HMC with a partial analytical solution for a logistic model . . . 24 3.3.2 Split HMC with splitting of data for a logistic model . . . . . . . . . 25 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Results on real data sets . . . . . . . . . . . . . . . . . . . . . . . . . 29 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Lagrangian Monte Carlo 35 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Riemannian Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Hamiltonian dynamics on Riemannian manifold . . . . . . . . . . . . 37 4.2.2 Riemannian Hamiltonian Monte Carlo Algorithm . . . . . . . . . . . 39 4.3 4.4 4.5 4.6 Semi-explicit Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Lagrangian Dynamics: from Momentum to Velocity . . . . . . . . . . 41 4.3.2 Semi-explicit Lagrangian Monte Carlo Algorithm . . . . . . . . . . . 42 4.3.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Explicit Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.1 Fully explicit integrator . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.2 Volume Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.1 Banana-shaped distributions . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.2 Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.3 Multivariate T-distributions . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.4 Finite Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . 58 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Wormhole Hamiltonian Monte Carlo 63 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Energy Barrier in HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Wormhole HMC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Tunnel Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.2 Wind Tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Wormhole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Mode Searching After Regeneration . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.1 75 5.4 Identifying Regeneration Times . . . . . . . . . . . . . . . . . . . . . iv CONTENTS 5.5 5.6 5.4.2 Searching New Modes . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.3 Regenerative Wormhole HMC . . . . . . . . . . . . . . . . . . . . . . 79 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5.1 5.5.2 Sensor Network Localization . . . . . . . . . . . . . . . . . . . . . . . Mixture of Gaussians with Known Modes . . . . . . . . . . . . . . . . 81 82 5.5.3 Mixture of Gaussians with Unknown Modes . . . . . . . . . . . . . . 83 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6 Spherical Hamiltonian Monte Carlo for Constrained Target Distributions 87 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 6.3 6.4 6.5 Sampling from distributions defined on the unit ball . . . . . . . . . . . . . . to sphere S D 88 6.2.1 Change of the domain: from unit ball . . . . . . 88 6.2.2 6.2.3 Hamiltonian Dynamics on Sphere . . . . . . . . . . . . . . . . . . . . Spherical HMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 91 92 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.1 Norm constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.2 Functional constraints . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 98 6.4.1 Truncated Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . 98 6.4.2 Bayesian Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.3 6.4.4 Bridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Modeling synchrony among multiple neurons . . . . . . . . . . . . . . 102 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7 Conclusion 7.1 BD 0 (1) 107 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Bibliography 111 References 111 Appendices 115 A Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.1 Equivalence between Riemannian Hamiltonian dynamics and Lagrangian dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.2 A.3 B Stationarity of Lagrangian Monte Carlo . . . . . . . . . . . . . . . . . 116 Convergence of explicit integrator to Lagrangian dynamics . . . . . . 117 Solutions to split Lagrangian dynamics on Sphere . . . . . . . . . . . . . . . 118 v CONTENTS vi List of Figures 1.1 Comparison of RWM, HMC and RHMC . . . . . . . . . . . . . . . . . . . . 4 1.2 Relationship of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Illustration of Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Comparison of HMC and RWM in simulating 2d Gaussian . . . . . . . . . . 18 3.2 An illustrative binary classification problem . . . . . . . . . . . . . . . . . . 23 3.3 Approximation in Split HMC with a partial analytic solution . . . . . . . . . 25 3.4 Approximation in Split HMC by splitting the data . . . . . . . . . . . . . . . 26 4.1 Comparison of RWM, HMC and RHMC in exploring a banana-shaped distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 37 Comparison of RHMC, sLMC and LMC in exploring a banana-shaped distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Histograms of the banana-shaped distribution . . . . . . . . . . . . . . . . . 53 4.4 Comparison of RHMC, sLMC and LMC in exploring a thin banana-shaped distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Change of sampling efficiency–trade off between geometry and efficiency . . . 57 4.6 Density plots of the generated synthetic Mixture of Gaussians . . . . . . . . 59 5.1 Energy Barrier in HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Comparison of HMC and THMC in sampling from a 2d distribution with two modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Shape of wind tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Sampling from a mixture of 10 Gaussians with 100 dimension using THMC with wind vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Wormhole Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.6 Discover unknown modes by down weighting known ones . . . . . . . . . . . 78 vii LIST OF FIGURES 5.7 Camparison of RDMC and WHMC in location inference of wireless sensor network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.8 Comparing WHMC to RDMC using K mixtures of D-dimensional Gaussians 83 5.9 Comparing RWHMC to RDMC in terms of REM using K = 10 mixtures of D-dimensional Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.10 Number of modes identified by RWHMC over time in simulating K = 10 mixtures of Gaussians with D = 10, 100 . . . . . . . . . . . . . . . . . . . . . 84 5.11 Comparison of WHMC and WLMC in simulating a 2d distribution with 2 modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1 6.2 D . . . . . . . . . . . . . . . . . . . Transforming unit ball BD 0 (1) to sphere S Truncated Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . 91 99 6.3 Bayesian Lasso using different sampling algorithms . . . . . . . . . . . . . . 101 6.4 Sampling Efficiency in Bayesian Lasso . . . . . . . . . . . . . . . . . . . . . . 102 6.5 6.6 Bayesian Bridge Regression by Spherical HMCo . . . . . . . . . . . . . . . . 103 Trace plots of samples–rewarded stimulus . . . . . . . . . . . . . . . . . . . . 104 6.7 Trace plots of samples–non-rewarded stimulus . . . . . . . . . . . . . . . . . 104 viii List of Tables 3.1 Split HMC vs HMC in sampling efficiency–simulated logistic regression . . . 29 3.2 Split HMC vs HMC in sampling efficiency–logistic regression on real data . . 30 4.1 Efficiency comparison of HMC, RHMC, sLMC and LMC–banana-shaped dis- 4.2 4.3 tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Efficiency comparison of HMC, RHMC, sLMC and LMC–thin banana-shaped distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Efficiency comparison of HMC, RHMC, sLMC and LMC–5 real logistic regression problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 4.5 Densities used for the generation of synthetic Mixture of Gaussian data sets 59 Efficiency comparison of HMC, RHMC, sLMC and LMC–5 mixture of Gaussians 60 6.1 6.2 Moments Matching by RWM, Wall HMC, and Spherical HMC . . . . . . . . Efficiency comparison of RWM, Wall HMC, and Spherical HMC–Truncated 99 Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Efficiency comparison of RWM, Wall HMC, and Spherical HMC–Copula modeling synchrony among multiple neurons . . . . . . . . . . . . . . . . . . . . 105 ix LIST OF TABLES x List of Algorithms 2.1 Hamiltonian Monte Carlo (HMC) . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Split Hamiltonian Monte Carlo with a partial analytic solution (Split HMC- 3.2 PAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Split Hamiltonian Monte Carlo by splitting the data (Split HMC-SD) . . . . 22 22 4.1 Riemannian Hamiltonian Monte Carlo (RHMC) . . . . . . . . . . . . . . . . 40 4.2 Semi-explicit Lagrangian Monte Carlo (sLMC) . . . . . . . . . . . . . . . . . 47 4.3 5.1 Explicit Lagrangian Monte Carlo (LMC) . . . . . . . . . . . . . . . . . . . . Wormhole Hamiltonian Monte Carlo (WHMC) . . . . . . . . . . . . . . . . . 51 74 5.2 Regenerative Wormhole Hamiltonian Monte Carlo (RWHMC) . . . . . . . . 79 6.1 Spherical Hamiltonian Monte Carlo (Spherical HMC) . . . . . . . . . . . . . 94 xi 14 ACKNOWLEDGEMENTS I would like to express my greatest gratitude to my advisor Professor Babak Shahbaba for his insightful guidance and persistent encouragement throughout my doctorate program. It is truly a blessing to work with him who enabled my transition to the field of Statistics smoothly. He has not only encouraged me to think independently at every step of my research, given me numerous support and advice, but also taught me the necessary presentation and writing skills with his own experiences. This dissertation would never have been written without his help. I am also grateful to the other two members of my dissertation committee, Professor Wesley O. Johnson and Professor Jeffrey Streets. I learned advanced statistics and Bayesian modeling from Professor Johnson, who also provided me a lot of support and suggestions during my degree pursuit. I want to thank Professor Streets for his precious time of discussion, which inspired many ideas in this dissertation. I would like to express my thankfulness to all my collaborators: Professor Mark Girolami, Professor Jeffrey Streets, Vasileios Stathopoulos and Bo Zhou. I want to acknowledge the help from Sungjin Ahn, Yutian Chen and Anoop Korattikara who patiently answered my questions about details of their work. I am also thankful to Professor Max Welling for his enlightening comments. Finally, thanks to my family! My wife, Yuanyuan Li, has sacrificed much in order to accompany and support me wherever I am. We owe our deepest gratitude to our parents for their continuing love, care, help and support! xii CURRICULUM VITAE Shiwei Lan EDUCATION Doctor of Philosophy in Statistics University of California, Irvine 2013 Irvine, California Master of Science in Mathematics University of California, Irvine 2010 Irvine, California Bachelor of Science in Mathematics Nanjing University 2005 Nanjing, China EXPERIENCE Graduate Research Assistant University of California, Irvine 06/2013–present Irvine, California Teaching Assistant University of California, Irvine 09/2006–06/2013 Irvine, California MATHEMATICAL SKILLS General: Mathematical/Real/Complex/Numerical Analysis, ODE/PDE Geometry: Topology, Differential Geometry, Geometric Analysis Statistics: Bayesian Statistics, Data Analysis, Stochastic Process COMPUTER SKILLS C/C++, Matlab, Mathematica, R, SAS, Stata HONORS Excellent Graduation Nanjing University top 20% 2005 National Scholarship Nanjing University 4 of 150 2002, 2003, 2004 xiii REVIEWER Statistical Analysis and Data Mining Scandinavian Journal of Statistics TALKS Spherical HMC for Constrained Target Distributions AI/ML seminar November,2013 UC Irvine Split HMC 5th International Conference of ERCIM December,2012 Oviedo, Spain Lagrangian Dynamical Monte Carlo AI/ML seminar November,2012 UC Irvine PUBLICATIONS Spherical HMC for Constrained Target Distributions Shiwei Lan, Bo Zhou, and Babak Shahbaba http://arxiv.org/abs/1309.4289 2013 Wormhole Hamiltonian Monte Carlo Shiwei Lan, Jeffrey Streets, and Babak Shahbaba http://arxiv.org/abs/1306.0063 2013 Split Hamiltonian Monte Carlo Babak Shahbaba, Shiwei Lan, Wesley O. Johnson and Radford M. Neal Statistics and Computing,DOI: 10.1007/s11222-012-9373-1. 2013 Lagrangian Dynamical Monte Carlo Shiwei Lan, Vassilios Stathopoulos, Babak Shahbaba, and Mark Girolami http://arxiv.org/abs/1211.3759 2012 xiv ABSTRACT Modern statistical methods relying on Bayesian inference typically involve intractable models that require computationally intensive algorithms, such as Markov Chain Monte Carlo (MCMC), for their implementation. While simple MCMC algorithms (e.g., random walk Metropolis) might be effective at exploring lowdimensional probability distributions, they can be very inefficient for complex, high-dimensional distributions. More specifically, broader application of MCMC is hindered by either slow mixing or expensive computational cost. As a result, many existing MCMC algorithms are not efficient or capable enough to handle complex models that are now commonly used in statistics and machine learning. This dissertation focuses on utilizing geometrically motivated methods to improve efficiency of MCMC samplers while lowering the computational cost, with the aim to extend the application of MCMC methods to complex statistical problems involving heavy computation, complicated distribution structure, multimodality, and parameter constraints. We start by extending the standard Hamiltonian Monte Carlo (HMC) algorithm through splitting the Hamiltonian in a way that allows enhanced movement around the state space achieved at low computational cost. For more advanced HMC algorithms defined on Riemannian manifolds, we propose a new method, Lagrangian Monte Carlo, which is capable of exploring complex probability distributions at relatively low computational cost. For multimodal distributions, we have developed a geometrically motivated approach, Wormhole Hamiltonian Monte Carlo, that explores the distribution around the known modes effectively while identifying previously unknown modes in the process. Furthermore, we propose another algorithm, Spherical Hamiltonian Monte Carlo, that combines geometric methods and computational techniques to provide a natural and efficient framework for sampling from constrained distributions. We use a variety of simulations and real data to illustrate the substantial improvement obtained by our proposed methods over alternative solutions. xv LIST OF ALGORITHMS xvi 1 Introduction 1.1 Background In Bayesian Statistics, for given data D, our model P (D|θ) contains parameters θ, which are usually assumed with some distribution P (θ) based on certain prior knowledge. The posterior knolwdege of θ is as follows: P (θ|D) = 1 P (D|θ)P (θ) ∝ P (D|θ)P (θ) P (D) It is important, for example, in prediction: Z ∗ P (y |D) = P (y ∗ |θ)P (θ|D)dθ Such integration is almost omni-present in Bayesian modeling however is very often intractable in the sense that there is no closed form for it. To infer/estimate the intractable posterior, we appeal to approximate methods. Two most prominent strategies in the literature are variational inference [1, 2] and Markov Chain Monte Carlo (MCMC) [3, 4]. Taking advantage of mean-field theory [5], variational Bayeisan inference searches a variational distribution Q(θ) in a flexible family that is closest to the true posterior P (θ|D) by iteratively reducing their distance (Kullback-Leibler divergence, DKL (Q||P )), thus transforms the inference problem to an optimization problem. Variational Bayesian method can be viewed as an extention of Expectation-Maximization (EM) algorithm [6], but instead of finding the Maximum A Posterior (MAP) esitmation, it computes an approximation to the entire posterior distribution for statistical inference/estimation, and it also provides the optimal lower bound for the marginal likelihood as a byproduct. This method is further studied by [7, 8, 9, 10, 11, 12, 13] for applications in more general settings. 1 1. INTRODUCTION While variaiontal Bayes provides a locally-optmial, exact analytical solution to an approximation to the posterior, MCMC, on the other hand, approximates the exact posterior using a set of samples from a Markov Chain. For example, ∗ Z P (y |D) = S 1X P (y |θ)P (θ|D)dθ ≈ P (y ∗ |θ (s) ), S s=1 ∗ θ (s) ∼ P (θ|D) By the functional central limit theorm [14], above approximation is unbiased with variance approximately σ 2 τ /S with autocorrelation time1 τ being interpreted as the number of dependent samples before an equivalently independent point[3, 15, 16, 17]. Compared to variational Bayes, MCMC algorithms tend to provide better approximations (typically at higher computational cost), especially in high dimensions. MCMC provides a simple but powerful tool in Bayesian learning [4, 18, 19, 20, 21]. Even though variational Bayes and MCMC are two different approximate techniques, they can be naturally combined [22, 23]. This dissertation will concentrate on MCMC methods. The main theory of Markov chain states that an aperiodic, irreducible Markov chain that has a stationary distribution π(·) must uniquely converge to π(·) [24, 25]. MCMC method involves designing reversible transition kernel that has the target distribution as its stationary distribution; then generating samples according to the transition kernel. Regardless of the starting point, these samples will follow the target distribution once entering the equilibrium. MCMC is introduced to tackle high dimensional integrals in statistics and machine learning. It is well known, however, MCMC may suffer from slow mixing (converging to the stationary distribution) and heavy computational burden with large data volume (number of observations) in high dimension (number of features). The complexity of the target distribution (skewness, multimodality, etc.) can make MCMC sampling of the parameter space difficult thus rendering a low mixing rate. High dimensionality adds another layer of difficulty due to the concentration of probability in certain regions. Rejection and importance [26, 27] sampling are two primitive Monte Carlo algorithms that remain only for the demonstrative purpose due to their inefficiency in practice. The Metropolis algorithm [18] is responsible for universality of MCMC. Given the current state θ, to derive a Markov chain having π(θ) as its stationary distribution, it first makes a proposal θ ∗ ∼ q(θ ∗ |θ) and accepts it for the next state with probability min{1, π(θ ∗ )/π(θ)} or stays at its current state. A simple proposal is q(θ ∗ |θ) ∼ N(θ, σ 2 I), called Random Walk Metropolis (RWM). However, its diffusive behavior makes resulting Markov chains mix slowly and thus limiting its efficiency in practice. [19] generalizes it to allow asymmetric proposals 1 τ =1+2 P+∞ k=1 ρ(k) where ρ(k) is autocorrelation function at lag k. 2 1.1 Background (q(θ ∗ |θ) 6= q(θ|θ ∗ )), e.g. independent proposals. Note, Gibbs sampler [20] is a special case of cyclic Metropolis-Hastings (M-H) algorithm by taking proposal distribution as P (θi∗ |θ −i ) and updating parameters coordinate wise. Though such proposals are always accepted [28], the full conditionals are not necessarily available or easy to sample from. Besides the recent advancement of M-H algorithms by [29, 30, 31, 32], careful design of transition kernel is needed for the Markov chain to converge fast to the target distribution. Using auxiliary variables could allow us to design efficient MCMC algorithms. This strategy is successfully used in slice sampling [33], which uniformly samples from the region under the density plot by alternating uniform sampling in ancillary vertical direction with unform sampling in the horizontal “slice”. Although slice sampling performs very well for univariate distributions, its generalization to higher dimensions could be problematic. It is recently developed by [34, 35]. Hamiltonian Monte Carlo (HMC) [36] is another popular example of MCMC design using ancillary variables. As a special case of the Metropolis algorithm, HMC augments the states of interest with ancillary variables, and proposes augmented states that are distant from the current state by deterministically simulating Hamiltonian dynamics, which nevertheless are accepted with high probability. Guided by the gradient information of the log density, HMC reduces the random walk behavior of RWM and significantly improves the efficiency in exploring the target distribution. We can see from figure 1.1 that RWM moves slowly, whereas HMC is more efficient in exploring the distribution with the help of geometry. [37] provides a complete introduction of HMC. [38] address two major issues involving tuning parameters (trajectory length and step size). [39] generalize HMC to a Riemannian manifold to futher improve the sampler’s ability to explore complicated distributions. There are other recent works on HMC by [40, 41, 42, 43, 44, 45]. As the dimension grows, the Hamiltonian dynamical system becomes increasingly restricted by its smallest eigen-direction, requiring smaller step sizes to maintain stability. Moreover, complicated distribution structure demands local adaptation of both the step size and the direction for HMC to better explore the parameter space. Riemannian HMC (RHMC) [39] defines HMC on a Riemannian manifold, which, as argued by [46], is more suitable for sampling from complicated non-Gaussian distributions. Specificially, RHMC uses a position dependent pre-conditioning matrix G(θ) in HMC to adapt to the local geometry of the distribution. As seen in figure 1.1, with the geometric information from the second order derivative matrix (Fisher metric) of the log posterior density, RHMC avoids erratic behavior of HMC and explores the parameter space more smoothly. RHMC is developed and generalized by [47, 48, 49, 50, 51]. 3 1. INTRODUCTION Sampling Path of HMC Sampling Path of RHMC 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 θ 0 2 2 θ2 θ2 Sampling Path of RWM 2 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 −2 −2 −1 0 θ1 1 2 −2 −2 −1 0 θ1 1 2 −2 −2 −1 0 θ1 1 2 Figure 1.1: The first 10 iterations in sampling from a banana shaped distribution by RWM , HMC and RHMC. Left: RWM explores the distribution in a non-sysmatic way; Middle: HMC, with gradient information, becomes more guided in exploration; Right: RHMC uses more geometric information, curvature, to explore the distribution more straightforwardly. Figure 1.1 illustrates the motivation of geometry in improving MCMC sampling efficiency. From RWM to RHMC, the more geometric information a sampler can adopt, the better its capability to explore the target distribution and thus to enhance the mixing behavior of the Markov chain. As expected, the computational cost will increase as more geometry is incorporated. This dissertation mainly focuses on using geometry to improve efficiency of MCMC samplers while keeping the computational cost low, with the aim to make MCMC methods more applicable to complex statistical problems involving heavy computation, complicated distribution structure, multimodality and constraints, etc. There are other interesting Monte Carlo methods not mentioned above. Tempered transition [52] is an elaborate improvement over simulated tempering [53, 54] for sampling from multimodal distributions. Both take advantage of simulated annealing [55] as an optimization algorithm. [51, 56, 57] discuss sampling from probability distributions defined on a submanifold embedded in RD . Reversible jump MCMC [58] extends standard MCMC algorithms by allowing the dimension of the posterior to vary. Sequential Monte Carlo (particle filtering) [59] methods are a set of online posterior density estimation algorithms that are successfully used in time series modeling. They are more or less related to the work of this dissertation. 4 1.2 Contributions 1.2 Contributions Both computational and geometric methods are used to serve the purpose of improving MCMC samplers’ efficiency. The main contributions of the dissertation are as follows: • Split Hamiltonian Monte Carlo speeds up HMC by splitting the Hamiltonian to allow enhanced movement around the state space done at low computational cost. • Lagrangian Monte Carlo is capable to explore complex probability distributions the same as RHMC but with reduced cost by avoiding its expensive implicit updates. • Wormhole Hamiltonian Monte Carlo is a novel geometric MCMC method that can effectively and efficiently sample from multimodal distributions in high dimension. • Spherical Hamiltonian Monte Carlo combines geometric and computational techniques to provide a natural and efficient framework for sampling from constrained distributions. 1.3 Outline This dissertation is organized as follows. Chapter 2 provides an overview of Hamiltonian Monte Carlo. Chapter 3 discusses the split HMC method for improving the computational efficiency of HMC. Chapter 4 explains why Lagrangian dynamics, which uses velocity as opposed to momentum, is preferred to Riemannian Hamiltonian dynamics. Chapter 5 discusses how geometry can be utilized to facilitate movement between modes in sampling from multimodal distributions. Chapter 6 combines both computational and geometric methods to implicitly and efficiently handle constraint problems in sampling. The last chapter 7 provides conclusions and discusses future research directions. The relationship among these chapters is shown in figure 1.2. 5 1. INTRODUCTION HMC RHMC Split HMC LMC efficiency complicated structure Spherical HMC Wormhole HMC constraints mul6modality Figure 1.2: Relationship of Chapters 6 2 Hamiltonian Monte Carlo Hamiltonian Monte Carlo originated from the landmark paper [36], which termed the method Hybrid Monte Carlo. This work united MCMC and molecular simulation. The statistical application began with Neal’s application in neural networks [60]. HMC suppress the random walk behavior of RWM by making proposals that are distant from the current state, yet have a high probability of being accepted. These proposals are found by numerically simulating Hamiltonian dynamics for some discretized time steps. In this chapter, we review Hamiltonian dynamics and its application to MCMC, where we are interested in sampling from the distribution of θ. By using auxiliary variables p, we can improve the computational efficiency of MCMC algorithms. While we provide the physical interpretation of this method, we can simply consider it as a data augmentation approach. 2.1 Hamiltonian Dynamics Hamiltonian dynamics is a set of partial differential equations guiding the evolution of the state of a particle in a closed system according to the law of energy conservation. It provides useful intuition in its application to MCMC. In this section, a brief overview of Hamiltonian dynamics and its properties is given. One can find a more detailed review in [37]. Consider a frictionless puck sliding on a surface of varying height. The state space of this system consists of its position, denoted by the vector θ ∈ RD , U (θ ) and its momentum, denoted by the vector p ∈ RD . The potential energy, U (θ), is proportional to the height of the surface at position θ, and the kinetic energy 7 θ 2. HAMILTONIAN MONTE CARLO is K(p) := pT p/(2m), where m is the mass of the puck. As the puck moves on an upward slope, its potential energy increases while its kinetic energy decreases. The puck keeps climbing to a point where the kinetic energy becomes zero, then it slides back down, with its potential energy decreasing and its kinetic energy increasing. The total energy of the above dynamical system is represented by a function called Hamiltonian defined as follows: Definition 2.1 (Hamiltonian). The Hamiltonian H is defined as the total energy, the sum of the potential and kinetic energy: H(θ, p) = U (θ) + K(p) (2.1) Then the evolution of the state (θ, p) over time t is governed by the following Hamilton equations (2.2). Definition 2.2 (Hamiltonian Dynamics). Given differentiable Hamiltonian, H(θ, p), Hamiltonian dynamics is defined by the following partial differential equations: ∂H ∂p ∂H ṗ = − ∂θ = θ̇ = ∇p K(p) (2.2) = −∇θ U (θ) where ˙ is time derivative, and ∇p = [∂/∂p1 , · · · , ∂/∂pD ]. The solution to (2.2) defines a flow1 Tt : M2D × R → M2D , t), p(t0 + t)), ∀t0 , t ∈ R. (θ(t0 ), p(t0 ), t) 7→ (θ(t0 + " # 0D×D ID×D Alternatively, denote the state z := (θ, p), and symplectic matrix J := , −ID×D 0D×D then Hamiltonian dynamics (2.2) can be rewritten as ż = J∇z H(z) 2.1.1 Properties Hamiltonian dynamics has three fundamental properties that are crucial for the application in MCMC: i) time reversibility; ii) symplecticity (volume-preservation); iii) energy (Hamiltonian) conservation [61]. 1 A flow T is a mapping T : M×R → M such that ∀z ∈ M, s, t ∈ R, T (z, 0) = z; T (T (z, t), s) = T (z, s+t). 8 2.1 Hamiltonian Dynamics Time Reversibility states the one-one correspondence between forward-time evolution, Tt , and time-reversed evolution, T−t , which is also the inverse of the flow, Tt−1 . Time reversibility of dynamics is used to prove the reversibility of the Markov chain transitions in HMC, which in turn provides an easy proof of the stationarity of the resulting Markov chain. Definition 2.3 (Time Reversibility). A dynamical system is time reversible if there exists an involution1 ν that gives a one-to-one mapping between its forward-time evolution and time-reversed evolution in the following way: T−t = ν ◦ Tt ◦ ν Proposition 2.1 (Time Reversibility). Hamiltonian dynamics (2.2) is time reversible. Proof. reverting the direction of momentum, i.e. ν could be a matrix " Let ν be the mapping # ID×D 0D×D I= acting on z. Denote z0 = ν(z) = Iz. H is a quadratic funciton of p, 0D×D −ID×D thus H(z0 ) = H(z). We have ż0 = I ż = IJ∇z H(z) = IJI T ∇z0 H(z0 ) = −J∇z0 H(z0 ) Therefore, if z(t) = Tt (z0 ) for some initial z0 , then we must have z0 (t) = T−t (z0 0 ) for the initial z0 0 = ν(z0 ) according to above equation. Thus accroding to the uniquness of the solution, Tt (z0 ) = z(t) = ν(z0 (t)) = ν ◦ T−t ◦ ν(z0 ) Notice that ν is involution, then we have T−t = ν ◦ Tt ◦ ν. Remark 2.1. The time reversibility of Hamiltonian dynamics has the following interpretation. Starting from θ 0 with some initial momentum p0 , evolve the dynamics (2.2) for some time to reach θ 1 with momentum p1 . Now if starting from θ 1 with flipped momentum −p1 , we can arrive θ 0 with momentum −p0 after evolving (2.2) for the same amount of time; further flipping the direction of −p0 , we can get the original initial state (θ 0 , p0 ). Volume Preservation means that any infinitesimal region R in the state space will have the same volume after being mapped by the flow Tt , i.e. Vol(R) = Vol(Tt (R)), ∀t ∈ R. It (or a stronger condition, symplecticity) is a property that simplifies the acceptance probability for Metropolis updates. If it does not hold, the acceptance probability needs to be adjusted with the Jacobian determinant of some discretized evolution to guarantee the stationarity of the resulting Markov chain. See Proposition 4.3 in Chapter 4. 1 An involution ν is a function that is its own inverse, i.e. ν 2 = ν ◦ ν = id, id the identity map. 9 2. HAMILTONIAN MONTE CARLO Proposition 2.2 (Volume Preservation). Hamiltonian dynamics (2.2) is volume preserving. Proof. The easiest proof is to show that the divergence of the vector field (θ̇, ṗ) is zero, thus the flux accross the boundary of any infinitesimal volume is zero, namely, " # θ̇ ∂ ∂ ∂ ∂H ∂ ∂H θ̇ + ∇· = − =0 ṗ = T T T ∂p ∂pT ∂θ ṗ ∂θ ∂θ ∂p One could also refer to the divergence theorem, or more directly to Liouville’s theorem [62]. Energy Conservation makes the acceptance probability one in Metropolis updates if (2.2) is analytically solved and makes the acceptance probability only depend on the discretization error when (2.2) is numerically solved. Proposition 2.3 (Energy Conservation). Hamiltonian dynamics (2.2) is energy conservative. Proof. Based on the Hamilton equations (2.2), ∂H ∂H dH = θ˙T + ṗT = dt ∂θ ∂p ∂H ∂p T T ∂H ∂H ∂H + − =0 ∂θ ∂θ ∂p In practice, Hamiltonian dynamics (2.2) might only be numerically solved. These properties remain valid for the discretized dynamics. They are important in the application to MCMC, as they constitute an essential part of the proof of stationarity of the induced Markov chain, and provide convenience in the use of the algorithm. In particular, a numerical method to solve differential equations satisfying properties i) time reversibility and ii) symplecticity (volume-preservation) is called a geometric integrator [39, 61]. 2.2 Hamiltonian Monte Carlo Algorithm Hamiltonian dynamics can be used to guide the proposal in Metropolis Hastings algorithm thus can suppress the random walk behavior in the RWM algorithm. The resulting algorithm is called Hamiltonian Monte Carlo (HMC). 10 2.2 Hamiltonian Monte Carlo Algorithm 2.2.1 Metropolis-Hastings Algorithm The Metropolis algorithm [18] is a popular MCMC sampling scheme, which was generalized for asymmetric proposals by [19]. Suppose the target distribution is π(·). We want to derive a transition probability (kernel) T(θ (n+1) |θ (n) ) for generating samples {θ n } that have the target distribution π(·) as the stationary distribution. A crucial sufficient but not necessary condition to ensure stationarity is the detailed balance condition [28]: π(θ (n) )T(θ (n+1) |θ (n) ) = π(θ (n+1) )T(θ (n) |θ (n+1) ) (2.3) Given the current state θ (n) , the M-H algorithm makes a proposal θ ∗ according to some easy-to-sample distribution, q(θ ∗ |θ (n) ); then accepts the proposal θ ∗ with the following acceptance probability: ( π(θ ∗ )/q(θ ∗ |θ (n) ) αM H (θ (n) , θ ∗ ) = min 1, π(θ (n) )/q(θ (n) |θ ∗ ) ) (2.4) Set θ (n+1) = θ ∗ if θ ∗ is accepted or θ (n+1) = θ (n) otherwise. The MH transition kernel is: T(θ (n+1) |θ (n) ) = q(θ (n+1) |θ (n) )α(θ (n) ,θ (n+1) ) + δθ(n) (θ (n+1) Z ) q(θ ∗ |θ (n) )(1 − α(θ (n) , θ ∗ ))dθ ∗ (2.5) The detailed balance condition (2.3) can be verified based on (2.4)(2.5) [28]. Note there are two popular choices of proposal distribution q(θ ∗ |θ (n) ): the independent sampler, q(θ ∗ |θ (n) ) = q(θ ∗ ), and the symmetric proposal (Metropolis algorithm), which satisfies q(θ ∗ |θ (n) ) = q(θ (n) |θ ∗ ). For RWM, q(θ ∗ |θ (n) ) ∼ N(θ (n) , σ 2 I). For HMC q(θ ∗ |θ (n) ) is defined by a symmetric deterministic process discussed below. Both RWM and HMC are specific examples of the Metropolis algirithm. 2.2.2 Proposal guided by Hamiltonian dynamics In Bayesian statistics, we need posterior samples of modeling parameters θ for inference or prediction. Here we use Hamiltonian dynamics to guide the proposal q(θ ∗ |θ (n) ) to develop HMC algorithm. Instead of only using the variables of interest θ alone, we consider the joint state z = (θ, p), where p is a vector of fictitious variables of the same dimension as θ. Assume the distribution of interest has density π(θ). We define its potential energy for the dynamical system as minus the log of the density π(θ). In Bayesian statistics, θ consists 11 2. HAMILTONIAN MONTE CARLO of the model parameters (and perhaps latent variables). It is of interest to sample from the posterior distribution of θ given the observed data D. Thus the corresponding potential energy is defined up to a constant as follows: U (θ) = − log(π(θ|D)) = −[log(P (θ)) + log(L(θ|D))] (2.6) where P (θ) is the prior density and L(θ|D) is the likelihood function. To make use of Hmailtonian dynamics, we augment the parameter space of θ by creating the auxiliary momentum vector, p, which is the same dimension as θ. This vector p is assigned a distribution that is defined by the kinetic energy function K(p) := 12 pT M−1 p, resulting in a density proportional to exp(−K(p)), i.e. p ∼ N(0, M) where M is the mass matrix, which is often set to identity matrix I in standard HMC for convenience. An alternative more complex choice is the Fisher information matrix, which can help to explore the parameter space more efficiently. See more details in [39] and in chapter 4. The joint density of (θ, p) is defined by the Hamiltonian function as f (θ, p) ∝ exp(−H(θ, p)) = exp(−U (θ)) exp(−K(p)) (2.7) Note that θ and p are independent for fixed mass matrix M ≡ const, but not in general, e.g. for a position dependent mass matrix, G(θ). The HMC algorithm works as follows: i) given the current state θ (n) , we first sample a random momentum variable p(n) ∼ N(0, M); ii) evolve the joint state z = (θ, p) for some time t according to Hamiltonian dynamics (2.2) to get a proposal z∗ = (θ ∗ , p∗ ) = Tt (z); iii) decide whether to accept the proposal z∗ according to the following acceptance probability: αHM C (z (n) f (z∗ )δT−t (z∗ ) (z(n) ) , z ) = min 1, = min{1, exp(−H(z∗ ) + H(z(n) ))} f (z(n) )δTt (z(n) ) (z∗ ) ∗ (2.8) where δ is the Dirac delta function. Finally drop the auxiliary momentum variables p and repeat i)ii)iii). In fact, step ii) means that the proposal machenism in HMC is actually deterministic, i.e. q(z∗ |z(n) ) = δTt (z(n) ) (z∗ ) but the randomness comes from step i) sampling momentum p(n) . The following theorem ensures the validity of HMC as detailed above: 12 (2.9) 2.2 Hamiltonian Monte Carlo Algorithm Theorem 2.1. The Markov chain generated by the HMC procedure i)ii)iii) has joint distribution (2.7) as its stationary distribution. Proof. Let z(n+1) = Tt (z(n) ). It suffices to verify the detailed balance condition (2.3) for z(n+1) 6= z(n) (Otherwise (2.3) becomes trivial). LHS = f (z(n) )T(z(n+1) |z(n) ) = f (z(n) )q(z(n+1) |z(n) )αHM C (z(n) , z(n+1) ) ) ( (n+1) (n) f (z )δ ) (n+1) ) (z T (z −t = f (z(n) )δTt (z(n) ) (z(n+1) ) min 1, f (z(n) )δTt (z(n) ) (z(n+1) ) = min{f (z(n) )δTt (z(n) ) (z(n+1) ), f (z(n+1) )δT−t (z(n+1) ) (z(n) )} ( ) (n) (n+1) f (z )δ (z ) (n) Tt (z ) = f (z(n+1) )δT−t (z(n+1) ) (z(n) ) min 1, f (z(n+1) )δT−t (z(n+1) ) (z(n) ) = f (z(n+1) )q(z(n) |z(n+1) )αHM C (z(n+1) , z(n) ) = f (z(n+1) )T(z(n) |z(n+1) ) = RHS Remark 2.2. Note in the above proof, it is the difference in energy functions that are used in (2.8) and their gradients that are used for Tt in (2.9), so they could have been defined upto a fixed constant. 2.2.3 Leapfrog Method Observe that in the acceptance probability (2.8), if z∗ = Tt (z(n) ) is analytically evolved, then according the property iii) energy conservation of Hamiltonian dynamics, we would have αHM C (z(n) , z∗ ) ≡ 1, i.e. proposals are always accepted. In practice, however, it is difficult to solve the Hamilton equations (2.2) analytically, so we need to approximate these equations by discretizing time with some small step size ε. Because of its accuracy (small local discretization error) and stability (controlled global discretization error), the following leapfrog method is commonly used to solve (2.2) numerically p(t + ε/2) = p(t) − (ε/2)∇θ U (θ(t)) θ(t + ε) = θ(t) + ε∇p K(p(t + ε/2)) p(t + ε) = p(t + ε/2) − (ε/2)∇θ U (θ(t + ε)) (2.10) The leapfrog integrator, also known as the Stömer-Verlet [63] method, denoted as T̂ε , is i) time reversible; ii) volume preserving. One can check the time reversibility of T̂ε by that switching two states z(t) and z(t + ε) and negating time1 don’t change the format of 1 −1 This property is actually called (time) symmetry of an integrator: T̂ε = T̂ε∗ := T̂−ε , which is not trivial 13 2. HAMILTONIAN MONTE CARLO integrator (2.10). The volume preservation of T̂ε can be verified by that Jacobian determinant ∂z(t + ε) ∂z(t) ≡ 1. In practice, we numerically solve (2.2) using the leapfrog method for L steps with step size ε, to make a proposal and decide to accept it as a new state with probability (2.8), which could be less than 1 in this case, or to stay at the current state. During this procedure, the Metropolis updates leave H fluctuating around some fixed value. See [37] for more discussion on leapfrog’s numerical properties and [61] and chapter 3 for more profound interpretation of leapfrog. Algorithm 2.1 below summarizes the HMC steps to generate a sample. Algorithm 2.1 Hamiltonian Monte Carlo (HMC) Initialize θ (1) = current θ Sample new momentum p(1) ∼ N(0, M) Calculate current H(θ (1) , p(1) ) = U (θ (1) ) + K(p(1) ) for ` = 1 to L do % Update the momentum for a half step p(`+1/2) = p(`) − (ε/2)∇θ U (θ (`) ) % Update the position for a full step θ (`+1) = θ (`) + εM−1 p(`+1/2) % Update the momentum for a half step p(`+1) = p(`+1/2) − (ε/2)∇θ U (θ (`+1) ) end for Calculate proposed H(θ (L+1) , p(L+1) ) = U (θ (L+1) ) + K(p(L+1) ) αHM C = exp{−Proposed H + Current H} if runif(1) < αHM C then Current θ = θ (L+1) end if 2.3 Discussion Now we revisit the illustration of Hamiltonian dynamics in figure 2.1. Based on the definition of potential energy, U (θ), its minimum corresponds to the maximum of the target density. The sliding puck in figure 2.1 provides the intuition of sampling: recording the proposal after evolving Hamiltonian dynamics (2.2) for a fixed trajectory length (εL) is equivalent to for discretized solution. According to the definition 2.3 of time reversibility, we should have checked that flipping momentum direction and switching states and flipping momentum direction again after evolution keep the format of the integrator. But since kinetic energy is quadratic in momentum in classical mechanics, they are equivalent. 14 2.3 Discussion recording the puck’s position at a fixed time interval. The puck moves faster towards than away from the lower energy region, therefore it takes less time for the puck to move from higher energy region to lower energy region than in the reversed direction. Being recorded at a constant frequency, the puck has a greater chance of visiting the lower energy (higher density) region. Even though HMC is advantageous over RWM in guiding the proposals, there are, however, more parameters to tune: the step size ε, the number of leapfrog steps L and the mass matrix M. The choice of step size ε is crucial– small value of ε leads to slow convergence of resulting Markov chain; whereas, large value of ε results in low acceptance rate of proposals. As suggested by [37], the number of leapfrog steps L can be randomized in a certain range to avoid periodic movement while exploring the distribution. A recent work, No-U-Turn Sampler (NUTS) [38], gives a tuning-free solution by letting the sampler go for the longest trajectory without turning back. The mass matrix M can be chosen as the inverse Hessian of potential energy evaluated at the density mode θ̂ if the target distribution can be well approximated by a multivariate Gaussian; but in general, a position specific matrix G(θ), e.g. Fisher information, can be adopted [39]. See more discussion in chapter 4. 15 2. HAMILTONIAN MONTE CARLO 16 3 Split Hamiltonian Monte Carlo 3.1 Introduction The simple Metropolis algorithm [18] is often effective at exploring low-dimensional distributions, but it can be very inefficient for complex, high-dimensional distributions — successive states may exhibit high autocorrelation, due to the random walk nature of the movement. Faster exploration can be obtained using Hamiltonian Monte Carlo (HMC), which was first introduced by [36], who called it “hybrid Monte Carlo”, and which has been recently reviewed by [37]. HMC reduces the random walk behavior of Metropolis by proposing states that are distant from the current state, but nevertheless have a high probability of acceptance. These distant proposals are found by numerically simulating Hamiltonian dynamics for some specified amount of fictitious time. For this simulation to be reasonably accurate (as required for a high acceptance probability), the step size used must be suitably small. This step size determines the number of steps needed to produce the proposed new state. Since each step of this simulation requires a costly evaluation of the gradient of the log density, the step size is the main determinant of computational cost. In this chapter, we show how the technique of “splitting” the Hamiltonian [37, 61] can be used to reduce the computational cost of producing proposals for HMC. In our approach, splitting “separates” the Hamiltonian, and consequently the simulation of the dynamics, into two parts. We discuss two contexts in which one of these parts can capture most of the rapid variation in the energy function, but is computationally cheap. Simulating the other slowly varying part requires costly steps, but can use a large step size. The result is that fewer costly gradient evaluations are needed to produce a distant proposal. We illustrate these splitting methods using logistic regression models. Computer programs for our methods are publicly available from http://www.ics.uci.edu/~babaks/Site/Codes.html. 17 3. SPLIT HAMILTONIAN MONTE CARLO Figure 3.1: Comparison of Hamiltonian Monte Carlo (HMC) and Random Walk Metropolis (RWM) when applied to a bivariate normal distribution. Left plot: The first 30 iterations of HMC with 20 leapfrog steps. Right plot: The first 30 iterations of RWM with 20 updates per iterations. As an illustration, consider sampling from the following bivariate normal distribution θ ∼ N (µ, Σ), 3 1 0.95 with µ = and Σ = 0.95 1 3 For HMC, we set L = 20 and ε = 0.15. The left plot in figure 3.1 shows the first 30 states from an HMC run started with θ = (0, 0). The density contours of the bivariate normal distribution are shown as gray ellipses. The right plot shows every 20th state from the first 600 iterations of a run of a simple random walk Metropolis (RWM) algorithm. (This takes time comparable to that for the HMC run.) The proposal distribution for RWM is a bivariate normal with the current state as the mean, and 0.152 I2 as the covariance matrix. (The standard deviation of this proposal is the same as the step size of HMC.) Figure 3.1 shows that HMC explores the distribution more efficiently, with successive samples being farther from each other, and autocorrelations being smaller. For an extended review of HMC, its properties, and its advantages over RWM, see [37]. In this example, we have assumed that one leapfrog step for HMC (which requires evaluating the gradient of the log density) takes approximately the same computation time as one Metropolis update (which requires evaluating the log density), and that both move approximately the same distance. The benefit of HMC comes from this movement being systematic, rather than in a random walk.1 We now propose a new approach called Split Hamiltonian 1 Indeed, in this two-dimensional example, it is better to use Metropolis with a large proposal standard deviation, even though this leads to a low acceptance probability, because this also avoids a random walk. However, in higher-dimensional problems with more than one highly-confining direction, a large proposal 18 3.2 Splitting the Hamiltonian Monte Carlo (Split HMC), which further improves the performance of HMC by modifying how steps are done, with the effect of reducing the time for one step or increasing the distance that one step moves. 3.2 Splitting the Hamiltonian As discussed by [37], variations on HMC can be obtained by using discretizations of Hamiltonian dynamics derived by “splitting” the Hamiltonian, H, into several terms: H(θ, p) = H1 (θ, p) + H2 (θ, p) + · · · + HK (θ, p) We use Ti,t , for i = 1, . . . , k to denote the mapping defined by Hi for time t. Assuming that we can implement Hamiltonian dynamics for Hk exactly, the composition T1,ε ◦ T2,ε ◦ . . . ◦ Tk,ε is a valid discretization of Hamiltonian dynamics based on H if the Hi are twice differentiable [61]. This discretization is symplectic and hence preserves volume. It will also be reversible if the sequence of Hi are symmetric: Hi (θ, p) = HK−i+1 (θ, p). Indeed, the leapfrog method (2.10) can be regarded as a symmetric splitting of the Hamiltonian H(θ, p) = U (θ) + K(p) as H(θ, p) = U (θ)/2 + K(p) + U (θ)/2 (3.1) In this case, H1 (θ, p) = H3 (θ, p) = U (θ)/2 and H2 (θ, p) = K(p). Hamiltonian dynamics for H1 is ∂H1 = 0 ∂p ∂H1 1 ṗ = − = − ∇θ U (θ) ∂θ 2 which for a duration of discretized time step size ε gives the first part of a leapfrog step. For θ̇ = H2 , the dynamics is ∂H2 = ∇p K(p) ∂p ∂H2 ṗ = − = 0 ∂θ For step size ε, this gives the second part of the leapfrog step. Hamiltonian dynamics for H3 θ̇ = is the same as that for H1 since H1 = H3 , giving the the third part of the leapfrog step. standard deviation leads to such a low acceptance probability that this strategy is not viable. 19 3. SPLIT HAMILTONIAN MONTE CARLO 3.2.1 Splitting the Hamiltonian with a partial analytic solution Suppose the potential energy U (θ) can be written as U0 (θ) + U1 (θ). We can split H as H(θ, p) = U1 (θ)/2 + [U0 (θ) + K(p)] + U1 (θ)/2 (3.2) Here, H1 (θ, p) = H3 (θ, p) = U1 (θ)/2 and H2 (θ, p) = U0 (θ) + K(p). The first and the last terms in this splitting are similar to equation (3.1), except that U1 (θ) replaces U (θ), so the first and the last part of a leapfrog step remain as before, except that we use U1 (θ) rather than U (θ) to update p. Now suppose that the middle part of the leapfrog, which is based on the Hamiltonian U0 (θ) + K(p), can be handled analytically — that is, we can compute the exact dynamics for any duration of time. We hope that since this part of the simulation introduces no error, we will be able to use a larger step size, and hence take fewer steps, reducing the computation time for the dynamical simulations. We are mainly interested in situations where U0 (θ) provides a reasonable approximation to U (θ), and in particular for Bayesian applications, we can use the Laplace approximation. Specifically, we approximate U (θ) with U0 (θ), the energy function for N(θ̂, J −1 (θ̂)), where θ̂ is the posterior mode (maximum a posterior, MAP), obtained by fast optimzation algorithms such Newton-Raphson method, and J (θ̂) is the Hessian matrix of U at θ̂. Finally, we set U1 (θ) = U (θ) − U0 (θ), the error in this approximation. [40] have recently proposed a similar splitting strategy for HMC, in which a Gaussian component is handled analytically, in the context of high-dimensional approximations to a distribution on an infinite-dimensional Hilbert space. In such applications, the Gaussian distribution will typically be derived from the problem specification, rather than being found as a numerical approximation, as we do here. Using a normal approximation in which U0 (θ) = 1 (θ 2 − θ̂)T J (θ̂)(θ − θ̂), and letting K(p) = 21 pT p (the energy for the standard normal distribution), H2 (θ, p) = U0 (θ) + K(p) in the equation (3.2) will be quadratic, and Hamilton’s equations will be a system of firstorder linear differential equations that can be handled analytically [64]. Specifically, setting θo = θ − θ̂, the dynamical equations can be written as follows: d θo (t) 0 I θo (t) = −J (θ̂) 0 p(t) dt p(t) (3.3) 0 I θo which can be denoted as ż(t) = Az(t), where z = , and A = . p −J (θ̂) 0 The solution of this system is z(t) = eAt z(0), where z(0) is the initial value at time t = 0, P k and eAt = +∞ k=0 (At) /k! is a matrix exponential. This matrix exponential can be simplified 20 3.2 Splitting the Hamiltonian as eAt = ΓeDt Γ−1 using the following eigen-decomposition of matrix A: A = ΓDΓ−1 where Γ is invertible and D is a diagonal matrix of eigen-values. Therefore the solution to the system (3.3) is z(t) = ΓeDt Γ−1 z(0) and eDt can be easily computed by exponentiating the diagonal elements of D times t. Remark 3.1. Note Γ and D could be complex matrices since A is not symmetric, but the solution z(t) must be real. This can be shown as follows. Eigen-decompose J (θ̂) = Γ∗ D∗ (Γ∗ )−1 where Γ∗ , D∗ are real because J (θ̂) is symmetric and positive definite. Let θ ∗ = (Γ∗ )T θo , p∗ = (Γ∗ )T p. The dynamics (3.3) can also be solved as " # " #" #" #" # ∗ ∗ ∗ −1/2 ∗ 1/2 ∗ 1/2 ∗ 1/2 θ (t) (D ) 0 cos(D ) t sin(D ) t (D ) 0 θ (0) = ∗ ∗ 1/2 ∗ 1/2 p (t) 0 I − sin(D ) t cos(D ) t 0 I p∗ (0) The solution can be recognized as stretching-rotating-shrinking the initial state, which is related to the symplectic structure of the dynamical system (3.3). The above analytical solution is of course for the middle part (denoted as H2 ) of the equation (3.2) only. We still need to approximate the overall Hamiltonian dynamics based on H, using the leapfrog method. Algorithm 3.1 shows the corresponding leapfrog steps — after an initial step of size ε/2 based on U1 (θ), we obtain the exact solution for a time step of ε based on H2 (θ, p) = U0 (θ) + K(p), and finish by taking another step of size ε/2 based on U1 (θ). 3.2.2 Splitting the Hamiltonian by splitting the data The method discussed in the previous section requires that we be able to handle the Hamiltonian H2 (θ, p) = U0 (θ) + K(p) analytically. If this is not the case, splitting the Hamiltonian in this way may still be beneficial if the computational cost for U0 (θ) is substantially lower than for U (θ). In these situations, we can use the following split: H(θ, p) = U1 (θ)/2 + M X [U0 (θ)/(2M ) + K(p)/M + U0 (θ)/(2M )] + U1 (θ)/2 (3.4) m=1 for some M > 1. The above discretization can be considered as a nested leapfrog, where the outer part takes half steps to update p based on U1 alone, and the inner part involves 21 3. SPLIT HAMILTONIAN MONTE CARLO Algorithm 3.1 Leapfrog for split Hamilto- Algorithm 3.2 Nested leapfrog for split nian Monte Carlo with a partial analytic so- Hamiltonian Monte Carlo with splitting of lution data Dε −1 Sample initial values for p from N(0, I) Rε ← Γe Γ for ` = 1 to L do Sample initial values for p from N(0, I) p ← p − (ε/2)∇θ U1 (θ) for ` = 1 to L do for m = 1 to M do p ← p − (ε/2)∇θ U1 (θ) p ← p − (ε/(2M ))∇θ U0 (θ) θo ← θ − θ̂ θ ← θ + (ε/M )p z0 ← (θo , p) p ← p − (ε/(2M ))∇θ U0 (θ) (θo , p) ← Rε z0 end for θ ← θo + θ̂ p ← p − (ε/2)∇θ U1 (θ) p ← p − (ε/2)∇θ U1 (θ) end for end for M leapfrog steps of size ε/M based on U0 . Algorithm 3.2 implements this nested leapfrog method. For example, suppose our statistical analysis involves a large data set with many observations, but we believe that a small subset of data is sufficient to build a model that performs reasonably well (compared to the model that uses all the observations). In this case, we can construct U0 (θ) based on a small part of the observed data, and use the remaining observations to construct U1 (θ). If this strategy is successful, we will be able to use a large step size for steps based on U1 , reducing the cost of a trajectory computation. In detail, we divide the observed data, y, into two subsets: R0 , which is used to construct U0 (θ), and R1 , which is used to construct U1 (θ): U (θ) = U0 (θ) + U1 (θ) U0 (θ) = − log[P (θ)] − X log[P (yi |θ)] i∈R0 U1 (θ) = − X (3.5) log[P (yi0 |θ)] i0 ∈R1 Note that the prior P (θ) appears in U0 (θ) only. [37] discusses a related strategy for splitting the Hamiltonian by splitting the observed data into multiple subsets. However, instead of randomly splitting data, as proposed there, here we split data by building an initial model based on the MAP estimate, θ̂, and using this model to identify the small subset of data that captures most of the information in the full data set. 22 3.3 Application of Split HMC to logistic regression models 3.3 Application of Split HMC to logistic regression models We now look at how Split HMC can be applied to Bayesian logistic regression models for binary classification problems. We will illustrate this method using the simulated data set 0 -2 -1 x2 1 2 with n = 100 data points and p = 2 covariates that is shown in figure 3.2. -2 -1 0 1 2 x1 Figure 3.2: An illustrative binary classification problem with n = 100 data points and two covariates, x1 and x2 , with the two classes represented by white circles and black squares. The logistic regression model assigns probabilities to the two possible classes (denoted by 0 and 1) in case i (for i = 1, . . . , n) as follows: P (yi = 1|xi , α, β) = exp(α + xTi β) 1 + exp(α + xTi β) Here, xi is the vector of length p with the observed values of the covariates in case i, α is the intercept, and β is the vector of p regression coefficients. We use θ to denote the vector of all p + 1 unknown parameters, (α, β). Let P (θ) be the prior distribution for θ. The posterior distribution of θ given x and y Q is proportional to P (θ) ni=1 P (yi |xi , θ). The corresponding potential energy function is U (θ) = − log[P (θ)] − n X i=1 23 log[P (yi |xi , θ)] 3. SPLIT HAMILTONIAN MONTE CARLO We assume the following (independent) priors for the model parameters: α ∼ N(0, σα2 ) βj ∼ N(0, σβ2 ), j = 1, . . . , p where σα and σβ are known constants. The potential energy function for the above logistic regression model is therefore as follows: p n X βj2 X α2 U (θ) = 2 + − [yi (α + xTi β) − log(1 + exp(α + xTi β))] 2σα j=1 2σβ2 i=1 The partial derivatives of the energy function with respect to α and the βj are ∂U ∂α ∂U ∂βj 3.3.1 n X α exp(α + xTi β) = − yi − σα2 1 + exp(α + xTi β) i=1 n βj X exp(α + xTi β) = − xij yi − σβ2 1 + exp(α + xTi β) i=1 Split HMC with a partial analytical solution for a logistic model To apply algorithm 3.1 for Split HMC to this problem, we approximate the potential energy function U (θ) for the logistic regression model with the potential energy function U0 (θ) of the normal distribution N(θ̂, J −1 (θ̂)), where θ̂ is the MAP estimate of model parameters. U0 (θ) usually provides a reasonable approximation to U (θ), as illustrated in figure 3.3. In the plot on the left, the solid curve shows the value of the potential energy, U , as β1 varies, with β2 and α fixed to their MAP values, while the dashed curve shows U0 for the approximating normal distribution. The right plot of figure 3.3 compares the partial derivatives of U and U0 with respect to β1 , showing that ∂U0 /∂βj provides a reasonable linear approximation to ∂U /∂βj . Since there is no error when solving Hamiltonian dynamics based on U0 (θ), we would expect that the total discretization error of the steps taken by algorithm 3.1 will be less that for the standard leapfrog method, for a given step size, and that we will therefore be able to use a larger step size — and hence need fewer steps for a given trajectory length — while still maintaining a good acceptance rate. The step size will still be limited to the region of stability imposed by the discretization error from U1 = U − U0 , but this limit will tend to be larger than for the standard leapfrog method. 24 3.3 Application of Split HMC to logistic regression models Figure 3.3: Left plot: The potential energy, U , for the logistic regression model (solid curve) and its normal approximation, U0 (dashed curve), as β1 varies, with other parameters at their MAP values. Right plot: The partial derivatives of U and U0 with respect to β1 . 3.3.2 Split HMC with splitting of data for a logistic model To apply algorithm 3.2 to this logistic regression model, we split the Hamiltonian by splitting the data into two subsets. Consider the illustrative example discussed above. In the left plot of figure 3.4, the thick line represents the classification boundary using the MAP estimate, θ̂. For the points that fall on this boundary line, the estimated probabilities for the two groups are equal, both being 1/2. The probabilities of the two classes become less similar as the distance of the covariates from this line increases. We will define U0 using the points within the region, R0 , within some distance of this line, and define U1 using the points in the region, R1 , at a greater distance from this line. Equivalently, R0 contains those points for which the probability that y = 1 (based on the MAP estimates) is closest to 1/2. The shaded area in Figure 3.4 shows the region, R0 , containing the 30% of the observations closest to the MAP line, or equivalently the 30% of observations for which the probability of class 1 is closest (in either direction) to 1/2. The unshaded region containing the remaining data points is denoted as R1 . Using these two subsets, we can split the energy function U (θ) into two terms: U0 (θ) based on the data points that fall within R0 , and U1 based on the data points that fall within R1 (see equation (3.5)). Then, we use the equation (3.4) to split the Hamiltonian dynamics. Note that U0 is not used to approximate the potential energy function, U (the exact value of U is used for the acceptance test at the end of the trajectory to ensuring that the equilibrium distribution is exactly the target distribution). Rather, ∂U0 /∂βj is used to approximate ∂U /∂βj , which is the costly computation when we simulate Hamiltonian dynamics. 25 3. SPLIT HAMILTONIAN MONTE CARLO Figure 3.4: Left plot: A split of the data into two parts based on the MAP model, represented by the solid line; the energy function U is then divided into U0 , based on the data points in R0 , and U1 , based on the data points in R1 . Right plot: The partial derivatives of U and U0 with respect to β1 , with other parameters at their MAP values. To see that it is appropriate to split the data according to how close the probability of class 1 is to 1/2, note first that the leapfrog step of the equation (2.10) will have no error if the derivatives ∇θ U do not depend on θ — that is, when the second derivatives of U are zero. Recall that for the logistic model, n βj X exp(α + xTi β) ∂U = 2 − xij yi − ∂βj σβ 1 + exp(α + xTi β) i=1 from which we get " 2 # n ∂ 2U exp(α + xTi β) δjj 0 X exp(α + xTi β) − = 2 + xij xij 0 T ∂βj βj 0 σβ 1 + exp(α + x β) 1 + exp(α + xTi β) i i=1 n δjj 0 X = 2 + xij xij 0 P (yi = 1|xi , α, β)[1 − P (yi = 1|xi , α, β)] σβ i=1 The product P (yi = 1|xi , α, β)[1 − P (yi = 1|xi , α, β)] is symmetrical around its maximum P (yi = 1|xi , α, β) = 1/2, justifying our criterion for selecting points in R0 . The right plot of figure 3.4 shows the approximation of ∂U /∂β1 by ∂U0 /∂β1 with β2 and α fixed to MAP. 3.4 Experiments In this section, we use simulated and real data to compare our proposed methods to standard HMC. For each problem, we set the number of leapfrog steps to L = 20 for standard HMC, 26 3.4 Experiments and find ε such that the acceptance probability (AP) is close to 0.65 [37]. We set L and ε for the Split HMC methods such that the trajectory length, εL, remains the same, but with a larger step size and hence a smaller number of steps. Note that this trajectory length is not necessarily optimal for these problems, but this should not affect our comparisons, in which the length is kept fixed. We try to choose ε for the Split HMC methods such that the acceptance probability is equal to that of standard HMC. However, increasing the step size beyond a certain point leads to instability of trajectories, in which the error of the Hamiltonian grows rapidly with L [37], so that proposals are rejected with very high probability. This sometimes limits the step size of Split HMC to values at which the acceptance probability is greater than the 0.65 aimed at for standard HMC. Additionally, to avoid near periodic Hamiltonian dynamics [37], we randomly vary the step size over a small range. Specifically, at each iteration of MCMC, we sample the step size from the Uniform(0.8ε, ε) distribution, where ε is the reported step size for each experiment. To measure the efficiency of each sampling method, we use the following autocorrelation time (ACT) [3, 17]. Throughout this section, we set the number of Markov chain Monte Carlo (MCMC) iterations for simulating posterior samples to N = 50000. Definition 3.1 (Autocorrelation Time). Given N posterior samples, we divide them into batches of size B, then autocorrelation time τ can be estimated as follows: τ =B Sb2 S2 where S 2 is the sample variance and Sb2 is the sample variance of batch means. Remark 3.2. Autocorrelation time can be roughly interpreted as the number of MCMC transitions required to produce samples that can be considered as independent. In practice, the posterior samples can be divided into N 1/3 batches of size B = N 2/3 [65]. For the logistic regression problems discussed in this section, we could find the autocorrelation time separately for each parameter and summarize the autocorrelation times using their maximum value (i.e., for the slowest moving parameter) to compare different methods. However, since one common goal is to use logistic regression models for prediction, we look P at the autocorrelation time, τ , for the log likelihood, ni=1 log[P (yi |xi , θ)] using the posterior P samples of θ. We also look at the autocorrelation time for j βj2 (denoting it τβ ), since this may be more relevant when the goal is interpretation of parameter estimates. We adjust τ (and similarly τβ ) to account for the varying computation time needed by the different methods in two ways. One is to compare different methods using τ × s, where s is 27 3. SPLIT HAMILTONIAN MONTE CARLO the CPU time per iteration, using an implementation written in R. This measures the CPU time required to produce samples that can be regarded as independent samples. We also compare in terms of τ × g, where g is the number of gradient computations on the number of cases in the full data set required for each trajectory simulated by HMC. This will be equal to the number of leapfrog steps, L, for standard HMC or Split HMC using a normal approximation. When using data splitting with a fraction f of data in R0 and M inner leapfrog steps, g will be (f M + (1 − f )) × L. In general, we expect that computation time will be dominated by the gradient computations counted by g, so that τ × g will provide a measure of performance independent of any particular implementation. In our experiments, s was close to being proportional to g, except for slightly larger than expected times for Split HMC with data splitting. Note that compared to standard HMC, our two methods involve some computational overhead for finding the MAP estimate. However, the additional overhead associated with finding the MAP estimate remains negligible (less than a second for most examples discussed here) compared to the sampling time. 3.4.1 Simulated data We first tested the methods on a simulated data set with 100 covariates and 10000 observations. The covariates were sampled as xij ∼ N(0, σj2 ), for i = 1, . . . , 10000 and j = 1, . . . , 100, with σj set to 5 for the first five variables, to 1 for the next five variables, and to 0.2 for the remaining 90 variables. We sampled true parameter values, α and βj , independently from N(0, 1) distributions. Finally, we sampled the class labels according to the model, as yi ∼ Bernoulli(πi ) with logit(πi ) = α + xTi β. For the Bayesian logistic regression model, we assumed normal priors with mean zero and standard deviation 5 for α and βj , where j = 1, . . . , 100. We ran standard HMC, Split HMC with normal approximation, and Split HMC with data splitting for N = 50000 iterations. For the standard HMC, we set L = 20 and ε = 0.015, so the trajectory length was 20 × 0.015 = 0.3. For Split HMC with normal approximation and Split HMC with data splitting, we reduce the number of leapfrog steps to 10 and 3 respectively, while increasing the step sizes so that the trajectory length remained 0.3. For the data splitting method, we use 40% of the data points for U0 and set M = 9, which makes g equal 4.2L. Since we set L = 3, we have g = 12.6, which is smaller than g = L = 20 used for the standard HMC algorithm. Table 3.1 shows the results for the three methods. The CPU times (in seconds) per iteration, s, and τ × s for the Split HMC methods are substantially lower than for standard 28 3.4 Experiments HMC L g s AP τ τ ×g τ ×s τβ τβ × g τβ × s Split HMC Normal Appr. Data Splitting 20 20 0.187 0.69 4.6 92 0.864 11.7 234 2.189 10 10 0.087 0.74 3.2 32 0.284 13.5 135 1.180 3 12.6 0.096 0.74 3.0 38 0.287 7.3 92 0.703 Table 3.1: Split HMC (with normal approximation and data splitting) compared to standard HMC using simulated data, on a data set with n = 10000 observations and p = 100 covariates. Here, L is the number of leapfrog steps, g is the number of gradient computations, s is the CPU time (in seconds) per iteration, AP is the acceptance probability, τ is theP autocorrelation time based on the log likelihood, and τβ is the autocorrelation time based on j (βj )2 . HMC. The comparison is similar looking at τ × g. Based on τβ × s and τβ × g, however, the improvement in efficiency is more substantial for the data splitting method compared to the normal approximation method mainly because of the difference in their corresponding values of τβ . 3.4.2 Results on real data sets In this section, we evaluate our proposed method using three real binary classification problems. The data for these three problems are available from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.html). For all data sets, we standardized the numerical variables to have mean zero and standard deviation 1. Further, we assumed normal priors with mean zero and standard deviation 5 for the regression parameters. We used the setup described at the beginning of Section 3.4, running each Markov chain for N = 50000 iterations. Table 3.2 summarizes the results using the three sampling methods. The first problem, StatLog, involves using multi-spectral values of pixels in a satellite image in order to classify the associated area into soil or cotton crop. (In the original data, different types of soil are identified.) The sample size for this data set is n = 4435, and the number of features is p = 37. For the standard HMC, we set L = 20 and ε = 0.08. For the two Split HMC methods with normal approximation and data splitting, we reduce L to 14 and 3 respectively while increasing ε so ε × L remains the same as that of standard HMC. For the data splitting methods, we use 40% of data points for U0 and set M = 10. As seen in the table, the Split HMC methods improve efficiency, with the data splitting method performing better than the normal approximation method. 29 3. SPLIT HAMILTONIAN MONTE CARLO HMC StatLog n = 4435, p = 37 CTG n = 2126, p = 21 Chess n = 3196, p = 36 L g s AP τ τ ×g τ ×s τβ τβ × g τβ × s L g s AP τ τ ×g τ ×s τβ τβ × g τβ × s L g s AP τ τ ×g τ ×s τβ τβ × g τβ × s 20 20 0.033 0.69 5.6 112 0.190 5.6 112 0.191 20 20 0.011 0.69 6.2 124 0.069 24.4 488 0.271 20 20 0.022 0.62 10.7 214 0.234 23.4 468 0.511 Split HMC Normal Appr. Data Splitting 14 14 0.026 0.74 6.0 84 0.144 4.7 66 0.122 13 13 0.008 0.77 7.0 91 0.055 19.6 255 0.154 9 13 0.011 0.73 12.8 115 0.144 18.9 246 0.212 3 13.8 0.023 0.85 4.0 55 0.095 3.8 52 0.090 2 9.8 0.005 0.81 5.0 47 0.028 11.5 113 0.064 2 11.8 0.013 0.62 12.1 143 0.161 19.0 224 0.252 Table 3.2: HMC and Split HMC (normal approximation and data splitting) on three real data sets. Here, L is the number of leapfrog steps, g is the number of gradient computations, s is the CPU time (in seconds) per iteration, AP is the acceptance probability, τ is theP autocorrelation time based on the log likelihood, and τβ is the autocorrelation time based on j βj2 . 30 3.5 Discussion The second problem, CTG, involves analyzing 2126 fetal cardiotocograms along with their respective diagnostic features [66]. The objective is to determine whether the fetal state class is “pathologic” or not. The data include 2126 observations and 21 features. For the standard HMC, we set L = 20 and ε = 0.08. We reduced the number of leapfrog steps to 13 and 2 for Split HMC with normal approximation and data splitting respectively. For the latter, we use 30% of data points for U0 and set M = 14. Both splitting methods improved performance significantly. The objective of the last problem, Chess, is to predict chess endgame outcomes — either “white can win” or “white cannot win”. This data set includes n = 3196 instances, where each instance is a board-description for the chess endgame. There are p = 36 attributes describing the board. For standard HMC, we set L = 20 and ε = 0.09. For the two Split HMC methods with normal approximation and data splitting, we reduced L to 9 and 2 respectively. For the data splitting method, we use 35% of the data points for U0 and set M = 15. Using the Split HMC methods, the computational efficiency is improved substantially compared to standard HMC. This time however, the normal approximation approach performs better than the data splitting method in terms of τ × g, τ × s, and τβ × s, while the latter performs better in terms of τβ × g. 3.5 Discussion We have proposed two new methods for improving the efficiency of HMC, both based on splitting the Hamiltonian in a way that allows much of the movement around the state space to be performed at low computational cost. While we demonstrated our methods on binary logistic regression models, they can be extended to multinomial logistic (MNL) models for multiple classes. For MNL models, the regression parameters for p covariates and K classes form a matrix of (p + 1) rows and K columns, which we can regard as a vector of (p + 1) × K elements. For Split HMC with normal approximation, we can define U0 (θ) using an approximate multivariate normal N(θ̂, J −1 (θ̂)) as before. For Split HMC with data splitting, we can still construct U0 (θ) using a small subset of data, based on the class probabilities for each data item found using the MAP estimates for the parameters (the best way of doing this is a subject for future research). The data splitting method could be further extended to any model for which it is feasible to find a MAP estimate, and then divide the data into two parts based on “residuals” of some form. Although in theory our method can be used for many statistical models, its usefulness is of course limited by how well the posterior distribution can be approximated by a Gaussian 31 3. SPLIT HAMILTONIAN MONTE CARLO distribution in algorithm 3.1, and how well the gradient of the energy function can be approximated using a small but influential subset of data in algorithm 3.2. For example, algorithm 3.1 might not perform well for neural network models, for which the posterior distribution is usually multimodal. When using neural networks classification models, one could however use algorithm 3.2 selecting a small subset of data using a simple logistic regression model. This could be successful when a linear model performs reasonably well, even if the optimal decision boundary is nonlinear. The scope of algorithm 3.1 proposed in this chapter might be broadened by finding better methods to approximate the posterior distribution, such as variational Bayes methods. Future research could involve finding tractable approximations to the posterior distribution other than normal distributions. Also, one could investigate other methods for splitting the Hamiltonian dynamics by splitting the data — for example, fitting a support vector machine (SVM) to binary classification data, and using the support vectors for constructing U0 . While the results on simulated data and real problems presented in this chapter have demonstrated the advantages of splitting the Hamiltonian dynamics in terms of improving the sampling efficiency, our proposed methods do require preliminary analysis of data, mainly, finding the MAP estimate. As mentioned above, the performance of our approach obviously depends on how well the corresponding normal distribution based on MAP estimates approximates the posterior distribution, or how well a small subset of data found using this MAP estimate captures the overall patterns in the whole data set. Moreover, this preliminary analysis involves some computational overhead. For many problems, however, the computational cost associated with finding the MAP estimate is negligible compared to the potential improvement in sampling efficiency for the full Bayesian model. For most of the examples discussed here, the additional computational cost is less than a second. Of course, there are situations for which finding the MAP estimate could be an issue; this is especially true for high dimensional problems. For such cases, it might be more practical to use algorithm 3.2 after selecting a small but influential subset of data based on probabilities found using a simpler model. For the neural network example discussed above, we can use a simple logistic regression model with maximum likelihood estimates to select the data points for U0 . Although the normal approximations have been used for Bayesian inference in the past [see 67], we use it for exploring the parameter space more efficiently while sampling from the exact distribution. One could of course use the approximate normal (Laplace) distribution as a proposal distribution in a Metropolis-Hastings algorithm. Using this approach however the acceptance rates drop substantially (below 10%) for our examples. 32 3.5 Discussion Another approach to improving HMC has recently been proposed by [39]. Their method, Riemannian HMC (RHMC), can also substantially improve performance. RHMC utilizes the geometric properties of the parameter space to explore the best direction, typically at higher computational cost, to produce distant proposals with high probability of acceptance. In contrast, our method attempts to find a simple approximation to the Hamiltonian to reduce the computational time required for reaching distant states. It is possible that these approaches could be combined, to produce a method that performs better than either method alone. The recent proposals by [38] for automatic tuning of HMC could also be combined with our Split HMC methods. 33 3. SPLIT HAMILTONIAN MONTE CARLO 34 4 Lagrangian Monte Carlo 4.1 Introduction Hamiltonian Monte Carlo (HMC) [36] reduces the random walk behavior of the MetropolisHastings algorithm by proposing samples distant from the current state, which nevetheless have a high probability of being accepted. These distant proposals are found by numerically simulating Hamiltonian dynamics for some specified amount of fictitious time [37]. Hamiltonian dynamics can be represented by a function, known as the Hamiltonian, of model parameters θ ∼ π(θ) and auxiliary momentum parameters p ∼ N(0, M) (with the same dimension as θ) as follows: 1 H(θ, p) = − log π(θ) + pT M−1 p 2 (4.1) where M is a symmetric, positive-definite mass matrix. Hamilton’s equations, which involve differential equations derived from H, determine how θ and p change over time. In practice, however, solving these equations exactly is difficult in general, so we need to approximate them by discretizing time, using some small step size ε. For this purpose, the leapfrog method (2.10) is commonly used. Hamiltonian dynamics is restricted by the smallest eigen-direction, requiring small step size to maintain the stability of the numerical discretization. [39] propose a new method, called Riemannian HMC (RHMC), that exploits the geometric properties of the parameter space to improve the efficiency of standard HMC, especially in sampling distributions with complex structure (e.g., high correlation, non-Gaussian shape). Simulating the resulting dynamics, however, is computationally intensive since it involves solving two implicit equations, which require additional iterative numerical computation (e.g., fixed-point iteration). 35 4. LAGRANGIAN MONTE CARLO In an attempt to increase the speed of RHMC, we propose a new integrator that is completely explicit: we propose to replace momentum with velocity in the definition of the Riemannian Hamiltonian dynamics. As we will see, this is equivalent to using Lagrangian dynamics as opposed to Hamiltonian dynamics. By doing so, we eliminate one of the implicit steps in RHMC. Next, we construct a time symmetric integrator to remove the remaining implicit step in RHMC. This leads to a valid sampling scheme (i.e., converges to the true target distribution) that involves only explicit equations. We refer to this algorithm as Lagrangian Monte Carlo (LMC). In what follows, we begin with a brief review of RHMC and its geometric integrator in section 2. Section 3 introduces our proposed semi-explicit integrator based on defining Hamiltonian dynamics in terms of velocity as opposed to momentum. Next, in section 4, we eliminate the remaining implicit equation and propose a fully explicit integrator. In section 5, we use simulated and real data to evaluate our methods’ performance. Finally, in section 6, we discuss some possible future research directions. 4.2 Riemannian Hamiltonian Monte Carlo As discussed above, although HMC explores the parameter space more efficiently than random walk Metropolis does, it does not fully exploit the geometric properties of the parameter space defined by the density π(θ). Indeed, [39] argue that dynamics over Euclidean space may not be appropriate to guide the exploration of parameter space. To address this issue, they propose a new method that exploits the Riemannian geometry of the parameter space to improve standard HMC’s efficiency by automatically adapting to the local structure. They do this by replacing the fixed mass matrix M in the standard HMC with a more informative position-specific matrix G(θ), which is set to the Fisher information matrix in the chapter. The resulted method is named Riemannian Hamiltonian Monte Carlo (RHMC). As an illustrative example, figure 4.1 shows the sampling paths of random walk Metropolis (RWM), HMC, and RHMC for an artificially created banana-shaped distribution [See 39, discussion by Luke Bornn and Julien Cornebise]. For this example, we fix the trajectory and choose the step sizes such that the acceptance probability for all three methods remains around 0.7. RWM moves slowly and spends most of iterations at the distribution’s low-density tail, and HMC explores the parameter space in an indirect way, while RHMC moves directly to the high density region and explores the distribution more efficiently. 36 4.2 Riemannian Hamiltonian Monte Carlo Sampling Path of HMC Sampling Path of RHMC 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 θ2 2 θ2 θ2 Sampling Path of RWM 2 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 −2 −2 −1 0 θ1 1 2 −2 −2 −1 0 1 θ1 2 −2 −2 −1 0 θ1 1 2 Figure 4.1: The first 10 iterations in sampling from a banana shaped distribution with random walk Metropolis (RWM), Hamiltonian Monte Carlo (HMC), and Riemannian HMC (RHMC). For all three methods, the trajectory length (i.e., step size ε times number of integration steps L) is set to 1. For RWM, L=17, for HMC, L=7, and for RHMC, L=5. Solid red lines are the sampling paths, and black circles are the accepted proposals. 4.2.1 Hamiltonian dynamics on Riemannian manifold Following [46], we define a family of probability distributions as a manifold in the following sense. Definition 4.1 (Statistical Manifold). Consider a family of probability distributions parametrized by D-dimensional vector θ: Z D D M := πθ = π(· ; θ) : X → Rπθ (x) ≥ 0, πθ (x)dx = 1, ∀θ ∈ Θ ⊂ R X where the probability density π(·) is defined in general as Radon-Nikodym derivative with respect to a σ-finite measure, e.g. Lebesgue measure, on the probability space X. If there open exists a coordinate system A = {(φ, U )|φ : U ⊂ RD → MD , θ 7→ πθ } statisfying i) Any parametrization φ ∈ A is a one-to-one mapping U → MD ; ii) [Compatible Transitions] Given any other one-to-one mapping ψ : V ⊂ RD → MD , the following holds: ψ ∈ A ⇐⇒ ψ −1 ◦ φ is a C ∞ diffeomorphism (both the mapping and its inverse are C ∞ ). then we call (MD , A) a C ∞ differentiable manifold, or Statistical Manifold, regarding all compatible parametrizations as equivalent. 37 4. LAGRANGIAN MONTE CARLO Remark 4.1. It is assumed that there exists a true underlying distribution π ∗ (·) that governs the generation of observations x1 , · · · , xN . Although π ∗ (·) is unknown, the objective is often to estimate it in order to best model the given data D. In Bayesian statistics, it is of interest to obtain the posterior π(θ|x) of the model parameter θ given certain prior. Each π(θ|x) specified by a vector of parameters θ is an element from MD , a model to explain the given observations. When substituting in the given data D, π(θ|D) becomes a scalar. For the convenience in the following discussion, we assume that ∀x ∈ X, the function θ 7→ π(x; θ) is C ∞ . In order to calculate quantities such as length, area, volume etc. and to form Hamiltonian dynamics on the manifold MD , we need to introduce the Riemannian metric on MD [46, 68]. Definition 4.2 (Riemannian Metric). A Riemannian metric on a smooth manifold MD is a correspondece which associates to each point πθ ∈ MD an inner product h· , ·iπ (a symmetric, bilinear, positive-definite form) on the tangent space Tπ M, such that for any vector ∂ ∂ (πθ ), Y = yT ∂θ (πθ ) on MD , θ 7→ gθ (X(πθ ), Y (πθ )) = hX(πθ ), Y (πθ )iπ = fields X = xT ∂θ ∂ ∂ xT ∂θ (πθ ), ∂θ (πθ ) π y =: xT G(θ)y defines a C ∞ function on some U . Then we call (MD , g) a Riemannian manifold. Remark 4.2. Following [39, 46], we use Fisher information matrix for the Riemannian metric G(θ) = (gij (θ))D×D , thus it is also called Fisher metric: Z gij (θ) := E[∂i log L(x; θ)∂j log L(x; θ)] = ∂i log L(x; θ)∂j log L(x; θ)dx (4.2) with the shorthand notation ∂i = ∂/∂θ i for partial derivative. In the above definition (4.2), we integrate out all random variables being modeled in the likelihood thus it becomes a function of θ. G(θ) may (e.g. logistic regression in section 4.5.2) or may not (e.g. banana shaped distribution in section 4.5.1) involve data. When such integration is not explicit, we use empirical Fisher Information (section 4.5.4) instead. In certain cases, minus Hessian of log-prior is also added to Fisher metric to ensure positive-definiteness [39]. Given the target distribution with density π(θ), which could be posterior density of θ, we introduce the ancilliary momentum p depending on θ: p|θ ∼ N(0, G(θ)), and define the Hamiltonian as follows: H(θ, p) = − log π(θ) + 1 1 1 log det G(θ) + pT G(θ)−1 p = φ(θ) + pT G(θ)−1 p 2 2 2 where φ(θ) := − log π(θ) + 1 2 (4.3) log det G(θ). Based on this Hamiltonian, [39] propose the 38 4.2 Riemannian Hamiltonian Monte Carlo following Hamiltonian dynamics on the Riemmanian manifold : G(θ)−1 p 1 ṗ = −∇θ H(θ, p) = −∇θ φ(θ) + ν(θ, p) 2 θ̇ = ∇p H(θ, p) = (4.4) where the ith element of the vector ν(θ, p) is (ν(θ, p))i = −pT ∂i (G(θ)−1 )p = (G(θ)−1 p)T ∂i G(θ)G(θ)−1 p Remark 4.3. As the general Hamiltonian dynamics (section 2.1.1), the Riemannian Hamiltonian dynamics (4.4) also has the corresponding properties important for MCMC application: i) time reversibility; ii) volume preservation; iii) energy conservation. 4.2.2 Riemannian Hamiltonian Monte Carlo Algorithm In practice, we need to numerically solve the non-separable (containing products of θ and p) dynamical system (4.4). However, the resulting map (θ, p) → (θ ∗ , p∗ ) based on the standard leapfrog method (2.10) is neither time-reversible nor symplectic thus not appropriate to be applied to solve (4.4) [39]. Instead, they use the Stömer-Verlet [63] method as follows: (n+1/2) p θ (n+1) p(n+1) 1 ε (n) (n) (n+1/2) ∇θ φ(θ ) − ν(θ , p ) =p − 2 2 i ε h −1 (n) (n) (n+1) −1 =θ + G (θ ) + G (θ ) p(n+1/2) 2 ε 1 (n+1) (n+1) (n+1/2) (n+1/2) =p − ∇θ φ(θ ) − ν(θ ,p ) 2 2 (n) (4.5) (4.6) (4.7) where ε is the size of time step. This is also known as generalized leapfrog, which can be derived by concatenating a symplectic Euler-B integrator of (4.4) with its adjoint symplectic Euler-A integrator [See more details in 61]. The above series of transformations T̂ε : (θ (n) , p(n) ) 7→ (θ (n+1) , p(n+1) ) provides a deterministic geometric integrator (both timereversible and volume-preserving) to (4.4). Starting from the current state (θ (1) , p(1) ), we evolve the dynamics (4.4) for L discretized steps to get a proposal (θ (L+1) , p(L+1) ) and accept it according the following acceptance probability as (2.8): αRHM C = min{1, exp(−H(θ (L+1) , p(L+1) ) + H(θ (1) , p(1) ))} (4.8) Note the proposal distribution is actually a delta function δT̂Lε (θ(1) ,p(1) ) (θ (L+1) , p(L+1) )). Algorithm 4.1 summarizes the steps of Riemannian Hamiltonian Monte Carlo (RHMC) [39]. One major drawback of the generalized leapfrog method is that it involves two implicit 39 4. LAGRANGIAN MONTE CARLO Algorithm 4.1 Riemannian Hamiltonian Monte Carlo (RHMC) Initialize θ (1) = current θ Sample new momentum p(1) ∼ N(0, G(θ (1) )) Calculate current H(θ (1) , p(1) ) according to equation (4.3) for ` = 1 to L (leapfrog steps) do % Update the momemtum with fixed point iteration p̂(0) = p(`) for i = 1 to NumOfFixedPointSteps do h i (`) (`) 1 ε (i) (`) (i−1) p̂ = p − 2 ∇θ φ(θ ) − 2 ν(θ , p̂ ) end for p(`+1/2) = p̂(last i) % Update the position with fixed point iteration (0) θ̂ = θ (`) for i = 1 to NumOfFixedPointSteps do i h (i−1) (i) (`) (`) ε −1 −1 ) p(`+1/2) θ̂ = θ + 2 G (θ ) + G (θ̂ end for (last i) θ (`+1) = θ̂ % Update the momentum exactly h i ε (`+1) (`+1/2) p =p − 2 ∇θ φ(θ (`+1) ) − 12 ν(θ (`+1) , p(`+1/2) ) end for Calculate proposed H(θ (L+1) , v(L+1) ) according to equation (4.3) logRatio = −ProposedH + CurrentH Accept or reject the proposal (θ (L+1) , p(L+1) ) according to logRatio functions: equations (4.5) and (4.6). These functions require extra numerical analysis (e.g. fixed-point iteration), which results in higher computational cost and simulation error. This is especially true when solving θ (n+1) because the fixed-point iteration for (4.6) repeatedly inverts the matrix G(θ). To address this problem, we propose an alternative approach that uses velocity instead of momentum in the equations of motion. 4.3 Semi-explicit Lagrangian Monte Carlo In this section, Einstein notation is adopted. Whenever the index appears twice in a mathP P ematical expression, we sum over it: e.g., ai bi := i ai bi , Γkij v i v j := i,j Γkij v i v j . A lower index is used for the covariant tensor, whose components vary by the same transformation as the change of basis (e.g., gradient), whereas the upper index is reserved for the contravariant tensor, whose components vary in the opposite way as the change of basis in order to compensate (e.g. velocity vector). Interested readers should refer to [69]. 40 4.3 Semi-explicit Lagrangian Monte Carlo 4.3.1 Lagrangian Dynamics: from Momentum to Velocity In the equations of Hamiltonian dynamics (4.4), the term G(θ)−1 p appears several times. This motivates us to re-parameterize the dynamics in terms of velocity, v = G(θ)−1 p. Note that this in fact corresponds to the usual definition of velocity in physics, i.e., momentum divided by mass. The transformation p 7→ v changes the Hamiltonian dynamics (4.4) to the following Lagrangian dynamics 1 : θ̇ = v v̇ = − η(θ, v) − G(θ)−1 ∇θ φ(θ) (4.9) where η(θ, v) is a vector whose kth element is Γkij (θ)v i v j . Here, Γkij (θ) := 21 g kl (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols, where gij and g ij denote (i, j)th element of G(θ) and G(θ)−1 respectively. Proposition 4.1. The Riemannian Hamiltonian dynamics (4.4) is equivalent to the Lagrangian dynamics (4.9). Proof. Appendix A.1. Remark 4.4. This transformation p 7→ v moves the complexity of the dynamics (4.4) in the first equation for θ to its second equation in (4.9) where most of the time is spent in finding a good direction v. In the following section, we will show it helps the develeoped integrator to resolve the implicitness of updating θ to reduce the associated computational cost. The introduction of velocity v in place of p is also advocated by [40] to avoid large momentum variables p for the sake of numerical stability. They consider a constant mass so the resulting dynamics is still Hamiltonian. Actually we have an example (section 4.5.1.1) for which RHMC using momentum p is very unstable in numerically simulating the dynamics (4.4). So the Lagrangian dynamics (4.9) is preferred to the Hamiltonian dynamics (4.4) also due to the consideration of numerical stability. Define Lagrangian as kinetic energy minus potential energy: L = 21 vT G(θ)v − φ(θ). The new dynamics (4.9) can be proved equivalent to the following Euler-Lagrange equation of the second kind: 1 d ∂L ∂L = ∂θ dt ∂ θ̇ which is the solution to variation of the total Lagrangian (action), that is, in our case, θ̈ = −η(θ, θ̇) − G(θ)−1 ∇θ φ(θ) 41 4. LAGRANGIAN MONTE CARLO Although the Lagrangian dynamics (4.9) in general cannot be recognized as a Hamilton dyanmics of (θ, v), it nevertheless preserves the original Hamiltonian of the system, which is intuitive. Proposition 4.2. The Lagrangian dynamics (4.9) preserves the Hamiltonian H(θ, p) = H(θ, G(θ)v). Proof. It suffices to prove d H ≡ 0 according to (4.9). dt d ∂ T ∂ H(θ, G(θ)v) = θ̇ H(θ, G(θ)v) + v̇T H(θ, G(θ)v) dt ∂θ ∂v T 1 = vT ∇θ φ(θ) + vT ∂G(θ)v + −vT Γ(θ)v − G(θ)−1 ∇θ φ(θ) G(θ)v 2 1 = vT ∇θ φ(θ) − (∇θ φ(θ))T v + vT vT ∂G(θ)v − (vT Γ̃(θ)v)T v 2 =0+0=0 where vT Γ(θ)v is a vector whose kth element is Γkij (θ)v i v j . The second 0 is due to the triple form (vT Γ̃(θ)v)T v = Γ̃ijk v i v j v k = 21 ∂k gij v i v j v k , where Γ̃ is the Christoffel symbol of first kind with elements Γ̃ijk (θ) := gkl Γlij (θ) = 12 (∂i gkj + ∂j gik − ∂k gij ). 4.3.2 Semi-explicit Lagrangian Monte Carlo Algorithm Now we want to use the Lagrangian dynamics (4.9) instead of the Riemannian Hamiltonian dyanmics (4.4) as the proposal mechanism in the Metropolis algorithm. In the following we derive a time reversible integrator for (4.9), which is not volume preserving; however the detailed balance condition (2.3) can still be achieved by adjusting the Jacobian determinant in the acceptance probability. 4.3.2.1 Time reversible integrator Similarly as the generalized leapfrog (4.5)-(4.7), we concatenate a half step of the following Euler-B integrator of (4.9) [chap 4 of 61]: ε θ (n+1/2) = θ (n) + v(n+1/2) 2 ε (n+1/2) (n) v = v − [(v(n+1/2) )T Γ(θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 42 4.3 Semi-explicit Lagrangian Monte Carlo with another half step of its adjoint Euler-A integrator: ε θ (n+1) = θ (n+1/2) + v(n+1/2) 2 ε (n+1) (n+1/2) v =v − [(v(n+1/2) )T Γ(θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 to get the following semi-explicit time-reversible integrator: ε v(n+1/2) = v(n) − [η(θ (n) , v(n+1/2) ) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 (n+1) (n) θ = θ + εv(n+1/2) ε v(n+1) = v(n+1/2) − [η(θ (n+1) , v(n+1/2) ) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 (4.10) (4.11) (4.12) Note (4.11) resolves the implicitness of updating θ in the generalized leapfrog method, thus reduces the associated computational cost. But the equation (4.10) updating v remains implicit. 4.3.2.2 Detailed balance condition Note that the integrator (4.10)-(4.12) is (i) time reversible and (ii) energy preserving up to a global error of order O(ε), where ε is the step size. The resulting map, however, is no longer volume preserving (see section 4.3.2.3). Nevertheless, based on proposition 4.3, we can still have detailed balance after determinant adjustment [See also 58]. Proposition 4.3 (Detailed Balance Condition with determinant adjustment). Denote z = (θ, v), z0 = T̂L (z) for some time reversible integrator T̂L to the Lagrangian dynamics (4.9). If the acceptance probability is adjusted in the following way: exp(−H(z0 )) α̃(z, z ) = min 1, | det T̂L | exp(−H(z)) 0 (4.13) then the detailed balance condition still holds α̃(z, z0 )P(dz) = α̃(z0 , z)P(dz0 ) 43 (4.14) 4. LAGRANGIAN MONTE CARLO Proof. exp(−H(z0 )) dz0 α̃(z, z )P(dz) = min 1, exp(−H(z))dz exp(−H(z)) dz 0 z=T̂L−1 (z0 ) 0 dz dz = min exp(−H(z)), exp(−H(z )) 0 dz0 dz dz exp(−H(z)) dz exp(−H(z0 ))dz0 = α̃(z0 , z)P(dz0 ) = min 1, 0 0 exp(−H(z )) dz 0 Before discussing the calcuation of the adjusted acceptance probability (4.13), we define the energy of Lagrangian dynamics (4.9) as follows: Definition 4.3 (Energy of Lagrangian Dynamics). Because p|θ ∼ N(0, G(θ)), the distribution of v|θ is N(0, G(θ)−1 ). The energy function E(θ, v) is defined as the sum of the potential energy, U (θ) = − log π(θ) and the kinetic energy K(θ, v) = − log(P (v|θ)): E(θ, v) = − log π(θ) − 1 1 log det G(θ) + vT G(θ)v 2 2 (4.15) Remark 4.5. This energy (4.15) differs from the Hamiltonian H(θ, G(θ)v) (4.3) in the sign of the middle term due to the difference in distributions of p|θ and v|θ. Note the energy (4.15) is not preserved by the Lagrangian dynamics (4.9), in constrast to the proposition 4.2. The energy is related to the Hamiltonian in the following change of variable formula. It is more natural to work with energy. Z Z ∂(θ, p) p7→v |dθ ∧ dv| f (θ, p) exp(−H(θ, p))|dθ ∧ dp| = f (θ, G(θ)v) exp(−H(θ, G(θ)v)) ∂(θ, v) Z = f (θ, G(θ)v) exp(−E(θ, v))|dθ ∧ dv| Note, the adjusted acceptance probability (4.13) should be calculated based on H(θ, G(θ)v). However, the following proposition allows it to be calculated based on the energy function E(θ, v) (4.15), which is more intuitive. Proposition 4.4. The adjust acceptance probability (4.13) can be calculated based on either H(θ, G(θ)v) or E(θ, v). 44 4.3 Semi-explicit Lagrangian Monte Carlo ∂(θ 0 , p0 ) det(G(θ 0 )) ∂(θ 0 , v0 ) , then Proof. Note that = ∂(θ, p) det(G(θ)) ∂(θ, v) exp(−H(θ 0 , G(θ 0 )v0 )) ∂(θ 0 , p0 ) exp(−H(θ 0 , p0 )) ∂(θ 0 , p0 ) = min 1, α̃ = min 1, exp(−H(θ, p)) ∂(θ, p) exp(−H(θ, G(θ)v)) ∂(θ, p) ( ) exp{−(log π(θ 0 ) + 21 log det G(θ 0 ) + 21 v0 T G(θ 0 )v0 )} det(G(θ 0 )) ∂(θ 0 , v0 ) = min 1, exp{−(log π(θ) + 12 log det G(θ) + 21 vT G(θ)v)} det(G(θ)) ∂(θ, v) ( ) exp{−(log π(θ 0 ) − 12 log det G(θ 0 ) + 21 v0 T G(θ 0 )v0 )} ∂(θ 0 , v0 ) = min 1, exp{−(log π(θ) − 12 log det G(θ) + 12 vT G(θ)v)} ∂(θ, v) exp(−E(θ 0 , v0 )) ∂(θ 0 , v0 ) = min 1, exp(−E(θ, v)) ∂(θ, v) Therefore, after solving the Lagrangian dynamics (4.9) by the semi-explicit integrator (4.10)-(4.12) for L steps, we get a proposal (θ (L+1) , v(L+1) ) to be accepted with the following acceptance probability: αsLM C = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JsLM C |} (4.16) where JsLM C is the Jacobian matrix of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) according to (4.10)(4.12) with the following determinant calculated in the next section 4.3.2.3. Proposition 4.5 (Jacobian determinant of semi-explicit integrator). det JsLM C L ∂(θ (L+1) , v(L+1) ) Y det(I − εΩ(θ (n+1) , v(n+1/2) )) = := (n) ∂(θ (1) , v(1) ) , v(n+1/2) )) n=1 det(I + εΩ(θ Here, Ω(θ (n+1) , v(n+1/2) ) is a matrix whose (i, j)th element is 4.3.2.3 P k (4.17) (n+1/2) i Γkj (θ (n+1) ). vk Volume Correction (L+1) To adjust volume change in (θ (1) , v(1) ) → (θ (L+1) ) according ,v to (4.10)-(4.12), we need ∂(θ (L+1) , v(L+1) ) to derive the Jacobian determinant, det J := , which can be calculated (1) ∂(θ , v(1) ) using wedge products [61]. Definition 4.4 (Differential Forms, Wedge Product). The differential one-form α : T MD → R on a differentiable manifold MD is a smooth mapping from tangent space T MD to R, which can be expressed as a linear combination of differentials of local coordinates: α = fi dxi =: f · dx. 45 4. LAGRANGIAN MONTE CARLO For example, if f : RD → R is a smooth function, then its directional derivative along a vector v ∈ RD , denoted by df (v) is given by df (v) = ∂f i v ∂zi then df (·) is a linear functional of v, called the differential of f at z and is an example of a differential one-form. In particular, dz i (v) = v i , thus df (v) = ∂f i dz (v), ∂zi then df = ∂f i dz ∂zi The wedge product of two one-forms α, β is a 2-form α ∧ β, an anti-symmetric bilinear function on tangent space which has the following properties (α, β, γ one-forms, A be a square matrix of same dimension D): • α∧α=0 • α ∧ (β + γ) = α ∧ β + α ∧ γ (thus α ∧ β = −β ∧ α) • α ∧ Aβ = AT α ∧ β The following proposition enables us to calculate the Jacobian determinant det J. Proposition 4.6. Let TL : (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) be evolution of a smooth flow, then dθ (L+1) ∧ dv(L+1) = ∂(θ (L+1) , v(L+1) ) (1) dθ ∧ dv(1) ∂(θ (1) , v(1) ) Remark 4.6. The Jacobian determinant det J can also be regarded as a Radon-Nikodym P(dθ (L+1) , dv(L+1) ) derivative of two probability measures: det J = , where P(dθ, dv) = P(dθ (1) , dv(1) ) p(θ, v)dθdv. Proof. of proposition 4.5. According to the semi-explicit integrator (4.10)-(4.12), dv(n+1/2) = dv(n) − ε(v(n+1/2) )T Γ(θ (n) )dv(n+1/2) + (∗∗)dθ (n) dθ (n+1) = dθ (n) + εdv(n+1/2) dv(n+1) = dv(n+1/2) − ε(v(n+1/2) )T Γ(θ (n+1) )dv(n+1/2) + (∗∗)dθ (n+1) 46 4.3 Semi-explicit Lagrangian Monte Carlo where vT Γ(θ) is a matrix whose (k, j)th element is v i Γkij (θ). Therefore, dθ (n+1) ∧ dv(n+1) = [I − ε(v(n+1/2) )T Γ(θ (n+1) )]T dθ (n+1) ∧ dv(n+1/2) = [I − ε(v(n+1/2) )T Γ(θ (n+1) )]T dθ (n) ∧ dv(n+1/2) = [I − ε(v(n+1/2) )T Γ(θ (n+1) )]T [I + ε(v(n+1/2) )T Γ(θ (n) )]−T dθ (n) ∧ dv(n) For volume adjustment, we must use the following Jacobian determinant accumulated along the integration steps: det JsLM C L ∂(θ (L+1) , v(L+1) ) Y det(I − ε(v(n+1/2) )T Γ(θ (n+1) )) := = ∂(θ (1) , v(1) ) n=1 det(I + ε(v(n+1/2) )T Γ(θ (n) )) Algorithm 4.2 Semi-explicit Lagrangian Monte Carlo (sLMC) Initialize θ (1) = current θ Sample new velocity v(1) ∼ N(0, G−1 (θ (1) )) Calculate current E(θ (1) , v(1) ) according to equation (4.15) for n = 1 to L (leapfrog steps) do % Update the velocity with fixed point iterations v̂(0) = v(n) for i = 1 to NumOfFixedPointSteps do v̂(i) = v(n) − 2ε G(θ (n) )−1 [(v̂(i−1) )T Γ̃(θ (n) )v̂(i−1) + ∇θ φ(θ (n) )] end for v(n+1/2) = v̂(last i) % Update the position only with simple one step θ (n+1) = θ (n) + εv(n+1/2) ∆ log detn = log det(I − εΩ(θ (n+1) , v(n+1/2) )) − log det(I + εΩ(θ (n) , v(n+1/2) )) % Update the velocity exactly v(n+1) = v(n+1/2) − 2ε G(θ (n+1) )−1 [(v(n+1/2) )T Γ̃(θ (n+1) )v(n+1/2) + ∇θ φ(θ (n+1) )] end for Calculate proposed E(θ (L+1) , v(L+1) ) according to equation (4.15) PL logRatio = −ProposedE + CurrentE + n=1 ∆ log detn Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio Algorithm 4.2 provides the corresponding steps of the semi-explicit Lagrangian Monte Carlo (sLMC) algorithm. It has a physical interpretation as exploring the parameter space along the path on a Riemannian manifold that minimizes the action (total Lagrangian). In contrast to RHMC augmenting parameter space with momentum, sLMC augments parameter space with velocity. In Section 4.5, we use several experiments to show that switching 47 4. LAGRANGIAN MONTE CARLO from momentum to velocity can lead to improvements in computational efficiency in some cases. 4.3.3 Stationarity Now with proposition 4.3 we can prove that the Markov Chain derived by our reversible integrator with adjusted acceptance probability (4.13) converges to the true target distribution. One can also find a similar proof in [chap 9 of 70]. Theorem 4.1. The Markov Chain generated by algorithm 4.2 sLMC has the target distribution as its stationary distribution. Proof. Appendix A.2. 4.4 Explicit Lagrangian Monte Carlo In this section we modify the semi-explicit integrator (4.10)-(4.12) to become a fully explicit integrator and validate it as a numerical method to solve the Lagrangian dynamics (4.9). The derived explicit integrator further reduces computational cost of implicitly updating v in (4.10). It is time reversible but not volume preserving thus needs determinant adjutstment in the acceptance probability for the adjusted detailed balance condition (proposition 4.3). 4.4.1 Fully explicit integrator To resolve the remaining implicit equation (4.10), we propose an additional modification motivated by the following relationship (notice the symmetry of lower indices in Γ): vT Γu = 1 (v + u)T Γ(v + u) − vT Γv − uT Γu 2 48 4.4 Explicit Lagrangian Monte Carlo To keep time-reversibility, we make the modification to both (4.10) and (4.12) as follows: ε v(n+1/2) = v(n) − [(v(n+1/2) )T Γ(θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 ⇓ ε v(n+1/2) = v(n) − [(v(n) )T Γ(θ (n) )v(n+1/2) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 θ (n+1) = θ (n) + εv(n+1/2) ε v(n+1) = v(n+1/2) − [(v(n+1/2) )T Γ(θ (n+1) )v(n+1/2) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 ⇓ ε v(n+1) = v(n+1/2) − [(v(n+1/2) )T Γ(θ (n+1) )v(n+1) + G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 (4.18) (4.19) (4.20) The time-reversibility of the integrator (4.18)-(4.20) can be shown by the fact that switching (θ, v)(n+1) and (θ, v)(n) and negating time do not change the format. The resulting integrator is completely explicit since both updates of velocity (4.18) and (4.20) can be solved by collecting terms containing v(n+1/2) and v(n+1) respectively: ε ε v(n+1/2) = [I + (v(n) )T Γ(θ (n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )] 2 2 ε (n+1/2) T ε (n+1) −1 (n+1/2) (n+1) v = [I + (v ) Γ(θ )] [v − G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 2 Therefore we achieve a fully explicit integrator for the Lagrangian dynamics (4.9): ε ε v(n+1/2) = [I + Ω(θ (n) , v(n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )] 2 2 (n+1) (n) (n+1/2) θ = θ + εv 1 ε ε v(n+1) = [I + Ω(θ (n+1) , v(n+ 2 ) )]−1 [v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 2 (4.21) (4.22) (4.23) The following proposition verifies that the derived integrator (4.21)-(4.23) is a valid numerical method to solve the Lagrangian dynamics (4.9) in the sense that the global error between the numerical solution and the theorectic solution diminishes when the discretization step size decreases to 0 [See 61, for a similar proof for the generalized leapfrog method]. Proposition 4.7 (Convergence of Numerical Solution). Suppose from the same initial point z(0) = z0 , we evolve the Lagrangian dynamics (4.9) for some time T to get theorectic solution z(T ), and numerically solve (4.9) according to the integrator (4.21)-(4.23) with step size ε for T /ε steps to get a solution z(T /ε) , then kz(T ) − z(T /ε) k → 0, 49 as ε → 0 4. LAGRANGIAN MONTE CARLO Proof. Appendix A.3. This fully explicit integrator (4.21)-(4.23) is (i) time reversible and (ii) energy preserving up to a global error of order O(ε). The resulting map is not volume preserving as the Jacobian determinant of (θ (1) , v(1) ) → (θ (L+1) , v(L+1) ) by (4.21)-(4.23) is not 1. Proposition 4.8 (Jacobian determinant of fully explicit integrator). det JLM C := L Y det(G(θ (n+1) ) − ε/2Ω̃(θ (n+1) , v(n+1) )) det(G(θ (n) ) − ε/2Ω̃(θ (n) , v(n+1/2) )) det(G(θ (n+1) ) + ε/2Ω̃(θ (n+1) , v(n+1/2) )) det(G(θ (n) ) + ε/2Ω̃(θ (n) , v(n) )) (4.24) P i Here, Ω̃(θ, v) denotes G(θ)Ω(θ, v) whose (k, j)th element is equal to i v Γ̃ijk (θ), with 1 l Γ̃ijk (θ) = gkl Γij (θ) = 2 (∂i gkj + ∂j gik − ∂k gij ). n=1 As a result, the acceptance probability must be adjusted as follows: αLM C = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JLM C |} 4.4.2 (4.25) Volume Correction As in section (4.3.2.3), we use the wedge product on the system of equations (4.21)-(4.23) to calculate its Jacobian determiant. Proof. of proposition 4.8 The Jacobian matrix of the integrator (4.21)-(4.23) for two consecutive steps is ∂(θ (n+1) , v(n+1) ) ε ε =[I + (v(n+1/2) )T Γ(θ (n+1) )]−T [I − (v(n+1) )T Γ(θ (n+1) )]T · (n) (n) 2 2 ∂(θ , v ) ε (n) T ε [I + (v ) Γ(θ (n) )]−T [I − (v(n+1/2) )T Γ(θ (n) )]T 2 2 Accumulating all the determinants along L integration steps: ∂(θ (L+1) , v(L+1) ) det JLM C := (1) (1) ∂(θ , v ) = = L Y det(I − ε/2(v(n+1) )T Γ(θ (n+1) )) det(I − ε/2(v(n+1/2) )T Γ(θ (n) )) n=1 L Y det(I + ε/2(v(n+1/2) )T Γ(θ (n+1) )) det(I + ε/2(v(n) )T Γ(θ (n) )) det(G(θ (n+1) ) − ε/2(v(n+1) )T Γ̃(θ (n+1) )) det(G(θ (n) ) − ε/2(v(n+1/2) )T Γ̃(θ (n) )) (n+1) ) + ε/2(v(n+1/2) )T Γ̃(θ (n+1) )) det(G(θ (n) ) + ε/2(v(n) )T Γ̃(θ (n) )) n=1 det(G(θ 50 4.4 Explicit Lagrangian Monte Carlo Algorithm 4.3 Explicit Lagrangian Monte Carlo (LMC) Initialize θ (1) = current θ Sample new velocity v(1) ∼ N(0, G(θ (1) )−1 ) Calculate current E(θ (1) , v(1) ) according to equation (4.15) ∆ log det = 0 for n = 1 to L do ∆ log det = ∆ log det − log det(G(θ (n) ) + ε/2Ω̃(θ (n) , v(n) )) % Update the velocity explicitly with a half step: v(n+1/2) = [G(θ (n) )+ 2ε Ω̃(θ (n) , v(n) )]−1 [G(θ (n) )v(n) − 2ε ∇θ φ(θ (n) )] ∆ log det = ∆ log det + log det(G(θ (n) ) − ε/2Ω̃(θ (n) , v(n+1/2) )) % Update the position with a full step: 1 θ (n+1) = θ (n) + εv(n+ 2 ) ∆ log det = ∆ log det − log det(G(θ (n+1) ) + ε/2Ω̃(θ (n+1) , v(n+1/2) )) % Update the velocity explicitly with a half step: v(n+1) = [G(θ (n+1) )+ 2ε Ω̃(θ (n+1) , v(n+1/2) )]−1 [G(θ (n+1) )v(n+1/2) − 2ε ∇θ φ(θ (n+1) )] ∆ log det = ∆ log det + log det(G(θ (n+1) ) − ε/2Ω̃(θ (n+1) , v(n+1) )) end for Calculate proposed E(θ (L+1) , v(L+1) ) according to equation (4.15) logRatio = −ProposedE + CurrentE + ∆ log det Accept or reject the proposal (θ (L+1) , v(L+1) ) according to logRatio Algorithm 4.3 shows the corresponding steps for the fully explicit Lagrangian Monte Carlo (LMC) algorithm. In both algorithms 4.2 and 4.3, the position update is relatively simple while the computational time is dominated by choosing the “right” direction (velocity) using the geometry of parameter space. In sLMC, solving θ explicitly reduces computation cost by (F − 1)O(D2.373 ) where F is the number of fixed-point iterations, and D is the number of parameters. This is because for each fixed-point iteration, it takes O(D2.373 ) elementary linear algebraic operations to invert G(θ). The connection terms Γ̃(θ) in Ω̃ do not add substantial computational cost since they are obtained from permuting three dimensions of the array ∂G(θ), which is also computed in RHMC. The additional price of determinant adjustment is O(D2.373 ). LMC avoids the fixed-point iteration method in updating v. Therefore, it further reduces computation by (F − 1)O(D2 ). Besides, it resolves possible convergence issues associated with using the fixed-point iteration method (section 4.5.1.1). However, because it involves additional matrix inversions to update v, its benefits could be undermined occasionally. This is evident from our experimental results presented in section 4.5.3. 51 4. LAGRANGIAN MONTE CARLO 4.5 Experimental Results In this section, we use both simulated and real data to evaluate our methods, sLMC and LMC, compared to standard HMC and RHMC. Following [39], we use a time-normalized effective sample size (ESS) [17] to compare these methods. Definition 4.5 (Effective Sample Size). For S samples, effective sample sizs is calculated as follows: K ESS = S[1 + 2Σk=1 ρ(k)]−1 where ρ(k) is the autocorrelation function with lag k, and K 1. Remark 4.7. Effective sample size can be understood as the number of nearly independent samples. So the more effective samples a sampling algorithm can generate within fixed CPU time (time-normalized ESS), the more efficient it is regarded. Minimum, median, and maximum values of ESS over all parameters are provided for comparing different algorithms. More specifically, we use the minimum ESS normalized by CPU time (s), min(ESS)/s, as the measure of sampling efficiency. All computer programs and data sets discussed in this chapter are available online at http://www.ics.uci.edu/ ~babaks/Site/Codes.html. 4.5.1 Banana-shaped distributions The banana-shaped distribution, which we used above for illustration, can be constructed as the posterior distribution of θ = (θ1 , θ2 )|y based on the following model: y|θ ∼ N(θ1 + θ22 , σy2 ) θ ∼ N(0, σθ2 I2 ) 2 The data {yi }100 i=1 are generated with θ1 + θ2 = 1, σy = 2, and σθ = 1. As we can see in figure 4.2, similar to RHMC, sLMC and LMC explore the parameter space efficiently by adapting to its local geometry. The histograms of posterior samples shown in figure 4.3 confirm that our algorithms converge to the true posterior distributions of θ1 and θ2 , whose density functions are shown as red solid curves. Table 4.1 compares the performance of these algorithms based on 20000 MCMC iterations after 5000 burn-in. For this specific example, sLMC has the best performance followed by LMC. As discussed above, although LMC is fully explicit, its numerical benefits (obtained by 52 4.5 Experimental Results Sampling Path of RHMC Sampling Path of sLMC Sampling Path of LMC 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 θ2 2 θ2 2 θ2 2 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 −2 −2 −1 0 1 2 −2 −2 −1 0 θ1 1 2 −2 −2 −1 0 θ1 1 2 θ1 Figure 4.2: The first 10 iterations in sampling from the banana-shaped distribution with Riemannian HMC (RHMC), semi-explicit Lagrange Monte Carlo (sLMC) and explicit LMC (LMC). For all three methods, the trajectory length (i.e., step size times number of integration steps) is set to 1.45 and number of integration steps is set to 10. Solid red lines show the sampling path, and each point represents an accepted proposal. RHMC sLMC LMC 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 −3 −2 −1 0 1 2 1.2 0 −3 −2 −1 0 1 2 0 −3 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 1 2 1 2 1 0.5 0 0 θ 0.6 −1 −1 1 1 −2 −2 θ θ 0 −2 −1 θ 0 θ 2 2 1 2 −2 −1 0 1 2 θ 2 Figure 4.3: Histograms of 1 million posterior samples of θ1 and θ2 for the banana-shaped distribution using RHMC (left), sLMC (middle) and LMC (right). Solid red curves are the true density functions. 53 4. LAGRANGIAN MONTE CARLO Method AP s/Iter ESS min(ESS)/s HMC 0.79 6.96e-04 (288,614,941) 20.65 RHMC 0.78 4.56e-03 (4514,5779,7044) 49.50 sLMC 0.84 7.90e-04 (2195,3476,4757) 138.98 LMC 0.73 7.27e-04 (1139,2409,3678) 78.32 Table 4.1: Comparing alternative methods using a banana-shaped distribution. For each method, the trajectory length is kept 1.2 and the step size is tuned to make the acceptance rate comparable. We provide the acceptance probability (AP), the CPU time (s) for each iteration, ESS (min., med., max.) and the time-normalized ESS. removing implicit equations) can be negated in certain examples since it involves additional matrix inversion operations to update v. 4.5.1.1 Thinner banana-shaped distribution In this section we discuss the issue of solutions given by fixed point iteration in RHMC. It turns out that sLMC and LMC not only reduce the computational cost of RHMC, but are also more numerically stable than RHMC by avoiding the fixed point iteration. Actually, (n) forh fixed point iteration i to find a solution to (4.6), the iterated function f (·) = θ + ε G−1 (θ (n) ) + G−1 (·) p(n+1/2) has to satisfy certain contraction condition e.g. Lipschitz 2 condition with constant 0 ≤ L < 1. When this is not satisfied, [39] argue that fixed point iteration is still used, not to get the exact solution, but to generate a proposal after several runs (5 or 6 in practice). However, very enormous solutions indicating strong divergent behavior can be given by fixed point iteration with limited number of runs. We observe this phenomenon in the following experiment where G is very ill-conditioned (See more discussion on condition number in section 4.5.3). If we increase the number of records y to 10000, the posterior distribution of θ|y becomes more concentrated on θ1 + θ22 = 1 thereafter a ’thinner banana’, challenging for both HMC and RHMC: HMC bounces more in the thinner banana resulting a slow exploration; RHMC updates θ by the fixed point iteration which frequently gives divergent solutions due to the ill nature of metric G(θ) (with condition number as large as 104 ). Figure 4.4 shows that RHMC frequently gives solutions divergent to infinity as its sampling path (red lines) goes beyond the range of the figure, rendering 7 of 10 proposals rejected. While sLMC and LMC still explore the distribution well and accept most of the proposals (10 and 8 respectively). Table 4.2 compares these algorithms in simulating the thinner bananashaped distribution based on 20000 MCMC iterations after 5000 burn-in. Note, RHMC has 54 4.5 Experimental Results Sampling Path of sLMC Sampling Path of LMC 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 θ2 2 θ2 θ2 Sampling Path of RHMC 2 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 −2 −2 −1 0 θ1 1 2 −2 −2 −1 0 θ1 1 2 −2 −2 −1 0 θ1 1 2 Figure 4.4: The first 10 iterations in sampling from the thinner banana-shaped distribution with RHMC, sLMC and LMC. For all three methods, the trajectory length (i.e., step size times number of integration steps) is set to 0.95 and number of integration steps is set to 10. Solid red lines show the sampling path, and dot points represent accepted proposals while cross represent rejected ones. Method AP s/Iter ESS min(ESS)/s HMC 0.82 3.22e-03 (545,567,590) 8.46 RHMC 0.70 1.37e-02 (506,995,1484) 1.84 sLMC 0.84 1.01e-03 (1022,1806,2589) 50.57 LMC 0.80 1.63e-03 (545,1197,1848) 16.77 Table 4.2: Comparing alternative methods using a ’thinner’ banana-shaped distribution. For each method, the trajectory length is kept 1 and the step size is tuned to make the acceptance rate comparable. We provide the acceptance probability (AP), the CPU time (s) for each iteration, ESS (min., med., max.) and the time-normalized ESS. to significantly reduce the step size to mitigate the issue of divergent solutions, surprisingly performing even worse than HMC. 4.5.2 Logistic Regression Models Next, we evaluate our methods based on five binary classification problems used in [39]. These are Australian Credit data, German Credit data, Heart data, Pima Indian data, and Ripley data. These data sets are publicly available from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). For each problem, we use a logistic regression 55 4. LAGRANGIAN MONTE CARLO model, exp(xTi β) p(yi = 1|xi , β) = , 1 + exp(xTi β) β ∼ N(0, 100I) i = 1, . . . , n where yi is a binary outcome for the ith observation, xi is the corresponding vector of predictors (with the first element equal to 1), and β is the set of regression parameters. We use standard HMC, RHMC, sLMC, and LCM to simulate 20000 posterior samples for β. We fix the trajectory length for different algorithms, and tune the step sizes so that they have comparable acceptance rates. Results (after discarding the initial 5000 iterations) are summarized in Table 4.3, and show that in general our methods improve the sampling efficiency measured in terms of minimum ESS per second compared to RHMC on these examples. Data Australian D=14,N=690 German D=24,N=1000 Heart D=13,N=270 Pima D=7,N=532 Ripley D=2,N=250 method HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC AP 0.75 0.72 0.83 0.75 0.74 0.76 0.71 0.70 0.71 0.73 0.77 0.76 0.85 0.81 0.81 0.82 0.88 0.74 0.80 0.79 s/Iter 6.13E-03 2.96E-02 2.17E-02 1.60E-02 1.31E-02 6.55E-02 4.13E-02 3.74E-02 1.75E-03 2.12E-02 1.30E-02 1.15E-02 5.75E-03 1.64E-02 8.98E-03 7.90E-03 1.50E-03 1.09E-02 6.79E-03 5.36E-03 ESS (1225,3253,10691) (7825,9238,9797) (10184,13001,13735) (9636,10443,11268) (766,4006,15000) (14886,15000,15000) (13395,15000,15000) (13762,15000,15000) (378,850,2624) (6263,7430,8191) (10318,11337,12409) (10347,10724,11773) (887,4566,12408) (4349,4693,5178) (4784,5437,5592) (4839,5193,5539) (820,3077,15000) (12876,15000,15000) (15000,15000,15000) (12611,15000,15000) min(ESS)/s 13.32 17.62 31.29 40.17 3.90 15.15 21.64 24.54 14.44 19.68 52.73 59.80 10.28 17.65 35.50 40.84 36.39 78.83 147.38 157.02 Table 4.3: Comparing alternative methods using five binary classification problems discussed in [39]. For each dataset, the number of predictors, D, and the number of observations, N , are specified. For each method, we provide the acceptance probability (AP), the CPU time (s) for each iteration, ESS (min., med., max.) and the time-normalized ESS. 56 4.5 Experimental Results 4.5.3 Multivariate T-distributions The computational complexity of standard HMC is O(D). This is substantially lower than O(D2.373 ), which is the computational complexity of the three geometrically motivated methods discussed here (RHMC, sLMC, and LMC). On the other hand, these three methods could have substantially better mixing rates compared to standard HMC, whose mixing time is mainly determined by the condition number of the target distribution defined as the ratio of the maximum and minimum eigenvalues of its covariance matrix: λmax /λmin . λ max /λ min = 10000 D = 20 450 HMC RHMC sLMC LMC 800 700 HMC RHMC sLMC LMC 400 350 600 Min(ESS)/s Min(ESS)/s 300 500 400 300 250 200 150 200 100 100 50 0 1e+1 1e+2 1e+3 1e+4 1e+5 0 10 20 30 40 50 Dimension λm a x/λm i n Figure 4.5: Left: Sampling efficiency, Min(ESS)/s vs. the condition number for a fixed dimension (D = 20). Right: Sampling efficiency vs dimension for a fixed condition number (λmax /λmin = 10000). Each algorithm is tuned to have an acceptance rate around 70%. Results are based on 5000 samples after discarding the initial 1000 samples. In this section, we illustrate how efficiency of these sampling algorithms changes as the condition number varies using multivariate t-distributions with the following density function: −(ν+D)/2 Γ((ν + D)/2) 1 T −1 −D/2 −1/2 π(x) = (πν) |Σ| 1+ x Σ x Γ(ν/2) ν (4.26) where ν is the degrees of freedom and D is the dimension. In our first simulation, we fix the dimension at D = 20 and vary the condition number of Σ from 10 to 105 . As the condition number increases, one can expect HMC to be more restricted by the smallest eigen-direction, whereas RHMC, sLMC, and LMC would adapt to the local geometry. Results presented in figure 4.5 (left panel) show that this is in fact the case: for higher conditional numbers, geometrically motivated methods perform substantially better than standard HMC. Note 57 4. LAGRANGIAN MONTE CARLO that our two proposed algorithms, sLMC and LMC, provide substantial improvements over RHMC. For our second simulation, we fix the condition number at 10000 and let the dimension changes from 10 to 50. Our results (Figure 4.5, right panel) show that the gain by exploiting geometric properties of the target distribution could be undermined eventually as the dimension increases. 4.5.4 Finite Mixture of Gaussians Finally we consider finite mixtures of univariate Gaussian components of the form p(x|θ) = K X πk N(x|µk , σk2 ) (4.27) k=1 where θ is the vector of size D = 3K − 1 of all the parameters πk , µk and σk2 and N(·|µ, σ 2 ) is a Gaussian density with mean µ and variance σ 2 . A common choice of prior takes the form p(θ) = D(π1 , . . . , πK |λ) K Y N(µk |m, β −1 σk2 )IG(σk2 |b, c) (4.28) k=1 where D(·|λ) is the symmetric Dirichlet distribution with parameter λ, and IG(·|b, c) is the inverse Gamma distribution with shape parameter b and scale parameter c. Although the posterior distribution associated with this model is formally explicit, it is computationally intractable, since it can be expressed as a sum of K N terms corresponding to all possible allocations of observations {xi }N i=1 to mixture components [chap. 9 of 71]. We want to use this model to test the efficiency of posterior sampling θ using the four methods. A more extensive comparison of Riemannian Manifold MCMC and HMC, Gibbs sampling and standard Metropolis-Hastings for finite Gaussian mixture models can be found in [47]. Due to the non-analytic nature of the expected Fisher information, I(θ), we use the empirical Fisher information as metric tensor [chap. 2 of 72], Definition 4.6 (Empirical Fisher information). G(θ) = ST S − where N × D score matrix S has elements Si,d = 58 1 T ss N P ∂ log p(xi |θ) T and s = N i=1 Si,· . ∂θd 4.6 Discussion Dataset name Density function Num. of parameters Kurtotic Bimodal Skewed 9 20 N Trimodal x| Claw 2 1 2 N(x|0, 1) + 13 N x|0, 10 3 1 2 2 2 2 1 + N x| − 1, N x|1, 2 3 3 2 3 1 3 1 2 1) + 4 N x| 2 , 3 4 N (x|0, 2 2 9 1 − 65 , 53 + 20 N x| 56 , 35 + 10 N x|0, P4 1 i 1 2 1 N (x|0, 1) + N x| − 1, i=0 10 2 2 10 5 5 5 1 2 8 4 17 Table 4.4: Densities used for the generation of synthetic Mixture of Gaussian data sets. 1.6 0.35 1.4 0.3 0.45 0.35 0.7 0.3 0.6 0.25 0.5 0.25 0.2 0.4 0.2 0.15 0.3 0.1 0.2 0.4 0.35 1.2 0.25 0.3 1 0.2 0.8 0.15 0.6 0.15 0.1 0.4 0.1 0.05 0.2 0 −3 0.05 0.1 0.05 −2 −1 0 1 2 3 0 −3 −2 −1 0 1 2 3 0 −3 −2 −1 0 1 2 3 0 −3 −2 −1 0 1 2 3 0 −3 −2 −1 0 1 2 3 Figure 4.6: Densities used to generate synthetic datasets. From left to right the densities are in the same order as in Table 4.4. The densities are taken from [72] We show five Gaussian mixtures in table 4.4 and figure 4.6 and compare sampling efficiency of HMC, RMHMC, sLMC and LMC using simulated datasets in table 4.5. As before, our two algorithms outperform RHMC. 4.6 Discussion Following the method of [39] for more efficient exploration of parameter space, we have proposed new sampling schemes to reduce the computational cost associated with using a position-specific mass matrix. To this end, we have developed a semi-explicit (sLMC) integrator and a fully explicit (LMC) integrator for RHMC and demonstrated their advantage in improving computational efficiency over the generalized leapfrog (RHMC) method used by [39]. It is easy to show that if G(θ) ≡ M, our method reduces to standard HMC. Compared to HMC, whose local and global errors are O(ε3 ) and O(ε2 ) respectively, LMC’s local error is O(ε2 ), and its global error is O(ε) (proposition 4.7). Although the numerical solutions converge to the true solutions of the corresponding dynamics at a slower rate for LMC compared to HMC, in general, the approximation remains adequate leading to reasonably high acceptance rates while providing a more computationally efficient sampling mechanism. Compared to RHMC, our LMC method has the additional advantage of being more stable by avoiding implicit updates relying on the fixed point iteration method: RHMC 59 4. LAGRANGIAN MONTE CARLO Data claw trimodal skewed kurtotic bimodal Method HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC HMC RHMC sLMC LMC AP 0.88 0.80 0.86 0.82 0.77 0.79 0.82 0.80 0.83 0.85 0.82 0.84 0.78 0.82 0.85 0.81 0.73 0.86 0.81 0.85 s 7.01E-01 5.08E-01 3.76E-01 2.92E-01 3.43E-01 9.94E-02 4.02E-02 4.84E-02 1.78E-01 5.10E-02 2.26E-02 2.52E-02 2.85E-01 4.72E-02 2.54E-02 2.70E-02 1.61E-01 5.38E-02 2.06E-02 2.06E-02 (1916, (1524, (2531, (2436, (2244, (4701, (4978, (4899, (2915, (5000, (4698, (4935, (3013, (5000, (5000, (5000, (2923, (5000, (4935, (5000, ESS 3761, 3474, 4332, 3455, 2945, 4928, 5000, 4982, 3237, 5000, 4940, 5000, 3331, 5000, 5000, 5000, 2991, 5000, 4996, 5000, 4970) 4586) 5000) 4608) 3159) 5000) 5000) 5000) 3630) 5000) 5000) 5000) 3617) 5000) 5000) 5000) 3091) 5000) 5000) 5000) min(ESS)/s 0.54 0.60 1.35 1.67 1.30 9.46 24.77 20.21 3.27 19.63 41.68 39.09 2.11 21.20 39.34 36.90 3.62 18.56 48.00 46.43 Table 4.5: Acceptance probability (AP), seconds per iteration (s), ESS (min., med., max.) and time-normalized ESS for Gaussian mixture models. Results are calculated on a 5,000 sample chain with a 5,000 sample burn-in session. For HMC the burn-in session was 20,000 samples in order to ensure convergence. could occasionally give highly divergent solutions, especially for ill conditioned metrics, G(θ). Future directions could involve splitting Hamiltonian [37, 41, 73, 74] to develop explicit geometric integrators. For example, one could split a non-separable Hamiltonian dynamic into several smaller dynamics some of which can be solved analytically. Specifically, the Lagrangian dynamics (4.9) could be split into the following two smaller dynamics ( ( θ̇ = v v̇ = − 12 G(θ)−1 ∇θ φ(θ) θ̇ = 0 v̇ = −vT Γ(θ)v (4.29) the first one separable and the second one solvable element wise. A similar idea has been explored by [75], where the Hamiltonian, instead of the dynamic, is split. Recently, [51] propose an alternative splitting essentially similar to (4.29): ( ( θ̇ = 0 v̇ = − 21 G(θ)−1 ∇θ φ(θ) θ̇ = v v̇ = −vT Γ(θ)v 60 (4.30) 4.6 Discussion the first one only updating v and the second one having analytical solution as geodesic when available. See more discussion in chapter 6. Because our methods involve costly matrix inversions, another possible research direction could be to approximate the mass matrix (and the Christoffel symbols as well) to reduce computational cost. For many high-dimensional problems, the mass matrix could be appropriately approximated by a highly sparse or structured (e.g., tridiagonal) matrix. This could further improve our method’s computational efficiency. 61 4. LAGRANGIAN MONTE CARLO 62 5 Wormhole Hamiltonian Monte Carlo 5.1 Introduction It is well known that standard Markov Chain Monte Carlo (MCMC) methods (e.g., Metropolis algorithms) tend to fail when the target distribution is multimodal [3, 52, 76, 77, 78, 79, 80]. These methods typically fail to move from one mode to another since such moves require passing through low probability regions. This is especially true for high dimensional problems with isolated modes. Therefore, despite recent advances in computational Bayesian methods, designing effective MCMC samplers for multimodal distribution has remained a major challenge. In the statistics and machine learning literature, many methods have been proposed to address this issue [see 52, 77, 78, 79, 81, 82, 83, 84, 85, for example]. However, these methods tend to suffer from the curse of dimensionality [83, 85]. In this chapter, we propose a new algorithm, which exploits and modifies the Riemannian geometric properties of the target distribution to create wormholes connecting modes in order to facilitate moving between them. Our method can be regarded as an extension of Hamiltonian Monte Carlo (HMC, chapter 2). Compared to random walk Metropolis (RWM), standard HMC explores the target distribution more efficiently by exploiting its geometric properties. However, it also tends to fail when the target distribution is multimodal since the modes are separated by high energy barriers (low probability regions) [79]. Before presenting our proposed method, we provide an explanation of energy barriers that prevent standard HMC from moving between modes in the next section. We then introduce our method in three steps assuming the locations of the modes are known (either exactly or approximately), possibly through some optimization techniques [e.g. 55, 86]. Later, we relax this assumption by incorporating a mode searching algorithm in our method in order to identify new modes and to update the network of wormholes. To this end, we use the 63 5. WORMHOLE HAMILTONIAN MONTE CARLO regeneration method [87, 88, 89]. Throughout this chapter, we evaluate our method’s performance by comparing it to a state-of-the-art algorithm called Regenerative Darting Monte Carlo (RDMC) [85], which is designed for sampling from multimodal distributions. RDMC itself is an improved version of the Darting Monte Carlo (DMC) algorithm [79, 90]. We show that our proposed approach performs substantially better than RDMC, especially for high dimensional problems. 5.2 Energy Barrier in HMC HMC [36, 37] is a Metropolis algorithm with proposals made by numerically simulating Hamiltonian dynamics of an augmented state space (position θ and ancillary momentum p). Since guided by Hamiltonian dynamics, HMC improves upon RWM by proposing states that are distant from the current state, but nevertheless accepted with high probability (Chapter 2 provides details of HMC algorithm). However, HMC does not full exploit the geometric structure of the target distribution thus it may not explore complicated distributions efficiently. [39] define HMC on a Riemannian manifold (RHMC, see chapter 4 for more details) by replacing the fixed mass matrix M with the position dependent Fisher metric G(θ) to adapt to the local geometry of parameter space. In the remainder of the chapter, we use the notation G0 to generally refer to a Riemannian metric, which is not necessarily the Fisher information. Even though we have seen the improvement of the ability to explore the target distribution by utilizing more and more geometric information (gradient, metric), these energy (Hamiltonian) based algorithms alone cannot explore multimodal distributions very well due to the energy barrier phenomenon, that is, the sampler gets trapped in some of the modes, unable to move to other modes due to being isolated by low probability regions. Recall that in HMC, potential energy is defined as minus log of the target density, so each local maximum (mode) of density corresponds to a local minimum of potential energy (well), and low density region corresponds to the energy barrier. The total energy (Hamiltonian) is (approximately) preserved in the Hamiltonian dynamical system (section 2.1.1), but it may not be enough to support the sampler to escape from one energy well to another. Figure 5.1 showing a frictionless puck sliding on a surface with 2 local minimums illustrates such phenomenon: once a initial velocity (or momentum) is sampled, with value v0 , the whole system (θ, v) evolves with some fixed energy H = U (θ0 ) + K(v0 ) until the highest point where the kinetic energy has completely converted to the potential energy thus v = 0; if it stays within the same energy well it starts from, then it will start sliding backwards to 64 5.3 Wormhole HMC Algorithm U (θ ) v=0 v=v0 Energy Barrier θ Figure 5.1: The frictionless puck starts from the left energy well (corresponding to the left mode of density) cannot pass over the energy barrier into the right energy well (corresponding to the right mode of density). the bottom, lacking momentum to pass over the barrier into the other energy well. Note, it is not as simple as increasing the intial velocity will endow the sampler more energy to overcome the barrier. In practice Hamiltonian dynamics (2.2) is solved numerically, so larger velocity means larger leap at each discretized step, which causes larger error, and in turn higher chance of rejection. In the following section, we introduce a natural modification of the base metric G0 such that the associated Hamiltonian dynamical system has a much greater chance of moving between isolated modes. 5.3 Wormhole HMC Algorithm We need a concept called distance on a manifold to develop our method. Definition 5.1 (Distance on a manifold). Let (M, G(θ)) be a Riemannian manifold. Given 65 5. WORMHOLE HAMILTONIAN MONTE CARLO a differentiable curve θ(t)1 : [0, T ] → M, one can define its arclength as follows Z `(θ) := T q θ̇(t)T G(θ(t))θ̇(t)dt (5.1) 0 Given any two points θ 1 , θ 2 ∈ M, there exists (nearly always satisfied in statistical models) a curve θ(t) : [0, T ] → M satisfying the boundary conditions θ(0) = θ 1 , θ(T ) = θ 2 whose arclength is minimal among the curves connecting θ 1 and θ 2 . The length of such a minimal curve defines a distance function on M. Remark 5.1. The minimal curve satisifies the following geodesic equation T θ̈ + θ̇ Γ(θ)θ̇ = 0 (5.2) Thus the minimal curve is also called minimizing geodesic. The solution to (5.2) is proved equivalent to Hamilton flow with only kinetic energy (see section 4.3.1). In Euclidean space, where G(θ) ≡ I, the shortest curve connecting θ 1 and θ 2 is simply a straight line with the Euclidean length kθ 1 − θ 2 k2 . In the following, we use Hamilton flow (2.2), Riemannian Hamilton flow (4.4), or Lagrangian flow (4.9) to define the distance on manifold whenever appropriate from the context. 5.3.1 Tunnel Metric To overcome the energy barrier, we propose to replace the base metric G0 with a new metric with which the distance between modes is shortened. This way, we can facilitate moving between modes by creating high-speed “tunnels” connecting modes under through the energy barrier. Let θ̂ 1 and θ̂ 2 be two modes of the target distribution. We define a straight line segment, vT := θ̂ 2 − θ̂ 1 , and refer to a small neighborhood (tube) of the line segment as a tunnel. Next, we define a tunnel metric, GT (θ), in the vicinity of the tunnel. The metric GT (θ) is an inner product assigning a non-negative real number to a pair of tangent vectors u, w: GT (θ)(u, w) ∈ R+ . To shorten the distance in the direction of vT , we project both u, w to the plane normal to vT and then take the Euclidean inner product of those projected vectors. 1 Here we identify the curve defined on manifold, which should have been written πθ(t) = φ(θ(t)), as its coordinate θ(t). Therefore, the curve length should be 1/2 Z T Z Ts Z Tq T dφ(θ(t)) dφ(θ(t)) ∂ ∂ , dt = θ̇ , θ̇dt = θ̇(t)T G(θ(t))θ̇(t)dt dt dt ∂θ ∂θ 0 0 0 66 5.3 Wormhole HMC Algorithm ∗ Definition 5.2 (Tunnel Metric). Set vT = vT /kvT k. First, define a pseudo tunnel metric ∗ GT as follows: ∗ ∗ ∗ ∗ ∗ ∗ T G∗T (u, w) := hu − hu, vT ivT , w − hw, vT ivT i = uT [I − vT (vT ) ]w ∗ ∗ T ∗ Note that G∗T := I − vT (vT ) is semi-positive definite (degenerate at vT 6= 0). We then modify it to be positive definite, and define the tunnel metric GT as follows: ∗ ∗ T ∗ ∗ T GT = G∗T + εvT (vT ) = I − (1 − ε)vT (vT ) (5.3) where 0 < ε 1 is a small positive number. ∗ and all others Remark 5.2. The smallest eigen-value of GT is ε with eigen-direction vT ∗ are 1 with eigen-directions normal to vT . It has a clear interpretation as cutting off the ∗ projection of any vector v to the tunnel direction vT in the following sense. ∗ ∗ T ∗ ∗ T ∗ ∗ v = [(1 − ε)vT (vT ) ]v + [I − (1 − ε)vT (vT ) ]v = (1 − ε)hv, vT ivT + GT v ∗ being removed after multiplying GT . with most of the projection to vT To see that the tunnel metric GT in fact shortens the distance between θ̂ 1 and θ̂ 2 , consider a simple case where θ(t) follows a straight line: θ(t) = θ̂ 1 + vT t, t ∈ [0, 1]. In this case, the distance under GT is Z dist(θ̂ 1 , θ̂ 2 ) = 1 p √ vT T GT vT dt = εkvT k kvT k 0 which is much smaller than the Euclidean distance. Next, we define the overall metric, G, for the whole parameter space of θ as a weighted sum of the base metric G0 and the tunnel metric GT , G(θ) = (1 − m(θ))G0 (θ) + m(θ)GT (5.4) where m(θ) ∈ (0, 1) is a mollifying function designed to make the tunnel metric GT influential only in the vicinity of the tunnel chosen as follows: m(θ) := exp{−(kθ − θ̂ 1 k + kθ − θ̂ 2 k − kθ̂ 1 − θ̂ 2 k)/F } (5.5) where the influence factor F > 0, is a free parameter that can be tuned to modify the extent of the influence of GT : decreasing F makes the influence of GT more restricted around the tunnel. The resulting metric leaves the base metric almost intact outside of the tunnel, while 67 5. WORMHOLE HAMILTONIAN MONTE CARLO making the transition of the metric from outside to inside smooth. Within the tunnel, the ∗ trajectories are mainly guided in the tunnel direction vT : G(θ) ≈ GT , so G(θ)−1 ≈ GT −1 ∗ has the dominant eigen-vector vT (with eigen-value 1/ε 1), thereafter v ∼ N(0, G(θ)−1 ) ∗ . tends to be directed in vT The tunnel metric GT is constant and calculated before we start the Markov chain. Each time, one only need to recalculate a mollifier, adding an almost negligible cost compared to updating G(θ) in RHMC [39]. We use the mixed overall metric (5.4) to substitute for Fisher metric in RHMC[39] or LMC (see chapter 4), and call the resulting algorithm as Tunnel Hamiltonian Monte Carlo (THMC). Figure 5.2 compares THMC with standard HMC based on the following illustrative example discussed in [91]: θd ∼ N(θd , σd2 ), d = 1, 2. 1 1 xi ∼ N(θ1 , σx2 ) + N(θ1 + θ2 , σx2 ). 2 2 Here, we set θ1 = 0, θ2 = 1, σ12 = 10, σ22 = 1, σx2 = 2, and generate 1000 data points from the above model. In figure 5.2, the dots show the posterior samples of (θ1 , θ2 ) given the simulated data. As we can see, the two modes are far from each other, and moving from one mode to the other requires passing through a low density region. While HMC is trapped in one mode, THMC moves easily between the two modes. For this example, we set G0 = I to make THMC comparable to standard HMC. Further, we use 0.03 and 0.3 for ε and F respectively. THMC 1 1 0.5 0.5 θ2 θ2 HMC 0 0 −0.5 −0.5 −1 −1 0 0.5 1 0 θ1 0.5 1 θ1 Figure 5.2: Comparing HMC and THMC in terms of sampling from a 2d posterior distribution of mixture of 2 Gaussians with tied means. 68 5.3 Wormhole HMC Algorithm For more than two modes, we can construct a network of tunnels by creating a tunnel between any two modes. Alternatively, we can create a tunnel between neighboring modes only. We can define the neighborhood using, for example, a minimal spanning tree [92]. 5.3.2 Wind Tunnel The above method could fail occasionally when the target distribution is highly concentrated around its modes. This often happens in high-dimensional problems. In such cases, the effect of tunnel metric diminishes fast as the sampler leaves one mode towards another mode. To address this issue, we propose to add an external vector field f to the Lagrangian dynamics (equation (4.9) in section 4.3.1) to enforce the movement between modes shown as below: θ̇ = v + f (θ, v) v̇ = (5.6) − η(θ, v) − G(θ)−1 ∇θ φ(θ) We define the wind vector f (θ, v) in terms of the position θ and the velocity v. Definition 5.3 (Wind Vector). A wind vector f (θ, v) is defined as follows: ∗ ∗ ∗ f (θ, v) := exp{−V (θ)/(DF )}U (θ)hv, vT ivT = m(θ)hv, vT iv with mollifier m(θ) := exp{−V (θ)/(DF )}, where D is the dimension, F > 0 is the influence factor, and V (θ) is a vicinity function indicating the Euclidean distance from the line segment vT , ∗ ∗ V (θ) := hθ − θ̂ 1 , θ − θ̂ 2 i + |hθ − θ̂ 1 , vT i||hθ − θ̂ 2 , vT i| (5.7) Contour of Tunnel A tunnel in MOG 2 15 1.5 2 2 1.5 1.5 10 1 1 0.5 0.4 0.3 0.2 0.1 5 0.2 0.1 θ̂ 2 1.5 0.3 0.1 0.1 0.3 0.4 0.5 0.5 1 2 −1 −5 0.2 0.2 0. 4 1 −0.5 0 1.5 0.5 0.4 0.3 θ̂ 1 0 2 2 1.5 1 0.5 0.30.4 1 0.5 −10 1 1.5 2 1.5 2 −1.5 −2 −2 −15 −1.5 −1 −0.5 0 0.5 1 1.5 2 −10 −5 0 5 Figure 5.3: Left: contour of the vicinity function in equation (5.7) wind tunnel; Right: a tunnel is shown in 2d mixture of 5 Gaussians 69 5. WORMHOLE HAMILTONIAN MONTE CARLO The contour of this vicinity function V (θ) looks like tunnel in deed as shown in figure 5.3. Note, the resulting wind vector field has three desirable properties: i) it is confined to a neighborhood of each tunnel; ii) it enforces the movement along the tunnel; iii) its influence diminishes at the end of the tunnel when the sampler reaches the other mode. Now to use the wind Lagrangian dynamics (5.6) to propose, we need a proper integrator in order to satisfy the detailed balance condition (2.3). we construct the time reversible integrator for the system (5.6) by concatenating its Euler-B integrator with its Euler-A integrator [61] (see also section 4.3.2.1): ε ε v(n+1/2) = [I + Ω(θ (n) , v(n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )] 2 2 (n+1) (n) (n) (n+1/2) (n+1/2) θ = θ + ε[v + (f (θ , v ) + f (θ (n+1) , v(n+1/2) ))/2] 1 ε ε v(n+1) = [I + Ω(θ (n+1) , v(n+ 2 ) )]−1 [v(n+1/2) − G(θ (n+1) )−1 ∇θ φ(θ (n+1) )] 2 2 (5.8) (5.9) (5.10) where implicit equation (5.9) can be solved by fixed point iteration. The integrator (5.8)-(5.10) is time reversible and numerically stable, however not volume preserving. Therefore we need to adjust acceptance rate by the Jacobian determinant calculated by wedge product (see section 4.3.2.3): ε ε dθ (n+1) ∧ dv(n+1) =[I + Ω(θ (n+1) , v(n+1/2) )]−T [I − Ω(θ (n+1) , v(n+1) )]T · 2 2 ε ε (n+1) (n+1/2) −1 ,v )] [I + ∇θT f (θ (n) , v(n+1/2) )]· [I − ∇θT f (θ 2 2 ε ε (n) (n) (n) −T [I + Ω(θ , v )] [I − Ω(θ , v(n+1/2) )]T · dθ (n) ∧ dv(n) 2 2 (5.11) ∗ ∗ T where ∇θT f (θ, v) = vm (vm ) v∇m(θ)T . We then accept the proposal obtained by implementing (5.8)-(5.10) for L steps with the following probability αW T = min{1, exp(−E(θ (L+1) , v(L+1) ) + E(θ (1) , v(1) ))| det JW T |} ∂(θ (n+1) , v(n+1) ) where the Jocabian determinant det JW T = n=1 and the energy E is (n) ∂(θ , v(n) ) defined in (4.15) (see more details in chapter 4). Figure 5.4 illustrates this approach based QL on sampling from a mixture of 10 Gaussian distributions with dimension D = 100. 70 4 4 3 3 2 2 1 1 x2 x2 5.3 Wormhole HMC Algorithm 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −2 0 2 −4 −4 4 x1 −2 0 2 4 x1 Figure 5.4: Sampling from a mixture of 10 Gaussian distributions with dimension D = 100 using THMC along with a wind vector f (θ, v) to enforce moving between modes in higher dimensions. 5.3.3 Wormhole While the previous examples show that our addition of tunnels to Hamiltonian dynamics succeeds in facilitating a rapid transition between modes, the implementation has the downside that the native HMC dynamics are overridden in a neighborhood of the tunnel, possibly preventing the sampler from properly exploring some of the low probability regions, as well as some pieces of a mode. Indeed, any tunneling mechanism which modifies the dynamics in the existing parameter space will suffer from this issue. Thus we are inevitably led to the idea of allowing the tunnels to pass through an extra dimension so as not to interfere with the existing HMC dynamics in the given parameter space, and we call such tunnels wormholes. In particular we introduce an extra auxiliary variable θD+1 ∼ N(0, 1) corresponding to an auxiliary dimension. We use θ̃ := (θ, θD+1 ) to denote the position parameters in the resulting D +1 dimensional space MD × R. θD+1 can be viewed as random noise independent 2 of θ and contributes 21 θD+1 to the total potential energy. At the end of the sampling, we discard θD+1 as projecting θ̃ to the real world. Correspondingly we augment velocity v with one extra dimension, denoted as ṽ := (v, vD+1 ). We refer to MD × {−h} as the real world, and MD × {+h} as the mirror world. The two worlds are connected by networks of wormholes as shown in figure 5.5. We construct these wormholes in a ’mobile network’ fashion. When the sampler is near a mode (θ̂ 1 , −h) in the real world, we build a wormhole network by connecting it to all the modes in the mirror world. Similarly, we connect the corresponding mode in the mirror world, (θ̂ 1 , +h), to all 71 5. WORMHOLE HAMILTONIAN MONTE CARLO Wormhole 3 Auxiliary dimension 1 1 4 Mirror World 2 5 0.5 0 −0.5 3 −1 −15 1 4 5 −5 0 θ1 5 −5 −10 15 10 5 0 Real World 10 15 2 Wormhole→ −10 θ2 −15 Figure 5.5: Illustrating a wormhole network connecting the real world to the mirror world (h = 1). As an example, the cylinder shows a wormhole connecting mode 5 in the real world to its mirror image. The dashed lines show two sets of wormholes. The red lines shows the wormholes when the sampler is close to mode 1 in the real world, and the magenta lines show the wormholes when the sampler is close to mode 5 in the mirror world. the modes in the real world. Note, such construction allows the sampler to jump from one mode to the vicinity of itself, avoiding overzealous blow in the wind tunnel. Note, several wormholes starting from the same mode may still have chance to influence each other in the intersected region provided they exist simultaneously. To furhter resolve the interference, we adopt a stochastic way to weigh these wormholes through a random wind vector f̃ , instead of derterministically weighing wind tunnels by the vicinity function (5.7). Now suppose that the current position, θ̃, of the sampler is near a mode denoted ∗ as θ̃ 0 . A network of wormholes connects this mode to all the modes in the opposite world ∗ θ̃ k , k = 1, · · · K. Definition 5.4 (Random Wind Vector). A random wind vector f̃ (θ̃, ṽ) is defined as follows X X X ∗ m ( θ̃)δ mk (θ̃) < 1 (1 − m ( θ̃))δ (·) + (·), if k ṽ k 2(θ̃ k −θ̃)/e k k k f̃ (θ̃, ṽ) ∼ P m (θ̃)δ ∗ X (·) k k P 2(θ̃k −θ̃)/e , if mk (θ̃) ≥ 1 k mk (θ̃) k where e is the stepsize, δ is the Kronecker delta function, and mk (θ̃) = exp{−Vk (θ̃)/(DF )} with Vk (θ̃) the vicinity function defined similarly to (5.7) along the k-th wormhole in the 72 5.3 Wormhole HMC Algorithm network, ∗ ∗ ∗ ∗ ∗ ∗ Vk (θ̃) = hθ̃ − θ̃ 0 , θ̃ − θ̃ k i + |hθ̃ − θ̃ 0 , ṽT i||hθ̃ − θ̃ k , ṽT i| k k ∗ ∗ ∗ ∗ ∗ where ṽT = (θ̃ k − θ̃ 0 )/kθ̃ k − θ̃ 0 k. k ∗ For each updte f̃ (θ̃, ṽ) is either ṽ or 2(θ̃ k − θ̃)/e according to the position dependent probabilities defined in terms of mk (θ̃). We then make proposals with the following modified Lagrangian dynamics with random wind vector field in the extended space: θ̃˙ = f̃ (θ̃, ṽ) ṽ˙ = (5.12) − η(θ̃, ṽ) − G(θ̃)−1 ∇θ̃ φ(θ̃) Note that compared to the first equation in (5.6), ṽ is now absorbed into f̃ (θ̃, ṽ). To solve the modified Lagrangian dynamic (5.12) in a time-reversible manner, we still refer to (5.8)-(5.10) except that solving (5.9) by fixed point iteration involves random vectors: θ̃ (`+1) = θ̃ (`) (`) + ε/2[f̃ (θ̃ , ṽ(`+1/2) ) + f (θ̃ (`+1) , ṽ(`+1/2) )] (5.13) ∗ Therefore in each update, the sampler either stays at the vicinity of θ̃ 0 or proposes a move ∗ (`) towards a mode θ̃ k in the opposite world depending on the values of f̃ (θ̃ , ṽ(`+1/2) ) and f̃ (θ̃ (`+1) (`) ∗ (`) , ṽ(`+1/2) ). For example, if we have f̃ (θ̃ , ṽ(`+1/2) ) = 2(θ̃ k −θ̃ )/e, and f̃ (θ̃ (`+1) , ṽ(`+1/2) ) = ṽ(`+1/2) , then equation (5.13) becomes θ̃ (`+1) e ∗ = θ̃ k + ṽ(`+1/2) 2 which indicates that a move to the k-th mode in the opposite world has in fact occurred. Note that the movement θ̃ (`) → θ̃ (`+1) lim kθ̃ e→0 in this case is discontinuous since (`+1) (`) − θ̃ k ≥ 2h > 0 where 2h is the distance between the two worlds and should be chosen at the same scale of average distance among modes. Therefore, in such cases, there will be an energy gap, ∆E = E(θ̃ (`+1) (`) , ṽ(`+1) ) − E(θ̃ , ṽ(`) ), between the two states. Instead of volume correction (5.11) (see also section 4.3.2.3) which is not well defined1 here, we adjust the Metropolis acceptance probability to account for the resulting energy gap. Further, we limit the maximum number jumps within each iteration of MCMC (i.e., over L leapfrog steps) to 1 in order to avoid overzealous jumps between the two worlds. Algorithm 5.1 provides the details of our sampling 73 5. WORMHOLE HAMILTONIAN MONTE CARLO Algorithm 5.1 Wormhole Hamiltonian Monte Carlo (WHMC) Prepare the modes θ ∗k , k = 1, · · · K (1) Set θ̃ = current θ̃ Sample velocity ṽ(1) ∼ N(0, ID+1 ) (1) (1) Calculate E(θ̃ , ṽ(1) ) = U (θ̃ ) + K(ṽ(1) ) Set ∆ log det = 0, ∆E = 0, Jumped = false. for ` = 1 to L do (`) 1 ṽ(`+ 2 ) = ṽ(`) − 2e ∇θ̃ U (θ̃ ) if Jumped then (`+1) (`) 1 θ̃ = θ̃ + eṽ(`+ 2 ) else ∗ ∗ Find the closest mode θ̃ 0 and build a network connecting it to all modes θ̃ k , k = 1, · · · K in the opposite world for m = 1 to M do Calculate mk (ˆθ̃ (m) ), k = 1, · · · K Sample u ∼ Unif(0, 1) P if u < 1 − m (ˆθ̃ (m) ) then k k 1 1 Set f (ˆθ̃ (m) , ṽ(`+ 2 ) ) = ṽ(`+ 2 ) else P Choose one of the k wormholes according to probability {mk / k0 mk0 } and set ∗ 1 f (ˆθ̃ (m) , ṽ(`+ 2 ) ) = 2(θ̃ − ˆθ̃ (m) )/e k end if ˆθ̃ (m+1) = θ̃ (`) + e [f (ˆθ̃ (m) , ṽ(`+ 21 ) ) + f (θ̃ (`) , ṽ(`+ 21 ) )] 2 end for (`+1) θ̃ = ˆθ̃ (M +1) end if (`+1) 1 ṽ(`+1) = ṽ(`+ 2 ) − 2e ∇θ̃ U (θ̃ ) If a modal jump truly happens, set Jumped = true, calculate energy gap ∆E. end for (L+1) (L+1) Calculate E(θ̃ , ṽ(L+1) ) = U (θ̃ ) + K(ṽ(L+1) ) (L+1) (1) p = exp{−E(θ̃ , ṽ(L+1) ) + E(θ̃ , ṽ(1) ) + ∆E} (L+1) Accept or reject the proposal (θ̃ , ṽ(L+1) ) according to p method Wormhole Hamiltonian Monte Carlo (WHMC). We close this section with some comments on the width of wormholes. When modes have drastically different shapes (high density region), jumping from small, round, concentrated mode might be easier than jumping from long, narrow, spanned mode. This is because for the latter mode, the sampler may wander around the narrow wings but have less chance to enter the wormhole if it is not wide enough. So it is plausible to adapt the width of 1 ∇θ̃T f (θ̃, ṽ) in (5.11) has elements either all 0 (staying) or all ∞ (jumping) 74 5.4 Mode Searching After Regeneration wormholes to the shape of modes. One possibility is to project the principal direction of the mode to the plane perpendicular to wormhole direction. More adaptive wormhole should work even better. 5.4 Mode Searching After Regeneration So far, we assumed that the locations of modes are known. This is of course not a realistic assumption in many situations. In this section, we relax this assumption by extending our method to search for new modes proactively and to update the network of wormholes dynamically. In general, however, allowing such adaptation to take place infinitely often will disturb the stationary distribution of the chain, rendering the process no longer Markov [89, 93]. To avoid this issue, we use the regeneration method discussed by [87, 88, 89, 94]. Regeneration allows adaptation to occur infinitely often without affecting the stationary distribution or the consistency of sample path averages. Informally, a regenerative process “starts again” probabilistically at each of a set of random stopping times, called regeneration times [94]. These regeneration times divide the chain into segments, called tours, which are independent from each other [88, 89, 94]. Therefore, at regeneration times, the transition mechanism can be modified based on the entire history of the chain up to that point without disturbing consistency of MCMC estimators. In our method, when the regeneration occurs, we search for new modes and update the network of wormholes moving forward until the next regeneration time. When searching for new modes at regeneration times, we learn about the distribution around the known modes from the history of the chain to increase the possibility of finding new modes as opposed to rediscovering known ones. In what follows, we discuss how our method identifies regeneration times and how it discovers new modes. 5.4.1 Identifying Regeneration Times The main idea of regeneration is to regard the transition kernel T(θ t+1 |θ t ), e.g., MetropolisHastings algorithm with independent proposal (section 2.2.1), as a mixture of two kernels, Q and R [85, 87], T(θ t+1 |θ t ) = S(θ t )Q(θ t+1 ) + (1 − S(θ t ))R(θ t+1 |θ t ) 75 (5.14) 5. WORMHOLE HAMILTONIAN MONTE CARLO where Q(θ t+1 ) is an independence kernel, and the residual kernel R(θ t+1 |θ t ) is defined as follows T(θ t+1 |θ t ) − S(θ t )Q(θ t+1 ) , 1 − S(θ t ) R(θ t+1 |θ t ) = 1, if S(θ t ) ∈ [0, 1) (5.15) if S(θ t ) = 1 S(θ t ) is the mixing coefficient between the two kernels such that T(θ t+1 |θ t ) ≥ S(θ t )Q(θ t+1 ), ∀θ t , θ t+1 (5.16) Now suppose that at iteration t, the current state is θ t . There are two ways to identify regeneration times. Prospective Regeneration Generate a Bernoulli random variable Bt+1 with success probability S(θ t ), Bt+1 |θ t ∼ Bern(S(θ t )) (5.17) If Bt+1 = 1, sample θ t+1 from the independence kernel θ t+1 ∼ Q(·); otherwise, use the residual kernel to generate θ t+1 ∼ R(·|θ t ). When Bt+1 = 1, the chain regenerates and the transition mechanism Q(·) becomes independent of the current state θ t . To sum up, P[θ t+1 |Bt+1 , θ t ] = Q(θ t+1 )δ1 (Bt+1 ) + R(θ t+1 |θ t )δ0 (Bt+1 ) (5.18) where δ is the Kronecker delta function. Note, S(·) has to be between 0 and 1 as in definition (5.15), and non-regerative states have to be sampled from the residual kernel, which might not be easy. The following retrospective procedure avoid these two constrictions thus is preferred in practice. Retrospective Regeneration For this method, Bernoulli random variable Bt+1 is always generated after sampling θ t+1 . This way, Q(·) does not need to be normalized and we do not need to specify R(·|θ t ) explicitly [87, 89]. To implement this approach, we first generate θ t+1 using the original transition kernel θ t+1 |θ t ∼ T(·|θ t ). Then, we sample Bt+1 from the Bernoulli distribution with retrospective success probability calculated as follows (notice equations (5.17)(5.18)) P[Bt+1 = 1, θ t+1 |θ t ] P[θ t+1 |θ t ] P[θ t+1 |Bt+1 = 1, θ t ]P[Bt+1 = 1|θ t ] S(θ t )Q(θ t+1 ) = = P[θ t+1 |θ t ] T(θ t+1 |θ t ) r(θ t , θ t+1 ) :=P[Bt+1 = 1|θ t+1 , θ t ] = 76 (5.19) 5.4 Mode Searching After Regeneration If Bt+1 = 1, a regeneration has occurred, then we discard θ t+1 and sample from the independence kernel θ t+1 ∼ Q(·). At regeneration times, we redefine the dynamics using the past sample path. This process is discussed in the following section. Remark 5.3. It is essential to find function S ≥ 0 and probability measure Q (not necessarily normalized) satisfying the condition (5.16), which is also called splitting MCMC kernel, and the pair (S, Q) is called atom [88, 89]. For MH algorithms (section 2.2.1), it is much easier to split the MCMC kernel for the independent proposal mechanism than for the symmetric one. Suppose the proposal kernel is an independent sampler, q(θ t+1 |θ t ) = q(θ t+1 ), then by (2.4)(2.5) we can split the MH transition kernel T(θ t+1 |θ t ) = q(θ t+1 |θ t )α(θ t , θ t+1 ) + δθt (θ t+1 ) Z q(θ ∗ |θ t )(1 − α(θ t , θ ∗ ))dθ ∗ π(θ t+1 )/q(θ t+1 ) ≥ q(θ t+1 |θ t )α(θ t , θ t+1 ) = q(θ t+1 ) min 1, π(θ t )/q(θ t ) 1 π(θ t+1 )/q(θ t+1 ) · min c, =: Q(θ t+1 )S(θ t ) ≥ q(θ t+1 ) min 1, c π(θ t )/q(θ t ) with some c > 0 and S(θ t ) = min c, 1 π(θ t )/q(θ t ) π(θ t+1 )/q(θ t+1 ) , Q(θ t+1 ) = q(θ t+1 ) min 1, c (5.20) However this is difficult for symmetric proposal kernel q(θ t+1 |θ t ) = q(θ t |θ t+1 ). [88, 89] provide one splitting for it which however quickly fails as dimension grows. In our method, the independence kernel Q(θ t+1 ), is defined as in (5.20) with the proposal kernel q(θ t+1 ) specified by a mixture of Gaussians with means centered at the k known modes prior to regeneration. The covariance matrix for each mixture component is set to the inverse Hessian evaluated at the mode. The relative weight of each mixture component could be initialized as 1/k, but updated at regeneration times to be proportional to the number of times the corresponding mode has been visited up to that regeneration time. 5.4.2 Searching New Modes When the chain regenerates, we can modify the transition kernel by including newly found modes in the mode library and updating the wormhole network accordingly. This way, starting with a limited number of modes (identified by some preliminary optimization process), our wormhole HMC will discover unknown modes on the fly without affecting the stationarity of the chain. 77 5. WORMHOLE HAMILTONIAN MONTE CARLO Energy Contour Residual Energy Contour (T=1.2) Residual Energy Contour (T=1.05) 10 10 10 5 5 0 Known Modes 0 Known Modes 0 −5 −5 5 −5 Unknown Modes Unknown Modes −10 −10 −10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 Figure 5.6: Left panel: True energy contour (red: known modes, blue: unknown modes). Middle panel: Residual energy contour at T = 1.2. Right panel: Residual energy contour at T = 1.05. To search for new modes after regeneration, we could simply do optimization on the original target density function π(θ) with some random starting point. This, however, could lead to frequently rediscovering the known modes. To reduce this computational waste, we propose a surgery on π(θ) to remove/down-weight the known modes using the history of the chain up to the regeneration time and use an optimization algorithm on the resulting residual density. To this end, we fit a mixture of Gaussians with the best knowledge of modes (locations, Hessians and relative weights) prior to the regeneration. It has the same density as q(θ) in the independence kernel Q(θ) (5.20), which will be adapted at future regeneration times. The residual density function could be simply defined as πr (θ) = π(θ) − q(θ) with the corresponding residual potential energy as follows, Ur (θ) = log(πr (θ) + c) = − log(π(θ) − q(θ) + c) (5.21) where the constant c > 0 is used to make the term inside the log function positive. However, in regions where the mixture of Gaussians is a good fit for the original density, the corresponding residual energy, Ur , becomes flat, causing gradient-based minimization, such as Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, to fail. To avoid this issue, we propose to use the following tempered residual potential energy: 1 Ur (θ, T ) = − log f (θ) − exp log q(θ) + c T (5.22) where T is the temperature. Figure 5.6 shows how the residual energy function changes at different temperature. As the temperature cools down, known modes become more and more down weighted so the optimization algorithm would have a higher chance of discovering unknown modes. 78 5.4 Mode Searching After Regeneration When the optimizer finds new modes (checked by the smallest distance to known modes being bigger than some threshold), they are added to the existing mode library, and the wormhole network is updated accordingly. Algorithm 5.2 Regenerative Wormhole Hamiltonian Monte Carlo (RWHMC) Initially search modes θ̂ 1 , · · · , θ̂ k for n = 1 to L do Sample θ̃ = (θ, θD+1 ) as the current state according to WHMC (algorithm 5.1). Fit a mixture of Gaussians q(θ) with known modes, Hessians and relative weights. o n π(θ ∗ )/q(θ ∗ ) ∗ Propose θ ∼ q(·) and accept it with probability α = min 1, π(θ)/q(θ) . if θ ∗ accepted then Determine if θ ∗ is a regeneration using (5.19)(5.20) with θ t = θ, θ t+1 = θ ∗ . if Regeneration occurs then Search new modes by minimizing Ur (θ, T ) (5.22); if new modes discovered, update the library, wormhole network and q(θ). Discard θ ∗ , sample θ (n+1) ∼ Q(·) as in (5.20) using rejection sampling. else Set θ (n+1) = θ ∗ . end if else Set θ (n+1) = θ̃. end if end for 5.4.3 Regenerative Wormhole HMC Before giving the regenerative version of Wormhole HMC algorithm, we comment on the indepence kernel Q in (5.20) regarding the underlying mechanism to guide the jump among modes. What we need for our WHMC to adapt new modes is a timing rule (regeneration) without breaking the stationarity of the Markov chain. A splitting for WHMC kernel to identify regerentation times would be ideal, which however, is difficult in practice1 . Therefore, we introduce the mixture of Gaussians q(θ) using the best knowledge of discovered modes as an indepdent proposal for the target density π(θ), which can be viewed as another mechanism aside from WHMC to help jump among modes (similar as Truncated Dirichlet Process Mixture of Gaussians in RDMC [85]). It is valid to use several different proposals (WHMC The symmetric proposal for WHMC ((2.9), q(z∗ |zt ) = δTe (zt ) (z∗ ) with Te integrator for (5.12) described in algorithm 5.1) is hard to be expressed as a product of separate functions of zt and z∗ respectively. 1 79 5. WORMHOLE HAMILTONIAN MONTE CARLO and mixture of Gaussians) in a hybrid sampler in a random or systematic scheme [25, 89]. Only the second jumping mechanism (MH with mixture of Gaussians as proposal) is splitted in the process of identifying regeneration times. However, as we will see in the next section 5.5, without WHMC, this alone fails in high dimension as RDMC [85]. We use these two proposal mechanisms in a cyclic manner [85] and summarize the Regenerative Wormhole HMC (RWHMC) in algorithm 5.2. 5.5 Empirical Results In this section, we evaluate the performance of our method, henceforth called Wormhole Hamiltonian Monte Carlo (WHMC), using three examples. The first example, which is discussed in [85, 95], involves inference regarding the locations of sensors in a network. The second example involves sampling from mixtures of Gaussian distributions with varying number of modes and dimensions. In this example, which is discussed in [85], the locations of modes are assumed to be known. For our third example, we also use mixtures of Gaussian distribution, but this time we assume that the locations of modes are unknown. We evaluate our method’s performance by comparing it to Regeneration Darting Monte Carlo (RDMC) [85], which is one of the most recent algorithms designed for sampling from multimodal distributions based on the Darting Monte Carlo (DMC) [79] approach. DMC defines high density regions around the modes. When the sampler enters these regions, a jump between the regions will be attempted. RDMC enriches the DMC method by using the regeneration approach [88, 89]. We compare the two methods (i.e., WHMC and RDMC) in terms of Relative Error of Mean (REM) [85] and R (MPSRF) statistic [96]. REM summarizes the errors in approximating the expectation of variables across all dimensions. Definition 5.5 (Relative Error of Mean). Given samples {θ(k)}tk=1 , relative error of mean estimated by the samples at time t is defined REM(t) = kθ(t) − θ ∗ k1 /kθ ∗ k1 where θ(t) is the mean of MCMC samples obtained by time t and θ ∗ is the true mean. The R statistic measures the convergence rate to the stationary distribution based on within and between variances across multiple chains, and it approaches 1 when the chains converge. 80 5.5 Empirical Results Definition 5.6 (R (Multivariate Potential Scale Reduction Factor)). Denote θjt as j-th chain at time t for j = 1, · · · , m, t =, · · · , n. Estimate the posterior variance-covariance matrix by 1 B n−1 W + (1 + ) , V̂ = n m n where m n XX 1 W = (θjt − θj· )(θjt − θj· )T , m(n − 1) j=1 t=1 m 1 X B/n = (θj· − θ·· )(θj· − θ·· )T m − 1 j=1 Then R (multivariate potential scale reduction factor, MPSRF) is estimated by R̂ := max a aT V̂ a = λ1 (W −1 V̂ ) aT W a where λ1 is the largest eigenvalue. Because RDMC uses standard HMC algorithm with flat metric, we set the metric G0 ≡ I to make the two algorithms comparable. However, our approach can be easily modified to use other metrics such as Fisher metric. 5.5.1 Sensor Network Localization For our first example, we use a problem discussed in [85, 95]. We assume N sensors are scattered in a planar region with 2d locations denoted as {xi }N i=1 . The distance Yij between a pair of sensors (xi , xj ) is observed with probability π(xi , xj ) = exp(−kxi −xj k2 /(2R2 )). If the distance is in fact observed (Yij > 0), then Yij follows a Gaussian distribution N(kxi −xj k, σ 2 ) with small σ; otherwise Yij = 0. That is, Zij = I(Yij > 0)|x ∼ Binom(1, π(xi , xj )) Yij |Zij = 1, x ∼ N(kxi − xj k, σ 2 ) where Zij is a binary indicator set to 1 if the distance between xi and xj is observed. Given a set of observations Yij and prior distribution of x, which is assumed to be uniform in this example, it is of interest to infer the posterior distribution of all the sensor locations. Following [85], we set N = 8, R = 0.3, σ = 0.02, and add three additional base sensors with known locations to avoid ambiguities of translation, rotation, and negation (mirror symmetry). The location of the 8 sensors form a multimodal distribution with dimension D = 16. 81 5. WORMHOLE HAMILTONIAN MONTE CARLO RDMC RDMC vs WHMC WHMC 0.16 0.8 0.8 0.14 0.6 0.6 0.12 0.4 0.4 RDMC WHMC REM y y 0.1 0.2 0.2 0 0 −0.2 −0.2 0.08 0.06 0.04 0.02 0 0.2 0.4 0.6 x 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0 100 200 300 x 400 500 600 700 800 Seconds Figure 5.7: Posterior samples for sensor locations using RDMC (left panel) and WHMC (middle panel) along with their corresponding REM over time (right panel). Figure 5.7 shows the posterior samples based on the two methods. As we can see, RDMC very rarely visits one of the modes (shown in red in the top middle part); whereas, WHMC generates enough samples from this mode to make it discernible. As a result, WHMC has substantially lower REM compared to RDMC (Figure 5.7, right panel). 5.5.2 Mixture of Gaussians with Known Modes Next, we evaluate the performance of our method based on sampling from K mixtures of Ddimensional Gaussian distributions with with known modes. We relax this assumption in the next section. The means of these distributions are randomly generated from D-dimensional uniform distributions such that the average pairwise distances remains around 20. The corresponding covariance matrices are constructed in a way that mixture components have different density functions. Simulating samples from the resulting D dimensional mixture of K Gaussians is challenging because the modes are far apart and the high density regions have different shapes. The left panels of figure 5.8 compare the two methods for varying number of mixture components with fixed dimension (D = 20). The right panels show the results for varying number of dimensions with fixed number of mixture components (K = 10). For both scenarios, we stop the two algorithms after 500 seconds and compare their REM and R. We run 10 chains from different locations to calculate R; and we use these 10 chains to estimate REM along with its 95% confidence interval. As we can see, WHMC has substantially lower REM and R (i.e., converges faster) compared to RDMC, especially when the number of modes and dimensions increase. 82 5.5 Empirical Results D=20 K=10 3.5 1 RDMC WHMC RDMC WHMC REM (after 500 sec) REM (after 500 sec) 3 0.8 0.6 0.4 0.2 0 2.5 2 1.5 1 0.5 5 10 15 0 20 10 20 K 40 100 40 100 D D=20 K=10 1.1 1.8 RDMC WHMC 1.7 RDMC WHMC 1.08 R (after 500 sec) R (after 500 sec) 1.6 1.06 1.04 1.02 1.5 1.4 1.3 1.2 1.1 1 1 5 10 15 20 10 K 20 D Figure 5.8: Comparing WHMC to RDMC using K mixtures of D-dimensional Gaussians. Left panels show REM (along with 95% confidence interval) and R based on 10 MCMC chains for varying number of mixture components with fixed dimension (D = 20). Right panels show REM (along with 95% confidence interval) and R based on 10 MCMC chains for varying number of dimensions with fixed number of mixture components (K = 10). 5.5.3 Mixture of Gaussians with Unknown Modes We now evaluate our method’s performance in terms of searching for new modes and updating the network of wormholes. For this example, we simulate a mixture of 10 Gaussian distributions with D dimension for D = 10, 100, and compare our method to RDMC. While RDMC runs four parallel HMC chains initially to discover a subset of modes and to fit a truncated Gaussian distribution around each identified mode, we run four parallel optimizers (different starting point) using BFGS. At regeneration times, for each chain of RDMC uses the Dirichlet process mixture model to fit a new truncated Gaussian around modes and possibly identify new modes. We on the other hand run the BGFS algorithm based on the residual energy function (with T = 1.05) to discover new modes for each chain. Figure 5.9 shows RWHMC reduces REM much faster than RDMC for both D = 10 and D = 1000. For both methods the recorded time (horizontal axis) accounts for the computational overhead for adapting the transition kernels. For D = 10, our method has a substantially lower REM compared to RDMC. For D = 100, while our method identifies 83 5. WORMHOLE HAMILTONIAN MONTE CARLO K=10, D=10 K=10, D=100 3 RDMC RWHMC RDMC RWHMC 2 2.5 1.5 REM REM 2 1.5 1 1 0.5 0.5 0 0 100 200 300 400 500 600 700 0 0 800 100 Seconds 200 300 400 500 600 700 800 Seconds Figure 5.9: Comparing RWHMC to RDMC in terms of REM using K = 10 mixtures of D-dimensional Gaussians. Left panels: D = 20. Right panels: D = 100. new modes over time and reduces REM substantially, RDMC fails to identify new modes so as a result its REM stays high over time. Figure 5.10 shows the number of identified modes by our parallelized RWHMC over time for D = 10 and D = 100 separately. 11 Number of Modes Discovered 10 9 8 7 6 5 4 3 2 1 0 D=10 D=100 100 200 300 400 500 600 700 800 Seconds Figure 5.10: Number of identified modes over time using our regenerative WHMC method for K = 10 mixtures of Gaussians with D = 10, 100. 5.6 Discussion We have proposed a new algorithm, called Wormhole Hamiltonian Monte Carlo, for sampling from multimodal distributions. Using empirical results, we have shown that our method 84 5.6 Discussion performs well in high dimensions. Moving continuously, wind tunnel weighs the jumping routes deterministically via smooth mollifier function. It has local HMC as continuous movement interrupted by quick leap as being blown through a tunnel. While despite of the extra dimension, Wormhole algorithm moves continuously most of the time, with some discontinuous jump via routes weighted in a stochastic way, directly aiming at a mode. Regenerative WHMC extends WHMC by adapting the chain through regeneration to allow mode searching on the fly. Our method involves several parameters that require tuning. However, these parameters can be adjusted at regeneration times without affecting the stationary distribution or the consistency of sample path averages. Although we used a flat base metric (i.e., I) in the examples discussed in this chapter, our method can be easily extended by specifying a more informative base metric (e.g., Fisher information) that adapts to local geometry. For example, figure 5.11 shows the additional improvement in REM for the illustrative example of section 5.3.1 by using Fisher information instead of I. In this example, Wormhole Lagrangian Monte Carlo (WLMC) is similar to WHMC, but uses Lagrangian Monte Carlo (LMC, see chapter 4) (as opposed to HMC), i.e. base metric G0 is Fisher metric. Figure 5.11: Comparing Wormhole Lagrangian Monte Carlo (WLMC) to WHMC posterior sampling a 2d mixture of 2 Gaussians with tied means in section (5.3.1). WLMC is similar to WHMC, but it uses Fisher information as its based metric instead of the flat metric (I) used in WHMC. (Shaded areas represent the 95% confidence intervals based on 10 MCMC chains.) 85 5. WORMHOLE HAMILTONIAN MONTE CARLO Further technical improvements can be made by finding better (and possibly adaptive) tunnel metrics, mollifiers, and vicinity functions. 86 6 Spherical Hamiltonian Monte Carlo for Constrained Target Distributions 6.1 Introduction Many commonly used statistical models in Bayesian analysis involve high-dimensional probability distributions confined to constrained domains. Some examples include regression models with norm constraints (e.g., Lasso), probit models, many copula models, and Latent Dirichlet Allocation (LDA) models. Very often, the resulting models are intractable, simulating samples for Monte Carlo estimations is quite challenging [45, 57, 97, 98, 99], and mapping the domain to the entire Euclidean space for convenience would be computationally inefficient due to exploring a much larger space. In this chapter, we propose a novel Markov Chain Monte Carlo (MCMC) method, which provides a natural and computationally efficient framework for sampling from constrained target distributions. Our method is based on Hamiltonian Monte Carlo (HMC, chapter 2) [36, 37], which is a Metropolis algorithm with proposals guided by Hamiltonian dynamics. In recent years, several methods have been proposed to improve the computational efficiency of HMC [38, 39, 40, 41, 49, 51]. In general, these methods do not directly address problems with constrained target distributions. In contrast, in this chapter, we focus on improving HMC-based algorithms when the target distribution is constrained. When dealing with constrained target distributions, the standard HMC algorithm needs to check each proposal to ensure it is within the boundaries imposed by the constraints. It is quite computationally inefficient to discard those not statisfiying the constraints. Alternatively, as discussed by [37], one could modify standard HMC such that the sampler bounces off the boundaries by letting the potential energy go to infinity for parameter values that violate 87 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS the constraints. This approach, however, is not very efficient either due to constant monitoring boundary hitting time and frequent bouncing. There are some recent papers in this research direction. [51] discuss an approach for distributions defined on a simplex. [57] propose a modified version of HMC for handling constraint functions c(θ) = 0. [45] propose an HMC algorithm with an exact analytical solution for truncated Gaussian distributions. All these methods provide interesting solutions for specific types of constraints. Our proposed method, however, provides a general and computationally efficient framework for handling many types of constraints. In what follows, we first present our method for distributions confined to the unit ball in section 6.2. The unit ball is a special case of q-norm constraints. In section 6.3.1, we discuss the application of our method for q-norm constraints in general. In section 6.4, we evaluate our proposed method using simulated and real data. Finally, we discuss future directions in section 6.5. 6.2 Sampling from distributions defined on the unit ball In many cases, bounded connected constrained regionsqcan be bijectively mapped to the PD 2 D D-dimensional unit ball BD : kθk2 = 0 (1) := {θ ∈ R i=1 θ i ≤ 1}. Therefore, in this section, we first focus on distributions confined to the unit ball with the constraint kθk2 ≤ 1. 6.2.1 D Change of the domain: from unit ball BD 0 (1) to sphere S We start by augmenting the original D-dimensional parameter θ with an extra auxiliary variable θD+1 to form an extended (D + 1)-dimensional parameter θ̃ = (θ, θD+1 ) such that p kθ̃k2 = 1 so θD+1 = ± 1 − kθk22 . This way, the domain of the target distribution is changed D D+1 from the unit ball BD : kθ̃k2 = 1}, through 0 (1) to the D-dimensional sphere, S := {θ̃ ∈ R the following transformation: TB→S : BD 0 (1) −→ S , D q θ→ 7 θ̃ = (θ, ± 1 − kθk22 ) (6.1) Note that although θD+1 can be either positive or negative, its sign does not affect our Monte Carlo estimates since after applying the above transformation, we need adjust our estimates according to the following change of variable theorem. 88 6.2 Sampling from distributions defined on the unit ball Proposition 6.1 (Change of Variable: from unit Ball to hyper-Sphere). Z dθ B dθ̃ S = π(θ̃)|θD+1 |dθ̃ S π(θ)dθ B = π(θ̃) D d θ̃ SD BD (1) S S + + 0 Z Z (6.2) where π(θ̃) ≡ π(θ). dθ B = |θD+1 |, or equivalently to show the Jacobian determinant Proof. It suffices to show dθ̃ S p of TB→S+ is 1/|θD+1 | since the map TB→S+ : θ 7→ θ̃ = (θ, 1 − kθk22 ) bijectively maps the unit ball BD 0 (1) to upper-hemisphere S+ : dθ̃ 1 S |dTB→S+ | := = dθ B |θD+1 | D If we view {θ, BD 0 (1)} as a coordinate chart for manifold S , then by the volume form [68, 100], we have p dθ̃ S = det GS (θ) dθ B where GS (θ) is the canonical metric on sphere SD . Therefore it suffices to prove p det GS (θ) = 1/|θD+1 | (6.3) In the following we calculate canonical metric GS (θ). For SD , the first fundamental form ds2 , i.e., squared infinitesimal length of a curve, is explicitly expressed in terms of the differential form dθ and the canonical metric GS (θ) as follows: ds2 = hdθ, dθiGS = dθ T GS (θ)dθ which can be obtained as follows [100]: ds2 = D+1 X i=1 dθi2 = D X dθi2 + (d(θD+1 (θ)))2 = dθ T dθ + i=1 89 (θ T dθ)2 2 = dθ T [I + θθ T /θD+1 ]dθ 1 − kθk22 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS Therefore, the canonical metric GS (θ) on SD is1 GS (θ) = ID + θθ T θθ T = I + D 2 θD+1 1 − kθk22 (6.4) The determinant of canonical metric GS (θ) is given by the matrix determinant lemma, θθ T θT θ 1 det GS (θ) = det ID + 2 =1+ 2 = 2 θD+1 θD+1 θD+1 (6.5) thus (6.3) follows, and the inverse of GS (θ) is obtained by the Sherman-Morrison-Woodbury formula [101] GS (θ) −1 −1 2 θθ T /θD+1 θθ T = ID + 2 = ID − = ID − θθ T T 2 θD+1 1 + θ θ/θD+1 (6.6) Remark 6.1. According to the formula (6.2), we can do the Monte Carlo estimation directly with samples θ̃ ∼ π(θ̃)dθ̃ S each associated with a weight |θD+1 |. Alternatively, when we need samples θ ∼ π(θ)dθ B for estimation or inference, we can resample {θ̃} according to their weights and drop the auxillary variables θD+1 . Note, the necessity of re-weighting θ̃ by |θD+1 | to recover samples on the unit ball θ ∼ π(θ)dθ B is verified in our experiments. Otherise, it would have ’oversampled’ from around D the boundary, due to the change of geometry from unit ball BD 0 (1) to sphere S . Using the above transformation (6.1), we redefine the Hamiltonian dynamics on the sphere. This way, the resulting HMC sampler can move freely on SD while implicitly handling the constraints imposed on the original parameters. As illustrated in figure 6.1, the boundary of the constraint, i.e., kθk2 = 1, corresponds to the equator of the sphere SD . Therefore, as the sampler moves on the sphere, passing across the equator from one hemisphere to the other (from A to B on the right) translates to “bouncing back” off the the boundary in the original parameter space (form A to B on the left). T For any vector ṽ = (v, vD+1 ) ∈ Tθ̃ SD = {ṽ ∈ RD+1 : θ̃ ṽ = 0}, one could view GS (θ) as a mean to express the length of ṽ in v: 1 vT GS (θ)v = kvk22 + vT θθ T v (−θD+1 vD+1 )2 2 = kvk22 + = kvk22 + vD+1 = kṽk22 2 2 θD+1 θD+1 90 6.2 Sampling from distributions defined on the unit ball ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ●● ●● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●● ● ●● ●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A ● ● B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● B D Figure 6.1: Transforming unit ball BD 0 (1) to sphere S . 6.2.2 Hamiltonian Dynamics on Sphere By defining HMC on the sphere (hence named Spherical HMC ), besides handling the constraints implicitly, the computational efficiency of the sampling algorithm could be improved by using the splitting technique previously exploited by [40, 41, 51]. To this end, we first need to study the Hamiltonian dynamics defined on the mainfold (SD , GS (θ)) (see section 4.2.1). Consider a family of target distributions, {π(· ; θ)}, defined on the unit ball BD 0 (1) endowed with the Euclidean metric I. The potential energy is defined as U (θ) := − log π(· ; θ). Associated with the ancillary velocity variables v is define the kinetic energy K(v) = 21 vT Iv D for v ∈ Tθ BD 0 (1), which is a D-dimensional vector sampled from the tangent space of B0 (1). Therefore, the Hamiltonian is defined on BD 0 (1) as 1 H(θ, v) = U (θ) + K(v) = U (θ) + vT Iv 2 (6.7) Next, we derive the corresponding Hamiltonian function on SD . The potential energy U (θ̃) = U (θ) remains the same since the distribution is fully defined in terms of the original parameter θ, i.e., the first D elements of θ̃. However, the kinetic energy, K(ṽ) := 12 ṽT ṽ, changes since the velocity ṽ = (v, vD+1 ) is now sampled from the tangent space of the sphere, T Tθ̃ SD := {ṽ ∈ RD+1 |θ̃ ṽ = 0}, with vD+1 = −θ T v/θD+1 . Therefore, on the sphere SD , the Hamiltonian H ∗ (θ̃, ṽ) is defined as follows: 1 H ∗ (θ̃, ṽ) = U (θ̃) + K(ṽ) = U (θ̃) + ṽT ṽ 2 91 (6.8) 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS D If we view {θ, BD 0 (1)} as a coordinate chart of S , this is equivalent to replacing the Euclidean metric I with the canonical spherical metric GS (θ) = ID + θθ T /(1 − k|θk22 ) in the definition of H(θ, v) (6.7) as we rewrite the Hamiltonian function (6.8) (see the footnote 1 ): 1 1 H ∗ (θ̃, ṽ) = U (θ̃) + ṽT ṽ = U (θ) + vT GS (θ)v 2 2 (6.9) I Now we can sample the velocity v ∼ N(0, GS (θ)−1 ) and set ṽ = v. Al−θ T /θD+1 ternatively, we can sample ṽ directly from the standard (D + 1)-dimensional Gaussian and project it onto Tθ̃ SD : I −1 I − θ/θD+1 ṽ ∼ N 0, GS (θ) −θ T /θD+1 which simplifies to T ṽ ∼ ID+1 − θ̃ θ̃ N(0, ID+1 ) (6.10) The Hamiltonian function (6.9), H = U (θ) + 12 pT GS (θ)−1 p defines the Hamiltonian dynamics on the Riemannian manifold (SD , GS (θ)) in terms of (θ, p = GS (θ)v) [39, see also chapter 4]: θ̇ = GS (θ)−1 p (6.11) 1 − ∇θ U (θ) + (GS (θ)−1 p)T dGS (θ)GS (θ)−1 p 2 which is equivalent to the following Lagrangian dynamics in terms of (θ, v) (see more details ṗ = in chapter 4): 6.2.3 θ̇ = v v̇ = − vT Γ(θ)v − GS (θ)−1 ∇θ U (θ) (6.12) Spherical HMC algorithm Now we use the splitting technique [61] to derive an efficient geometric (time reversible and volume preserving) integrator for the above Riemannian Hamiltonian dynamics (6.11) [51], or Lagrangian dynamics (6.12). [51] split the Hamiltonian (6.9) as 1 H ∗ (θ, p) = U (θ)/2 + pT GS (θ)−1 p + U (θ)/2 2 92 6.2 Sampling from distributions defined on the unit ball and the Hamiltonian dynamics corresponging to U (θ)/2 and 21 pT GS (θ)−1 p are as follows: ( ( θ̇ = 0 ṗ = − 12 ∇θ U (θ) θ̇ = GS (θ)−1 p ṗ = 1 (GS (θ)−1 p)T dGS (θ)GS (θ)−1 p 2 (6.13) They realize that the second dynamics in (6.13) is equivalent to the geodesic equation (5.2), but solve it on the condition that the manifold must be embedded into a larger Euclidean space. To avoid such strong assumption, we propose to split Lagrangian dynamics instead of splitting Hamiltonian dynamics. Although splitting Hamiltonian and its usefulness in improving HMC is well studied in research [41, 51, 61], splitting Lagrangian has not been discussed in the literature, to the best of our knowledge. Nevertheless, we can split the Lagrangian dynamics (6.12) into smaller ones corresponding to U (θ)/2 and 21 vT GS (θ)v by applying the transformation p 7→ v to the dynamics (6.13) respectively (see section 4.3.1): ( ( θ̇ = 0 v̇ = − 12 GS (θ)−1 ∇θ U (θ) θ̇ = v v̇ = −vT Γ(θ)v (6.14) In the following we solve these dynamics (6.14) defined on SD . Proposition 6.2. The dynamics (6.14) have the following solutions respectively: θ̃(t) = θ̃(0) " # ! I t D − θ̃(0)θ(0)T ∇θ U (θ(0)) ṽ(t) = ṽ(0) − 2 0T θ̃(t) = θ̃(0) cos(kṽ(0)k2 t) + ṽ(0) sin(kṽ(0)k2 t) kṽ(0)k2 ṽ(t) = −θ̃(0)kṽ(0)k sin(kṽ(0)k t) + ṽ(0) cos(kṽ(0)k t) 2 2 2 (6.15) (6.16) where t denotes time. Proof. Appendix B. Remark 6.2. Based on the solutions (6.15) and (6.16), we have kθ̃(t)k2 = 1 if kθ̃(0)k2 = 1 and ṽ(t) ∈ Tθ̃(t) SD if ṽ(0) ∈ Tθ̃(0) SD . We observe that the whole dynamics do not take place on an embedded manifold, but that they occur on a manifold whose geodesics are known explicitly. With this viewpoint the applicability of the ideas of [51] should be further expanded. Note that (6.15) and (6.16) are both symplectic. Due to the explicit formula for the geodesic flow on sphere, the second dynamics in (6.14) is simulated exactly. Therefore, 93 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS Algorithm 6.1 Spherical Hamiltonian Monte Carlo (Spherical HMC) (1) Initialize θ̃ at current θ̃ after transformation Sample a new momentum value ṽ(1) ∼ N(0, ID+1 ) (1) (1) Set ṽ(1) ← ṽ(1) − θ̃ (θ̃ )T ṽ(1) (1) Calculate H(θ̃ , ṽ(1) ) = U (θ (1) ) + K(ṽ(1) ) for the current state for ` = 1 to L do (`) ID (`) T ε (`+ 21 ) (`) = ṽ − 2 − θ̃ (θ ) ∇θ U (θ (`) ) ṽ 0 θ̃ ṽ (`+1) (`+ 12 ) ṽ(`+1) = θ̃ (`) 1 cos(kṽ(`+ 2 ) kε) + (`) (`+ 21 ) 1 ṽ(`+ 2 ) kṽ (`+ 1 2) 1 k sin(kṽ(`+ 2 ) kε) (`+ 12 ) 1 1 ) (`+ ) ← −θ̃ kṽ k sin(kṽ kε) + ṽ(`+ 2 cos(kṽ 2 kε) (`+1) 1 ID − θ̃ (θ (`+1) )T ∇θ U (θ (`+1) ) = ṽ(`+ 2 ) − 2ε 0 end for (L+1) Calculate H(θ̃ , ṽ(L+1) ) = U (θ (L+1) ) + K(ṽ(L+1) ) for the proposed state (L+1) (1) α = min{1, exp(−H(θ̃ , ṽ(L+1) ) + H(θ̃ , ṽ(1) ))} (L+1) Accept or reject the proposal (θ̃ , ṽ(L+1) ) according to α (n) Calculate the corresponding weight |θD+1 | updating θ̃ does not involve discretization error so we can use large step sizes. This could lead to improved computational efficiency. Since this step is in fact a rotation on sphere, we set the trajectory length to be 2π/D and randomize the number of leapfrog steps to avoid periodicity. Algorithm 6.1 shows the steps for implementing this approach, henceforth called Spherical HMC. 6.3 Constraints In this section we discuss several types of constraints which can be transformed to ball type constraints so that Spherical HMC can be applied to sample from target distributions with these constraints. 6.3.1 Norm constraints The unit ball region discussed in the previous section is in fact a special case of q-norm constraints. In this section we discuss constraints given by of q-norm of parameters. 94 6.3 Constraints Definition 6.1 (q-norm). ∀β ∈ RD , q-norm (q > 0) of β is defined as follows: D X ( |β |q )1/q , i kβkq = q ∈ (0, +∞) (6.17) i=1 max |βi |, 1≤i≤D q = +∞ For example, when β are regression parameters, q = 1 corresponds to Lasso method, and q = 2 corresponds to ridge regression. In what follows, we show how this type of constraints can be transformed to SD . 6.3.1.1 Norm constraints with q = +∞ When q = +∞, the norm inequality defines a hypercube. Note that hypercube, and in general hyper-rectangle, RD := {β ∈ RD : l ≤ β ≤ u}, can be bijectively transformed to the unit hypercube, CD := [−1, 1]D = {β ∈ RD : kβk∞ ≤ 1}, by proper shifting and scaling of the original parameters. [37] discusses this kind of constraints, which could be handled by adding a term to the energy function such that the energy goes to infinity for values that violate the constraints. This creates “energy walls” at boundaries. As a result, the sampler bounces off the energy wall whenever it reaches the boundary. As mentioned earlier, this approach, henceforth called Wall HMC, has limited applications and tends to be computationally inefficient. To use Spherical HMC, the unit hypercube can be bijectively transformed to its inscribed unit ball through the following map: TC→B : [−1, 1]D → BD 0 (1), β 7→ θ = β kβk∞ kβk2 (6.18) Further, as discussed in the previous section, the resulting unit ball BD 0 (1) can be mapped to the sphere SD through TB→S for which the Spherical HMC can be used. The following proposition gives the weights needed for the change of domains from hyper-rectangle RD to sphere SD . Proposition 6.3. The Jacobian determinant (weight) of TS→R is as follows: D Y kθkD ui − li 2 |dTS→R | = |θD+1 | D kθk∞ i=1 2 95 (6.19) 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS Proof. First, we note TS→R = TC→R ◦ TB→C ◦ TS→B : θ̃ 7→ θ 7→ β 0 = θ u−l 0 u+l kθk2 7→ β = β + kθk∞ 2 2 The corresponding Jacobian matrices are TB→C TC→R " !# eTarg max |θ| kθk2 θT = I+θ − kθk∞ kθk22 θ arg max |θ| dβ u−l : = diag 2 d(β 0 )T dβ 0 : T dθ where earg max |θ| is a vector with (arg max |θ|)-th element 1 and all others 0. Therefore, D Y dβ dβ 0 dθ B kθkD ui − li 2 |dTS→R | = |dTC→R | |dTB→C | |dTS→B | = = |θD+1 | 0 T T D kθk∞ i=1 2 d(β ) dθ dθ̃ S 6.3.1.2 Norm constraints with q ∈ (0, +∞) A domain constrained by q-norm QD := {β ∈ RD : kβkq ≤ 1} for q ∈ (0, +∞) can be transformed to the unit ball BD 0 (1) bijectively via the following map: TQ→B : QD → BD 0 (1), βi 7→ θi = sgn(βi )|βi |q/2 (6.20) D As before, the unit ball BD for which we can use 0 (1) can be transformed to the sphere S the Spherical HMC method. The following proposition gives the weights needed for the transformaton from QD to SD . Proposition 6.4. The Jacobian determinant (weight) of TS→Q is as follows: D 2 |dTS→Q | = q D Y !2/q−1 |θi | |θD+1 | i=1 Proof. Note TS→Q = TB→Q ◦ TS→B : θ̃ 7→ θ 7→ β = sgn(θ)|θ|2/q The Jacobian matrix for TB→Q is dβ 2 = diag(|θ|2/q−1 ) T q dθ 96 (6.21) 6.3 Constraints Therefore the Jacobian Determinant of TS→Q is D dβ dθ B = 2 |dTS→Q | = |dTB→Q | |dTS→B | = T q dθ dθ̃ S 6.3.2 D Y !2/q−1 |θi | |θD+1 | i=1 Functional constraints [45] discuss linear and quadratic constraints for multivariate Gaussian distribution. Since the target distribution is simple, Hamiltonian dynamics can be exactly simulated and the hitting time can be analytically obtained. As admitted, most of the computation is spent in finding wall-hitting time and wall bouncing. In this section, we treat this type of constraints by mapping the constrained domain to the sphere SD for sampling from general distributions. 6.3.2.1 Linear constraints In genreal, M linear constraints can be written as l ≤ Xβ ≤ u, with X an M ×D matrix, β a D-vector and l, u both M -vectors. Assume there are no conflicting inequalities. Do Singular Value Decomposition (SVD) X = LΣRT where LM ×M , RD×D are both orthogonal matrices and Σ M ×D is a rectangle diagonal matrix with positive diagonal entries σ1 , · · · , σK where K = rank(X). Notice these inequalities actually constrains only K variables β ∗ := RT β. Without loss of generality, we assume X is full rank. For the convenience of discussion, we assume M ≥ D = K. Then (XT X)D×D is invertible. Now we can consider the hyper-rectangle type constraints for η := Xβ: l ≤ η ≤ u and apply the same procedure in section 6.3.1.1 to sample η using Spherical HMC. Followingly, we get samples of β = (XT X)−1 XT η, which simplifies as β = X−1 η when X is square invertible matrix. Needless to say, this method doesn’t scale up well if M D; when that happens, we can directly get bounds of β using norm inequalities, which sufficies the need in many scenarios. 6.3.2.2 Quadratic constraints There is no easy solution for general quadratic constraints l ≤ β T Xβ + bT β ≤ u, where l, u > 0 scalars. Here we consider for X symmetric and positive definite. By spectrum theorm, we have decomposition X = QΣQT with Q orthogonal and Σ diagonal of positive √ entries. By shifting and scaling β 7→ β ∗ = ΣQT (β + 12 X−1 b), we only need to consider 97 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS the concentric-ball type constraints for β ∗ : T D : l∗ ≤ kβ ∗ k22 = (β ∗ )T β ∗ ≤ u∗ , 1 1 l∗ = l + bT Xb, u∗ = u + bT Xb 4 4 (6.22) which can further be mapped to unit ball as follows: TT→B : √ BD 0 ( √ u∗ )\BD 0 ( l∗ ) −→ BD 0 (1), √ β ∗ kβ ∗ k2 − l∗ √ β → 7 θ= √ kβ ∗ k2 u∗ − l∗ ∗ √ √ D D ∗ ∗ whose inverse is TB→T : BD 0 (1) −→ B0 ( u )\B0 ( l ), √ l∗ ). θ 7→ β ∗ = (6.23) √ √ θ (( u∗ − l∗ )kθk2 + kθk2 We conclude this section with a comment on general functional constraints. Unless a bijective differentiable mapping from the constraint domain to the sphere exists, Spherical HMC cannot be applied directly. However, one can still find a piecewise linear envelope (e.g. tangent planes) of the domain that can be mapped to the sphere, then sampling by Spherical HMC on the evelope, and ditching a small portion of samples off boundary of the orginal constraint can still improve the efficiency compared to standard HMC with simple truncation. 6.4 Experimental results In this section, we evaluate our proposed method, Spherical HMC, by comparing its efficiency to that of Random Walk Metropolis (RWM) and Wall HMC using simulated and real data. To this end, we define efficiency in terms of time-normalized effective sample size (ESS, definition 4.5, see section 4.5) [17]. Roughly speaking, ESS can be interpreted as number of samples that can be regarded as independent. We use the minimum ESS normalized by the CPU time, as the overall measure of efficiency: min(ESS)/s. All computer codes are available online at http://www.ics.uci.edu/~slan/lanzi/CODES.html. 6.4.1 Truncated Multivariate Gaussian For illustration purposes, we first start with a truncated bivariate Gaussian distribution, β1 1 .5 ∼ N 0, , .5 1 β2 0 ≤ β1 ≤ 5, 0 ≤ β2 ≤ 1 The lower and upper limits are l = (0, 0) and u = (5, 1) respectively. The original rectangle domain can be mapped to the 2-dimensional unit sphere through the following transforma98 6.4 Experimental results tion: T : [0, 5] × [0, 1] → S2 , 2 1 1 2 β 7→ β 0 = (2β − (u + l))/(u − l) q 0 0 kβ k∞ 2 7→ θ = β 7→ θ̃ = θ, 1 − kθk2 kβ 0 k2 0.1 β2 0 −2 −1 0.02 0.06 0.08 β2 0 −2 −1 0.04 0.12 0.14 0.16 −2 −1 0 β1 1 2 −2 −1 0 β1 1 2 Figure 6.2: Density plots of a truncated bivariate Gaussian using exact density function (left) and MCMC samples from Spherical HMC (Right) The left panel of figure 6.2 shows the heatmap based on the exact density function, and the right panel shows the corresponding heatmap based on MCMC samples from Spherical HMC. Table 6.1 compares the true mean and covariance (by R package ’tmvtnorm’ [102]) of the above truncated bivariate Gaussian distribution with the point estimates obtained from RWM, Wall HMC, and Spherical HMC using 100000 MCMC iterations. Overall, all methods provide reasonably well estimates. Method Truth RWM Wall HMC Spherical HMC Mean 0.7906 0.4889 0.7764 0.4891 0.7929 0.4890 0.7925 0.4892 Covariance 0.3269 0.0172 0.0172 0.0800 0.3216 0.0152 0.0152 0.0801 0.3283 0.0163 0.0163 0.0800 0.3261 0.0170 0.0170 0.0797 Table 6.1: Comparing the point estimates of mean and covariance matrix of a bivariate truncated Gaussian distribution using RWM, Wall HMC, and Spherical HMC. 99 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS To evaluate the efficiency of the above three methods (RWM, Wall HMC, and Spherical HMC), we repeat the this experiment for higher dimensions, D = 10, and D = 100. As before, we set the mean to zero and set the (i, j)-th element of the covariance matrix to Σij = 1/(1 + |i − j|). Further, we impose the following constraints on the parameters, 0 ≤ β1 ≤ 5; Dim D=10 D=100 Method RWM Wall HMC Spherical HMC RWM Wall HMC Spherical HMC 0 ≤ βi ≤ 0.5, i 6= 1. AP s/Iteration 0.64 1.6E-04 0.93 5.8E-04 0.81 9.7E-04 0.72 1.3E-03 0.94 1.4E-02 0.88 1.5E-02 Min(ESS)/s 8.80 426.79 602.78 0.06 14.23 40.12 Table 6.2: Sampling Efficiency in of RWM, Wall HMC, and Spherical HMC for generating samplers from truncated Gaussian distributions. For each method, we obtain 10000 MCMC samples after discarding the initial 1000 samples. We set the tuning parameters of algorithms such that their overall acceptance rates are within a reasonable range. For RWM, above 95% of times proposed states are rejected due to violating the constraints. As shown in table 6.2, Spherical HMC is substantially more efficient than RWM and Wall HMC. On average, Wall HMC bounces off the wall around 7.68 and 31.10 times per iteration for D = 10 and D = 100 respectively. In contrast, by augmenting the parameter space, Spherical HMC handles the constraints in an efficient way. 6.4.2 Bayesian Lasso In regression analysis, overly complex models tend to overfit the data. Regularized regression models control complexity by imposing a penalty on model parameters. By far, the most popular model in this group is Lasso (least absolute shrinkage and selection operator) proposed by [103]. In this approach, the coefficients are obtained by minimizing the residual P sum of squares (RSS) subject to D j=1 |βj | ≤ t. [104] and [105] have proposed a Bayesian alternative method, called Bayesian Lasso. More specifically, the penalty term is replaced by a Laplace prior distribution of the form P (β) ∝ QD j=1 exp(−λ|βj |), which can be represented as a scale mixture of normal distributions [106]. This leads to a hierarchical Bayesian model with full conditional conjugacy. Therefore, the Gibbs sampler can be used for inference. 100 6.4 Experimental results Our proposed method in this chapter can directly handle the constraints in Lasso. Therefore, we can conveniently use Gaussian priors for model parameters, β|σ 2 ∼ N(0, σ 2 I), and use Spherical HMC with the transformation discussed in section 6.3.1.2. 3 3 9 20 6 3 20 4 6 10 0 0.2 0.4 0.6 0.8 Shrinkage Factor 1 2 −10 5 −20 −30 −20 −30 −30 −20 5 2 −10 2 −10 7 7 1 1 10 Coefficients 0 1 10 Coefficients 0 Coefficients 0 10 8 10 4 4 20 10 9 30 Bayesian Lasso Spherical HMC 30 Bayesian Lasso Wall HMC 30 Bayesian Lasso Gibbs Sampler 0 0.2 0.4 0.6 0.8 Shrinkage Factor 1 0 0.2 0.4 0.6 0.8 Shrinkage Factor 1 Figure 6.3: Bayesian Lasso using three different sampling algorithms: Gibbs sampler (left), Wall HMC (middle) and Spherical HMC (right) We now evaluate our method based on the diabetes data set discussed in [104]. Figure 6.3 compares coefficient estimates given by the Gibbs sampler [104], Wall HMC, and Spherical HMC algorithms as the shrinkage factor s := kβ̂ β̂ OLS Lasso k1 /kβ̂ OLS k1 changes from 0 to 1. Here, denotes the ordinary least square (OLS) estimates. For the Gibbs sampler, we choose different λ so that the corresponding s varies from 0 to 1. For Wall HMC and Spherical HMC, we fix the number of leapfrog steps to 10 and set the trajectory length such that they have comparable acceptance rates around 70%. Figure 6.4 compares the sampling efficiency of these three methods. As we impose tighter constraints (i.e., lower shrinkage factors), our method becomes substantially more efficient than the Gibbs sampler and Wall HMC. 6.4.3 Bridge regression The Lasso model discussed in the previous section is in fact a member of a family of regression models called Bridge regression [104, 107, 108], where the coefficients are obtained P q by minimizing the residual sum of squares subject to D j=1 |βj | ≤ t. For Lasso, q = 1, 101 1000 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS 0 200 Min(ESS)/s 400 600 800 Gibbs Sampler Wall HMC Spherical HMC 0.1 0.2 0.3 0.4 0.5 0.6 Shrinkage Factor 0.7 0.8 0.9 1 Figure 6.4: Sampling efficiency of different algorithms for Bayesian Lasso based on the diabetes dataset. which allows the model to force some of the coefficients to become exactly zero (i.e., become excluded from the model). As mentioned earlier, our Spherical HMC method can easily handle this type of constraints through the following transformation: T : Q →S , D D βi 7→ βi0 = βi /t 7→ θi = sgn(βi0 )|βi0 |q/2 , q 2 θ→ 7 θ̃ = θ, 1 − kθk2 Figure 6.5 compares the parameter estimates of Bayesian Lasso to the estimates obtained from two Bridge regression models with q = 1.2 and q = 0.8 for the diabetes dataset [104] using our Spherical HMC algorithm. As expected, tighter constraints (e.g., q = 0.8) would lead to faster shrinkage of regression parameters as we decrease s. 6.4.4 Modeling synchrony among multiple neurons [109] have recently proposed a semiparametric Bayesian model to capture dependencies among multiple neurons by detecting their co-firing patterns over time. In this approach, after discretizing time, there is at most one spike in each interval. The resulting sequence of 1’s (spike) and 0’s (silence) for each neuron is called a spike train, which is denoted as Y and is modeled using the logistic function of a continuous latent variable with a Gaussian process prior. For n neurons, the joint probability distribution of spike trains, Y1 , . . . , Yn , is coupled to the marginal distributions using a parametric copula model. Let H be n-dimensional 102 6.4 Experimental results Beysian Bridge Regression q=1.2 Beysian Bridge Regression q=0.8 30 3 30 20 20 0 0.2 0.4 0.6 0.8 Shrinkage Factor 10 4 6 1 7 5 6 8 1 10 Coefficients 0 −10 −30 −30 −30 5 −20 2 −10 2 5 −20 −20 −10 Coefficients 0 7 1 10 Coefficients 0 8 10 6 10 4 20 9 3 9 30 9 Beysian Bridge Regression Lasso (q=1) 0 0.2 0.4 0.6 0.8 Shrinkage Factor 1 0 0.2 0.4 0.6 0.8 Shrinkage Factor 1 Figure 6.5: Bayesian Bridge Regression by Spherical HMC: Lasso (q=1, left), q=1.2 (middle), and q=0.8 (right). distribution functions with marginals F1 , ..., Fn . In general, an n-dimensional copula is a function with the following form: H(y1 , ..., yn ) = C(F1 (y1 ), ..., Fn (yn )), for all y1 , . . . , yn Here, C defines the dependence structure between the marginals. [109] use a special case of the Farlie-Gumbel-Morgenstern (FGM) copula family [110, 111, 112, 113], for which C has the following form: " 1+ n X X βj1 j2 ...jk k=2 1≤j1 <···<jk ≤n k Y # (1 − Fjl ) n Y Fi i=1 l=1 where Fi = Fi (yi ). Restricting the model to second-order interactions, we have " H(y1 , . . . , yn ) = 1 + X # n 2 Y Y βj1 j2 (1 − Fjl ) Fi 1≤j1 <j2 ≤n l=1 i=1 Here, Fi = P (Yi ≤ yi ) for the ith neuron (i = 1, . . . , n), where y1 , . . . , yn denote the firing status of n neurons at time t. βj1 ,j2 captures the relationship between the j1 th and j2 th neurons, with βj1 ,j2 = 0 interpreted as “no relationship” between the two neurons. To ensure that probability distribution functions remain within [0, 1], the following constraints 103 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS Shperical HMC 0.6 0.4 0.6 0.2 0.4 β14 0.2 0.0 β14 −0.2 0.0 −0.4 −0.2 −0.4 −0.4 −0.2 0.0 β14 0.2 0.4 0.6 0.8 Wall HMC 0.8 RWM 0 500 1000 1500 2000 0 500 1000 Iterations 1500 2000 0 500 1000 1500 2000 1500 2000 Iterations Iterations Figure 6.6: Trace Plots of β14 under the rewarded stimulus. Shperical HMC 0.6 0.4 0.6 0.2 0.4 β34 0.2 0.0 β34 −0.2 0.0 −0.4 −0.2 −0.4 −0.4 −0.2 0.0 β34 0.2 0.4 0.6 0.8 Wall HMC 0.8 RWM 0 500 1000 1500 2000 0 500 1000 Iterations 1500 2000 0 500 1000 Iterations Iterations Figure 6.7: Trace Plots of β34 under the non-rewarded stimulus. on all n 2 parameters βj1 j2 are imposed: 1+ X 1≤j1 <j2 ≤n βj1 j2 2 Y εjl ≥ 0, ε1 , · · · , εn ∈ {−1, 1} l=1 Considering all possible combinations of εj1 and εj2 in the above condition, there are n(n−1) linear inequalities, which can be combined into the following inequality: X |βj1 j2 | ≤ 1 1≤j1 <j2 ≤n For this model, we can use the square root mapping described in section 6.3.1.2 to transform the original domain (q = 1) of parameters to the unit ball before using Spherical HMC. We apply our method to a real dataset based on an experiment investigating the role of prefrontal cortical area in rats with respect to reward-seeking behavior discussed in [109]. Here, we focus on 5 simultaneously recorded neurons under two scenarios: I) rewarded (pressing a lever by rats delivers 0.1 ml of 15% sucrose solution), and II) non-rewarded (nothing happens after pressing a lever by rats). The copula model detected significant associations among three neurons: the first and forth neurons (β1,4 ) under the rewarded scenario, and the third and forth neurons (β3,4 ) under the non-rewarded scenario. All other parameters were deemed non-significant (based on 95% posterior probability interval). The 104 6.5 Discussion Scenario I II Method RWM Wall HMC Spherical HMC RWM Wall HMC Spherical HMC AP s/Iteration 0.69 8.2 0.67 17.0 0.83 17.0 0.67 8.1 0.75 19.4 0.81 18.0 Min(ESS)/s 2.8E-04 7.0E-03 2.0E-02 2.8E-04 1.8E-03 2.2E-02 Table 6.3: Comparing sampling efficiencies of RWM, Wall HMC, and Spherical HMC based on the copula model for detecting synchrony among five neurons under rewarded stimulus and non-rewarded stimulus. trace plots of β14 under the rewarded stimulus and β34 under the non-rewarded stimulus are provided in figure 6.6 and figure 6.7 respectively. As we can see in table 6.3, Spherical HMC is order(s) of magnitudes more efficient than RWM and Wall HMC. 6.5 Discussion We have introduced a new efficient sampling algorithm for constrained distributions. Our method first maps the parameter space to the unit ball and then augments the resulting space to a sphere. A dynamical system is then defined on the sphere to propose new states that are guaranteed to remain within the boundaries imposed by the constraints. We have also shown how our method can be used for other types of constraints after mapping them to the unit ball. Further, by using the splitting strategy, we could improve the computational efficiency of our algorithm. We split Lagrangian dynamics and solve corresponding dynamics without requiring emmbedding manifold into a larger space, which extends [51]. Note, the radii of ball B and sphere S don’t have to be restricted to 1, as assumed in this chapter for the convenience of discussion. In this chapter, we assumed the Euclidean metric I on unit ball, BD 0 (1). The proposed approach can be extended to more complex metrics, such as the Fisher information metric GF (θ), in order to exploit the geometric properties of the parameter space [39]. This way, 2 the metric for the augmented space could be defined as GF (θ) + θθ T /θD+1 . Under such a metric however, we might not be able to find the geodesic flow analytically. Therefore, the added benefit from using the Fisher information metric might be undermined by the resulting computational overhead. See [39, 51] for more discussion. We have discussed several applications of our method in this chapter. The proposed method can be applied to other problems involved constrained target distributions. Further, the ideas presented here can be employed in other MCMC algorithms. 105 6. SPHERICAL HAMILTONIAN MONTE CARLO FOR CONSTRAINED TARGET DISTRIBUTIONS 106 7 Conclusion Markov Chain Monte Carlo is a crucial tool for Bayesian statistics, not only because it can handle intractable integration, which is almost omnipresent in moden Baysian modeling, but it also naturally provides interval estimates. The wider application of MCMC is however hindered by either slow mixing rates or expensive computational cost. Hamiltonian Monte Carlo is an efficient Metropolis-Hastings algorithm. It uses Hamiltonian dynamics to guide the proposal so that the sampler can make several consecutive and systematic move towards a distant state. Yet the stantdard HMC algorithm is not efficient or capable enough to handle statistical or machine learning problems that involve certain complicated probability distributions. This dissertation is an attempt to use geometry to help solve these challenges, including computational burden, exploration of complex distribution structure, multimodal distributions and constrained distributions. The experimental results provided here confirm the potential for substantial improvement over traditional solutions. Split HMC improves the computation efficiency of HMC by splitting the Hamiltonian into smaller dynamics, one of which can be simulated exactly or at lower cost thus with larger step size and fewer steps. Two scenarios have been discussed: one case is when the potential energy can be well approximated by a quadratic function so that the dynamics has a partial analytic solution; the other case is when the most influential terms of the potetial and their gradients can be evaluated based on a small subset of data, thus the simulation is computationally less expensive. In both scenarios, the original potential energy or its gradient has to be well approximated to avoid large errors. Lagrangian Monte Carlo reduces the computational cost of RHMC by removing the expensive implicit updates using velocity instead of momentum. The original Hamiltonian dynamics on a Riemannian manifold is shown to be equivalent to Lagrangian dynamics, which is the solution to the variation of action in Physics. A semi-explicit integrator is 107 7. CONCLUSION derived the same way as the generalized leapfrog and further made explicit by a symmetric construction. Wormhole HMC is a novel geometric MCMC algorithm designed for sampling from multimodal distributions, a challenging problem in high dimension. By tunneling the metric, adding an external vector field, passing through an extra anxillary dimension, wormhole plays a role to facilitate the movement between modes, and naturally embeds the jumping mechanism in HMC algorithm. Besides, with the regeneration technique to allow adaptation, the sampler can proactively search unkown modes, as opposed to rediscovering known ones, and dynamically update the wormhole network on the fly without affecting the stationarity. Spherical HMC provides a natural and efficient framework to sample from constrained distributions. It first maps the constrained domain to a unit ball, then augments it to a sphere in one higher dimension such that the original boundary corresponds to the equator of the sphere. The sampler defined on sphere handles the constraints implicitly by moving freely on the sphere generating proposals that remain within the boundary when mapped back to the original space. Although we discussed applications of this method using HMC, the proposed framework can be easily extended to other MCMC samplers. The work presented here is by no means a comprehensive application of geometry in Bayesian inference. The author believes that using other geometrically motivated methods could substantially advance the development of MCMC methodology. With computational methods to balance the added cost, these methods could broaden the application of MCMC to large, complicated problems. 7.1 Future Directions Even though our proposed methods in this dissertation show the benefits of using geometry in Bayesian inference, the associated computational overhead can not be neglected. Occasionally, the extra computational cost overwhelms the gain (see section 4.5.3). However, this means that we should attempt to develop better geometrical methods and find better integration of these methods with computational techniques. In the following I will point out some possible future directions. Matrix Calculation In general, this is a challenging problem in numerical analysis. Many matrix calculations, e.g. multiplication, inversion, etc. have complexity O(D2.373 ). Therefore, it is quite expensive to work with full matrices in our proposed methods. To avoid this issue, we could use sparse or structurally sparse (e.g., tri-diagonal) matrices instead. For example, we could approximate full matrices with some simpler and easier-to-calculate forms 108 7.1 Future Directions [48]. We can also take adavante of the feature of specific problems which involve structured matrices [114, 115]. Stochastic Updates When the data volume is extremely large, it is not computationally practical to directly apply these geometric methods, considering that each update of geometric terms and the accetpance test require scanning all of the data. The idea of stochastic updates stems from [116], where stochastic gradient calculated with uniformly sampled subset of data is used for optimization. For the variational Bayes, [117] develop stochastic variational Bayes by solving the variational inference with stochastic optimization. For MCMC methods, [91, 118] are poineers in using stochastic gradient to reduce the computational cost. Their method is based on Lagevin dynamics, which is a simpler version of HMC with one leap in each iteration. Extension of this method to HMC, however, might be challenging since the introduced errors will accumulate along the trajectory, rendering more diffusive movement. [91] also avoid the acceptance test for the propoposal by annealing the step size along the sampling. It is shown that there is a trade-off between computational cost and accuracy gain. [119] point out an interesting approach to reduce the computational cost in the Metropolis-Hastings algorithms by using sequential testing for acceptance tests. The (stochastic gradient) Lagevin versions of the algorithms presented in this dissertation are worth investigation for more scalable application. Geometric Variational Bayes Variational Bayes relies on iteratively reducing the distance (Kullback-Leibler divergence) between a variational distribution and the true posterior distribution. However, K-L is not always the best choice of distance function between distributions; in fact, the metric is not a proper distance measure since it is not symmetric. Besides, K-L divergence could have complicated forms, e.g. K-L divergence between N(0, σ 2 ) and N(0, σ 2 + δ 2 ). On the other hand, if we view the family of distributions as a manifold [46], with propoer metric (e.g. Fisher metric), we can define their distance as the length (or simply kinetic energy, which is the squared length) of geodesic connecting them. In this example, such distance would be as simple as 12 (log(1 + δ 2 /σ 2 ))2 . One future direction could be to develop a geometric version of variational Bayes by using geodesic based distance function. Variation of energy is a fully developed concept in geometry and should be naturally adopted to provide an easier alternative to current variational Bayes methods based on K-L divergence. 109 7. CONCLUSION 110 [15] T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. Estimation of statistical errors in molecular simulation calculations. Molecular Physics, 57:89–95, 1986. 2 [16] Brian D. Ripley. Stochastic simulation. John Wiley & Sons, Inc., New York, NY, USA, 1987. 2 [17] C. J. Geyer. Practical Markov Chain Monte Carlo. Statistical Science, 7(4):473–483, 1992. 2, 27, 52, 98 References [18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. 2, 11, 17 [19] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. 2, 11 [1] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An Introduction to Variational Methods for Graphical Models. Mach. Learn., 37(2):183–233, November 1999. 1 [20] Stuart Geman and Donald Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741, November 1984. 2, 3 [2] Tommi S. Jaakkola. Tutorial on Variational Approximation Methods. In In Advanced Mean Field Methods: Theory and Practice, pages 129–159. MIT Press, 2000. 1 [21] Alan E. Gelfand and Adrian F. M. Smith. Sampling-Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association, 85(410):398–409, 1990. 2 [3] R. M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993. 1, 2, 27, 63 [22] Tommi Jaakkola and Michael I. Jordan. Variational probabilistic inference and the QMR-DT database. Journal of Artificial Intelligence Research, 10:291–322, 1999. 2 [4] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer- Verlag, 2nd edition, 2004. 1, 2 [23] Zoubin Ghahramani and Matthew J. Beal. Variational Inference for Bayesian Mixtures of Factor Analysers. In In Advances in Neural Information Processing Systems 12, pages 449– 455. MIT Press, 2000. 2 [5] Radford M. Neal and Geoffrey E. Hinton. Learning in graphical models. chapter A view of the EM algorithm that justifies incremental, sparse, and other variants, pages 355–368. MIT Press, 1999. 1 [24] R. Durrett. Probability: theory and examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge U. Press, 4th edition, August 2010. 2 [6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1–38, 1977. 1 [25] Luke Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701–1728, 1994. 2, 80 [26] John Geweke. Bayesian Inference in Econometric Models Using Monte Carlo Integration. Econometrica, 57(6):1317– 39, 1989. 2 [7] Hagai Attias. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, UAI’99, pages 21–30. Morgan Kaufmann Publishers Inc., 1999. 1 [27] A. F. M. Smith and A. E. Gelfand. Bayesian Statistics without Tears: A Sampling-Resampling Perspective. The American Statistician, 46(2):84–88, May 1992. 2 [8] Hagai Attias. A Variational Bayesian Framework for Graphical Models. In In Advances in Neural Information Processing Systems 12, pages 209–215. MIT Press, 2000. 1 [28] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and MichaelI. Jordan. An Introduction to MCMC for Machine Learning. Machine Learning, 50(1-2):5–43, 2003. 3, 11 [9] Wim Wiegerinck. Variational approximations between mean field theory and the junction tree algorithm. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, UAI’00, pages 626–633. Morgan Kaufmann Publishers Inc., 2000. 1 [29] W. R. Gilks, N. G. Best, and K. K. C. Tan. Adaptive Rejection Metropolis Sampling within Gibbs Sampling. Journal of the Royal Statistical Society. Series C (Applied Statistics), 44(4):pp. 455–472, 1995. 3 [30] Yves F. Atchade and Francois Perron. Improving on the independent Metropolis-Hastings algorithm. Statistica Sinica, 15(3-18), 2005. 3 [10] Zoubin Ghahramani and Matthew J. Beal. Propagation Algorithms for Variational Bayesian Learning. In In Advances in Neural Information Processing Systems 13, pages 507–513. MIT Press, 2001. 1 [31] Ragnar Hauge Lars Holden and Marit Holden. Adaptive independent Metropolis–Hastings. Annals of Applied Probability, 19(1):395–413, 2009. 3 [11] Mark Girolami. A Variational Method for Learning Sparse and Overcomplete Representations. Neural Comput., 13(11):2517–2532, November 2001. 1 [32] Paolo Giordani and Robert Kohn. Adaptive Independent Metropolis–Hastings by Fast Estimation of Mixtures of Normals. Journal of Computational and Graphical Statistics, 19(2):243–259, 2010. 3 [12] Eric P. Xing, Michael I. Jordan, and Stuart Russell. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, UAI’03, pages 583–591. Morgan Kaufmann Publishers Inc., 2003. 1 [33] Radford M. Neal. Slice sampling. 31(3):705–767, 2003. 3 [13] C. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A Variational Inference Engine for Bayesian Networks. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 777–784. MIT Press, Cambridge, MA, 2003. 1 Annals of Statistics, [34] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical slice sampling. JMLR: W&CP, 9:541–548, 2010. 3 [35] Robert Nishihara, Iain Murray, and Ryan P. Adams. Parallel MCMC with Generalized Elliptical Slice Sampling. http://arxiv.org/abs/1210.7477, 2012. 3 [14] C. Kipnis and S. R. S. Varadhan. Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Commun. Math. Phys., 104:1–19, 1986. 2 [36] S. Duane, A. D. Kennedy, B J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987. 3, 7, 17, 35, 64, 87 111 REFERENCES [37] R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and X. L. Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, 2010. 3, 7, 14, 15, 17, 18, 19, 22, 27, 35, 60, 64, 87, 95 [56] P. Diaconis, S. Holmes, and M. Shahashahani. Sampling from a Manifold. In Galin Jones and Xiaotong Shen, editors, Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Morris L. Eaton, 10, pages 102–125. Institute of Mathematical Statistics, 2013. 4 [38] M. Hoffman and A. Gelman. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. arxiv.org/abs/1111.4246, 2011. 3, 15, 33, 87 [57] Marcus A. Brubaker, Mathieu Salzmann, and Raquel Urtasun. A Family of MCMC Methods on Implicitly Defined Manifolds. In Neil D. Lawrence and Mark A. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), 22, pages 161–172, 2012. 4, 87, 88 [39] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society, Series B, (with discussion) 73(2):123–214, 2011. 3, 10, 12, 15, 33, 35, 36, 38, 39, 52, 54, 55, 56, 59, 64, 68, 87, 92, 105 [58] Peter J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711–732, 1995. 4, 43 [40] A. Beskos, F. J. Pinski, J. M. Sanz-Serna, and A. M. Stuart. Hybrid Monte-Carlo on Hilbert spaces. Stochastic Processes and their Applications, 121:2201–2230, 2011. 3, 20, 41, 87, 91 [59] Arnaud Doucet, Nando Freitas, and Neil Gordon. An Introduction to Sequential Monte Carlo Methods. In Arnaud Doucet, Nando Freitas, and Neil Gordon, editors, Sequential Monte Carlo Methods in Practice, Statistics for Engineering and Information Science, pages 3–14. Springer New York, 2001. 4 [41] Babak Shahbaba, Shiwei Lan, Wesley O. Johnson, and RadfordM. Neal. Split Hamiltonian Monte Carlo. Statistics and Computing, pages 1–11, 2013. 3, 60, 87, 91, 93 [60] Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996. 7 [42] Michael Betancourt and Leo C. Stein. The Geometry of Hamiltonian Monte Carlo. http://arxiv.org/abs/1112.4118, December 2011. 3 [61] B. Leimkuhler and S. Reich. Simulating Hamiltonian Dynamics. Cambridge University Press, 2004. 8, 10, 14, 17, 19, 39, 42, 45, 49, 70, 92, 93 [43] Jascha Sohl-Dickstein. Hamiltonian Monte Carlo with Reduced Momentum Flips. http://arxiv.org/abs/1205.1939, May 2012. 3 [62] V. I. Arnold. Mathematical Methods of Classical Mechanics. Springer, 2nd edition, May 1989. 10 [44] Jascha Sohl-Dickstein and Benjamin J. Culpepper. Hamiltonian Annealed Importance Sampling for partition function estimation. http://arxiv.org/abs/1205.1925, May 2012. 3 [63] L. Verlet. Computer ”Experiments” on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules. Phys. Rev., 159(1):98–103, 1967. 13, 39 [45] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for Truncated Multivariate Gaussians. ArXiv e-prints, August 2013. 3, 87, 88, 97 [64] A. D. Polyanin, V. F. Zaitsev, and A. Moussiaux. Handbook of First Order Partial Differential Equations. Taylor & Francis, London, 2002. 20 [46] S. Amari and H. Nagaoka. Methods of Information Geometry, 191 of Translations of Mathematical monographs. Oxford University Press, 2000. 3, 37, 38, 109 [65] Madeleine B. Thompson. A Comparison of Methods for Computing Autocorrelation Time. Technical Report, (1007), 2010. 27 [47] V. Stathopoulos and M. Girolami. Manifold MCMC for Mixtures. In C. P. Robert K. Mengersen and M. D. Titteringhton, editors, Mixture: Estimation and Applications, pages 255–276. John Wiley & Sons, Ltd, 2011. 3, 58 [66] D. Ayres de Campos, J. Bernardes, A. Garrido, J Marques de Sa, and L Pereira-Leite. SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. Journal of MaternalFetal Medicine, 9:311–318, 2000. 31 [48] Yichuan Zhang and Charles Sutton. Quasi-Newton Methods for Markov Chain Monte Carlo. In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2393–2401. 2011. 3, 109 [67] Luke Tierney and Joseph B. Kadane. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association, 81(393):82–86, 1986. 32 [68] Manfredo P. do Carmo. Riemannian Geometry. Boston, 1 edition, January 1992. 38, 89 [49] S. Lan, V. Stathopoulos, B. Shahbaba, and M. Girolami. Lagrangian Dynamical Monte Carlo. arxiv.org/abs/1211.3759, 2012. 3, 87 Birkhäuser [69] Richard L. Bishop and Samuel I. Goldberg. Tensor Analysis on Manifolds. Dover Publications, Inc., December 1980. 40 [50] Ziyu Wang, Shakir Mohamed, and Nando de Freitas. Adaptive Hamiltonian and Riemann Manifold Monte Carlo Samplers. http://arxiv.org/abs/1302.6182, February 2013. 3 [70] Jun S. Liu. Monte Carlo Strategies in Scientific Computing, chapter Molecular Dynamics and Hybrid Monte Carlo. SpringerVerlag, 2001. 48 [51] S. Byrne and M. Girolami. Geodesic Monte Carlo on Embedded Manifolds. ArXiv e-prints, January 2013. 3, 4, 60, 87, 88, 91, 92, 93, 105 [71] Jean-Michel Marin, Kerrie L. Mengersen, and Christian Robert. Bayesian modelling and inference on mixtures of distributions. In D. Dey and C.R. Rao, editors, Handbook of Statistics: Volume 25. Elsevier, 2005. 58 [52] R. M. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6(4):353, 1996. 4, 63 [72] Geoffrey McLachlan and David Peel. John Wiley & Sons, Inc., 2005. 58, 59 [53] E. Marinari and G. Parisi. Simulated tempering: a new Monte Carlo scheme. Europhysics Letters, 19:451–8, 1992. 4 Finite Mixture Models. [73] A. Dullweber, B. Leimkuhler, R. Mclachlan, England Cb Ew, Andreas Dullweber, Benedict Leimkuhler, and Robert Mclachlan. Split-Hamiltonian methods for rigid body molecular dynamics. J. Chem. Phys, 107:5840–5852, 1997. 60 [54] Charles J. Geyer and Elizabeth A. Thompsonb. Annealing Markov Chain Monte Carlo With Applications to Ancestral Inference. Journal of the American Statistical Association, 90(431):909–920, Sep 1995. 4 [74] J.C. Sexton and D.H. Weingarten. Hamiltonian evolution for the hybrid Monte Carlo algorithm. Nuclear Physics B, 380(3):665 – 677, 1992. 60 [55] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by Simulated Annealing. Science, Number 4598, 13 May 1983, 220(4598):671–680, 1983. 4, 63 [75] Siu A. Chin. Explicit symplectic integrators for solving nonseparable Hamiltonians. Phys. Rev. E, 80:037701, Sep 2009. 60 112 REFERENCES [76] G. Celeux, M. Hurn, and C. P. Robert. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95:957– 970, 2000. 63 [96] S. P. Brooks and A. Gelman. General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics, 7(4):pp. 434–455, 1998. 80 [77] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001. 63 [97] Peter Neal and Gareth O. Roberts. Optimal scaling for random walk Metropolis on spherically constrained target densities. Methodology and Computing in Applied Probability, Vol.10(No.2):277–297, June 2008. 87 [78] D. Rudoy and P. J. Wolfe. Monte Carlo Methods for MultiModal Distributions. In Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth Asilomar Conference on, pages 2019– 2023, 2006. 63 [98] Chris Sherlock and Gareth O. Roberts. Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets. Bernoulli, Vol.15(No.3):774–798, August 2009. 87 [79] C. Sminchisescu and M. Welling. Generalized darting Monte Carlo. Pattern Recognition, 44(10-11), 2011. 63, 64, 80 [99] Peter Neal, Gareth O. Roberts, and Wai Kong Yuen. Optimal scaling of random walk Metropolis algorithms with discontinuous target densities. Annals of Applied Probability, Volume 22(Number 5):1880–1927, 2012. 87 [80] R. V. Craiu, Jeffrey R., and Chao Y. Learn From Thy Neighbor: Parallel-Chain and Regional Adaptive MCMC. Journal of the American Statistical Association, 104(488):1454– 1466, 2009. 63 [100] Michael Spivak. A Comprehensive Introduction to Differential Geometry, 1. Publish or Perish, Inc., Houston, second edition, 1979. 89 [81] G. R. Warnes. The normal kernel coupler: An adaptive Markov Chain Monte Carlo method for efficiently sampling from multi-modal distributions. Technical Report Technical Report No. 395, University of Washington, 2001. 63 [101] Gene H. Golub and Charles F. Van Loan. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996. 90 [82] K. B. Laskey and J. W. Myers. Population Markov Chain Monte Carlo. Machine Learning, 50:175–196, 2003. 63 [102] Stefan Wilhelm and Manjunath B G. tmvtnorm: Truncated Multivariate Normal and Student t Distribution, 2013. R package version 1.4-8. 99 [83] G. E. Hinton, M. Welling, and A. Mnih. Wormholes Improve Contrastive Divergence. In Advances in Neural Information Processing Systems 16, 2004. 63 [103] Robert Tibshirani. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. 100 [84] C. J. F. Ter Braak. A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces. Statistics and Computing, 16(3):239–249, 2006. 63 [104] Trevor Park and George Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. 100, 101, 102 [85] S. Ahn, Y. Chen, and M. Welling. Distributed and adaptive darting Monte Carlo through regenerations. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AI Stat), 2013. 63, 64, 75, 79, 80, 81 [105] Chris Hans. Bayesian lasso regression. Biometrika, 96(4):835– 845, 2009. 100 [106] M. West. On scale mixtures of normal distributions. Biometrika, 74(3):646–648, 1987. 100 [86] C. Sminchisescu and B. Triggs. Building Roadmaps of Local Minima of Visual Models. In In European Conference on Computer Vision, pages 566–582, 2002. 63 [107] Ildiko E. Frank and Jerome H. Friedman. A Statistical View of Some Chemometrics Regression Tools. Technometrics, 35(2):109–135, 1993. 101 [87] Esa Nummelin. General Irreducible Markov Chains and NonNegative Operators, 83 of Cambridge Tracts in Mathematics. Cambridge University Press, 1984. 64, 75, 76 [108] Nicholas G. Polson, James G. Scott, and Jesse Windle. The Bayesian Bridge. http://arxiv.org/abs/1109.2279v2, 2012. 101 [88] Per Mykland, Luke Tierney, and Bin Yu. Regeneration in Markov Chain Samplers. Journal of the American Statistical Association, 90(429):pp. 233–241, 1995. 64, 75, 77, 80 [109] B. Shahbaba, B. Zhou, H. Ombao, D. Moorman, and S. Behseta. A semiparametric Bayesian model for neural coding. arXiv:1306.6103, 2013. 102, 103, 104 [89] Walter R. Gilks, Gareth O. Roberts, and Sujit K. Sahu. Adaptive Markov Chain Monte Carlo through Regeneration. Journal of the American Statistical Association, 93(443):pp. 1045–1054, 1998. 64, 75, 76, 77, 80 [110] D. J. G. Farlie. The Performance of Some Correlation Coefficients for a General Bivariate Distribution. Biometrika, 47(3/4), 1960. 103 [90] John E. Straub Ioan Andricioaei and Arthur F. Voter. Smart darting monte carlo. The Journal of Chemical Physics, 114(16):6994–7000, 2001. 64 [111] E. J. Gumbel. Bivariate Exponential Distributions. Journal of the American Statistical Association, 55:698–707, 1960. 103 [91] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the International Conference on Machine Learning, 2011. 68, 109 [112] D. Morgenstern. Einfache beispiele zweidimensionaler verteilungen. Mitteilungsblatt für Mathematische Statistik, 8:234–235, 1956. 103 [92] J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005. 69 [113] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2nd edition, 2006. 103 [93] A. E. Gelfand and D. K. Dey. Bayesian model choice: Asymptotic and exact calculation. Journal of the Royal Statistical Society. Series B., 56(3):501–514, 1994. 75 [114] Zhen Chen and David B. Dunson. Random Effects Selection in Linear Mixed Models. Biometrics, 59(4):pp. 762–769, 2003. 109 [94] Anthony E. Brockwell and Joseph B. Kadane. Identification of regeneration times in MCMC simulation, with application to adaptive schemes. Journal of Computational and Graphical Statistics, 14:436–458, 2005. 75 [115] Mohsen Pourahmadi. Cholesky Decompositions and Estimation of A Covariance Matrix: Orthogonality of Variance– Correlation Parameters. Biometrika, 94(4):1006–1013, 2007. 109 [95] A. T. Ihler, J. W. Fisher III, R. L. Moses, and A. S. Willsky. Nonparametric belief propagation for self-localization of sensor networks. IEEE Journal on Selected Areas in Communications, 23(4):809–819, 2005. 80, 81 [116] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of Mathematical Statistics, 22(3):400–407, 1951. 109 113 REFERENCES ing. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 1591–1598, New York, NY, USA, 2012. ACM. 109 [117] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013. 109 [119] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. http://arxiv.org/abs/1304.5299, April 2013. 109 [118] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scor- 114 Appendix A A.1 Lagrangian Monte Carlo Equivalence between Riemannian Hamiltonian dynamics and Lagrangian dynamics Proof of Proposition 4.1. The first equation in (4.9) is directly obtained from the transformation p 7→ v: θ̇k = g kl pl = v k . For the second equation in (4.9), we have from the definition ṗl = ∂glj i j d(glj (θ)v j ) θ̇ v + glj v̇ j = ∂i glj v i v j + glj v̇ j = dt ∂θi (A.1) Further, from equation (4.4) we have 1 1 ṗl = −∂l φ(θ) + vT ∂l G(θ)v = −∂l φ + ∂l gij v i v j 2 2 i j j = ∂i glj v v + glj v̇ which means 1 glj v̇ j = −(∂i glj − ∂l gij )v i v j − ∂l φ 2 By multiplying G−1 = (g kl ) on both sides, we have 1 v̇ k = δjk v̇ j = −g kl (∂i glj − ∂l gij )v i v j − g kl ∂l φ 2 (A.2) Since i, j are symmetric in the first summand (see equation (A.1)), switching them gives the following equations: 1 v̇ k = −g kl (∂j gli − ∂l gji )v i v j − g kl ∂l φ (A.3) 2 Then the final form of the second equation (4.9) is obtained by adding equations (A.2) and (A.3) and dividing the results by two: v̇ k = −Γkij (θ)v i v j − g kl (θ)∂l φ(θ) 115 REFERENCES Here, Γkij (θ) := 21 g kl (∂i glj + ∂j gil − ∂l gij ) are the Christoffel symbols of the second kind. A.2 Stationarity of Lagrangian Monte Carlo Proof of Theorem 4.1. Starting from position θ ∼ π(θ) at time 0, we generate a velocity v ∼ N(0, G(θ)). Then evolve (θ, v) according to our time reversible integrator T̂ to reach reach a new state (θ ∗ , v∗ ) with θ ∗ ∼ f (θ ∗ ) after acceptance test. We want to prove that f (·) = π(·), which can be done by showing Ef [h(θ ∗ )] = Eπ [h(θ ∗ )] for any square integrable function h. Denote z := (θ, v) and P(dz) := exp(−E(z))dz. Note that z∗ = (θ ∗ , v∗ ) can be reached from two scenarios: either the proposal is accepted or rejected. Therefore, Z ∗ Ef [h(θ )] = h(θ ∗ )[P(dT̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) + P(dz∗ )(1 − α̃(z∗ , T̂ (z∗ )))] Z Z ∗ ∗ = h(θ )P(dz ) + h(θ ∗ )[P(dT̂ −1 (z∗ ))α̃(T̂ −1 (z∗ ), z∗ ) − P(dz∗ )α̃(z∗ , T̂ (z∗ ))] So it suffices to prove Z Z ∗ −1 ∗ −1 ∗ ∗ h(θ )P(dT̂ (z ))α̃(T̂ (z ), z ) = h(θ ∗ )P(dz∗ )α̃(z∗ , T̂ (z∗ )) (A.4) Denote the involution ν : (θ, v) 7→ (θ, −v). First, by time reversibility we have T̂ −1 (z∗ )) = ν T̂ ν(z∗ )). Further, we claim α̃(ν(z), z0 ) = α̃(z, ν(z0 )). This is true because: i) E is quadratic 0 dz dν(z0 ) dν(z0 ) = in v so E(ν(z)) = E(z); ii) dν(ν(z)) = dz . Then that follows from dν(z) definition of the adjusted acceptance probability (4.13) and the equivalence discussed in proposition 4.4. Therefore Z Z ∗ −1 ∗ −1 ∗ ∗ h(θ )P(dT̂ (z ))α̃(T̂ (z ), z ) = h(θ ∗ )P(dν T̂ ν(z∗ ))α̃(ν T̂ ν(z∗ ), z∗ ) Z (A.5) ∗ ∗ ∗ ∗ = h(θ )P(dT̂ ν(z ))α̃(T̂ ν(z ), ν(z )) Next, applying the detailed balance condition (4.14) to ν(z∗ ) we get P(dT̂ ν(z∗ ))α̃(T̂ ν(z∗ ), ν(z∗ )) = P(dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ )) 116 A Lagrangian Monte Carlo substitute it in (A.5) and continue, Z h(θ ∗ )P(dν(z∗ ))α̃(ν(z∗ ), T̂ ν(z∗ )) Z ν(z∗ )7→z∗ = h(θ ∗ )P(dz∗ )α̃(z∗ , T̂ (z∗ )) = Therefore, the equation (A.4) holds, and we complete the proof. A.3 Convergence of explicit integrator to Lagrangian dynamics Proof of Proposition 4.7. We first look at how the discretization error en = kz(tn ) − z(n) k = k(θ(tn ), v(tn )) − (θ (n) , v(n) )k changes over two consecutive steps, also known as local error, and then investigate how such error accumulates over multiple steps, i.e. global error. Assume f (θ, v) := vT Γ(θ)v + G(θ)−1 ∇θ φ(θ) is smooth, hence, f and its derivatives are uniformly bounded as (θ, v) evolves within finite time duration T . First we expand the true solution z(tn+1 ) at tn : 1 z(tn+1 ) = z(tn ) + ż(tn )ε + z̈(tn )ε2 + o(ε2 ) 2 # " " # " # −f (θ(tn ), v(tn )) θ(tn ) v(tn ) 1 ε2 + o(ε2 ) = + ε+ ∂f ∂f 2 − ∂θT v(tn ) + ∂vT f (θ(tn ), v(tn )) v(tn ) −f (θ(tn ), v(tn )) " # " # θ(tn ) v(tn ) + ε + O(ε2 ) = v(tn ) −f (θ(tn ), v(tn )) " # (n+1) θ Next, we simplify the expression of the numerical solution z(n+1) = (n+1) by the integrator v (4.21)-(4.23) and compare it to the above true solutions. To this end, we rewrite equation (4.21) as follows: ε ε v(n+1/2) = [I + (v(n) )T Γ(θ (n) )]−1 [v(n) − G(θ (n) )−1 ∇θ φ(θ (n) )] 2 2 ε (n) T ε (n) −1 (n) = v − [I + (v ) Γ(θ )] [(v(n) )T Γ(θ (n) )v(n) + G(θ (n) )−1 ∇θ φ(θ (n) )] 2 2 ε (n) T (n) −1 ε (n) (n) (n) = v − [I + (v ) Γ(θ )] f (θ , v ) 2 2 ε ε2 ε = v(n) − f (θ (n) , v(n) ) + [I + (v(n) )T Γ(θ (n) )]−1 [(v(n) )T Γ(θ (n) )]f (θ (n) , v(n) ) 2 4 2 ε (n) (n) (n) 2 = v − f (θ , v ) + O(ε ) 2 117 REFERENCES Similarly, from equation (4.23) we have ε v(n+1) = v(n+1/2) − f (θ (n+1) , v(n+1/2) ) + O(ε2 ) 2 Substituting v(n+1/2) in the above equation, we obtain v(n+1) as follows: ε ε v(n+1) = v(n) − f (θ (n) , v(n) ) − f (θ (n+1) , v(n) ) + O(ε2 ) 2 2 ε (n) (n) (n) = v − f (θ , v )ε + [f (θ (n) , v(n) ) − f (θ (n) + O(ε), v(n) )] + O(ε2 ) 2 (n) (n) (n) = v − f (θ , v )ε + O(ε2 ) From (4.19) and the above equations, we have the following numerical solution: " z(n+1) # " # " # θ (n+1) θ (n) v(n) = (n+1) = (n) + ε + O(ε2 ) (n) (n) v v −f (θ , v ) Therefore, the local error is en+1 " # " # θ(t ) − θ (n) (n) v(tn ) − v n (n+1) 2 = kz(tn+1 ) − z k= + ε + O(ε ) v(tn ) − v(n) −[f (θ(tn ), v(tn )) − f (θ (n) , v(n) )] ≤ (1 + M ε)en + O(ε2 ) where M = c supt∈[0,T ] k∇f (θ(t), v(t))k for some constant c > 0. Accumulating the local errors by iterating the above inequality for L = T /ε steps provides the following global error: en+1 ≤ (1 + M ε)en + O(ε2 ) ≤ (1 + M ε)2 en−1 + 2O(ε2 ) ≤ · · · ≤ (1 + M ε)n e1 + nO(ε2 ) ≤ (1 + M ε)L ε + LO(ε2 ) ≤ (eM T + T )ε → 0, B as ε → 0 Solutions to split Lagrangian dynamics on Sphere Proof of Proposition 6.2. To solve the first dynamics in (6.14), we note that θ̇D+1 = v̇D+1 d dt q θT 1 − kθk22 = − θ̇ θD+1 d θT v =− dt θD+1 =0 T θ̇ v + θ T v̇ θT v 1 θT =− + 2 θ̇D+1 = GS (θ)−1 ∇θ U (θ) θD+1 θD+1 2 θD+1 118 B Solutions to split Lagrangian dynamics on Sphere Therefore, we have θ̃(t) = θ̃(0) " # I t [I − θ(0)θ(0)T ]∇θ U (θ) ṽ(t) = ṽ(0) − T 2 − θ θ(0)(0) D+1 " # " # " # T I − θ(0)θ(0) I where [I − θ(0)θ(0)T ] = = − θ̃(0)θ(0)T . Note this dyθ(0)T T − θD+1 (0) −θD+1 (0)θ(0) 0T namics only involves updating velocity ṽ in the tangent space Tθ̃ SD . The second dynamics in (6.14) only involves the kinetic energy, hence, it is equivalent to the geodesic flow on the sphere SD with a great circle (orthodrome or Riemannian circle) as its analytical solution. To solve it, we need to calculate the Christoffel symbols, Γ(θ), first. 2 Note that the (i, j)-th element of GS is gij = δij + θi θj /θD+1 , and the (i, j, k)-th element of 2 4 dGS is gij,k = (δik θj + θi δjk )/θD+1 + 2θi θj θk /θD+1 . Therefore I 1 Γkij = g kl [glj,i + gil,j − gij,l ] 2 1 kl 2 2 = (δ − θk θl )[(δli θj + θl δji )/θD+1 + (δij θl + θi δlj )/θD+1 2 2 4 − (δil θj + θi δjl )/θD+1 + 2θi θj θl /θD+1 ] 2 2 = (δ kl − θk θl )θl /θD+1 [δij + θi θj /θD+1 ] 2 = θk [δij + θi θj /θD+1 ] = [GS (θ) ⊗ θ]ijk Using these results, we can write the second equation evolving v as v̇ = −vT GS (θ)vθ = −kṽk22 θ. Further, we have θ̇D+1 = − θT θ̇ θD+1 = vD+1 T v̇D+1 θ̇ v + θ T v̇ θT v =− + 2 θ̇D+1 = −kṽk22 θD+1 θD+1 θD+1 Therefore, we can rewrite the geodesic equations (the second dynamics in (6.14)) as θ̃˙ = ṽ ṽ˙ = − kṽk22 θ̃ (B.6) (B.7) d Multiplying both sides of equation (B.7) by ṽT to obtain kṽk22 = 0, and the rest is dt straightforward. 119