MCMC Using Parallel Computation Andrew Beam ST 790 – Advanced Bayesian Inference 02/21/2013 Outline Talk focus: Grab bag of ways to speed up R methods to use alternative platforms • Serial vs. Parallel Computation • Chain level parallelization in MCMC • R code and example • Kernel speed-up and parallelization for larger data sets • Calling C/Java from R • GPU example • MCMC in pure java • Conclusions MCMC Methods • MCMC methods are great! • They are general. • Work for every family of distribution. • They are exact (in the limit). • Unlike approximate methods, e.g. variational methods • Flexible. • They do have some drawbacks… • Can be painfully slow for complex problems • Complex problems are exactly when we would like to use MCMC grad student simulation is running. Dr. Advisor Your funding has been revoked. Simulating! Serial vs. Parallel Computation • Computers have used roughly the same architecture for nearly 70 years – Proposed by John von Neumann in 1945 • Programs are stored on disk, loaded into system memory (RAM) by a control unit • Instructions are executed one at a time by an arithmetic logic unit (ALU). • Central Processing Unit (CPU) consists of one ALU, control units, communication lines, and devices for I/O. – ALU is what does instruction execution. Moore's law • “The number of transistors on integrated circuits doubles approximately every two years.” – Intel co-founder Gordon E. Moore • Around 2002, chip manufacturers began to shift from increasing the speed of single processors to providing multiple computational units on each processor. • Consistent with Moore's law, but changes the way programmers (i.e. us) have to use them. • CPU still consists of the same components, but now have multiple ALUs capable of executing instructions at the same time. Multicore architectures • Each central processing unit (CPU) now consists of multiple ALUs. • Can execute multiple instructions simultaneously. • Communication between coreshappens with shared memory. • Programs can create multiple execution “threads” to take advantage of this. • Speed up requires explicit programming representations. • Running a program designed for a single core processor on a multicore processor will not make it faster. Program execution hierarchy • Atomic unit of execution is a thread – A serial set of instructions – E.g. – a = 2+2; b=3*4; c = (b/a)2 – Threads communicate with each other using a variety of system dependent protocols • A collection of threads is known as a process • Usually represent a common computational task. • A collection of processes is known as an application. • Example: Google chrome browser: • The entire browser is an application. • Each tab is a process controlling multiple threads. • A thread handles different elements of each page (i.e. music, chat, etc.) Decomposing the MCMC problem • Given our computers are capable of doing multiple things at once, we should all the resources available to us. • If a problem has independent parts, we should try to execute them in parallel. This is often called “embarrassingly” parallel. • Let’s go through the general Metropolis-Hasting algorithms and look for parts we can speed up using parallel computation. General MH MCMC Algorithm 1. 2. 3. 4. 5. 6. 7. 8. For each chain in 1 to number of chains (Nc) For i in 1 to number of sim iterations (Ns) Draw θnew~ from the proposal density Evaluate the posterior kernel for this point K(θnew) Calculate the acceptance ratio, ρ(K(θnew), (K(θold)) Accept θnew with probability ρ(K(θnew), (K(θold)) Repeat Ns times Repeat Nc times How can we make this faster? 1. 2. 3. 4. 5. 6. 7. 8. For each chain in 1 to number of chains (Nc) For i in 1 to number of sim iterations (Ns) Draw θnew~ from the proposal density Evaluate the posterior kernel for this point K(θnew) Calculate the acceptance ratio, ρ(K(θnew), (K(θold)) Accept θnew with probability ρ(K(θnew), (K(θold)) Repeat Ns times Repeat Nc times This is Nc independent operations – we can do this concurrently on a multicore machine. Speeding up multi-chain simulations • Without parallel computing, chains must be completed one at a time Completed Chains: Chain 3 Chain 2 4 available cores Chain 1 Speeding up multi-chain simulations • The first chain finishes and the second one starts… Completed Chains: Chain 3 Chain 2 4 available cores Chain 1 Speeding up multi-chain simulations • The second chain is done and the third one starts… Completed Chains: Chain 3 Chain 1 Chain 2 4 available cores Speeding up multi-chain simulations • Now all three are done. • If one chain takes T seconds, the entire simulation takes 3*T seconds Completed Chains: Chain 1 Chain 2 Chain 3 4 available cores Speeding up multi-chain simulations • Now consider a parallel version. • Each chain is assigned to its own core. Completed Chains: Chain 1 Chain 2 Chain 3 Speeding up multi-chain simulations • The entire simulation takes T + ε, where ε is the thread communication penalty and is usually << T Completed Chains: Chain 1 Chain 2 Chain 3 Multiple Chains vs. One Long Chain • It has been claimed that a chain of length N*L is preferable to N chains of length L, due to better mixing properties of the longer chain. • Robert and Casella agree (text, pg. 464) • However, if you have a multicore machine you should NEVER just use one chain, for moderately sized problems because you can get additional samples for essentially no extra time. • Buy one, get two free • You can use multiple chins to assess convergence • Gelman and Rubin diagnostic. • More samples is always better, even if they are spread over a few chains. Chain-level parallelization in R • For small to moderate sized problems, it’s very easy to do this in R with minor changes to existing code. • Small to moderate size = 100-1000 observations 10-100 predictors. • Revolutionary Analytics has developed several packages to parallelize independent for loops. • Windows – doSNOW • Linux/OSX – doMC • Uses foreach paradigm to split up the work across several cores and then collect the results. • Use special %dopar% par operator to indicate parallel tasks. • This is more of a “hack” then a fully concurrent library. • This is a good idea if your simulation currently takes a reasonable amount of time and doesn’t use too much memory. doMC/doSNOW Paradigm For N parallel tasks: Create N copies of the entire R workspace Task 1 Run N R sessions in parallel … Task 2 … … Collect results and return control to original R session. Task N-1 Task N Parallel Template (Windows): Load doSNOW package Register parallel backend Indicate how the results should be collected Indicate this should be done in parallel Assign return value of foreach to a variable Example: Penalized Regression via MCMC Prior for β τ=1 τ = 0.5 τ = 0.1 Simulation Description N = 100, p = 10 Xij ~ N(0,1) True model: Y = X2 + X6 Yi = X2 + X6 + ei, where ei ~ N(0,1) Full penalty -> τ = 1 Sample using RW-MH such that βi = βi-1 + εi, where εi ~ N(0,(1/2p)2) Simulate initially for Nc = 1 chain, Ns = 100,000 iterations, keep 10,000 for inference • Macbook pro, quad-core i7 with hyperthreading, 8GB ram. • Sampler code available in chain-parallel-lasso.R. Demonstration code in mcmc-lasso.R • N and p are adjustable so you can see how many chains your system can handle for a given problem size • • • • • • • Simulation Results β1 ≈ 0 β6 ≈ 1 β2 ≈ 1 β7 ≈ 0 β3 ≈ 0 β8 ≈ 0 β4 ≈ 0 β9 ≈ 0 β5 ≈ 0 β10 ≈ 0 Simulation Results • • • • Sampler appears to be working – finds correct variables. Execution time: 3.02s Ram used: 30MB Sampler Code Returns a Nkx p matrix of samples Propose Evaluate Accept/Reject Avoid using rbind at all costs Multiple Chain Penalized Regression • • Wrap the single chain sampler into a function Call it multiple times from within the foreach loop Register parallel backend Call foreach and %dopar% Windows quirk Call single chain sampler Return single chain results to be collected Multiple Chain Penalized Regression Example call using 2 chains: Multiple Chain Execution Times Single chain execution time doSNOW/doMC not just for MCMC • • • Worth noting that this paradigm is useful for a lot of other tasks in R Any task that has “embarrassingly” parallel portions • Bootstrap • Permutations • Cross-validation • Bagging • etc Be aware that every parallel R session will use the same amount of memory as the parent. • Computation happens 5x faster, but use 5x as much memory • Application level parallelism Back the the MH Algorithm Can we speed up the inner loop? 1. 2. 3. 4. 5. 6. 7. 8. For each chain in 1 to number of chains (Nc) For i in 1 to number of sim iterations (Ns) Draw θnew~ from the proposal density Evaluate the posterior kernel for this point K(θnew) Calculate the acceptance ratio, ρ(K(θnew), (K(θold)) Accept θnew with probability ρ(K(θnew), (K(θold)) Repeat Ns times Repeat Nc times Unfortunately loop is inherently serial and we can’t speed it up through parallelization What is the most expensive part of each iteration? 1. 2. 3. 4. 5. 6. 7. 8. For each chain in 1 to number of chains (Nc) For i in 1 to number of sim iterations (Ns) Draw θnew~ from the proposal density Evaluate the posterior kernel for this point K(θnew) Calculate the acceptance ratio, ρ(K(θnew), (K(θnew)) Accept θnew with probability ρ(K(θnew), (K(θnew)) Repeat Ns times Repeat Nc times • Each iteration we have to evaluate the kernel for N samples and p parameters – most expensive. • As N gets bigger, we can use parallel devices to speed this up. • Communication penalties preclude multicore CPUs from speeding this. Kernel Evaluations • For the penalized regression problem we have 3 main kernel computations: • Evaluate Xβ – Could be expensive for large N or large p • Evaluate the likelihood for Xβ, Y – N evaluations of the normal pdf. • Evaluate the prior for β – cheap for small p, potentially expensive for large p. • For demonstration, let’s assume we are in a scenario with large N and reasonable p • Xβ – not expensive • Likelihood – expensive • Prior – very cheap Kernel Evaluations in Java and C • • • • • • • • If evaluating our kernel is too slow in R, we can drop down to another language from within R and evaluate it there. • This is actually what dnorm() in R does R has native support for calling C functions. rJava package provides interface for calling Java from R. • For small problems (e.g. N ~ 100) C will be faster. • Reason – there is also a communication penalty since R can’t call Java natively. R has to transfer data to Java, then start execution, then retrieve it. Java has rich external libraries set for scientific computing. Java is portable – write once and run anywhere No pointers! Built in memory management – no malloc. Unified concurrency library. Can write multithreaded programs that will run on any platform. • Thread level parallelism saves memory Example Java code for kernel evaluation function called from R evaluate likelihood evaluate prior function to evaluate normal PDF function to evaluate prior Calling Java from R • • • • • • rJava package has R-Java bindings that will allow R to use Java objects. Java based kernel for penalized regression available in java_based_sampler.R and java source code in dnorm.java Compiled java code is in dnorm.class • To compile source use javac command: javac dnorm.java • Not necessary, but if you modify the source you will have to recompile Performance is slightly worse that C code used by dnorm() in R Abstracting out kernel evaluation is a good idea if you have complicated kernels not available in R or you have large values of N or p. What if you have so much data Java or C still isn’t enough? GPU Computing GPU – Graphics Processing Unit Designed to do floating point math very quickly for image processing Cheap – usually a few hundred dollars Highly parallel – 100s to 1000s of processors per unit Very little control units – designed for floating point operations • Good for doing a lot of simultaneous calculations (e.g. + - / * ) • CUDA - Compute Unified Device Architecture • SDK and driver set that gives access to NVIDIA floating point units • Provides C-level interface and compiler • Write kernel evaluation in CUDA C • • • • • CPU GPU CUDA Programming Paradigm • CUDA is based on notions of host (i.e. the CPU) and the device (i.e. the GPU) • Create data objects (arrays, matrices, etc) on the host, transfer them to the device, and then retrieve the results. • Computation is accomplished via “kernels” – small computational programs. • GPU threads are given a block of data, which they use to execute the kernel. • Compile C programs using NVIDIA’s nvcc compiler • We can use this to do the expensive normal PDF evaluation. GPU Simulation Setup • Do N normal PDF evaluations for values ranging from 100 to 100,000,000. • Compare to R’s native dnorm() • GEFORCE GTX 650 – 2GB RAM, 384 CUDA processors • Requires CUDA SDK 3.0, compatible GPU and cuda drivers. • Code available in kernel.cu. • Compile on Linux using nvcc kernel.cu –o kernel Execution Times N GPU dnorm 100 0.000348 0.0001 1000 0.000524 0.0004 10000 0.000733 0.001 1.00E+005 0.00124 0.007 1.00E+006 0.006677 0.097 5.00E+006 0.0235 0.385 1.00E+007 0.053936 0.772 5.00E+007 0.22657 3.75 1.00E+008 0.542329 7.567 Speed up With a sample size of 5 million, if you ran a simulation for 100,000 iterations it would take 10.7 hours days using dnorm(), but just over 30 minutes on a GPU. GPUs and MCMC – Not just about speed Using parallel computation to improve Independent Metropolis--Hastings based estimation Pierre Jacob , Christian P. Robert, Murray H. Smith In this paper, we consider the implications of the fact that parallel raw-power can be exploited by a generic Metropolis-Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent Metropolis--Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are based on the Rao--Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert (1996), Atchade and Perron (2005) and Douc and Robert (2010). We illustrate those improvements both on a toy normal example and on a classical probit regression model, but stress the fact that they are applicable in any case where the independent MetropolisHastings is applicable. http://arxiv.org/abs/1010.1595 Non-R based solutions • If you want your simulation to just “go faster”, you will probably need to abandon R all together. • Implement everything in another language like C/C++ or Java and it will (most likely) be orders of magnitude faster • Compiled code is in general much faster. • You can stop having wacky R memory issues. • Control the level of parallelization (thread vs. process vs. application). • Interface more easily with devices like GPUs. • Pain now will be rewarded later. BAJA – Bayesian Analysis in JAva • • • • • • • • ST 790 Project Pure java, general MH-MCMC sampler • Adaptive sampling options available Model specification via graphical directed, acyclic graph (DAG) Parallel chain execution – multithreaded Stochastic thinning via resolvant Memory efficient (relatively) Automatic generation of trace plots, histograms, ACF plots. Export samples to .csv or R. BAJA Parallel Performance 35 N Independent Chains of length L Chain of length N*L 30 Time (secs) 25 20 15 10 5 0 1 2 3 Number of Chains 4 5 L = 1,000,000 6 Speed comparison with several platforms Thank you!