MCMC using Parallel Computation

advertisement
MCMC Using Parallel
Computation
Andrew Beam
ST 790 – Advanced Bayesian Inference
02/21/2013
Outline
Talk focus: Grab bag of ways to speed up R methods to use
alternative platforms
• Serial vs. Parallel Computation
• Chain level parallelization in MCMC
• R code and example
• Kernel speed-up and parallelization for larger data sets
• Calling C/Java from R
• GPU example
• MCMC in pure java
• Conclusions
MCMC Methods
• MCMC methods are great!
• They are general.
• Work for every family of distribution.
• They are exact (in the limit).
• Unlike approximate methods, e.g. variational methods
• Flexible.
• They do have some drawbacks…
• Can be painfully slow for complex problems
• Complex problems are exactly when we would like to use MCMC
grad student
simulation is running.
Dr. Advisor
Your funding has been
revoked.
Simulating!
Serial vs. Parallel Computation
• Computers have used roughly the same architecture for nearly 70
years
– Proposed by John von Neumann in 1945
• Programs are stored on disk, loaded into system memory (RAM) by a
control unit
• Instructions are executed one at a time by an arithmetic logic unit
(ALU).
• Central Processing Unit (CPU) consists
of one ALU, control units,
communication lines, and devices for
I/O.
– ALU is what does instruction execution.
Moore's law
•
“The number of transistors on integrated circuits doubles approximately
every two years.”
– Intel co-founder Gordon E. Moore
•
Around 2002, chip manufacturers began to shift from increasing the speed of
single processors to providing multiple computational units on each
processor.
•
Consistent with Moore's law, but changes the way programmers (i.e. us) have
to use them.
•
CPU still consists of the same components, but now have multiple ALUs
capable of executing instructions at the same time.
Multicore architectures
• Each central processing unit (CPU) now consists of multiple ALUs.
• Can execute multiple instructions simultaneously.
• Communication between coreshappens with shared memory.
• Programs can create multiple execution “threads” to take advantage of
this.
• Speed up requires explicit programming representations.
• Running a program designed for a single core processor on a multicore
processor will not make it faster.
Program execution hierarchy
• Atomic unit of execution is a thread
– A serial set of instructions
– E.g. – a = 2+2; b=3*4; c = (b/a)2
– Threads communicate with each other using a variety of system
dependent protocols
• A collection of threads is known as a process
• Usually represent a common computational task.
• A collection of processes is known as an application.
• Example: Google chrome browser:
• The entire browser is an application.
• Each tab is a process controlling multiple threads.
• A thread handles different elements of each page (i.e. music, chat,
etc.)
Decomposing the MCMC problem
• Given our computers are capable of doing multiple things at once, we
should all the resources available to us.
• If a problem has independent parts, we should try to execute them in
parallel. This is often called “embarrassingly” parallel.
• Let’s go through the general Metropolis-Hasting algorithms and look
for parts we can speed up using parallel computation.
General MH MCMC Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
For each chain in 1 to number of chains (Nc)
For i in 1 to number of sim iterations (Ns)
Draw θnew~ from the proposal density
Evaluate the posterior kernel for this point K(θnew)
Calculate the acceptance ratio, ρ(K(θnew), (K(θold))
Accept θnew with probability ρ(K(θnew), (K(θold))
Repeat Ns times
Repeat Nc times
How can we make this faster?
1.
2.
3.
4.
5.
6.
7.
8.
For each chain in 1 to number of chains (Nc)
For i in 1 to number of sim iterations (Ns)
Draw θnew~ from the proposal density
Evaluate the posterior kernel for this point K(θnew)
Calculate the acceptance ratio, ρ(K(θnew), (K(θold))
Accept θnew with probability ρ(K(θnew), (K(θold))
Repeat Ns times
Repeat Nc times
This is Nc
independent
operations – we
can do this
concurrently on
a multicore
machine.
Speeding up multi-chain simulations
• Without parallel computing, chains must be completed
one at a time
Completed Chains:
Chain 3
Chain 2
4 available cores
Chain 1
Speeding up multi-chain simulations
• The first chain finishes and the second one
starts…
Completed Chains:
Chain 3
Chain 2
4 available cores
Chain 1
Speeding up multi-chain simulations
• The second chain is done and the third one
starts…
Completed Chains:
Chain 3
Chain 1
Chain 2
4 available cores
Speeding up multi-chain simulations
• Now all three are done.
• If one chain takes T seconds, the entire
simulation takes 3*T seconds
Completed Chains:
Chain 1
Chain 2
Chain 3
4 available cores
Speeding up multi-chain simulations
• Now consider a parallel version.
• Each chain is assigned to its own core.
Completed Chains:
Chain 1
Chain 2
Chain 3
Speeding up multi-chain simulations
• The entire simulation takes T + ε, where ε is the
thread communication penalty and is usually << T
Completed Chains:
Chain 1
Chain 2
Chain 3
Multiple Chains vs. One Long Chain
• It has been claimed that a chain of length N*L is preferable to N
chains of length L, due to better mixing properties of the longer
chain.
•
Robert and Casella agree (text, pg. 464)
• However, if you have a multicore machine you should NEVER just
use one chain, for moderately sized problems because you can get
additional samples for essentially no extra time.
•
Buy one, get two free
• You can use multiple chins to assess convergence
•
Gelman and Rubin diagnostic.
• More samples is always better, even if they are spread over a few
chains.
Chain-level parallelization in R
• For small to moderate sized problems, it’s very easy to do
this in R with minor changes to existing code.
•
Small to moderate size = 100-1000 observations 10-100 predictors.
• Revolutionary Analytics has developed several packages to
parallelize independent for loops.
•
Windows – doSNOW
•
Linux/OSX – doMC
• Uses foreach paradigm to split up the work across several
cores and then collect the results.
• Use special %dopar% par operator to indicate parallel tasks.
• This is more of a “hack” then a fully concurrent library.
• This is a good idea if your simulation currently takes a
reasonable amount of time and doesn’t use too much
memory.
doMC/doSNOW Paradigm
For N parallel tasks:
Create N copies of the entire R
workspace
Task 1
Run N R sessions in
parallel
…
Task 2
…
…
Collect results and
return control to
original R session.
Task N-1
Task N
Parallel Template (Windows):
Load doSNOW package
Register parallel backend
Indicate how the results should be collected
Indicate this should
be done in parallel
Assign return value of
foreach to a variable
Example: Penalized Regression via
MCMC
Prior for β
τ=1
τ = 0.5
τ = 0.1
Simulation Description
N = 100, p = 10
Xij ~ N(0,1)
True model: Y = X2 + X6
Yi = X2 + X6 + ei, where ei ~ N(0,1)
Full penalty -> τ = 1
Sample using RW-MH such that βi = βi-1 + εi, where εi ~ N(0,(1/2p)2)
Simulate initially for Nc = 1 chain, Ns = 100,000 iterations, keep 10,000
for inference
• Macbook pro, quad-core i7 with hyperthreading, 8GB ram.
• Sampler code available in chain-parallel-lasso.R. Demonstration code
in mcmc-lasso.R
• N and p are adjustable so you can see how many chains your system
can handle for a given problem size
•
•
•
•
•
•
•
Simulation Results
β1 ≈ 0
β6 ≈ 1
β2 ≈ 1
β7 ≈ 0
β3 ≈ 0
β8 ≈ 0
β4 ≈ 0
β9 ≈ 0
β5 ≈ 0
β10 ≈ 0
Simulation Results
•
•
•
•
Sampler appears to be working – finds correct variables.
Execution time: 3.02s
Ram used: 30MB
Sampler Code
Returns a Nkx p matrix of samples
Propose
Evaluate
Accept/Reject
Avoid using rbind at all costs
Multiple Chain Penalized Regression
•
•
Wrap the single chain sampler into a function
Call it multiple times from within the foreach loop
Register parallel backend
Call foreach and %dopar%
Windows quirk
Call single chain sampler
Return single chain results to be collected
Multiple Chain Penalized Regression
Example call using 2 chains:
Multiple Chain Execution Times
Single chain execution time
doSNOW/doMC not just for MCMC
•
•
•
Worth noting that this paradigm is useful for a lot of other tasks in R
Any task that has “embarrassingly” parallel portions
• Bootstrap
• Permutations
• Cross-validation
• Bagging
• etc
Be aware that every parallel R session will use the same amount of memory
as the parent.
• Computation happens 5x faster, but use 5x as much memory
• Application level parallelism
Back the the MH Algorithm
Can we speed up the inner loop?
1.
2.
3.
4.
5.
6.
7.
8.
For each chain in 1 to number of chains (Nc)
For i in 1 to number of sim iterations (Ns)
Draw θnew~ from the proposal density
Evaluate the posterior kernel for this point K(θnew)
Calculate the acceptance ratio, ρ(K(θnew), (K(θold))
Accept θnew with probability ρ(K(θnew), (K(θold))
Repeat Ns times
Repeat Nc times
Unfortunately loop is
inherently serial and
we can’t speed it up
through parallelization
What is the most expensive part of each
iteration?
1.
2.
3.
4.
5.
6.
7.
8.
For each chain in 1 to number of chains (Nc)
For i in 1 to number of sim iterations (Ns)
Draw θnew~ from the proposal density
Evaluate the posterior kernel for this point K(θnew)
Calculate the acceptance ratio, ρ(K(θnew), (K(θnew))
Accept θnew with probability ρ(K(θnew), (K(θnew))
Repeat Ns times
Repeat Nc times
• Each iteration we have to evaluate the kernel for N samples and p
parameters – most expensive.
• As N gets bigger, we can use parallel devices to speed this up.
• Communication penalties preclude multicore CPUs from speeding this.
Kernel Evaluations
• For the penalized regression problem we have 3 main kernel
computations:
• Evaluate Xβ – Could be expensive for large N or large p
• Evaluate the likelihood for Xβ, Y – N evaluations of the normal
pdf.
• Evaluate the prior for β – cheap for small p, potentially
expensive for large p.
• For demonstration, let’s assume we are in a scenario with large N
and reasonable p
• Xβ – not expensive
• Likelihood – expensive
• Prior – very cheap
Kernel Evaluations in Java and C
•
•
•
•
•
•
•
•
If evaluating our kernel is too slow in R, we can drop down to another language from
within R and evaluate it there.
• This is actually what dnorm() in R does
R has native support for calling C functions.
rJava package provides interface for calling Java from R.
• For small problems (e.g. N ~ 100) C will be faster.
• Reason – there is also a communication penalty since R can’t call Java natively.
R has to transfer data to Java, then start execution, then retrieve it.
Java has rich external libraries set for scientific computing.
Java is portable – write once and run anywhere
No pointers!
Built in memory management – no malloc.
Unified concurrency library. Can write multithreaded programs that will run on any
platform.
• Thread level parallelism saves memory
Example Java code for kernel evaluation
function called from R
evaluate likelihood
evaluate prior
function to evaluate normal
PDF
function to evaluate prior
Calling Java from R
•
•
•
•
•
•
rJava package has R-Java bindings that will allow R to use Java objects.
Java based kernel for penalized regression available in
java_based_sampler.R and java source code in dnorm.java
Compiled java code is in dnorm.class
• To compile source use javac command: javac dnorm.java
• Not necessary, but if you modify the source you will have to recompile
Performance is slightly worse that C code used by dnorm() in R
Abstracting out kernel evaluation is a good idea if you have complicated
kernels not available in R or you have large values of N or p.
What if you have so much data Java or C still isn’t enough?
GPU Computing
GPU – Graphics Processing Unit
Designed to do floating point math very quickly for image processing
Cheap – usually a few hundred dollars
Highly parallel – 100s to 1000s of processors per unit
Very little control units – designed for floating point operations
• Good for doing a lot of simultaneous calculations (e.g. + - / * )
• CUDA - Compute Unified Device Architecture
• SDK and driver set that gives access to NVIDIA floating point units
• Provides C-level interface and compiler
• Write kernel evaluation in CUDA C
•
•
•
•
•
CPU
GPU
CUDA Programming Paradigm
• CUDA is based on notions of host (i.e. the CPU) and the device (i.e. the
GPU)
• Create data objects (arrays, matrices, etc) on the host, transfer them to
the device, and then retrieve the results.
• Computation is accomplished via “kernels” – small computational
programs.
• GPU threads are given a block of data, which they use to execute the
kernel.
• Compile C programs using NVIDIA’s nvcc compiler
• We can use this to do the expensive normal PDF evaluation.
GPU Simulation Setup
• Do N normal PDF evaluations for values ranging from 100 to
100,000,000.
• Compare to R’s native dnorm()
• GEFORCE GTX 650 – 2GB RAM, 384 CUDA processors
• Requires CUDA SDK 3.0, compatible GPU and cuda drivers.
• Code available in kernel.cu.
• Compile on Linux using nvcc kernel.cu –o kernel
Execution Times
N
GPU
dnorm
100
0.000348
0.0001
1000
0.000524
0.0004
10000
0.000733
0.001
1.00E+005
0.00124
0.007
1.00E+006
0.006677
0.097
5.00E+006
0.0235
0.385
1.00E+007
0.053936
0.772
5.00E+007
0.22657
3.75
1.00E+008
0.542329
7.567
Speed up
With a sample size of 5 million, if you ran a simulation for 100,000 iterations it
would take 10.7 hours days using dnorm(), but just over 30 minutes on a GPU.
GPUs and MCMC – Not just about speed
Using parallel computation to improve Independent Metropolis--Hastings
based estimation
Pierre Jacob , Christian P. Robert, Murray H. Smith
In this paper, we consider the implications of the fact that parallel raw-power can be exploited by a generic Metropolis-Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent
Metropolis--Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a
null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the
techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are
based on the Rao--Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert (1996), Atchade
and Perron (2005) and Douc and Robert (2010). We illustrate those improvements both on a toy normal example and on a
classical probit regression model, but stress the fact that they are applicable in any case where the independent MetropolisHastings is applicable.
http://arxiv.org/abs/1010.1595
Non-R based solutions
• If you want your simulation to just “go faster”, you will probably need to
abandon R all together.
• Implement everything in another language like C/C++ or Java and it will
(most likely) be orders of magnitude faster
• Compiled code is in general much faster.
• You can stop having wacky R memory issues.
• Control the level of parallelization (thread vs. process vs. application).
• Interface more easily with devices like GPUs.
• Pain now will be rewarded later.
BAJA – Bayesian Analysis in JAva
•
•
•
•
•
•
•
•
ST 790 Project
Pure java, general MH-MCMC sampler
• Adaptive sampling options available
Model specification via graphical directed, acyclic graph (DAG)
Parallel chain execution – multithreaded
Stochastic thinning via resolvant
Memory efficient (relatively)
Automatic generation of trace plots, histograms, ACF plots.
Export samples to .csv or R.
BAJA Parallel Performance
35
N Independent Chains of length L
Chain of length N*L
30
Time (secs)
25
20
15
10
5
0
1
2
3 Number of Chains 4
5
L = 1,000,000
6
Speed comparison with several platforms
Thank you!
Download