Bayesian statistics – MCMC techniques How to sample from the posterior distribution Bayes formula – the problem term Bayes theorem: f ( | D) f ( D | ) f ( ) f ( D) f ( D | ) f ( ) f ( D | ' ) f ( ' )d ' Seems straight forward. You multiply the likelihood (which you know) with your prior (which you have specified). So far, so analytic. But, the integral may not be analytically tractable. Solutions: Numerical integration or Monte Carlo-integration of the integral ... or sampling from the posterior distribution using Markov chain Monte Carlo (MCMC). Normalization Bayes theorem: f ( D | ) f ( ) f ( | D) f ( D) f ( D | ) f ( ) f ( D | ' ) f ( ' )d ' • Note that the problematic integral is simply a normalization factor. The result of that integral contains no function dependency on the parameters, . Normalization is that constant in a probability density function which makes the probabilities integrate to one. • The functional form of the posterior distribution depends only on the two terms in the denominator, the likelihood f(D| ) and the prior f(). • If we can find a way of sampling from a distribution without knowing it’s normalization constant, we can use those samples to tell us about all properties of the posterior distribution (mean, variance, covariance, quantiles etc.) • How to sample from a distribution f(x)=g(x)/A without knowing A? Markov chains Organize a series of stochastic variables, X1,...,Xn like this: X1 X2 X3 … Xi-1 Xi Xi+1 … Xn Each variable depends on it’s past only through the nearest past. f(xt|xt-1,...,x1)=f(xt|xt-1). This simplifies the model for the combined series a lot. f(xt,xt-1,...,x1)= f(xt|xt-1,...,x1) f(xt|xt-1,...,x1)...f(x1). Markov chains can be very practical for modelling data with some time-dependecy. Example: The autoregressive model in the “water temperature” series. Markov chains –stationary distribution Each element in the chain will have a distribution. By simulating the chain multiple times and plotting the histograms, you get a sense of this. X1 X2 X3 X4 X5 X6 Sometimes, a Markov chain will converge towards a fixed distribution, the stationary distribution. Ex: Random walk with reflective boundaries (-1,1). Xi=Xi-1+ i, iN(0,2), if Xi<-1 then set Xi to -1-(Xi+1), if Xi>1 then set Xi to 1-(Xi-1). While we sampled from the normal distribution, we get something that is uniformly distributed. Even if we did not know how to sample from the uniform directly, we could do it indirectly using this Markov chain. Idea: Make a Markov chain that converges towards the posterior, f(|D). MCMC – why? • Make a Markov chains timeseries of parameter samples, (1, 2, 3,…), which has stationary distribution f(|D), in the following way (Metropolis-Hastings): i-2 i=i* Propose a new value from the previous values, i*,f(i*| i-1) i-1 i=i-1 • Acceptance with probability: h(i 1 | i ) f (i | D) h(i 1 | i ) f ( D | i ) f (i ) / f ( D) * * h(i | i 1 ) f (i 1 | D) h(i | i 1 ) f ( D | i 1 ) f (i ) / f ( D) * * * * * • We don’t need the troublesome normalization constant, f(D). We can drop it! • We may even drop further normalization constants in the likelihood, prior and proposal density. • Solves the normalization problem in Bayesian statistics! Everything solved? What about the proposal distribution? • Almost every proposal distribution, i*,f(i*| i-1), will do. There’s a lot of (too much?) freedom here. • Can restrict ourselves to go through each the parameters, one at a time. For =((1), (2)), propose a change in (1), accept/reject, then do the same for (2). (1) f((1), (2)|D) (2) • Pros: Simpler. Cons: Great parameter dependency means little change. • Special case: A Gibbs sample of a parameter comes from it’s distribution conditioned on the data and all the rest of the parameters. The proposal is now specified by the model, so no need to choose anything. Automatic methods possible, where we only need to specify the model (WinBUGS). MCMC - convergence Sampling using MCMC means that if we keep doing it for enough time, we get a sample from the posterior distribution. Question 1: How much is “enough time”? Markov chains will not converge immediately. We need ways of testing whether convergence has occurred or not. • Start many chains (starting with different initial positions in parameter space). When do they start to mix? Compare plots of the parameter values or of the likelihoods. • Gelman’s ANOVA test for MCMC samples. • Check if we are getting the sample results if we rerun the analysis from different starting points. After determining the point of convergence, we discard the samples from before that (burn-in) and only keep samples from later than this point. MCMC - dependency Question 2: How many samples do we need? In a Markov chain, one sample depends on the previous one. How will that affect the precision of the total sample and how should we relate to that? • Look at the timeseries plot of a single chain. Do we see dependency? • Estimate auto-correlation -> effective number of samples. • Keep only each k’th sample, so that the dependency is almost gone. We can then get a target number of (almost) independent samples. h0 0.7 0.6 0.5 0.4 23000 23500 24000 24500 25000 iteration Autocorrelation plot. Keep each sample vs keep each 600th sample. a a 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 0 20 40 lag If we have a set of independent samples, we can then use standard sampling theory to say something about the precision of means and covariances, for instance. MCMC – causes for concern h0 • High posterior dependency between parameters: Gives high dependency between samples and thus low efficiency. Look at scatterplots and to see if this is the case. • Multimodality – posterior density has many “peaks”. Chains starting from different places “converge” to different places. Greatly reduced mixing. Usual solution within WinBUGS: Re-parametrize or sample a lot! 0.8 0.7 0.6 0.5 0.4 0.3 2.25 2.75 a 3.25