– Bayesian statistics MCMC techniques How to sample from the posterior distribution

advertisement
Bayesian statistics –
MCMC techniques
How to sample from the posterior distribution
Bayes formula –
the problem term
Bayes theorem:
f ( | D)
f (D | ) f ( )
f ( D)
f (D | ) f ( )
f ( D | ' ) f ( ' )d '
Seems straight forward. You multiply the likelihood (which you
know) with your prior (which you have specified). So far, so
analytic.
But, the integral may not be analytically tractable.
Solutions: Numerical integration or Monte Carlo-integration of the
integral ... or sampling from the posterior distribution using
Markov chain Monte Carlo (MCMC).
Normalization
Bayes theorem:
f ( | D)
f (D | ) f ( )
f ( D)
f (D | ) f ( )
f ( D | ' ) f ( ' )d '
• Note that the problematic integral is simply a normalization factor. The
result of that integral contains no function dependency on the
parameters, . Normalization is that constant in a probability density
function which makes the probabilities integrate to one.
• The functional form of the posterior distribution depends only on the
two terms in the denominator, the likelihood f(D| ) and the prior f( ).
• If we can find a way of sampling from a distribution without knowing it’s
normalization constant, we can use those samples to tell us about all
properties of the posterior distribution (mean, variance, covariance,
quantiles etc.)
• How to sample from a distribution f(x)=g(x)/A without knowing A?
Markov chains
Organize a series of stochastic variables, X1,...,Xn like this:
X1
X2
X3
…
Xi-1
Xi
Xi+1
…
Xn
Each variable depends on it’s past only through the nearest past.
f(xt|xt-1,...,x1)=f(xt|xt-1).
This simplifies the model for the combined series a lot.
f(xt,xt-1,...,x1)= f(xt|xt-1,...,x1) f(xt|xt-1,...,x1)...f(x1).
Markov chains can be very practical for modelling data with some
time-dependecy. Example: The autoregressive model in the
“water temperature” series.
Markov chains –stationary
distribution
Each element in the chain will have a distribution. By simulating the chain
multiple times and plotting the histograms, you get a sense of this.
X1
X2
X3
X4
X5
X6
Sometimes, a Markov chain will converge towards a fixed distribution, the
stationary distribution.
Ex: Random walk with reflective boundaries (-1,1). Xi=Xi-1+ i, i N(0, 2), if
Xi<-1 then set Xi to -1-(Xi+1), if Xi>1 then set Xi to 1-(Xi-1).
While we sampled from the normal distribution, we get something that is
uniformly distributed. Even if we did not know how to sample from the
uniform directly, we could do it indirectly using this Markov chain.
Idea: Make a Markov chain that converges towards the posterior, f( |D).
MCMC – why?
• Make a Markov chains timeseries of parameter samples, ( 1, 2, 3,…),
which has stationary distribution f( |D), in the following way (Metropolis-Hastings):
i-2
i=
Propose a new
value from the
previous values,
*
*
i , f( i | i-1)
i-1
i= i-1
• Acceptance with probability:
h(
h(
i 1
*
i
|
*
|
)f(
1) f (
i
i
*
i
i
| D)
1 | D)
h(
h(
*
i
i 1
*
i
|
*
|
i
i
) f (D |
1) f (D |
*
i
i
*
) f ( i ) / f ( D)
1 ) f ( i ) / f ( D)
• We don’t need the troublesome normalization constant, f(D). We can drop it!
• We may even drop further normalization constants in the likelihood, prior
and proposal density.
• Solves the normalization problem in Bayesian statistics!
Everything solved? What about
the proposal distribution?
• Almost every proposal distribution, i*, f( i*| i-1), will do. There’s a lot of (too
much?) freedom here.
• Can restrict ourselves to go through each the parameters, one at a time.
For =( (1), (2)), propose a change in (1), accept/reject, then do the same for
(2).
(1)
f(
(1),
(2)|D)
(2)
• Pros: Simpler. Cons: Great parameter dependency means little change.
• Special case: A Gibbs sample of a parameter comes from it’s distribution
conditioned on the data and all the rest of the parameters.
 The proposal is now specified by the model, so no need to choose anything.
Automatic methods possible, where we only need to specify the model
(WinBUGS).
MCMC - convergence
Sampling using MCMC means that if we keep doing it for enough time, we
get a sample from the posterior distribution.
Question 1: How much is “enough time”? Markov chains will not converge
immediately. We need ways of testing whether convergence has
occurred or not.
• Start many chains (starting with
different initial positions in parameter
space). When do they start to mix?
Compare plots of the parameter
values or of the likelihoods.
• Gelman’s ANOVA test for MCMC samples.
• Check if we are getting the sample results if we rerun the analysis from
different starting points.
After determining the point of convergence, we discard the samples from
before that (burn-in) and only keep samples from later than this point.
MCMC - dependency
Question 2: How many samples do we need? In a Markov chain, one sample
depends on the previous one. How will that affect the precision of the total
sample and how should we relate to that?
• Look at the timeseries plot of a single
chain. Do we see dependency?
• Estimate auto-correlation ->
effective number of samples.
• Keep only each k’th sample, so that the dependency is almost gone. We can
then get a target number of (almost) independent samples.
h0
0.7
0.6
0.5
0.4
23000
23500
24000
24500
25000
iter ation
Autocorrelation
plot. Keep each
sample vs keep
each 600th sample.
a
a
1.0
0.5
0.0
- 0.5
- 1.0
1.0
0.5
0.0
- 0.5
- 1.0
0
20
40
lag
0
20
40
lag
If we have a set of independent samples, we can then use standard sampling
theory to say something about the precision of means and covariances, for
instance.
MCMC – causes for concern
h0
• High posterior dependency between
parameters: Gives high dependency
between samples and thus low
efficiency. Look at scatterplots and to
see if this is the case.
• Multimodality – posterior density has
many “peaks”. Chains starting from
different places “converge” to different
places. Greatly reduced mixing.
Usual solution within WinBUGS:
Re-parametrize or sample a lot!
0 .8
0 .7
0 .6
0 .5
0 .4
0 .3
2 .25
2 .75
a
3 .25
Download