Remember that our objective is for some density y are vectors

Remember that our objective is for some density
f(y|) for observations where y and  are vectors
of data and parameters,  being sampled from a
prior () then the posterior is
p( | y) 
f (y |  ) ( )
f (y |  ) ( )d
If we can get a formula for this (e.g. have this as
an explicit function of y and ) then we are done.
 But usually we can’t compute this particularly
when y and  are multidimensional. So instead
what we will do is estimate the distribution of the
posterior by taking lots of samples from it. From
this we can then do Bayesian estimation.
Idea of MCMC:
We are going to set up a Markov Chain for which we
can compute samples and which has the posterior
density as it’s stationary distribution. There are several
different ways to do this, but in all cases there are two
main issues:
(i)Once we can be sure the MC indeed does have the
posterior as its stationary distribution, we need a
stopping criteria to ensure we have indeed reached
stationarity. (convergence diagnosis)
(ii)The samples from the MC are correlated, rather than
being iid draws from the posterior, so we need some
way to relate how the variance in the estimators (e.g. of
moments of the posterior) compares to that of the
actual posterior (variance estimation)
How do we set up an appropriate MC?
Gibbs Sampling: This is useful in the situation which we
can sample from conditional distributions but we don’t
have the full or marginal distribution for the posterior.
The assumption is (usually justified) that the full joint
posterior distribution is determined uniquely if we know
all conditional distributions, and from this we can find all
marginal distributions as well.
So for a posterior p(|y) where  = (1, …,k) the
conditional distributions are p(i| j≠i,y) i = 1, …,k
and if we know all these conditional distributions,
e.g. we can sample from them, then we can use
this to obtain samples from the full joint prior p(|y)
and sample from the marginal posteriors p(i|y)
The Gibbs sampler: to find this we need a way to get
a MC, meaning finding an update of the current state
of the MC to the next state – this will not be
deterministic since we are sampling from
distributions, so what we are doing is specifying the
transition function for the MC using the conditional
distributions of the posterior.
1.Start with initial values of (2(0), …,k(0)) chosen
from the prior density
2.Draw 1(1) from p(1| 2(0),…, k(0),y)
3.Draw 2(1) from p(2| 1(1), 3(0),…, k(0),y)
And proceed in similar way to have a sample
(1(1),2(1), …,k(1))
So the above procedure provides a way to get a
new update of the parameters from an initial set –
the “state” of the MC is the vector (1(1),2(1),
…,k(1)) and we continue this same procedure to
update the state for t = 2, …, T
Then we have a stopping criteria to say that when
we have t = t0, …, T then we are near stationary
distribution so these vectors ( (t) , t = t0, …, T ) are
a sample (they are correlated though since they are
taken as a particular sample trajectory of the MC)
of the stationary distribution of the MC and thus are
a sample from the posterior. So for example, we
could use (1 (t) , t = t0, …, T ) to get a histogram of
values that are a sample of the marginal posterior
density for 1 , p(1 | y ) and from this get estimate
of the posterior mean
(t )
Eˆ (1 | y) 
T  t 0 t t0 1
There’s no reason to limit this to a single chain
though – it would be typical to start several
trajectories for the MC from different initial values.
This is a typical approach to try to determine and
appropriate burn-in period (e.g. the value t0 ). You
look at some estimate from the MC trajectory (such
as the above mean) and then compare the
variation within a single trajectory of this parameter
varies as compared to the variation across the
trajectories. Due to ergodicity of the MC, if we are
near stationary distribution, variation across
trajectories should be identical to variation within
trajectories (see the book for scale reduction factor)
How many calculations does the above need?
We need k samples from various conditional
distributions for each transition step of the MC. We
do T of these, so we need a total of kT samples
from the conditional distributions and if we do a
total of m trajectories, and sample for each k values
from the prior then we have mk(T+1) total samples
to take, each of which might require several
pseudo-random number draws.
Bivariate example – case with k=2 parameters a,b
1.Start with initial values of (a,b) chosen from the
prior density
2.Draw a’ from p(a|b,y)
3.Draw b’ from p(b|a’,y)
4.Then the new state of the MC is (a’,b’) and note
that as we continue this, if we are at stationary
distribution, the probability of drawing a particular b’
given a’ will be the same across iterations.
The Gibbs sampler has a variety of variations (see
the handout) but all assume that we have a
reasonably feasible way to sample from all
conditional distributions. When this is not feasible,
there are other approaches an the MetropolisHastings sampler is one: The idea here is to add a
rejection criteria in which we reject some samples
under certain criteria. This is a standard approach in
using pseudo-random number generators to develop
a sample from a probability distribution where we
cannot easily sample from the distribution. This goes
back to von-Neumann. A simple example is if you
want to sample points uniformly in the unit circle,
then you independently choose two pseudo-random
numbers x, y uniform on [-1,1], and accept them if
x2+y2 ≤1 and reject them otherwise.
The Metropolis-Hasting algorithm: we wish to
generate a MC for the posterior p() and suppose
there is the transition kernel q(,φ) which we know
how to sample from.
1.Start with initial value of (0) chosen from the prior
2.Generate proposed value φ from q((0),φ)
3.Evaluate acceptance probability α((0),φ)
p( )q(, )
 (, )  min{ 1,
p( )q(, )
4.Put (1)=φ with probability α((0),φ) and otherwise
put (1)=(0)
The above procedure provides a MC and the
handout shows that it has p() as a stationary
distribution. The trick is to choose the transition
kernel –there are several simplified cases:
1.If we choose q(,φ)=q(φ, ) then this is Metropolis
sampler and acceptance depends only on ratio –
accept if step is to higher density, accept step to
lower density depending upon ratio of densities.
2.Random walks – φ = (j) + wj where wj are iid with
some density.
3.Independent chains - q(,φ)=f(φ) for some
density f( ) so is independent of state of chain and if
choose prior for proposal density then get ratio of
likelihoods in acceptance criteria