Statistics 580 Introduction to Markov Chain Monte Carlo

advertisement
Statistics 580
Introduction to Markov Chain Monte Carlo
Introduction to Markov Chains
A stochastic process is a sequence of random variables {X(t), t ∈ T } indexed by a parameter t in
an index set T. X(t) is called the state of the process at time t and the set of possible realizations of
X(t) defines the state space denoted by S. The time or the parameter space may be discrete, for e.g.,
T = {0, 1, . . .} or continuous, for e.g., T = (0, ∞). The state space may also be discrete, for e.g.,
S = {0, 1, . . .} or continuous, for e.g., S = (−∞, ∞).
A discrete parameter stochastic process {X(t), t = 0, 1, 2. . . .} or a continuous parameter process
{X(t), t ≥ 0} is said to be a Markov process if, for any set of n time points t 1 < t2 < . . . < tn in
the index set T ,
P r[X(tn ) ≤ xn |X(tn−1 ) = xn−1 , . . . , X(t1 ) = x1 ] = P r[X(tn ) ≤ xn |X(tn−1 ) = xn−1 ].
It says that the probability distribution of future states in a Markov process depends only on the present
(or most recently observed) state and not on the past states. Markov processes are classified according
to the nature of the state space or the parameter space. A Markov process whose state space is discrete
is called a Markov chain. We shall use the set {0, 1, . . .} to denote the state space of a Markov chain
and first consider discrete parameter Markov chains where, without loss of generality, we shall use
the notation X0 , X1 , . . . , Xn , . . . to denote the states.
The transition probability function of a discrete parameter Markov chain that define the probability distribution of the next state given the present state is given by,
pij (m, n) = P r(Xn = j|Xm = i)
for all states i and j and n ≥ m. The matrix of transition probabilities is denoted by P (m, n) =
(pij (m, n)). In order to give the probability law of a discrete parameter Markov chain {X n } it is sufficient
to specify, for all times n ≥ m, pj (n) = P (Xn = j) and pij (m, n) for all states i and j. A fundamental
relation satisfied by the transition probability function of a Markov chain is the Chapman-Kolmogorov
equation: for any times n > u > m and states i and j,
pij (m, n) =
X
pik (m, u)pkj (u, n).
k
or, in terms of transition probability matrices,
P (m, n) = P (m, u) P (u, n).
If pij (m, n) depend only on the difference n − m they are said to be stationary transition
probabilities and the Markov chain is said to be stationary or homogeneous. In a stationary Markov
chain, the k-step transition probability function is denoted by
(k)
pij = P r(Xn+k = j|Xn = i)
1
(k)
and the k-step transition probability matrix by P (k) , where P (k) = (pij ). The one-step transition
probabilities of a stationary Markov chain are denoted by
(1)
where by definition
X
pij ≡ pij = P r(Xn = j|Xn−1 = i)
pij = 1 and the matrix of these transition probabilities is the square matrix
j

p11

p
P =
 21
..
.

p12
p22
..
.
p13 · · ·

p23 · · ·  ,

..
.
···
and is called the one-step transition matrix. We see that the rows of P sum to one and hence it
is a stochastic matrix or a marix of probabilities. As an example, in image analysis a binary image is
represented in pixels where black or white pixels are indicated by θi = 1 or 0, respectively. The posterior
joint density of the true image θ given the observed noisy image x (i.e., observed data) is
p(θ|x)
∝
f (x|θ)g(θ)
where f (x|θ) is the model for how the true image is corrupted by noise, and g(θ) is the prior that involves
knowledge about the types of images under consideration. The number of rows (= number of columns)
of P in this example may be 2262,144 > 107,857 .
Examples of discrete parameter Markov chains:
Example 1 - Two-state Markov chains:
A two-state weather model with the two states being “rain” or “no rain”, on successive days. The one-step
transition probability matrix is:
P =
rain
no rain
rain
α
β
no rain!
1−α
1−β
where the elements represent probabilities that it will rain or not on given day conditional on that it rained
or not the previous day. Here the Markov chain will be homogeneous since the transition probabilities
are stationary, since they are unaffected by what day it is.
Example 2:
Toss a coin where P r(head) = p, repeatedly. After the nth toss, let Xn represent the number of heads
that have appeared so far.
Then the one-step transition probability matrix is:



P =


1−p
p
0
...
0
1−p
p
...
0
0
1−p p
..
..
..
..
.
.
.
.
where the elements of the 1st row are

... 0
... 0 


... 0 

... 0
p11 = P (Xn = 1 | Xn−1 = 1) ,
p12 = P (Xn = 2 | Xn−1 = 1) ,
p13 = P (Xn = 3 | Xn−1 = 1) , ..., etc.
2
and the elements of the 2nd row are
p21
p22
p23
p24
= P (Xn
= P (Xn
= P (Xn
= P (Xn
=1
=2
=3
=4
|
|
|
|
Xn−1
Xn−1
Xn−1
Xn−1
= 2) ,
= 2) ,
= 2) ,
= 2) , ..., etc.
We see that Xn = Xn−1 + S where S ∼ Bernoulli and Xn−1 and S are independent, which implies that
{Xn } is a Markov chain.
Some definitions and results about discrete parameter Markov chains
Recall that the k-step transition probability function was defined as
(k)
pij = P r(Xn+k = j | Xn = i)
for any integer n. The notation
Xn+k = j | Xn = i
says that X goes from state i to state j in k steps in time. P (k) denotes the k-step transition matrix,
(k)
whose elements are pij , and P is the one-step transition matrix whose elements are pij . For a stationary
Markov chain
P (k) = P k
p(k) = p(0) P k
where

p(k)
(k)
p1
(k)
p2
..
.




=

 (k)
 p
 j
..
.





,





p(0)
p1
p2
..
.







(k)


 , pj = P r(Xn+k = j),
=


 pj 


and, pj = P r(Xn = j),
..
.
(1)
(0)
are the k-step unconditional probabilities. Note that pj ≡ pj is the pmf of the random variable Xn , pj
is the pmf of the random variable Xn+1 , etc. These results are immediate from the Chapman-Kolmogorov
equations. Consequently, the probability law of a homogeneous Markov chains is completely determined
once one knows the one-step transition probability matrix P and the unconditional probability vector
p(0) at time 0.
Example:
The two-state weather model with α = 0.7, β = 0.4 is stationary. Thus it follows that
P =
0.7 0.3
0.4 0.6
!
2
P =
0.61 0.39
0.52 0.48
!
4
P =
0.575 0.425
0.567 0.433
!
Note that the rows of P 4 are almost identical showing that the probability of rain or no rain on a certain
day does not depend on whether it rained or not four days earlier.
3
Some properties of Markov chains:
The following definitions, properties and results are useful, in general, and for defining certain useful
classes of Markov chains, in particular:
1. Markov chain is irreducible if every state can be reached from every other state. That is, for all
(k )
i, j there exists some k0 such that pij 0 > 0. We say that all states communicate with each other.
Example: The chain described by P on the left is irreducible and the
right is not:



0.5 0.5
0.5 0.5 0.0
 0.5 0.5


P =
P =
 0.5 0.3 0.2 
 0.2 0.3
0.0 0.3 0.7
0.0 0.0
chain described by P on the
0.0
0.0
0.3
0.0
0.0
0.0
0.2
1.0





(k)
(k)
2. Define fij ≡ P r(Xn+k = j for the 1st time |Xn = i). That is, fij is the probabilty that Xn+k is
(k)
for the first time in the j th state given that Xn was in the ith state. Thus fjj probability of first
passage from state j to state j in k-steps. The j th state is said to be persistent (i.e.,not transient) if
∞
X
(k)
fjj = 1.
k=1
That is, having started at state j, the probability that the chain will eventually return to j is one.
Note that in the literature the term recurrent is sometimes used in place of the term persistent to
describe states that satisfy this condition.
(k)
3. The j th state is periodic of period tj if pjj > 0 only when k = ν tj where ν is an integer.
4. Markov chain is aperiodic if no states are periodic.
5. The Markov chain is persistent (not transient) if all states are not transient.
Theorem
Consider an irreducible, aperiodic, and persistent Markov chain whose mean recurrence time is finite,
i.e.,
mjj =
∞
X
k=1
(k)
k fjj < ∞ .
An invariant (or limiting) distribution for the Markov chain is said to exist if there exists a probability
distribution {πj } such that,
(k)
lim pij = πj
k→∞
for all j = 1, 2, . . .. If the invariant distribution {πj } exists then it is the unique solution to the equation
πj =
X
i
that satisfies
P
πj = 1.
4
πi pij .
(1)
Equation (1) follows since
(k)
P k = pij


π1 π2 · · ·

k→∞ 
.

−→ 
 ..

π1 π2
and clearly, P k+1 = P k P . Therefore, as k → ∞,







π0
π0
..
.





 = 





π0
π0
..
.
..
.
π0




 P ,


where π 0 = (π1 , π2 , . . . , ), which implies (1). P is said to satisfy global balance if p ij satisfies (1). π is also
known as the equilibrium distribution or the stationary distribution.
The idea behind Markov chain Monte Carlo is to find an appropriate Markov chain (i.e, P ) whose
invariant distribution {πj } is the distribution from which we wish to draw samples. From a starting
realization X0 , simulate X1 according to the transition matrix P subsequently simulate X2 from X1 and
P and so forth. After a burn in
(k)
pij = P r Xk = j|X0 = i ' πj
and so Xk is a realization from the distribution {πj , j = 1, 2, . . .}. How can we find such a Markov chain?
We need at least one more result.
Reversibility of Markov chains
Let {Xn : −∞ < n < ∞} be a Markov chain with invariant distribution {πj }. Then P r(Xn = j) = πj .
Consider the time-reversed process
Zn = X−n
Now {Zn } is a Markov chain with transition probabilities
qij = P r(Zn = j|Zn−1 = i)
= P r(Zn = j , Zn−1 = i)/P r(Xn−1 = i)
= P r(X−n+1 = i|X−n = j)
= pji
P r(X−n = j)
P (X−n+1 = i)
πj
πi
The Markov chain is time reversible if {Zn }
d
=
{Xn } which implies qij = pij which in turn implies
πj pji = πi pij
for all i, j ∈ S.
5
(2)
Proposition
Equation (2) implies Equation (1).
Proof:
R.H.S. of (1)
=
X
πi pij
X
πj pji
i
=
i
= πj = L.H.S. of (1) .
Condition (2) is obviously the stronger condition and the π are said to satisfy detailed balance. To simulate
samples from a distribution {πj }, it is sufficient that a Markov chain is defined via transition probabilities
{pij } that satisfy the relation πi pij = πj pji . This relation is called the reversibility condition. Markov
Chain Monte Carlo turns the theory around: the invariant density is known (perhaps up to a constant
multiple) – actually it is the target density from which samples are desired – but the transition matrix is
unknown. To generate samples from π(.), Markov Chain Monte Carlo methods use a transition matrix
called a nominating matrix and employs an acceptance-rejection algorithm whose nth iterate converges
to π(.) for large n.
Metropolis algorithm (Metropolis, et al., 1953)
Suppose the nominating matrix Q is any symmetric matrix of probabilities i.e., q ij = qji . We would like
to obtain a sample from a distribution {πj } where πj = P r(Xn = j) by generating observations from a
Markov chain that has {πj } as its invariant distribution. Metropolis algorithm starts with proposed state
i and decides whether it moves to a new state j based on Bernoulli trial:
Step
0. Set xn−1 = i where i is any realization from πi .
1. Generate j from the probability distribution {qij ; j = 1, 2, . . .}
2. Set r = πj /πi .
3. If r ≥ 1 set xn = j
Otherwise generate u from U (0, 1)
if u < r set xn = j
else set xn = xn−1
4. Set n = n + 1, go to Step 1 .
In the above algorithm, the value j is accepted with probability αij = min {πj /πi , 1}. The proof that the
equilibrium distribution of the chain constructed by the above algorithm is indeed {π j }, it is sufficient to
check that the detailed balance condition holds.
6
Example: As an application of the Metropolis algorithm, suppose that we want to generate from the
Poisson distribution with mean λ i.e.,
π j = P Xn = j =
1 j −λ
λ e ,
j!
j = 0, 1, . . .
We will use the nominating probability matrix





Q=



1/2 1/2 0
0
1/2 0 1/2 0
0
0 1/2 0 1/2 0
0
0 1/2 0 1/2
..
..
..
..
..
.
.
.
.
.
···
···
···
···


q00 = 1/2







 qij = 1/2 for j = i − 1


 i.e.,


= 1/2 for j = i + 1







= 0, otherwise
which is symmetric (and is a one-step transition matrix).
The Metropolis algorithm for generating samples from Poisson(6) is as follows:
Step 0. Start with xn−1 = i
Step 1. Generate j from {qij }
if i 6= 0
if i = 0

i.e. , generate u1 from U (0, 1)






(
if u1 < 1/2, set j = i − 1
else set j = i + 1
if u1 < 1/2, set j = 0
else set j = 1
Step 2. Set r = πj /πi = (i! λj )(j! λi )
i.e., set r = 1,
= i/λ,
= λ/j,
if i = 0, j = 0
if j = i − 1
if j = i + 1
Step 3. If r ≥ 1, set xn = j
Otherwise, generate u2 from U (0, 1)
if u2 < r, set xn = j
else, set xn = xn−1
Step 4. Set n = n + 1, go to 1
7
The table below display values of relevant quantities computed in the first 15 iterations of this algorithm for Poisson with λ = 6 starting with x0 = 2:
-------------------------------------------------n
i
u1
j
r
u2
--------------------------------------------------1
2 0.71889082
3 2.0000000 0.83568994
2
3 0.92144722
4 1.5000000 0.67244221
3
4 0.48347869
3 0.6666667 0.23677552
4
3 0.38764000
2 0.5000000 0.70580029
5
3 0.66973964
4 1.5000000 0.47446056
6
4 0.51325076
5 1.2000000 0.44375696
7
5 0.22118260
4 0.8333333 0.79923561
8
4 0.32724500
3 0.6666667 0.55147710
9
3 0.32624403
2 0.5000000 0.88511680
10
3 0.32752058
2 0.5000000 0.82785282
11
3 0.51644296
4 1.5000000 0.30783601
12
4 0.53919790
5 1.2000000 0.40234452
13
5 0.95002276
6 1.0000000 0.07881027
14
6 0.07521049
5 1.0000000 0.25551719
15
5 0.78899123
6 1.0000000 0.59512748
--------------------------------------------------The R code that was used to generate these iterates is as follows:
poisson.metro=function(lamda,i,n)
{
y=seq(n)
for(k in 1:n)
{
u1=runif(1)
j =if(u1<.5)
ifelse(i==0,i,i-1) else i+1
r =switch(i+2-j,lamda/j,1,i/lamda)
u2 =runif(1)
new=if(r>=1)j else
{if(u2<r)j else i}
i=new
y[k]=i
}
return(y)
}
8
Rel.Freq.
Rel.Freq.
0.20
0.10
0.0
0.20
0.10
0.0
•
•
2
•
2
•
4
•
6
j
•
8
•
•
10
•
•
•
4
•
•
6
•
j
•
8
•
•
10
•
Iterates 4505-5000 in steps of 5
•
•
•
Iterates 101-200
•
12
•
12
•
•
•
•
2
•
2
•
4
j
6
•
•
•
4
•
•
6
•
j
•
8
•
•
10
•
•
Iterates 4500-5000 in steps of 3
•
•
•
Iterates 901-1000
Metropolis Algorithm for generating Poisson(6) Samples
Rel.Freq.
Rel.Freq.
0.20
0.10
0.0
0.20
0.10
0.0
9
•
12
8
•
•
•
The last page showed plots of relative frequency barcharts constructed from 100 values obtained from this
Metropolis sampler (each sample plotted obtained as labelled on the plots) superimposed by actual probability mass function of the Poisson(6) shown on the plots with connected line segements for comparison
purposes.
A more general form of Metropolis algorithm was given by Hastings(1970) and is usually referred to as
Metropolis-Hastings algorithm. In this case, qij , the nominating probabilities are more general instead of
being symmetrical. The acceptance probability of j in this case is given by
αij = min {πj qji /πi qij , 1}
Metropolis-Hastings Algorithm (discrete state space case)
Step
0. Set xn−1 = i where i is any realization from πi .
1. Generate j from the probability distribution {qij ; j = 1, 2, . . .}
2. Set r = πj qji /πi qij .
3. If r ≥ 1 set xn = j
Otherwise generate u from U (0, 1)
if u < r set xn = j
else set xn = xn−1
4. Set n = n + 1, go to Step 1 .
It is easily shown that the detailed balance condition holds for this algorithm as well thus proving that
its equilibrium distribution is {πj }.
In both Metropolis and Metrpolis-Hastings the resulting chain would have transition probability matrices
defined by
pij = qij αij , fori 6= j
X
6= iqij αij
pii = 1 −
j
The theory on the discrete parameter Markov chains carry over to the continuous time, continuous state
space case with some theoretical generalizations. In particaular, the transition matrix P becomes a
transition kernel p(x, y) for x, y ∈ <, which can be used to compute probabilities as usual:
P (y ∈ A|X = x) =
Z
A
p(x, y)dy
Other properties need to be defined accordingly, for e.g. recurrence is defined in terms of sets with
positive probability of being visited infinitely often. The stationary or the equilibrium distribution π(y)
of a continuous Markov chain then satisfies
π(y) =
Z
p(x, y) π(x) dx
10
Now we generalize Metropolis-Hastings algorithm to the case when the state space is continuous instead
of discrete. In this case let π(x) denote the invariant distribution of a Markov chain and is the target
density from which samples are desired. Let q(x, y) denote the candidate-generating density, or the
proposal density meaning that when the process is at the point x, a value y is generated from this density.
The Metropolis-Hastings algorithm is described in terms of the acceptance probability (or probability of
a move) α(x, y):
)
(
π(y) q(y, x)
, 1 , if π(x) q(x, y) > 0
α(x, y) = min
π(x) q(x, y)
= 1, otherwise
The idea is that at a current state X(t) = x, a candidate value for the next state y is generated from
q(x, y); this value is accepted as the next state with probability α(x, y). Transition probabilities for the
chain are then given by the density
p(x, y) = q(x, y) α(x, y)
= 1−
Z
q(x, t) α(x, t) dt
if y 6= x
if y = x
The reversibility condition is then
π(x) p(x, y) = π(y) p(y, x)
and if it is satisfied and p(x, y) leads to a irreducible, aperiodic chain, then π(.) will be the invariant
distribution. These conditions are usually satisfied if q(x, y) is positive on the same support as that of
π(·)
Metropolis-Hastings Algorithm
Step 0. Set n = 0 and start with xn
Step 1. Generate y from q(xn , .) and u from U (0, 1)
Step 2. If u ≤ α(xn , y)
Set xn+1 = y
Else
Set xn+1 = xn
Step 3. Set n = n + 1, go to Step 1
n
Step 4. Return x0 , x1 , . . . , xN
o
11
Example 1: Implement a Metropolis-Hastings algorithm to simulate from the mixture
.7N (7, 0.52 ) + .3N (10, 0.52 )
using N (x, 0.12 ) as the proposal distribution. For starting values x0 = 0, 7, and 15 run the chain for
10, 000 iterations. Plot the sample path of the output for each chain. Change the proposal distribution
to improve the convergence properties of the chain.
It is clear that the target π(x) is the density of the mixture of the two normals above. The proposal
density q(x, y is the density of N (x, 0.12 ) given by
1
1 y−x 2
√
)
exp − (
2 .1
2π(.1)
which is symmetric in x and y implying q(x, y) = q(y, x). Thus the acceptance probability is given by
α(x, y) = min
n π(y)
o
,1 .
π(x)
The following R code was used to generate two paths of the chain for starting values 0 = 0.0 and x0 = 7.0,
repectively, the graph of which are on the next page.
normal.metro=function(x0,n)
{
set.seed(1234,"Mersenne-Twister")
r=rep(0,n)
x=x0
for(k in 1:n)
{
u=runif(1)
y=rnorm(1,x,.1)
if(u<alpha(x,y))
{
x=y
}
else
{
x=x0
}
r[k]=x
}
return(r)
}
alpha = function(x,y) {
# Acceptance probability calculation
return( min( 1, (.7*dnorm(y,7,.5)+.3*dnorm(y,10,.5))/(.7*dnorm(x,7,.5)+.3*dnorm(x,10,.5))))
}
12
Proposal sigma=.4; Starting value x=7
9
6.0
6
6.5
7
8
r(t)
7.0
r(t)
7.5
10
8.0
Proposal sigma=.1; Starting value x=7
0
2000
4000
6000
8000
10000
0
2000
4000
t
6000
8000
10000
t
Figure 1: Paths of Random Samples from the Normal Mixture MCMC, respectively
Example 2: (Chib and Greenberg, The American Statistician, (1995))
To illustrate the Metropolis algorithm we consider sampling from the bivariate normal distribution N 2 (µ, Σ),
where
!
1
1 .9
µ=
.
and Σ =
.9 1
2
Note that random variates from the multivariate normal distribution are usually obtained using the Cholesky
factorization Σ = T0 T where T is a unique upper triangular matrix. Generallly, a random vector z is generated
from Np (0, I) and is transformed to Np (µ, Σ) using y = µ + T0 z.
For applying the Metropolis algorithm for this problem, we will re-state the problem as follows: Suppose that
we want to generate from x ∼ N2 (µ, Σ) i.e., π(x) will be the density
1
1
π(x) =
exp − (x − µ)0 Σ−1 (x − µ) ,
1/2
2
2π|Σ|
x ∈ <2 .
Choose the candidate generating
density to be the
pdf of y ∼ N2 (x, D) where D = 0.6 .40 . Notice that
1
q(x, y) = 2π|D|
− 21 (y − x)0 D−1 (y − x) is symmetric in x and y, so that the acceptance probability is
1/2 exp
given by
α(x, y) = min
n exp[− 1 (y − µ)0 Σ−1 (y − µ)]
2
exp[− 12 (x
−1
− µ)0 Σ
13
(x − µ)]
o
,1 ,
x, y ∈ <2
Thus the Metropolis algorithm for generating from π(x) can be described as follows:
Step 0.
Set n = 0 and start with xn , say (1.2, 1.8)T
Step 1. Generate y from q(xn , ·) and u from U (0, 1).
Step 2. If u ≤ α(xn , y)
Set xn+1 = y
Else
Set xn+1 = xn
Step 3. Set n = n + 1, go to step 1
Step 4. Return {x0 , x1 , x2 , . . . , }
As an exercise, we shall implement R functions to generate from π(·) using both the standard algorithm and the
Metropolis algorithm given above and obtain scatterplots as shown on p.334 of Chib and Greenberg.
Usually, the choice of a proposal density (candidate generating density) is problem specific. However, the
simplest choice for q(x, y), in general, is a random walk. That is, given x, y is generated simply using y = x + z,
where z is independently ∼ U (−k, k) where k is a small value, say k = .1, chosen depending on the starting
value. If the random walk is the choice, q(x, y) is symmetric, so the Metroplois Algorithm can be used (instead
of the Metropolis-Hastings version).
14
Monte Carlo Sampling from a Posterior Distribution using Metropolis-Hastings Algorithm
Recall that in Bayesian applications, we would like to sample from the posterior p(θ|y) where:
p(θ|y) ∝ f (y|θ) π(θ)
i.e., “posterior”
is proportional to
“data model”
× “prior”
This allows us to study the posterior distribution or just estimate the posterior mean E(θ|y) empirically and thus
avoid the computation of a complicated integral. The data model is usually the joint density of the observations
(i.e., the likelihood function). Note that if π(·) is a conjugate prior then p(θ|y) can be obtained in closed form.
To sample from p(θ|y) we will restate the Metropolis-Hastings algorithm in the following form:
Let the acceptance probability of moving from θn to θ ∗ be α(θn , θ∗ ) where
∗
α(θn , θ ) = min
(
p(θ∗ |y) q(θ ∗ , θn )
,1
p(θn |y) q(θn , θ∗ )
)
where p(θ|y) is the posterior and corresponds to π(·) in the original description of the M-H algorithm and q(θ n , θ)
corresponds to the candidate-generating density.
Metropolis-Hastings algorithm for sampling from a posterior
Initialize n = 0 and θn
Repeat {
Sample θ ∗ from q(θn , θ)
Sample u from U (0, 1)
If u ≤ α(θn , θ∗ ) then
set θn+1 = θ∗
Else
set θn+1 = θn
Set n = n + 1
}
The implementation of this algorithm for a real problem of sampling from a posterior distributiom of a parameter
is discussed. The data set used consists of measurements of the weight of a block of metal called NB10 used
as a standard for 10g., made by the National Bureau of Standards (now NIST) annually to a high degree of
accuracy. See pages attached at the end of this note. The problem is to estimate the variance of these data using
a Gaussian model, i.e., yi |µ, σ 2 ∼ N (µ, σ 2 ) and a prior (µ, σ 2 ) ∼ h(µ, σ 2 ). First, some implementation concerns.
15
Practical Issues
1. Choosing Initial Values
Metropolis-Hastings requires you to pick just a single initial value θ0 , in many cases this one value may
suffice. It is recommended that you select a value near the center of the posterior from which you are trying
to simulate. This will increase the possibility of the Markov Chain reaching the invariant distribution
reasonably quickly. This value could be obtained from any information you have of the posterior, such as
a good estimate of θ like the maximum likelihood estimate. From a practical viewpoint, a problem with
just starting with a single value is that we will not know in advance whether the chain will be mixing well,
i.e., it is reaching all areas of probability of the posterior distribution. For example, if the posterior is
multimodal the starting near one of the modes the chain may not find the other modes. One strategy to
overcome this is to use several different initial values.
2. Choosing a Convergence Monitoring Strategy
We have two issues to deal with:
• how to decide if the chain has reached equilibrium.
• how to monitor the output from that point onwards to obtain the posterior summaries.
If you started from quite a bit away from the true posterior then the output will be similar to the one
shown in the time series plot of the Gaussian model for the NB10 data. It clear from that graphic that
the chain is not mixing well: there are long periods where it does not move at all. This is caused by the
large first order auto-correlations.
A solution for this problem is to allow a burn-in period: nB , i.e., discard, say the first 1000 (or 5000)
values output and then start observing the time series plot. After burn-in, monitor the output for a larger
number of iterations, upto , say 25,000 to 100,000 iterations. One could also use thinning i.e., retaining
only every 100th or 200th value of the chain thus reducing the auto-correlation to virtually zero.
From this part one could estimate posterior means and standard errors, obtain plots of histograms and
density traces of the marginal posteriors or estimate posterior covariance matrixix. etc.
3. Choosing a Candidate Generating Density(CGD)
This is very difficult problem since Metropolis-Hastings will work for many choices of CGD’s. However,
one may want to select q(x, y) that results in a chain that mixes well.
One strategy is to pick a CGD such that, on the average, a move to the left or the right is equally likely.
That is E(θ ∗ |θt ) = θt where θ ∗ represent a new move and θt is the current value. The use of this strategy
is illustrated below for sampling from the posterior variance of the Gaussian model for the NB10 data.
16
For the NB10 data, pretend that µ for the data distribution is known (assume that it is equal to the sample
mean 404.59), so that the problem reduces to one of studying the posterior distribution of a single parameter,
σ2.
σ 2 ∼ SI −χ2 (νp , σp2 )
Prior:
i.i.d
Data Model: yi |σ 2 ∼ N (µ, σ 2 ),
i = 1, . . . , n
The problem then is to sample from the posterior p(σ 2 |y) using MC.
Note 1: For this problem, we shall ignore the fact that the exact posterior distribution can be derived theoretically:
!
νp σp2 + n s2∗
2
2
σ |y ∼ SI −χ νp + n,
νp + n
where s2∗ =
1
n
n
X
i=1
(yi − µ)2 .
Note 2: We use the short-hand notation SI −χ2 (νp , σp2 ) for the “Scaled Inverse-χ2 ” distribution that is often
used as the conjugate prior for the variance parameter.
To use the M-H algorithm to get MC samples from the posterior distribution p(σ 2 |y) we need to consider the
following implementation details for writing the needed R functions.
Implementation Details of Sampling from the Posterior Distribution of the Variance for NB10
Data
1. Selection
of an appropriate candidate generating density (CGD). Since we know that the prior is SI −
2
2
χ νp , σp , we might consider the CGD q(σn2 , σ 2 ) to be the density of SI−χ2 (ν1 , σ12 ) for some ν1 , σ12 where
ν1 is the degrees of freedom parameter and σ12 is the scale parameter. Note that the density function for
this distribution is
!
!
ν1 /2 −(ν1 /2+1)
ν1 σ12
2
2
2
2
exp −
σ
π σ |ν1 , σ1 = c σ1
2σ 2
with mean
ν1
σ 2 for ν1 > 2 .
ν1 − 2 1
If the strategy of chossing a CGD such that E(θ|θn ) = θn is adopted (as discussed earlier; this implies that
the average of the moves is the current value) then σ12 needs to be selected so that
E σ 2 ν1 , σ12 =
This can be done by selecting σ12 =
ν1 −2 2
ν1 σn ,
since in that case,
Thus, the distribution
E σ 2 σn2 = σn2 .
E σ 2 σn2 =
ν1
σ 2 = σn2 .
ν1 − 2 1
ν1 − 2 2 σn
ν1
is the CGD chosen with ν1 being a “tuning” constant that can be varied to improve mixing of the chain.
q
σn2 ,
σ
2
≡ SI −χ2 ν1 ,
17
2
2. If X ∼ χ2 (ν), then Y = νXσ ∼ SI −χ2 (ν, σ 2 ). To generate a random variate σ 2 from the scaled-inverse
chi-squared distribution SI −χ2 (ν, σ 2 ), generate x from χ2 (ν) and set σ 2 = νσ 2 /x.
3. Because of the form of α(σn2 , σ∗2 ), it is convenient to compute it as exp (log(α)). This involves computing
the log posterior and log CGD densities each time through the loop in the M-H algorithm.
log(α) = log(posterior(σ∗2 )) + log(CGD(σ∗2 , σn2 )) − log(posterior(σn2 )) − log(CGD(σn2 , σ∗2 ))
4. Note that log(posterior) = log(prior) + log(likelihood) where
log(prior) = log [h(σ 2 |νp , σp2 )]
νp σp2
νp
+ 1) log(σ 2 ) −
2
2σ 2
2
log(likelihood) = log [`(σ |y)]
= c1 − (
= c2 −
n
n
1 X
log(σ 2 ) − 2
(yi − µ)2
2
2σ i=1
n
Y
i
h
1
1
√
exp − 2 (yi − µ)2 . Note that the constant c1 and c2 cancel out in computing
2σ
2πσ 2
i=1
log (α) above so need not be exactly determined.
since
`(σ 2 |y)
=
5. Also note that
log(CGD(σn2 , σ 2 )) = log [q(σn2 , σ 2 )]
ν1
ν1
(ν1 − 2)σn2
= c3 +
log(σn2 ) − ( + 1) log(σ 2 ) −
2
2
2σ 2
after some simplification. Again c3 cancels out in computing log acceptance ratio log(α) although it
depends on ν1 .
6. In the R functions supplied the arguments are in the order shown below:
generate.CGD (ν, σ 2 ): generate from SI −χ2 (ν, σ 2 )
log.prior (σ 2 , νp , σp2 ): compute log π(σ 2 )
log.lik (σ 2 , y, µ): compute log likelihood
log.post (σ 2 , y, µ, νp , σp2 ): compute log posterior
log.CGD (σn2 , ν1 , σ 2 ): compute log proposal density
MH.normal.variance (y, µ, νp , σp2 , σ02 , ν1 , nB , nM , nT , seed, output.file.prefix)
alpha(σn2 , σ∗2 , y, µ, νp , σp2 , ν1 ): compute acceptance probability
18
#-----------------------------------------------------------# R functions to do Metropolis-Hastings sampling
# for the NB10 data
#
# prior:
sigma2 ~ SI-chisq( nu.p, sigma2.p )
# data model: ( y_i | sigma2 ) ~IID N( mu, sigma2 ), i = 1, ..., n
#
#
#-----------------------------------------------------------MH.normal.variance = function( y, mu, nu.p, sigma2.p,
sigma2.0, nu.star, n.burnin, n.monitor, n.thin, seed ) {
# Main routine
sigma2.old = sigma2.0
R=0
set.seed( seed )
for ( i in 1:n.monitor) {
sigma2.star = generate.CGD( nu.star, ( nu.star - 2 ) * sigma2.old /
nu.star )
u = runif( 1 )
b = ( u <= alpha( sigma2.old, sigma2.star, y, mu, nu.p, sigma2.p,
nu.star ) )
sigma2.new = sigma2.star * b + sigma2.old * ( 1 - b )
if ( i > n.burnin ) R = R + b
if ( ( i > n.burnin ) & ( ( i - n.burnin ) %% n.thin == 0 ) )
write( c( ( i - n.burnin ) / n.thin, signif(sigma2.new, digits = 5
)),
file="nb10.output", ncol = 2, append = T )
}
return( R / (n.monitor-n.burnin) )
}
#-------------------------------------------------------generate.CGD = function( nu, sigma2 ) {
# candidate generating distribution
return( nu * sigma2 / rchisq( 1, nu ) )
}
#--------------------------------------------------------alpha = function( sigma2.old, sigma2.star, y, mu, nu.p, sigma2.p,
nu.1 ) {
# Acceptance probability calculation
return( min( 1, exp( log.post( sigma2.star, y, mu, nu.p, sigma2.p ) +
log.CGD( sigma2.star, nu.1, sigma2.old) - log.CGD( sigma2.old, nu.1,
sigma2.star) log.post(sigma2.old, y, mu, nu.p, sigma2.p ) ) ) )
}
#----------------------------------------------------------log.post = function( sigma2, y, mu, nu.p, sigma2.p ) {
# log( posterior ) calculation
return( log.lik( sigma2, y, mu ) + log.prior( sigma2, nu.p, sigma2.p
) )
}
#-----------------------------------------------------------log.lik = function( sigma2, y, mu ) {
# log( likelihood ) calculation
n = length( y )
return( ( - n / 2 ) * log( sigma2 ) - sum( ( y - mu )^2 )/( 2 *
sigma2 ) )
}
#-----------------------------------------------------------log.prior = function( sigma2 , nu.p, sigma2.p ) {
# log( prior ) calculation
return( ( -1 - nu.p / 2 ) * log( sigma2 ) - nu.p * sigma2.p /
( 2 * sigma2 ) )
}
#-----------------------------------------------------------log.CGD = function( sigma2.old, nu.1, sigma2 ) {
# log( candidate generating density ) calculation
return( ( nu.1 / 2 ) * log( sigma2.old ) - ( 1 + nu.1 / 2 ) *
log( sigma2 ) - ( nu.1 - 2 ) * sigma2.old/ ( 2 * sigma2 ) )
}
Gibbs Sampler
If π(·) is a multivariate target distribution e.g., π(x), then the entire vector x will be updated all at once by
generating the y from a proposal density q(x, y), using the Metropolis-Hastings algorithm. Instead, the updating
may be done componentwise, where the components of x may be of any dimension. For the purpose of discussion,
consider all components of x to be of single dimension i.e. x = (x1 , x2 , . . . , xk ). Each of these components are
then updated one by one sequentially in separate Metropolis-Hastings steps. For example, at the i th step, yi is
generated from the proposal density qi (xi , yi ) where qi depends on the current value of xi and may depend on
any of the other components of x, namely x−i = (xi , . . . , xi−1 , xi+1 , . . . , xk ), as well. The candidate yi is
accepted with acceptance probability
πi (yi ) qi (yi , xi )
αi (xi , yi ) = min
, 1
πi (xi ) qi (xi , yi )
If yi is accepted, set the ith component of xn , xn,i = yi ; otherwise set xn,i = xn,i . The remaining components of
xn are not changed in step i. This is repeated for i = 1, . . . , k, at end of which the entire vector x n would have
been updated.
The above is called a single component Metropolis-Hastings algorithm. Here π i (xi ), called the full conditional
distribution of xi , is the distribution of the ith component of x conditioning on all remaining components of x:
πi (xi ) = R
π(x)
.
π(x)dxi
Here we are using the result that a joint density (i.e. π(x)) is uniquely determined by the set of full conditionals
πi (xi ), i = 1, . . . , k.
A special single-component Metropolis-Hastings is the Gibbs sampler. For the Gibbs sampler, the proposal
distribution for updating the ith component of x is
qi (xi , yi ) = πi (yi ) ,
where πi (yi ) is the full conditional distribution of yi with respect to π(·). That is yi is generated from πi (yi ). If
qi (xi , yi ) above is substituted in the expression for αi (xi , yi ), it turns out to be equal to 1; i.e. Gibbs sampler
candidates are always accepted. Thus Gibbs sampling consists of sampling from full conditionals of the target
distribution.
Example 1:
Consider generating bivariate random variables from the density
f (x, y) =
n
x
!
y x+α−1 (1 − y)n−x+β−1
It can be shown that
f (x|y) ∝
n
x
!
for x = 0, 1, . . . , n
and 0 < y ≤ 1
y x (1 − y)n−x
i.e., X|(Y = y) ∼ Bin(n, y). Similarly
f (y|x) ∝ y x+α−1 (1 − y)n−x+β−1
19
i.e. Y |(X = x) ∼ Beta(x + α, n + β). The Gibbs sampler for generating bivariate samples from f (x, y) is then
for i = 1, . . . , n repeat
1. generate yi from Beta(xi−1 + α, n + β)
2. generate xi from Bin(n, yi )
3. return (xi , yi )
The stationary or equilibrium
distribution of pairs (xi , yi ) is f (x, y) given above. It can be shown that the
R
global balance condition p(x, y)π(x)dx = π(y) holds in this case.
Hierarchical Models
Suppose we have a data model f (y|θ) and a prior distribution of θ with density g(θ|λ), that depends on a parameter λ that is an unknown random variable. Let the distribution, called the hyperprior, of the hyperparameter
λ have density π(λ) . We wish to obtain the posterior p(θ|y). But, f (y|θ)g(θ|λ) ∝ posterior of θ given y and λ.
Thus
f (y, θ)
f (y, θ)
p(θ|y) = R
=
,
f (y|θ)g(θ)dθ
h(y)
where h(y) is the marginal distribution of y. Since the joint density,
f (y, θ, λ) = f (y|θ, λ) f (θ, λ)
= f (y|θ) g(θ|λ) π(λ).
and
f (y, θ) =
we have that
p(θ|y) =
Z
Z
f (y, θ, λ) dλ
f (y, θ, λ) dλ/h(y)
Z
f (y|θ)
∝
| {z }
g(θ|λ)π(λ) dλ
|
{z
}
(4)
model × (marginal) prior on θ.
If the posterior of λ, p(λ|y) is needed
p(λ|y) = f (y, λ)/h(y) =
∝
(3)
Z
|
Z
f (y, θ, λ) dθ/h(y)
f (y|θ)g(θ|λ)dθ · π(λ)
{z
mixed model
}
× prior on λ.
The prior g(θ|λ) just “mixes” the model f (y|θ) over the values of θ giving a mixed model independent of θ.
20
(5)
In summary, once you have determined which posterior distribution you need for inference, the rest is easy.
Go back to the joint density f (y, θ, λ) and integrate out the appropriate variables. When the posterior can
be obtained in closed form, then the prior is said to be a conjugate prior. Thus if it is known that a prior is a
conjugate then posterior can be obtained by inspection. The same applies for obtaining full conditionals. The
harder problem is to evaluate the integrals in closed form when the priors are not conjugate. See below for an
application of the Gibbs Sampler methods to solve a problem in heirarchical modelling.
Example 2:
Consider a random sample y = (y1 , . . . , yn ) from a normal density with mean µ and variance τ −1 . Suppose that
µ is subject to a normal prior with mean 0 and variance ω −1 and τ is subject to a gamma prior with shape
parameter α and scale parameter β. Given that the two priors are independent, the joint density of data and
parameters is
Pn
n+1 n
τ
ω 2
1
2
τ α−1 − βτ
(2π)− 2 τ 2 e− 2 i=1 (yi −µ) · ω 2 e− 2 µ ·
e .
Γ(α)β α
Gibbs sampling from the joint posterior distribution of µ and τ requires the conditional density of µ given y and
τ , say, h(µ|bf y, τ ), and the conditional density of τ given y and µ, say, g(τ |bf y, µ).
Because of the the choice of conjugate priors the conditional densities are easily obtained. Note then that
τ
h(µ|y, τ ) ∝ e− 2
τ
Pn
i=1
∝ e− 2 (−
∝ e−(
∝ e
(yi −µ)2 − ω
µ2
2
Pn
i=1
2µyi +nµ2 )− ω
µ2
2
nτ +ω 2
µ −τ
2
Pn
y µ)
i=1 i
nȳτ 2
1
)
(µ− nτ
−
+ω
2(nτ +ω)−1
This implies that µ given y is normally distributed with mean nτ ȳ/(ω + nτ ) and variance 1/(ω + nτ ), where ȳ
P
is the sample mean n1 ni=1 yi . Further,
n
τ
g(τ |y, µ) ∝ τ 2 e− 2
n
Pn
∝ τ 2 +α−1 e
i=1
−τ (
τ
(yi −µ)2 α−1 − β
τ
Pn
i=1
(yi −µ)2
2
e
+ β1 )
Thus, τ given y and µ is gamma distributed with shape parameter n/2 + α and scale parameter 1/(ns 2n /2 + 1/β),
P
where s2n is the sample variance n1 ni=1 (yi −µ)2 . The Gibbs sampling scheme is easily constructed using specified
values of ω, α, and β.
21
References
Chib, S., and Greenberg, E. (1995),“Understanding the Metropolis-Hastings Algorithm,” The American Statistician, 49, 327–335.
Gelfand, A. E., and Smith, A. F. M. (1990), “Sampling-Based Approaches to Calculating Marginal Densities,”
Journal of the American Statistical Association, 85, 398–409.
Gelman, A. (1992), “Iterative and Non-Iterative Simulation Algorithms,” in Computing Science and Statistics
(Interfact Proceedings), 24, 433–438.
Gelman, A., Carlin, D. B., Stern, H.S., and Rubin, D. B. (1995), Bayesian Data Analysis, Chapman &
Hall:London
Gelman, A., and Rubin, D. B. (1992), “Inference from Iterative Simulation Using Multiple Sequences” (with
discussion), Statistical Science, 7, 457–511.
Geman, S., and Geman, D. ((1984) “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration
of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Geweke, J. (1989), “Bayesian inference in econometric models using Monte Carlo integration,” Econometrica,
57, 1317-1340.
Gilks, W. R. , Richardson, S. and Spiegelhalter, D. J. [Ed.] (1996) Markov Chain Monte Carlo in Practice,
Chapman & Hall:London.
Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,”
Biometrika, 57, 97–109.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953), “Equations of State
Calculations by Fast Computing Machines,” Journal of Chemical Physics, 21, 1087–1092.
Smith, A. F. M., and Roberts, G. O. (1993), “Bayesian Computation via the Gibbs Sampler and Related
Markov Chain Monte Carlo Methods,” Journal of the Royal Statistical Society, Scr. B, 55, 3–24.
Tanner, M. A., and Wong, W. H. (1987), “The Calculation of Posterior Distributions by Data Augmentation,”
Journal of the American Statistical Association, 82, 528–549.
Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions” (with discussion), Annals of Statistics, 22, 1701–1762.
22
Download