Stat 543 A VERY Brief Introduction to Markov Chain Monte Carlo

advertisement
Stat 543
A VERY Brief Introduction to Markov Chain Monte Carlo
The object here is to describe algorithms for generating a sequence of parameter vectors 1 ; 2 ; 3 ; : : : whose long run empirical (relative frequency) distribution can approximate a (posterior) distribution speci…ed by a density proportional to the product of a likelihood and a prior L ( ) g ( ) (for a k-dimensional
vector). (Of course, the product L ( ) g ( ) could be replaced by a general h ( )
specifying a distribution up to a normalizer in contexts other than Bayes analysis.)
This is intended as an "introductory operational" presentation. It does
not address the theoretical questions of why or when these algorithms work
or what versions of them might be best (in terms of quickly producing a set
of iterates representative of the target distribution). Neither does it address
practical/applied issues of detecting potential problems in MCMC simulations,
determining when any "transient"/"start-up" e¤ects have been "washed out"
and it makes sense to begin accumulating iterates for subsequent analysis, or
deciding how many iterations to use. These all are matters for other courses.
Successive Substitution Sampling (Gibbs Sampling)
Abbreviate (for iterates i ) for each j = 1; 2; : : : ; k
i;<j
=
i;1 ; i;2 ; : : : ; i;j 1
and
i;>j
=
i;j+1 ; i;j+2 ; : : : ; i;k
Begin with 0 (possibly generated from an approximation to L ( ) g ( ) or from
the prior g ( )). With i in hand, generate i+1 as follows. For each j =
1; 2; : : : ; k, generate i+1;j from
L
i+1;<j ;
;
g
i;>j
i+1;<j ;
;
i;>j
(that is, hold all entries of except the jth at their current iterate values, and
generate a replacement for the current value, i;j , from the resulting density).
Sometimes one can see by inspection how to do this directly. Sometimes
once can use the rejection algorithm to do this. Other times clever new tricks
are needed. But when this all can be done, under appropriate conditions this
can produce a sequence with the ergodicity property.
Example 1 Suppose parameters 1 ; 2 both have values in < and c 2 (0; 1).
Given these parameters suppose further that (X; Y ) has joint distribution speci…ed by
X U (0; 1)
and conditioned on the value of X
Y
Y
N(
N(
1 ; 1)
2 ; 1)
1
if X < c
if X c
The conditional mean of Y given X = x is the step function
if x < c
if x c
1
E [Y jX = x] =
2
and based on n iid observations (Xl ; Yl ), the likelihood is
0
0
n
X
1
1 @X
2
L ( 1 ; 2 ; c) = p
exp @
(yl
(yl
1) +
2 x <c
2
xl c
l
11
2 AA
I
2)
[0 < c < 1]
Consider then the use of a prior distribution of independence speci…ed by
c
U (0; 1)
N 0;
2
1
N 0;
2
2
so that
g(
1;
2 ; c)
1
=
2
2
1
exp
2
2
1
2
2
2
+
I [0 < c < 1]
One "Gibbs" (successive substitution sampling) algorithm is then as follows.
With iterate ( i ; i ; ci ) in hand, one may
1. update
1;i
to
1;i+1
sampling from the
N y1
n1
n1
2
2
2
+1
;
n1
2
+1
distribution, where y1 is the sample mean yl for xl ’s that are less than ci
and n1 = # [xl < ci ],
2. update
2;i
to
2;i+1
sampling from the
N y2
n2
n2
2
2
2
+1
;
n2
2
+1
distribution, where y2 is the sample mean yl for xl ’s that are at least as
large as ci and n2 = # [xl ci ], and
3. update ci to ci+1 sampling from a density on (0; 1) that is constant between
ordered x values x(l) , with value proportional to
!!
1 X
2
yl
h0 = exp
2;i+1
2
l
on 0; x(1) , with value proportional to
0
0
X
1
@
hm = exp @
yl
1;i+1
2
xl x(m)
2
2
+
X
xl x(m+1)
yl
2
2;i+1
11
AA
on x(m) ; x(m+1) for 1
m
n
1, and with value proportional to
!!
X
2
yl
1;i+1
1
2
hn = exp
l
on x(n) ; 1 . That is, with
H = h0 x(1) +
n
X1
hm
x(m+1)
x(m) + hn
1
x(n)
m=1
update ci to ci+1 sampling from a density that is h0 =H on 0; x(1) , hm =H
on x(m) ; x(m+1) for 1 m n 1, and hn =H on x(n) ; 1 .
Example 2 For f (yj ) the N( ; 1) density consider Bayes analysis for a sample Y1 ; Y2 ; : : : ; Yn iid from the mixture distribution speci…ed by density
(1
) f (yj
0)
+ f (yj
1)
for 2 (0; 1), for real 0 < 1 . The likelihood here is nasty, but (in a way
related to how one approaches maximization of the likelihood through the EM
algorithm) it’s possible to apply a kind of "latent" (unobserved) or "auxiliary"
variable argument to do MCMC from a posterior.
That is, consider a model for X = (W; Y ) where
W
Ber ( )
Y jW
N(
W ; 1)
In this model for X, the marginal distribution for Y is exactly the mixture
distribution of interest. An iid sample X 1 ; X 2 ; : : : ; X n produces likelihood
Lx (
0;
1; ) =
n
Y
((1
l=1
a) f yl j
Then with, e.g., independent priors for (
(1 wl )
0)
0;
1)
and
( f (yl j
1 ))
wl
with
U (0; 1)
and ( 0 ; 1 ) bivariate normal with mean 0 and covariance matrix
tioned on 0 < 1 so that the prior pdf is
2I [
0
<
1]
1
2
2
exp
1
2
2
2
0
+
2
1
I [0 <
2
I condi-
< 1]
an SSS algorithm for
= ( ; W1 ; W2 ; : : : ; Wn ;
0;
1)
(the parameters and latent variables) can be produced fairly easily.
Suppose that 0 is some starting value. With i = i ; w1;i ; w2;i ; : : : ; wn;i ;
in hand, one may
3
0;i ;
1;i
1. update
i
to
i+1
sampling from
Beta
n
X
n
X
wl;i + 1; n
l=1
wl;i + 1
l=1
!
2. update each wl;i to wl;i+1 sampling from
i+1 f
Ber
3. update
0;i
to
I
1
0;i+1
0
<
f yl j
i+1
yl j
1;i
i+1 f
+
0;i
yl j
1;i
!
sampling from a density proportional to
1;i
0
B
exp B
@
1
2
1
2
2
0
2
X
(yl
l such that
wl;i+1 =0
1
2C
C
0) A
h
i
which (for y0 the sample mean yl for wl;i+1 ’s that are 0 and n0 = # wl;i+1 = 0 )
is a truncated (above at 1;i )
N y0
n0
n0
2
2
2
+1
;
n0
2
+1
density, and
4. update
1;i
I
to
1
1;i+1
>
sampling from a density proportional to
0;i+1
0
B
exp B
@
1
2
2
1
2
2
1
X
l such that
wl;i+1 =1
(yl
1
2C
C
1) A
h
i
which (for y1 the sample mean yl for wl;i+1 ’s that are 1 and n1 = # wl;i+1 = 1 )
is a truncated (below at 0;i+1 )
N y1
n10
n1
2
2
+1
2
;
n1
2
+1
density.
Upon generating a sequence of realizations i ; w1;i ; w2;i ; : : : ; wn;i ; 0;i ; 1;i ,
the sub-vectors i ; 0;i ; 1;i can be used to approximate the posterior for ( ; 0 ;
4
1 ).
The Metropolis-Hastings Algorithm
An alternative to the Gibbs algorithm (or an alternative to a "sample from
a conditional density" step in an SSS algorithm) is the so-called MetropolisHastings algorithm. We begin with the "M-H alone" version.
Begin with 0 (possibly generated from an approximation to L ( ) g ( ) or
from the prior g ( )). With i in hand, generate i+1 as follows. For each ,
let Ji+1 0 j specify a distribution for 0 from which it is possible to sample.
Use it to generate a "proposal"/"candidate" replacement for i as
Ji+1 ( j
i+1
i)
and accept this proposal based on
ri+1 =
L
g
L( i)g(
i+1
=Ji+1 i+1 j
i j i+1
i ) =Ji+1
i+1
i.e. with probability min (1; ri+1 ), otherwise setting
Yi+1 Ber(min (1; ri+1 ))
i+1
= Yi+1
i+1
+ (1
i
i+1
Yi+1 )
=
i.
That is, with
i
This is a kind of "adaptive rejection sampling" methodology. One usually
chooses Ji+1 0 j to specify 0 as a small random perturbation of .
A very nice special instance of this is one where Ji+1 is symmetric, i.e.
0
Ji+1 0 j = Ji+1 j . In this special case the jumping ratio is
ri+1 =
L
g
L( i)g(
i+1
i+1
i)
(and one always jumps to the proposal if it takes one up-hill on L ( ) g ( )) and
this is a so-called "Metropolis" algorithm.
Also (in a very important development) one can use the M-H idea to replace
straight Gibbs updates in an SSS algorithm. That is, when updating i;j
(having updated i;l to i+1;l for all l < j) one speci…es Ji+1;j 0j j j (that can
actually depend upon not only on j but also on the the current values i+1;<j
and i;>j ) and generate a proposal
Ji+1;j j
i+1;j
i;j
and accept it based on
ri+1;j =
L
L
i+1;<j ; i+1;j ;
i;>j
g
i+1;<j ; i+1;j ;
i+1;<j ; i;j ;
i;>j
g
i+1;<j ; i;j ;
otherwise setting
i+1;j
=
i;j .
5
i;>j
i;>j
=Ji+1;j
=Ji+1;j
i+1;j j i;j
i;j j i+1;j
Example 3 For Example 1, consider replacing the "Gibbs step" for updating ci
to ci+1 in an SSS algorithm with a "M-H step" based on a symmetric jumping
kernel. It’s not clear how to do this based directly on c 2 (0; 1). So instead we
do the following.
De…ning
c
d = ln
1 c
we have
1
1 + exp ( d)
c=
The uniform prior for c used before implies that a priori
P [d
t] = P c <
1
1
=
1 + exp ( t)
1 + exp ( t)
so that a corresponding prior pdf for d is for t 2 <
d
dt
1
1 + exp ( t)
=
exp ( t)
(1 + exp ( t))
2
So one way to replace the Gibbs step with a Metropolis step is to operate on d
rather than c and propose
di+1 N di ; 2
(where
2
is a tuning parameter for the algorithm). Since
Ji+1 (d0 jd) = p
1
2
2
exp
1
(d
2 2
2
d0 )
is symmetric, di+1 (and the corresponding ci+1 = 1= 1 + exp di+1 ) is accepted based on the jumping ratio involving the likelihood times prior
!
exp di+1
L 1;i+1 ; 2;i+1 ; ci+1
2
1 + exp di+1
!
ri+1 =
exp ( di )
L 1;i+1 ; 2;i+1 ; ci
2
(1 + exp ( di ))
=
L
1;i+1 ;
L
2;i+1 ; ci+1
1;i+1 ;
2;i+1 ; ci
6
ci+1 1
ci (1
ci+1
ci )
Download