Markov Chain Monte Carlo

advertisement
Markov Chain Monte Carlo
Let’s introduce this subject with an overview of what it does and how it does it. The
purpose of Markov Chain Monte Carlo (MCMC) is to estimate E(f(X)) where X ranges
over the points in a finite probability space S  s1 , s2 ,, sk with a distribution  . You
k
might say, oh, S is finite so this is easy, E ( f ( X ))   f ( si ) ( si ) . Indeed, for a small, or
i 1
even moderately sized space, this would be all you would need to do. To fix the idea,
1
1
1
suppose that S  0,1,2,  (1)  ,  (1)  ,  (2)  , f ( X )  X 2 , then
2
3
6
1 1
1 2
E ( f ( X ))  0  1  4  . However, let’s consider the following canonical example
2 3
6 3
of a finite probability space that really needs a technique such as MCMC.
Consider a finite lattice, L, consisting of the points of the form (i, j ),0  i  8,0  j  8.
A configuration on L is an assignment of 0s and 1s to the points of L. Points of L are
considered to be neighbors if they differ by at most 1 in at most one coordinate. Thus, (1,
0) and (0, 1) are neighbors of (0, 0) but (1, 1) is not. The set S = {feasible configurations
1
on L} and the probability  L is the uniform probability on S, i.e.,  L ( )  ,  S . The
S
challenge is to estimate E (n( )), n( )  # of 1s in  ,  S . A little thought indicates
that even the size of S is difficult to determine and it would not be possible to write down
all the feasible configurations. Indeed, since  (i, j )  1, i, j both even or both odd and 0
otherwise is a feasible configuration, we see that S  232 , and, in fact, it is somewhat
larger than that.
We will have to develop a number of topics before this statement makes sense, but a
somewhat concise statement of how MCMC helps with this problem and others like it e
is:
The MCMC approach to this problem is to develop a Markov Chain (MC) with state
space S and transition probability P , ' that involves a relatively small number of
possible transitions from any given state  in S. Then, if the MC has been designed
properly, and we can simulate the MC well enough, the distribution of the nth iterate
(starting from any initial distribution) will, for n large enough, be close to  L , and we
can estimate E (n( )) by the average of the n( n ) .
Designed properly means that the MC satisfies certain conditions, namely, it is:
1. Irreducible
2. Aperiodic
3.  L is a stationary distribution for the MC (actually, we will usually have that
 L a reversible distribution for the MC).
With these conditions satisfied, it is possible to show that, for large enough n, the
distribution of  n will indeed be close to that of  L , no matter what the initial
distribution is, and it will even be possible to obtain useful estimates for convergence
rates.
Before beginning to discuss the topics highlighted above, let’s indicate how a suitable
Markov chain can be defined on the feasible configurations in S.
At each integer time n + 1, we do the following:
1. Pick a point in v L .
2. Toss a fair coin
3. If the coin comes up heads, and all neighbors of v are take value 0 in  n , then let
 n1 (v)  1, otherwise, let  n1 (v)  0 .
4. For all other points w in L, leave the value at w unchanged, i.e.,  n1 (w)   n (w) .
Note, even though the state space is complicated, given a   S there are only relatively
few other feasible states that can be transitioned to, basically, the step one accessible
states can differ from  at only one place.
Markov Chains:
Let S  s1 , s2 ,, sk  be a finite set. We will think of S as the set of possible states for
some process X that assigns to each non-negative integer an element or state in S. A
realization of the process then will be a sequence of states. The state associated with 0 is
the initial state, and it must be specified. After that the process tells us how to
“determine” the next state and then the next one, etc. If the process is deterministic, then
the sequence of states is uniquely determined by the choice of the initial state. We are
interested in processes for which that may not be the case. Here is an example:
A man lives on a block with four corners labeled v1 , v2 , v3 , v4 (maybe there are coffee
houses there). At time 0 he is at the corner with index 2. A fair coin is flipped, if the
coin comes up heads, he will move at time 1 to the corner with index 3, otherwise, he will
move to the corner with index 1. Another coin is flipped, if heads, he will move at time 2
clockwise to the next corner, if the outcome is tails, he will move counterclockwise to the
next corner in that direction. We can represent the states of this process by a graph with
four vertices (for the corners) and arrows indicating possible transitions. All Markov
chains can be represented this way. We can describe the dynamics of this chain by a
matrix called the transition matrix which specifies the probabilities of each transition
from a corner with index i to a corner with index j. The transition matrix for this chain is
1
1

0 2 0 2
1

1

0
0
2
 , the coordinates of the transition matrix can be represented by Pij
P  2
0 1 0 1

2
2
1

1
0
0

2
2

and represents the probability that the process will transition to state j from state i. For
1
example, P12  for the process above. Notice that the transition probability only
2
depends on the current state and not any earlier ones.
Definition: Let P be a k x k matrix with coordinates Pij : i, j  1,2,, k . A random
process  X 0 , X 1 , with finite state space S  s1 , s2 ,, sk is said to be a homogeneous
Markov chain with transition matrix P, if for all n, all i, j  1,2,, k and all
i0 ,, in1  1,, k  we have


P X n1  s j | X 0  si0 , X 1  si1 ,, X n1  sin 1 , X n  si  PX n1  s j | X n  si 
 Pij
In words, the future is independent of the past given the present.
Conditions that the Pij must satisfy:
Pij  0 ,
k
P
j 1
ij
 1 for all i.
If we start with a point in the state space, then we cannot, with certainty say what the next
state will be and certainly not the one after that and so on. However, if we start with a
distribution on the points of the state space, an initial distribution, then the distribution of
the next states is determined.
If we specify an initial distribution  0 ,  0 ( si )  P( X 0  si ),0  i  k , then the
distribution of X 1 is determined by
P X 1  s j    P ( X 1  s j , X 0  s i )   P ( X 0  si ) P ( X 1  s j | X 0  si )
i
i

   0 ( si ) Pij    0 i Pij   01
i
i
 P11
P
 21
 

 
 Pk1
 0 2   0 k 1  0 k 
P12

 P1k 1
 

Pk 2 

P1k 
P2 k  and
 

 
Pkk 
, in general, it is easy to see that the distribution of X n is given by  n   0 P where we
understand  j to be a row vector with k components.
Simulating Markov Chains
We assume that we have access to a source of uniformly distributed random variables.
This is a big assumption and since I don’t actually know of a bona fide source of truly
random numbers, uniform or otherwise, we might say that this assumption is for the
purpose of being able to proceed with our discussion. In fact, we will in practice
substitute a simulation of such randomness by using a pseudo random number generator.
Thus we assume that anytime we want we can extract from our “source” a string of any
desired length of uniformly [0, 1] distributed independent random variables
U 0 ,U1 ,,U N  . Having such strings available and having a Markov chain, we can
simulate the Markov chain for as many “steps” as we wish, in a very straightforward
manner. So, we have a statespace S  s1 , s2 ,, sk  and a transition matrix P. To
simulate the process X for n steps, we need to do three things:
1. We have to define an initial distribution, which is completely up to us.
2. We have to define how to determine the next state given the present state.
3. We need to do this so that once a sequence of uniform [0, 1] random variables
U 0 ,U1 ,,U N  is inputted, we can determine the states of X out to the same
number of places, i.e.,  X 0 , X 1 ,, X N  and verify that the simulation is really
acting like X.
Specifying the initial distribution:


Suppose we have decided on an initial distribution  0 =  01 ,,  0 k . Simulating such a
distribution can be accomplished by any function  : 0,1  S , called an initiation
function such that for each i the measure of the set u |   si  is  0 i . A straightforward
i
 i 1

way to do this is to take  (u )  si on the interval   0 j ,   0 j , i  2 , and
j 1
 j 1

0


X 0   (U 0 ) will
on
the
interval
.
Then,
given
,
taking
U
,
U
,

,
U
0
,

 (u)  s1
1
0
1
N


assign to X 0 values in S distributed according to the distribution  0 . Note: the
assignment of an initiation function is in no way unique.
Updating:
If we are in a state s i at time n, we want to know how to determine the next state. But
this is no harder than simulating the initial distribution, except we will have a different
update function for each state:
s1 , u  0, Pi1 




j 1
j


s , u 
P
,
Pij 


j
ij

 ( si , u )  
l 1
 l 1




 k 1

 s k , u   Pij ,1

 l 1

Pi1 , j  1

 j
j 1
Clearly, for each si , s j  S , P( ( si ,U )  s j )  
.
P

Pij  Pij , j  1


ij

l 1
 l 1
Update functions are not unique either.
To simulate the chain now, given a U 0 ,U1 ,,U N  , we take
X 0   (U 0 ),
X 1   ( (U 0 ), U 1 )

Assignment, simulate the four corners Markov chain.
Irreducible Markov Chains:
Basically, an irreducible Markov chain is one such that you can get to any other state
from any given state, eventually.
Let si , s j  S , we say that s i communicates with s j , si  s j if the chain has positive
probability of ever reaching s j from s i . If it is also true that s j communicates with s i ,
then these states are said to intercommunicate and write si  s j .
Definition: A Markov chain  X 0 , X 1 , with state space S  s1 , s2 ,, sk  and transition
matrix P is said to be irreducible if for all si , s j  S we have si  s j . Otherwise the
chain is said to be reducible.
Example of a reducible MC: (just draw the graph)
Aperiodic Markov Chains:
For a finite or infinite set a1 , a2 , of positive integers, we write gcda1 , a2 , for the
greatest common divisor of a1 , a2 , The period d si  of a state s i is defined as

 
d si   gcd n  1 | P n
ii

0
The period of s i is the greatest common divisor of the set of times that the chain can
return with positive probability to the state s i given that X 0  si . If d si  = 1, we say
that the state s i is aperiodic.
Definition: A Markov chain is said to be aperiodic if all of its states are aperiodic.
Otherwise, the chain is said to be periodic.
Exercise: Show that the four corners MC is periodic.
Theorem (on aperiodicity): Suppose that  X 0 , X 1 , is an aperiodic MC with state
space S  s1 , s2 ,, sk  and transition matrix P. Then there exists a N   such that
P 
n
ii
 0 for all i  1,, k and all n  N .
Lemma (from number theory): Let A  a1 , a2 , be a set of positive integers which is
i)
nonlattice, meaning that gcda1 , a2 ,  1 , and
ii)
closed under addition
Then there exists an integer N   such that n A for all n  N .
Proof of theorem on aperiodicity: For si  S , let Ai  n  1 | P n ii  0, Ai is the set of
possible return times to the state s i starting from s i . Since the chain is aperiodic, Ai is
nonlattice. Let’s see that Ai is also closed under addition: Suppose that a, a' Ai , then
P  X a  si | X 0  si   P  X a  si , X a  a '  si | X 0  si 
 P ( X a  si | X 0  si ) P ( X a  a '  si | X a  si )  P ( X a  si | X 0  si ) P ( X a '  si | X 0  si )  0
Corollary: Suppose that  X 0 , X 1 , is an irreducible and aperiodic MC with state space
S  s1 , s2 ,, sk  and transition matrix P. Then there exists a finite positive integer M
 
such that P n
ij
 0 for all i, j  1,, k and all n  M .
Proof: by aperiodicity, there is a N   such that P n ii  0 for all i  1,, k and all
 
n  N . For si , s j  S , by irreducibility, there is an ni , j such that P
ni , j
ij
 0. Let
M i , j  N  ni , j . Then, for m  M ij , we have
P( X m  s j | X 0  si )  P( X mni , j  si , X m  s j | X 0  si )
 P( X mni , j  si | X 0  si ) P( X m  s j | X mni , j  si )  0
.
Take M  max M 11 , M 12 ,, M 1k , M 21 ,, M kk  .
Stationary Distributions:
A distribution  is said to be a stationary distribution for the Markov chain X with
transition matrix P if P   .
We are going to prove some results concerning existence, uniqueness, and convergence
to stationarity for irreducible, aperiodic Markov chains. Some of these results can be
relaxed.
Theorem (Existence): For any irreducible and aperiodic MC, there exists at least one
stationary distribution.
Hitting times: Tij  min n | X n  s j , Tij   if the chain never visits s j . The mean hitting
 
time is  ij  E Tij . The mean return time then is  ii .
Lemma (finiteness of hitting times): Suppose that  X 0 , X 1 , is an irreducible and
aperiodic MC with state space S  s1 , s2 ,, sk  and transition matrix P and let si , s j  S .
Then P Tij     1 . Moreover, the mean hitting time  ij is finite.
 
Proof: By the corollary in the previous section, there is an M such that P n
ij
 0 for all
i, j  1,, k and all n  M . Let   min P M i , j | i, j  1,2,, k > 0. For states
s i , s j , P(Tij  M )  P( X M  s j )  1   . Furthermore,
P(Tij  2M )  P(Tij  M ) P(Tij  2M | Tij  M ) 
P(Tij  M ) P( X 2 M  s j | Tij  M )  P(Tij  M ) P( X 2 M  s j | X 1  s j , X 2  s j ,, X M  s j ) 
P(Tij  M ) P( X 2 M  s j | X M  s j )  1    .
By continuing this process, we have that, for each l > 1,
l
l
P(Tij  lM )  1    . But since 1     0, PTij    1.
2
 ij  E Tij    P(Tij  n)   P(Tij  n)  
Recall that


n 1
n 0


l 0
l 0
 ( l 1) M 1
l 0
 M  P(Tij  lM )  M  1    
The Lemma is established.
l
 P(T
n lM
ij
 lM )
M
M

 .
1  (1   ) 

Proof of the existence theorem: Let  i   P X n  si , T11  n  assuming X 0  s1 .
n 0
That is,  i is the expected number of visits to state s i up to time T11  1 .
Theorem (The Markov Chain Convergence Theorem): Let  X 0 , X 1 , be an
irreducible, aperiodic Markov chain with state space S  s1 , s2 ,, sk  transition matrix
P, and arbitrary initial distribution  0 . Then for any distribution  that is stationary for
the transition matrix P, we have  n    in total variation, TV.
Proof: (Coupling argument) Assume that  X 0 , X 1 , has been obtained by
X 0    0 U 0 
simulation X 1   ( X 0 ,U 1 )

Where   0 is a valid initiation function for  0 and  is a valid update function for P,and
U 0 ,U1 , is an i.i.d. sequence of uniform [0, 1] random variables. Next, let
 X 0 ' , X 1 ' , be a second Markov chain by letting   be a valid initiation function for
the distribution  , letting U 0 ' ,U 1 ' , be a second i.i.d. sequence of uniform [0, 1]
random variables independent of U 0 ,U 1 , and setting
X 0 '    U 0 '
X 1 '   ( X 0 ' ,U 1 ' )

Since  is a stationary distribution, we have that X n ' has distribution  for any n. Also,
the chains  X 0 , X 1 , and  X 0 ' , X 1 ' , are independent of each other since U 0 ,U 1 ,
and U 0 ' ,U 1 ' , are.
We want to show that these two chains must meet, i.e., there must exist a time n
such that X n  X n ' . To this end, let T  min n : X n  X n ', T   if the chains never
meet. Since the Markov chain  X 0 , X 1 , is irreducible and aperiodic, we can find,
 
using the Corollary of the previous section, a finite M such that P M
i, j  1,, k. Let   min P M ij , i  1,, k > 0. We get that
ij
 0 for all
P(T  M )  P( X M  X M ' )  P( X M  s1 , X M '  s1 )
 k
 k

 P( X M  s1 ) P( X M '  s1 )    P X 0  si , X M  s1   P X 0 '  si , X M '  s1 
 i 1
 i 1

 k
 k

   P X 0  si P X M  s1 | X 0  si   P X 0 '  si P X M '  s1 | X 0 '  si 
 i 1
 i 1

k
k



    P X 0  si    P X 0 '  si    2 .
 i 1
 i 1

2
So, P(T  M )  1   .
Similarly, given everything that has happened up to time M, we have conditional
probability of at least  2 of X 2 M  X 2M '  s1 so, P( X 2 M  X 2 M '| T  M )  1   2 .
Hence,
P(T  2M )  P(T  M ) P(T  2M | T  M )  1   2 P(T  2M | T  M )



 1   2 P ( X 2 M  X 2 M '| T  M )  1   2



2
Iteration shows that P(T  lM )  1   2 , which tends to 0 as l tends to infinity. Hence,
lim P(T  n)  0 .
l
n 
We now construct a third Markov chain  X 0 ' ' , X 1 ' ' , by setting X 0 ' '  X 0 and, for each
  X n ' ' ,U n1  if X n ' '  X n '
n  1, X n1 ' '  
. So, the chain  X 0 ' ' , X 1 ' ' , evolves exactly
  X n ' ' ,U n1 'if X n ' '  X n '
like the chain  X 0 , X 1 , until the time T when it first meets the chain  X 0 ' , X 1 ' , . It
then switches to evolve exactly like the chain  X 0 ' , X 1 ' , .
Because the initial distribution of  X 0 ' ' , X 1 ' ' , is  0 , for any n, X n ' ' has distribution
 i  n    i  P ( X n ' '  si )  P ( X n '  si )
 n . For any i  1,2,, k, we have
 P ( X n ' '  si , X n '  si )
 P( X n ' '  X n ' )
 P(T  n)
Using the same argument with the roles of X n ' ' and X n ' interchanged, we get
 i  i n   P(T  n) as well, so  i  i n   P(T  n)  0
Theorem (Uniqueness): Any irreducible and aperiodic MC has exactly one stationary
distribution.
The proofs of these results will be the main task of the next meeting.
Right now we want to go on to one more topic and then return briefly to the opening
example.
Reversible Markov Chains:
Definition: Let  X 0 , X 1 , be an irreducible, aperiodic Markov chain with state space
S  s1 , s2 ,, sk  and transition matrix P. A probability distribution  on S is said to be
reversible for the chain (or the transition matrix P) if for all i, j  1,, k, we have
 i Pij   j Pji . The Markov chain is said to be reversible if there exists a reversible
distribution for it.
Theorem (reversible distributions are stationary): Let  X 0 , X 1 , be an irreducible,
aperiodic Markov chain with state space S  s1 , s2 ,, sk  and transition matrix P. If 
is a reversible distribution for P, then it is also a stationary distribution for P.
k
Proof: We have to show that, for any j  1,, k, we have  j    i Pij . From
i 1
k
k
k
i 1
i 1
i 1
reversibility, we get  j   j  Pji   j Pji   i Pij (reversibility used here).
Returning to the opening example:
We would like to see that the uniform distribution on the feasible states of the opening
example is reversible for the transition mechanism defined.
Let P , ' denote the transition probability from state  to state  ' . We want to see that
 L ( ) P , '   L  'P ', .
Let d  d  ,  ' = # vertices that  and  ' differ at. If d = 0 then the equality is obvious.
If d  2 then P , '  P ',  0 (these transitions don’t happen). So, suppose that d = 1. So,
 and  ' differ at exactly one vertex, say v. Then all neighbors of v must be 0 otherwise,
one of the configurations would not be feasible. We therefore have
1 1
 L ( ) P , ' 
  L  'P ', .
S 2k
Assignment: Let a Bose Einstein experiment of placing n identical particle in N boxes.
The configurations are characterized by the occupation statistics for the boxes, x1 =
number of particles in box 1 etc. All configurations have the same probability given by
1
. Define a transition mechanism on the states of this system and show that
 N  n  1


 N 1 
this distribution is a reversible distribution for the mechanism.
Q – colorings: Let G = (V, E) be a graph and let q  2 be an integer. A q – coloring of
the graph G is an assignment of values from 1,, q thought of as different colors with
the property that no two adjacent vertices have the same color. A random q – coloring is
a coloring chosen from the set of all possible colorings with the uniform distribution.
A Gibbs sampler for random q – colorings is an S V - valued Markov chain where at each
time n + 1, transitions take place as follows:
1. Pick a vertex v V at random (uniformly)
2. Pick X n1 (v) according to the uniform distribution over the set of colors that are
not attained at any neighboring vertex of v.
3. Leave the color unchanged at all other vertices.
Whether this chain is irreducible depends on G,q and it is not a trivial problem to
determine this.
The systematic sweep Gibbs sampler:
v1 at times 1, k  1,
  at times 2, k  2,

Update 
vi at times i, k  i,
   

Theorem: (bound on number of iterates needed): Let G  (V , E ) be a graph. Let k be
the number of vertices in G, and suppose that any vertex v in V has at most d neighbors.
Suppose furthermore that q  2d 2 . Then for any fixed   0, the number of iterations
needed for the systematic sweep Gibbs sampler described above (starting from any fixed
q – coloring  ) to come within total variation distance  of the target distribution  G ,q is
at most


1
 log k  log    log( d ) 
 
k
 1.


 q 
log  2 


 2d 


Proof: We use a coupling argument. We run two 1,, q chains  X 0 , X 1 , and
V
 X 0 ' , X 1 ' , at the same time.
The first chain will start with the fixed coloring X 0   ,
while the second chain is started at random state  ' chosen according to the stationary
distribution  G ,q . Then, X n ' has distribution  G ,q and X n distribution  n . We will see


that dTV G,q ,  n is close to 0 if P( X n '  X n ) is close to 1.
For each update, of both chains, we need to choose a vertex to update, which we are
doing systematically, and then we need to choose a random color to assign to the vertex
from the set of colors that are not assigned to the neighbors. We do this by choosing a
random permutation of the colors (a fresh one for each update) and then, for each chain,
we take the first color in the random permutation that is not one of those taken by a
neighbor of the vertex being updated. More formally, let R0 , R1 , be an i.i.d. sequence
of random permutations, each of them uniformly distributed on the set of permutations of
1,, q. At each time n, the updates of the two chains use the permutation


Rn  Rn ,, Rn . , and the vertex v to be updated is assigned the new value
1
q

X n1 v   Rn where i  min j : X n (w)  Rn for all neighbors w of v
i
j

 in the first chain,
while in the second chain, we set X n1 ' v   Rn where i'  min j ': X n (w)  Rn
neighbors w of v  .
i'
j'
for all
Now, let’s consider the probability that the two chains agree at a particular vertex.
Consider the update of the two chains at a vertex v at time n  k , (we are in the first
sweep of the Gibbs sampler). We call the update successful if it results in having
X n1 (v)  X n1 ' (v) ; otherwise the update has failed.
Define B2  the number of colors r  1,, q that are attained in the neighborhood of v
in both X n and X n ' , B1  the number of colors r  1,, q that are attained in the
neighborhood of v in exactly one of X n and X n ' , and B0  the number of colors
r  1,, q that are attained in the neighborhood of v in neither of X n and X n ' , and
note that B0  B1  B2  q .
Note, if Rn  B2 , then it is discarded. The update is successful only if the first color in
B0 occurs in Rn before the first one that is in B1 . The event of a successful update then is
B0
B0  B1
So, the probability of a failed update is
B1
P failedupda te 
.
B0  B1
Clearly, 0  B2  d . Furthermore, B1  2d  2B2 , because counting the neighbors in
both configurations, there are in all at most 2d of them, and each color contributing to B2
uses up two of them. We get
1
P failedupda te 
B1
B1
.
.
B0  B1
q  B2
 B 
2 d 1  2 
.
2 d  2 B2
2 d  B2
2d  2d


.


q  B2
q  B2
q
 B2 

q1 
q 

Thus, after the first sweep of the Gibbs sampler, we have, for each vertex v,
2d
.
P  X k (v )  X k ' (v )  
q
Now, consider updates during the second sweep of the Gibbs sampler, i.e., between k and
2k. For an update at time n during the second sweep to fail at v, the configurations
X n and X n ' must differ in at least one neighbor of v. Each neighbor w has
X n (w)  X n ' (w) with probability at most
2d
and summing over the at most d neighbors,
q
we get that
2d 2
q
where discrepancy means there is a neighbor w where X n (w)  X n ' (w) . Furthermore, by
repeating the arguments above, it follows that the conditional probability
2d
and so a the end of the
P( failedupdate | discrepenc y) 
q
Second sweep,
2d 2d 2
P( failedupdate)  P(discripenc y) P( failedupdate | discrepenc y) 
.
q q
2d 2d 2
So, after two sweeps, P X 2 k (v)  X 2 k ' (v)  
.
q q
P(discrepenc y) 
2d  2d 2 


After m sweeps, P X mk (v)  X mk ' (v)  
q  q 
m 1
.
Now, let A  1,, q be any subset. We have that
V


d TV  mk  ,  G ,q  max  mk  ( A)   G ,q  A 
A1,,q
max P X mk  A  P X ' mk  A.
A1,,q
For any such A, we have
P X mk  A  P X ' mk  A
 P X mk  A, X ' mk  A  P X mk  A, X ' mk  A
 P X ' mk  A, X mk  A  P X ' mk  A, X mk  A
 P X mk  A, X ' mk  A  P X ' mk  A, X mk  A
 P X mk  A, X ' mk  A  P X mk
k  2d 2 

 X ' mk   
d  q 
m
Similarly, we get that
m
k  2d 2 
 .
P X ' mk  A  P X mk  A  
d  q 
Thus,
m
k  2d 2 
 and, thus
P X mk  A  P X ' mk  A  
d  q 

d TV 
 mk 
,  G ,q

m
k  2d 2 
 .
 
d  q 
Now, how large does m need to be in order to make this less than  ?
m
k  2d 2 
   . Solving for m, we get
We want 
d  q 
1
log( k )  log    log( d )
 
m
 q 
log  2 
 2d 
So, running the Gibbs sampler long enough to get at least this many scans will guarantee
that dTV  mk  , G,q   . So, how many steps do we need? We must multiply m by k.
But, wanting both n and m to be integer implies that we should take n to be


1
 log( k )  log     log( d ) 
 
n  k
 1 .
 q 


log  2 


 2d 




Download