Markov Chain Monte Carlo Let’s introduce this subject with an overview of what it does and how it does it. The purpose of Markov Chain Monte Carlo (MCMC) is to estimate E(f(X)) where X ranges over the points in a finite probability space S s1 , s2 ,, sk with a distribution . You k might say, oh, S is finite so this is easy, E ( f ( X )) f ( si ) ( si ) . Indeed, for a small, or i 1 even moderately sized space, this would be all you would need to do. To fix the idea, 1 1 1 suppose that S 0,1,2, (1) , (1) , (2) , f ( X ) X 2 , then 2 3 6 1 1 1 2 E ( f ( X )) 0 1 4 . However, let’s consider the following canonical example 2 3 6 3 of a finite probability space that really needs a technique such as MCMC. Consider a finite lattice, L, consisting of the points of the form (i, j ),0 i 8,0 j 8. A configuration on L is an assignment of 0s and 1s to the points of L. Points of L are considered to be neighbors if they differ by at most 1 in at most one coordinate. Thus, (1, 0) and (0, 1) are neighbors of (0, 0) but (1, 1) is not. The set S = {feasible configurations 1 on L} and the probability L is the uniform probability on S, i.e., L ( ) , S . The S challenge is to estimate E (n( )), n( ) # of 1s in , S . A little thought indicates that even the size of S is difficult to determine and it would not be possible to write down all the feasible configurations. Indeed, since (i, j ) 1, i, j both even or both odd and 0 otherwise is a feasible configuration, we see that S 232 , and, in fact, it is somewhat larger than that. We will have to develop a number of topics before this statement makes sense, but a somewhat concise statement of how MCMC helps with this problem and others like it e is: The MCMC approach to this problem is to develop a Markov Chain (MC) with state space S and transition probability P , ' that involves a relatively small number of possible transitions from any given state in S. Then, if the MC has been designed properly, and we can simulate the MC well enough, the distribution of the nth iterate (starting from any initial distribution) will, for n large enough, be close to L , and we can estimate E (n( )) by the average of the n( n ) . Designed properly means that the MC satisfies certain conditions, namely, it is: 1. Irreducible 2. Aperiodic 3. L is a stationary distribution for the MC (actually, we will usually have that L a reversible distribution for the MC). With these conditions satisfied, it is possible to show that, for large enough n, the distribution of n will indeed be close to that of L , no matter what the initial distribution is, and it will even be possible to obtain useful estimates for convergence rates. Before beginning to discuss the topics highlighted above, let’s indicate how a suitable Markov chain can be defined on the feasible configurations in S. At each integer time n + 1, we do the following: 1. Pick a point in v L . 2. Toss a fair coin 3. If the coin comes up heads, and all neighbors of v are take value 0 in n , then let n1 (v) 1, otherwise, let n1 (v) 0 . 4. For all other points w in L, leave the value at w unchanged, i.e., n1 (w) n (w) . Note, even though the state space is complicated, given a S there are only relatively few other feasible states that can be transitioned to, basically, the step one accessible states can differ from at only one place. Markov Chains: Let S s1 , s2 ,, sk be a finite set. We will think of S as the set of possible states for some process X that assigns to each non-negative integer an element or state in S. A realization of the process then will be a sequence of states. The state associated with 0 is the initial state, and it must be specified. After that the process tells us how to “determine” the next state and then the next one, etc. If the process is deterministic, then the sequence of states is uniquely determined by the choice of the initial state. We are interested in processes for which that may not be the case. Here is an example: A man lives on a block with four corners labeled v1 , v2 , v3 , v4 (maybe there are coffee houses there). At time 0 he is at the corner with index 2. A fair coin is flipped, if the coin comes up heads, he will move at time 1 to the corner with index 3, otherwise, he will move to the corner with index 1. Another coin is flipped, if heads, he will move at time 2 clockwise to the next corner, if the outcome is tails, he will move counterclockwise to the next corner in that direction. We can represent the states of this process by a graph with four vertices (for the corners) and arrows indicating possible transitions. All Markov chains can be represented this way. We can describe the dynamics of this chain by a matrix called the transition matrix which specifies the probabilities of each transition from a corner with index i to a corner with index j. The transition matrix for this chain is 1 1 0 2 0 2 1 1 0 0 2 , the coordinates of the transition matrix can be represented by Pij P 2 0 1 0 1 2 2 1 1 0 0 2 2 and represents the probability that the process will transition to state j from state i. For 1 example, P12 for the process above. Notice that the transition probability only 2 depends on the current state and not any earlier ones. Definition: Let P be a k x k matrix with coordinates Pij : i, j 1,2,, k . A random process X 0 , X 1 , with finite state space S s1 , s2 ,, sk is said to be a homogeneous Markov chain with transition matrix P, if for all n, all i, j 1,2,, k and all i0 ,, in1 1,, k we have P X n1 s j | X 0 si0 , X 1 si1 ,, X n1 sin 1 , X n si PX n1 s j | X n si Pij In words, the future is independent of the past given the present. Conditions that the Pij must satisfy: Pij 0 , k P j 1 ij 1 for all i. If we start with a point in the state space, then we cannot, with certainty say what the next state will be and certainly not the one after that and so on. However, if we start with a distribution on the points of the state space, an initial distribution, then the distribution of the next states is determined. If we specify an initial distribution 0 , 0 ( si ) P( X 0 si ),0 i k , then the distribution of X 1 is determined by P X 1 s j P ( X 1 s j , X 0 s i ) P ( X 0 si ) P ( X 1 s j | X 0 si ) i i 0 ( si ) Pij 0 i Pij 01 i i P11 P 21 Pk1 0 2 0 k 1 0 k P12 P1k 1 Pk 2 P1k P2 k and Pkk , in general, it is easy to see that the distribution of X n is given by n 0 P where we understand j to be a row vector with k components. Simulating Markov Chains We assume that we have access to a source of uniformly distributed random variables. This is a big assumption and since I don’t actually know of a bona fide source of truly random numbers, uniform or otherwise, we might say that this assumption is for the purpose of being able to proceed with our discussion. In fact, we will in practice substitute a simulation of such randomness by using a pseudo random number generator. Thus we assume that anytime we want we can extract from our “source” a string of any desired length of uniformly [0, 1] distributed independent random variables U 0 ,U1 ,,U N . Having such strings available and having a Markov chain, we can simulate the Markov chain for as many “steps” as we wish, in a very straightforward manner. So, we have a statespace S s1 , s2 ,, sk and a transition matrix P. To simulate the process X for n steps, we need to do three things: 1. We have to define an initial distribution, which is completely up to us. 2. We have to define how to determine the next state given the present state. 3. We need to do this so that once a sequence of uniform [0, 1] random variables U 0 ,U1 ,,U N is inputted, we can determine the states of X out to the same number of places, i.e., X 0 , X 1 ,, X N and verify that the simulation is really acting like X. Specifying the initial distribution: Suppose we have decided on an initial distribution 0 = 01 ,, 0 k . Simulating such a distribution can be accomplished by any function : 0,1 S , called an initiation function such that for each i the measure of the set u | si is 0 i . A straightforward i i 1 way to do this is to take (u ) si on the interval 0 j , 0 j , i 2 , and j 1 j 1 0 X 0 (U 0 ) will on the interval . Then, given , taking U , U , , U 0 , (u) s1 1 0 1 N assign to X 0 values in S distributed according to the distribution 0 . Note: the assignment of an initiation function is in no way unique. Updating: If we are in a state s i at time n, we want to know how to determine the next state. But this is no harder than simulating the initial distribution, except we will have a different update function for each state: s1 , u 0, Pi1 j 1 j s , u P , Pij j ij ( si , u ) l 1 l 1 k 1 s k , u Pij ,1 l 1 Pi1 , j 1 j j 1 Clearly, for each si , s j S , P( ( si ,U ) s j ) . P Pij Pij , j 1 ij l 1 l 1 Update functions are not unique either. To simulate the chain now, given a U 0 ,U1 ,,U N , we take X 0 (U 0 ), X 1 ( (U 0 ), U 1 ) Assignment, simulate the four corners Markov chain. Irreducible Markov Chains: Basically, an irreducible Markov chain is one such that you can get to any other state from any given state, eventually. Let si , s j S , we say that s i communicates with s j , si s j if the chain has positive probability of ever reaching s j from s i . If it is also true that s j communicates with s i , then these states are said to intercommunicate and write si s j . Definition: A Markov chain X 0 , X 1 , with state space S s1 , s2 ,, sk and transition matrix P is said to be irreducible if for all si , s j S we have si s j . Otherwise the chain is said to be reducible. Example of a reducible MC: (just draw the graph) Aperiodic Markov Chains: For a finite or infinite set a1 , a2 , of positive integers, we write gcda1 , a2 , for the greatest common divisor of a1 , a2 , The period d si of a state s i is defined as d si gcd n 1 | P n ii 0 The period of s i is the greatest common divisor of the set of times that the chain can return with positive probability to the state s i given that X 0 si . If d si = 1, we say that the state s i is aperiodic. Definition: A Markov chain is said to be aperiodic if all of its states are aperiodic. Otherwise, the chain is said to be periodic. Exercise: Show that the four corners MC is periodic. Theorem (on aperiodicity): Suppose that X 0 , X 1 , is an aperiodic MC with state space S s1 , s2 ,, sk and transition matrix P. Then there exists a N such that P n ii 0 for all i 1,, k and all n N . Lemma (from number theory): Let A a1 , a2 , be a set of positive integers which is i) nonlattice, meaning that gcda1 , a2 , 1 , and ii) closed under addition Then there exists an integer N such that n A for all n N . Proof of theorem on aperiodicity: For si S , let Ai n 1 | P n ii 0, Ai is the set of possible return times to the state s i starting from s i . Since the chain is aperiodic, Ai is nonlattice. Let’s see that Ai is also closed under addition: Suppose that a, a' Ai , then P X a si | X 0 si P X a si , X a a ' si | X 0 si P ( X a si | X 0 si ) P ( X a a ' si | X a si ) P ( X a si | X 0 si ) P ( X a ' si | X 0 si ) 0 Corollary: Suppose that X 0 , X 1 , is an irreducible and aperiodic MC with state space S s1 , s2 ,, sk and transition matrix P. Then there exists a finite positive integer M such that P n ij 0 for all i, j 1,, k and all n M . Proof: by aperiodicity, there is a N such that P n ii 0 for all i 1,, k and all n N . For si , s j S , by irreducibility, there is an ni , j such that P ni , j ij 0. Let M i , j N ni , j . Then, for m M ij , we have P( X m s j | X 0 si ) P( X mni , j si , X m s j | X 0 si ) P( X mni , j si | X 0 si ) P( X m s j | X mni , j si ) 0 . Take M max M 11 , M 12 ,, M 1k , M 21 ,, M kk . Stationary Distributions: A distribution is said to be a stationary distribution for the Markov chain X with transition matrix P if P . We are going to prove some results concerning existence, uniqueness, and convergence to stationarity for irreducible, aperiodic Markov chains. Some of these results can be relaxed. Theorem (Existence): For any irreducible and aperiodic MC, there exists at least one stationary distribution. Hitting times: Tij min n | X n s j , Tij if the chain never visits s j . The mean hitting time is ij E Tij . The mean return time then is ii . Lemma (finiteness of hitting times): Suppose that X 0 , X 1 , is an irreducible and aperiodic MC with state space S s1 , s2 ,, sk and transition matrix P and let si , s j S . Then P Tij 1 . Moreover, the mean hitting time ij is finite. Proof: By the corollary in the previous section, there is an M such that P n ij 0 for all i, j 1,, k and all n M . Let min P M i , j | i, j 1,2,, k > 0. For states s i , s j , P(Tij M ) P( X M s j ) 1 . Furthermore, P(Tij 2M ) P(Tij M ) P(Tij 2M | Tij M ) P(Tij M ) P( X 2 M s j | Tij M ) P(Tij M ) P( X 2 M s j | X 1 s j , X 2 s j ,, X M s j ) P(Tij M ) P( X 2 M s j | X M s j ) 1 . By continuing this process, we have that, for each l > 1, l l P(Tij lM ) 1 . But since 1 0, PTij 1. 2 ij E Tij P(Tij n) P(Tij n) Recall that n 1 n 0 l 0 l 0 ( l 1) M 1 l 0 M P(Tij lM ) M 1 The Lemma is established. l P(T n lM ij lM ) M M . 1 (1 ) Proof of the existence theorem: Let i P X n si , T11 n assuming X 0 s1 . n 0 That is, i is the expected number of visits to state s i up to time T11 1 . Theorem (The Markov Chain Convergence Theorem): Let X 0 , X 1 , be an irreducible, aperiodic Markov chain with state space S s1 , s2 ,, sk transition matrix P, and arbitrary initial distribution 0 . Then for any distribution that is stationary for the transition matrix P, we have n in total variation, TV. Proof: (Coupling argument) Assume that X 0 , X 1 , has been obtained by X 0 0 U 0 simulation X 1 ( X 0 ,U 1 ) Where 0 is a valid initiation function for 0 and is a valid update function for P,and U 0 ,U1 , is an i.i.d. sequence of uniform [0, 1] random variables. Next, let X 0 ' , X 1 ' , be a second Markov chain by letting be a valid initiation function for the distribution , letting U 0 ' ,U 1 ' , be a second i.i.d. sequence of uniform [0, 1] random variables independent of U 0 ,U 1 , and setting X 0 ' U 0 ' X 1 ' ( X 0 ' ,U 1 ' ) Since is a stationary distribution, we have that X n ' has distribution for any n. Also, the chains X 0 , X 1 , and X 0 ' , X 1 ' , are independent of each other since U 0 ,U 1 , and U 0 ' ,U 1 ' , are. We want to show that these two chains must meet, i.e., there must exist a time n such that X n X n ' . To this end, let T min n : X n X n ', T if the chains never meet. Since the Markov chain X 0 , X 1 , is irreducible and aperiodic, we can find, using the Corollary of the previous section, a finite M such that P M i, j 1,, k. Let min P M ij , i 1,, k > 0. We get that ij 0 for all P(T M ) P( X M X M ' ) P( X M s1 , X M ' s1 ) k k P( X M s1 ) P( X M ' s1 ) P X 0 si , X M s1 P X 0 ' si , X M ' s1 i 1 i 1 k k P X 0 si P X M s1 | X 0 si P X 0 ' si P X M ' s1 | X 0 ' si i 1 i 1 k k P X 0 si P X 0 ' si 2 . i 1 i 1 2 So, P(T M ) 1 . Similarly, given everything that has happened up to time M, we have conditional probability of at least 2 of X 2 M X 2M ' s1 so, P( X 2 M X 2 M '| T M ) 1 2 . Hence, P(T 2M ) P(T M ) P(T 2M | T M ) 1 2 P(T 2M | T M ) 1 2 P ( X 2 M X 2 M '| T M ) 1 2 2 Iteration shows that P(T lM ) 1 2 , which tends to 0 as l tends to infinity. Hence, lim P(T n) 0 . l n We now construct a third Markov chain X 0 ' ' , X 1 ' ' , by setting X 0 ' ' X 0 and, for each X n ' ' ,U n1 if X n ' ' X n ' n 1, X n1 ' ' . So, the chain X 0 ' ' , X 1 ' ' , evolves exactly X n ' ' ,U n1 'if X n ' ' X n ' like the chain X 0 , X 1 , until the time T when it first meets the chain X 0 ' , X 1 ' , . It then switches to evolve exactly like the chain X 0 ' , X 1 ' , . Because the initial distribution of X 0 ' ' , X 1 ' ' , is 0 , for any n, X n ' ' has distribution i n i P ( X n ' ' si ) P ( X n ' si ) n . For any i 1,2,, k, we have P ( X n ' ' si , X n ' si ) P( X n ' ' X n ' ) P(T n) Using the same argument with the roles of X n ' ' and X n ' interchanged, we get i i n P(T n) as well, so i i n P(T n) 0 Theorem (Uniqueness): Any irreducible and aperiodic MC has exactly one stationary distribution. The proofs of these results will be the main task of the next meeting. Right now we want to go on to one more topic and then return briefly to the opening example. Reversible Markov Chains: Definition: Let X 0 , X 1 , be an irreducible, aperiodic Markov chain with state space S s1 , s2 ,, sk and transition matrix P. A probability distribution on S is said to be reversible for the chain (or the transition matrix P) if for all i, j 1,, k, we have i Pij j Pji . The Markov chain is said to be reversible if there exists a reversible distribution for it. Theorem (reversible distributions are stationary): Let X 0 , X 1 , be an irreducible, aperiodic Markov chain with state space S s1 , s2 ,, sk and transition matrix P. If is a reversible distribution for P, then it is also a stationary distribution for P. k Proof: We have to show that, for any j 1,, k, we have j i Pij . From i 1 k k k i 1 i 1 i 1 reversibility, we get j j Pji j Pji i Pij (reversibility used here). Returning to the opening example: We would like to see that the uniform distribution on the feasible states of the opening example is reversible for the transition mechanism defined. Let P , ' denote the transition probability from state to state ' . We want to see that L ( ) P , ' L 'P ', . Let d d , ' = # vertices that and ' differ at. If d = 0 then the equality is obvious. If d 2 then P , ' P ', 0 (these transitions don’t happen). So, suppose that d = 1. So, and ' differ at exactly one vertex, say v. Then all neighbors of v must be 0 otherwise, one of the configurations would not be feasible. We therefore have 1 1 L ( ) P , ' L 'P ', . S 2k Assignment: Let a Bose Einstein experiment of placing n identical particle in N boxes. The configurations are characterized by the occupation statistics for the boxes, x1 = number of particles in box 1 etc. All configurations have the same probability given by 1 . Define a transition mechanism on the states of this system and show that N n 1 N 1 this distribution is a reversible distribution for the mechanism. Q – colorings: Let G = (V, E) be a graph and let q 2 be an integer. A q – coloring of the graph G is an assignment of values from 1,, q thought of as different colors with the property that no two adjacent vertices have the same color. A random q – coloring is a coloring chosen from the set of all possible colorings with the uniform distribution. A Gibbs sampler for random q – colorings is an S V - valued Markov chain where at each time n + 1, transitions take place as follows: 1. Pick a vertex v V at random (uniformly) 2. Pick X n1 (v) according to the uniform distribution over the set of colors that are not attained at any neighboring vertex of v. 3. Leave the color unchanged at all other vertices. Whether this chain is irreducible depends on G,q and it is not a trivial problem to determine this. The systematic sweep Gibbs sampler: v1 at times 1, k 1, at times 2, k 2, Update vi at times i, k i, Theorem: (bound on number of iterates needed): Let G (V , E ) be a graph. Let k be the number of vertices in G, and suppose that any vertex v in V has at most d neighbors. Suppose furthermore that q 2d 2 . Then for any fixed 0, the number of iterations needed for the systematic sweep Gibbs sampler described above (starting from any fixed q – coloring ) to come within total variation distance of the target distribution G ,q is at most 1 log k log log( d ) k 1. q log 2 2d Proof: We use a coupling argument. We run two 1,, q chains X 0 , X 1 , and V X 0 ' , X 1 ' , at the same time. The first chain will start with the fixed coloring X 0 , while the second chain is started at random state ' chosen according to the stationary distribution G ,q . Then, X n ' has distribution G ,q and X n distribution n . We will see that dTV G,q , n is close to 0 if P( X n ' X n ) is close to 1. For each update, of both chains, we need to choose a vertex to update, which we are doing systematically, and then we need to choose a random color to assign to the vertex from the set of colors that are not assigned to the neighbors. We do this by choosing a random permutation of the colors (a fresh one for each update) and then, for each chain, we take the first color in the random permutation that is not one of those taken by a neighbor of the vertex being updated. More formally, let R0 , R1 , be an i.i.d. sequence of random permutations, each of them uniformly distributed on the set of permutations of 1,, q. At each time n, the updates of the two chains use the permutation Rn Rn ,, Rn . , and the vertex v to be updated is assigned the new value 1 q X n1 v Rn where i min j : X n (w) Rn for all neighbors w of v i j in the first chain, while in the second chain, we set X n1 ' v Rn where i' min j ': X n (w) Rn neighbors w of v . i' j' for all Now, let’s consider the probability that the two chains agree at a particular vertex. Consider the update of the two chains at a vertex v at time n k , (we are in the first sweep of the Gibbs sampler). We call the update successful if it results in having X n1 (v) X n1 ' (v) ; otherwise the update has failed. Define B2 the number of colors r 1,, q that are attained in the neighborhood of v in both X n and X n ' , B1 the number of colors r 1,, q that are attained in the neighborhood of v in exactly one of X n and X n ' , and B0 the number of colors r 1,, q that are attained in the neighborhood of v in neither of X n and X n ' , and note that B0 B1 B2 q . Note, if Rn B2 , then it is discarded. The update is successful only if the first color in B0 occurs in Rn before the first one that is in B1 . The event of a successful update then is B0 B0 B1 So, the probability of a failed update is B1 P failedupda te . B0 B1 Clearly, 0 B2 d . Furthermore, B1 2d 2B2 , because counting the neighbors in both configurations, there are in all at most 2d of them, and each color contributing to B2 uses up two of them. We get 1 P failedupda te B1 B1 . . B0 B1 q B2 B 2 d 1 2 . 2 d 2 B2 2 d B2 2d 2d . q B2 q B2 q B2 q1 q Thus, after the first sweep of the Gibbs sampler, we have, for each vertex v, 2d . P X k (v ) X k ' (v ) q Now, consider updates during the second sweep of the Gibbs sampler, i.e., between k and 2k. For an update at time n during the second sweep to fail at v, the configurations X n and X n ' must differ in at least one neighbor of v. Each neighbor w has X n (w) X n ' (w) with probability at most 2d and summing over the at most d neighbors, q we get that 2d 2 q where discrepancy means there is a neighbor w where X n (w) X n ' (w) . Furthermore, by repeating the arguments above, it follows that the conditional probability 2d and so a the end of the P( failedupdate | discrepenc y) q Second sweep, 2d 2d 2 P( failedupdate) P(discripenc y) P( failedupdate | discrepenc y) . q q 2d 2d 2 So, after two sweeps, P X 2 k (v) X 2 k ' (v) . q q P(discrepenc y) 2d 2d 2 After m sweeps, P X mk (v) X mk ' (v) q q m 1 . Now, let A 1,, q be any subset. We have that V d TV mk , G ,q max mk ( A) G ,q A A1,,q max P X mk A P X ' mk A. A1,,q For any such A, we have P X mk A P X ' mk A P X mk A, X ' mk A P X mk A, X ' mk A P X ' mk A, X mk A P X ' mk A, X mk A P X mk A, X ' mk A P X ' mk A, X mk A P X mk A, X ' mk A P X mk k 2d 2 X ' mk d q m Similarly, we get that m k 2d 2 . P X ' mk A P X mk A d q Thus, m k 2d 2 and, thus P X mk A P X ' mk A d q d TV mk , G ,q m k 2d 2 . d q Now, how large does m need to be in order to make this less than ? m k 2d 2 . Solving for m, we get We want d q 1 log( k ) log log( d ) m q log 2 2d So, running the Gibbs sampler long enough to get at least this many scans will guarantee that dTV mk , G,q . So, how many steps do we need? We must multiply m by k. But, wanting both n and m to be integer implies that we should take n to be 1 log( k ) log log( d ) n k 1 . q log 2 2d