MATH 3/5680: MARKOV CHAINS WINFRIED JUST, OHIO UNIVERSITY 1. Basic examples A stochastic process is a sequence ξt of random variables, where the index variable t stands for time. If t ∈ [0, ∞) or t ∈ R, then we talk about a continuous stochastic process; if t ∈ [0, 1, 2, . . .) = N or t ∈ Z, then we talk about a discrete stochastic process. What makes stochastic processes different from considering merely indexed sets of random variables is that there is some dependence between the ξt ’s at different times t. In this course we will focus only on one kind of discrete stochastic process called a Markov Chain. In this type of process, each variable can take values in a (usually finite) set S of states. Let us look at some examples: Example 1. Consider the time course of an infection in a population of N humans. At any given time, a person can be healthy (but susceptible to infection), infectious, or removed. Let us assume that recovery confers permanent immunity to subsequent infections. One can model the state ξt of the population at a given time as a stochastic process, where ξt = (St , It , Rt ) and St , It , Rt stand for the number of susceptible, infected, and recovered individuals at time t. Example 2. A drunkard walks along a path. At each time step, he makes either one step forward, one step backwards. At position 1 of the path, there is a steep cliff; reaching position 1 is equivalent to falling off the cliff. At the other end of the path, at position n, there is a haystack. If the drunkard reaches position n, he will stumble into the haystack and fall asleep. Example 3. Consider a DNA locus. Here the state space S = {a, c, g, t}. Assume evolution of this locus proceeds in discrete steps (let us say: generations); at each step the nucleotide at this locus may either be faithfully copied or it may be mutated into another nucleotide (ignore indels for the moment). What do all the stochastic processes in these examples have in common? • We can conceptualize time t ∈ N as progressing in discrete increments. • We can conveniently treat S = [n] = {1, 2, . . . , n}. • At every step, the system may transition into another state (with a certain probability). • The transition probabilities at any given time can be expressed by a square matrix M (t) = [mij (t)]1≤i,j≤n of state transition probabilities. 1 2 WINFRIED JUST, OHIO UNIVERSITY 2. Properties of discrete finite-state stochastic processes A stochastic process as above is said to have the Markov property or to be a Markov Chain if the distribution of ξt+1 depends only on the distribution of ξt , not on prior history. • The process in Example 1 is a Markov Chain iff it takes at most one time step to recover from the disease. If it takes exactly k time steps to recover, this process can be modeled by a k-th order Markov Chain. • The process in Example 2 is a Markov Chain iff it the probability of making a step forward is not dependent on whether the previous step was made forward or backward. • The process in Example 3 is a Markov Chain iff we assume that mutation probabilities do not depend on the prior history of mutations at this locus. In all models of molecular evolution that we will consider in this course we will make this assumption. A Markov Chain is said to be stationary or time-homogeneous if the transition probability matrix does not depend on time, that is, M (t) = M for all times t. • The process in Example 1 is stationary in the absence of medical interventions or behavior modifications, but is non-stationary if, for example, an immunization program is initiated or if people start avoiding contact after the outbreak of the disease. • The process in Example 2 is stationary for a while, but as the person sobers up the transition probability m21 may diminish. • The process in Example 3 is presumably not stationary due to varying evolutionary pressures over different periods of evolution (think benefits of a higher gc-content while the organism evolves near a thermal vent). However, we usually have no specific such information and we model evolution of a locus as a stationary process in the absence of specific evidence that tells us otherwise. A state i of a Markov Chain is said to be an absorbing state iff mii = 1. • If N denotes the total population size, then the process in Example 1 has N + 1 absorbing states: (N, 0, 0), (N − 1, 0, 1), (N − 2, 0, 2), . . . , (1, 0, N − 1), (0, 0, N ). • The process in Example 2 has two absorbing states: 1 and n. • The process in Example 3 has no absorbing states. A state i of a Markov Chain is said to be irreducible iff every state can reached from every state with positive probability. • The process in Example 1 is not irreducible. More generally, no nontrivial Markov Chain with an absorbing state can be irreducible. • Thus the process in Example 2 is not irreducible. However, if we replace the cliff and the haystack with two vertical walls, the process becomes irreducible. • The process in Example 3 is usually assumed irreducible. However, if we assume that all transversions have probability 0 and only transitions happen MATH 3/5680: MARKOV CHAINS 3 with positive probability, then we get a reducible Markov Chain without absorbing states. 3. More on the SIR model The model of Exercise 1 is an SIR-model. As in our previous work on disease transmission models, let a denote the probability that a given infectious host recovers after one time step and let b denote the probability of an effective contact between any given pair (i, j) of distinct hosts until the next time step. Also, let N be the total population size. Suppose i is the number of the state (S, I, R), and j is the number of the state (S − s, I + s − r, R + r), where s, r are nonnegative integers with s ≤ S and r ≤ R. The calculations that we did in class generalize as follows: (1) mij = S R r (1 − b)I(S−s) (1 − (1 − b)I(S−s) )s a (1 − a)I−r . s r If s = r = 0, then (1) gives the transition probability mii if we interpret (1 − s 0 0 0 0 0 (1 − b)I(S−s) ) = (1 − 1) = 0 = 1. Note that since (1 − b) = 1 = (1 − a) = a = S R S = R , for each absorbing state i the right-hand side of (1) gives mii = 1. 4. More on drunkards, their wives, and hungry bacteria 4.1. Action of the transition matrix. Let us assume for now that the story of Example 2 can be modeled with a stationary Markov Chain, and that in each of the non-absorbing states our drunkard makes a step forward (towards the cliff) with probability p and a step backward (toward the haystack) with probability q = 1−p. For n = 5 this gives the following transition probability matrix: (2) Md,p,5 1 p = 0 0 0 0 0 p 0 0 0 q 0 p 0 0 0 0 0 q 0 . 0 q 0 1 Let us have a look at this matrix. Note that each row sums up to 1; it is a (right) stochastic matrix. Every transition (probability) matrix of any Markov Chain must be right stochastic. Not all columns sum up to 1 though; this matrix is not left stochastic, thus it is in particular not doubly stochastic. We will see later that some Markov Chains do have doubly stochastic transition matrices. Suppose our drunk starts at position i. If i ∈ {1, n} we know for sure where he will be one time step later, but if i ∈ {2, . . . , n − 1} he might be in one of two positions. In general, given the starting position, we will know, for every t ≥ 0, a vector π(t) = [π1 (t), . . . , πn (t)] of probabilities πi (t) that he is in position i at time t. These vectors are probability distributions (aka probability vectors). Given π(t), we can determine π(t + 1), in any Markov Chain, by: (3) π(t + 1) = π(t)M (t). More generally, if the Markov Chain is stationary, we get for every k ∈ N: (4) π(t + k) = π(t)M k . 4 WINFRIED JUST, OHIO UNIVERSITY Exercise 1. Characterize the property of irreducibility of a stationary Markov Chain that was introduced in the previous section in terms of powers M k of the transition matrix. Notice that if the drunkard is never going to sober up, then with probability 1 he will eventually reach one of the two absorbing states. Suppose n is given and the drunkard starts in position i. What is the probability that he will fall off the cliff? This question is known, in a different interpretation, as the gambler’s ruin problem. We won’t ruin a perfectly fine lecture by working out the mathematics of this problem; let us consider a different one instead: Exercise 2. Suppose the drunkard’s wife calls the inn (which is located at position i) and learns that he left at time t = 0 facing in the direction of the cliff. She reckons that by now he has made about T steps. (a) Suppose 1 i n. Where should she go looking for him, and what length of the path does she need to inspect to have a chance of at least 0.95 of finding him? Hint: You may assume for simplicity that she can move at the speed of light but tires too quickly √ to inspect the whole path. (b) Suppose i = T and p = q = 0.5. Give a lower bound for the probability that it is already too late to prevent disaster. 4.2. An aside: hungry bacteria, the CTL, and physicists. Now back to biology: Let’s mutate our drunkard into a colony of sober but hungry bacteria that randomly move along a thin capillary in search for food. Instead of a probability distribution of of the position of an individual we can describe the state of the system at time t as a function B(x, t) that signifies the number of bacteria (say, in thousands) at a cross section x of the capillary. Bacteria are small, so let us assume that we have partitioned the capillary into neighboring compartments of length dx each, so that, with probability p < 0.5, each individual bacterium moves into the next compartment to the left or to the right over a time interval (t, t + dt); independently of each other, with no preference of direction and the probability of moving across a whole compartment over a time interval of length dt is negligible. Then the number of bacteria at position x (the compartment of x) at time t + dt will approximately be: B(x, t + dt) − B(x, t) = pB(x + dx, t) + pB(x − dx, t) − 2pB(x, t). Equivalently, B(x, t + dt) − B(x, t) = p(B(x + dx, t) − B(x, t) + B(x − dx, t) − B(x, t)). Dividing both sides by dx we get: (5) B(x, t + dt) − B(x, t) B(x + dx, t) − B(x, t) B(x − dx, t) − B(x, t) =p + . dx dx dx Taking the limit of (5) as dx → 0 gives (6) B(x, t + dt) − B(x, t) lim =p dx→0 dx ∂B(x, t) ∂B(x, t) (x, t) − (x − dx, t) . ∂x ∂x MATH 3/5680: MARKOV CHAINS 5 In turn, dividing both sides of (6) by dx and taking the limit as dx → 0 gives (7) B(x, t + dt) − B(x, t) p = lim dx→0 dx→0 dx (dx)2 lim ∂B(x, t) ∂B(x, t) (x, t) − (x − dx, t) . ∂x ∂x 2 The rhs of (7) is clearly ∂ B(x,t) , but the lhs is nothing easily recognizable and ∂x2 mathematically meaningful. Unless (dx)2 = dt, that is. In this case, the lhs of (7) simply becomes ∂B(x,t) . So let us choose the units for x and t so that numeri∂t p cally dx ∼ |dt|, and bingo! we get (a version of) the linear heat equation ∂B(x, t) ∂ 2 B(x, t) . =p ∂t ∂x2 But wait! We can certainly fix the unit of time any way we want, pick p, choose some dt. Thenpour set-up determines dx; we can certainly choose a unit of length to make dx = |dt|p numerically at the chosen time scale. But going from (7) to (8) requires that dx ∼ |dt| at all smaller time scales, and we have already made all choices of units over which we had control. So, either dx scales like the square root of dt, or it does not. (8) Why can we be confident of the scaling law dx ∼ p |dt|? Physicist’s Answer: Because Nature says so. Our little thought experiment describes processes of diffusion. We can study those in the lab. We observe in the lab that (8) holds within reasonable bounds of measurement error. For this to be true, under the choices made in the above argument, dx must scale like the quare root of dt. Probabilist’s Answer: Because the Central Limit Theorem says so. Think about the movement of the bacterium at the level of resolution of dt and the corresponding dx as the drunkard’s walk over many, many (k 0) independent steps, where we may as well assume that dt ∼ k. Then by the CLT, the standard √ deviation of the drunkard’s distance from his starting point scales like k, and for the probability of ending up in the compartment to the immediate left to be a fixed p, the length dx of individual compartments must scale like the standard deviation of the drunkard’s distance from his starting point. 4.3. A Markovian model of a non-Markovian drunkard. So far, we have modeled the state space for our drunkard as a set of possible positions. This setup is an unlikely candidate for a Markov process though, since a forward step is more likely followed by another forward step than a step back. But how about this: let us ignore position (together with any cliffs, haystacks, prison cell walls, etc.) and define the state ξt as the direction of movement at time t: a value of 1 or F means taking a step forward, a value of 2 or B means taking a step backward. The stochastic process thus described appears to have the Markov Property within a very good approximation. The transition matrix takes the form (9) M2,p,q p = 1−q 1−p . q 6 WINFRIED JUST, OHIO UNIVERSITY Note that this matrix is double stochastic iff it is symmetric iff p = q. If we observe the drunkard moving forward in step 0, he will move forward in step 1 with probability p and backward with probability q. That much is clear. But what about step t when t is large? What can we say about the probability distribution π(t) = [π1 (t), π2 (t)] of the direction of movement when t is large? Can we still say anything meaningful about π(t) even if we did not observe the actual value of ξ0 ? Numerical explorations lead to the following empirical observations: • For any initial distribution π(0), the vectors π(t) approach a limiting distribution π ∗ ; formally limt→∞ π(t) = π ∗ . • In general, π ∗ depends on π(0). • If the Markov chain is stationary and irreducible, π ∗ does not depend on π(0). The probability vectors π ∗ that represent limiting distributions are called stationary vectors or stationary distributions of the Markov chain. Proposition 1. (a) A vector π ∗ is stationary for a Markov chain with transition matrix M iff π ∗ is a left eigenvector of M , that is, iff π∗ M = π∗ . (10) (b) The eigenvalue for this left eigenvector must be 1. Theorem 2. [Perron-Frobenius Theorem] Suppose M = [mij ] is a real square matrix with all mij > 0. Then there exists a positive real eigenvalue λ∗ such that (i) |λ∗ | > |λ| for every (real or complex) eigenvalue λ 6= λ∗ of M . (ii) There exists an eigenvector v of M with eigenvalue λ∗ with all coordinates vi > 0. (iii) λ∗ has multiplicity 1. Corollary 3. There always exists a stationary distribution, and it is unique if the Markov Chain is irreducible (this actually requires a slight generalization of the Perron-Frobenius Theorem). In particular, if M k has only positive entries for all sufficiently large k (such Markov chains are called ergodic), then π(t) will always converge to the unique stationary distribution. For example, the characteristic polynomial of M2,p,q in (9) is (p − λ)(q − λ) − (1 − p)(1 − q), with roots λ∗ = 1 and λ∗∗ = p + q − 1. Now |λ∗ | = |λ∗∗ | iff p = q = 1 or p = q = 0. In both cases, some entries of M2,p,q evaluate to 0, and we may get multiple stationary distributions. In all other cases, the stationary distribution will be unique, although we still may have some zero entries in M2,p,q ; the condition of all mij > 0 is sufficient but not necessary for Theorem 2. We can derive a formula for a stationary distribution of the Markov chain of (9) by solving (11) subject to [π1∗ , π2∗ ] p−1 1−q 1−p = [0, 0] q−1 MATH 3/5680: MARKOV CHAINS 7 π1∗ + π2∗ = 1. (12) We get (13) [π1∗ , π2∗ ] 1−q 1−p = , . 2−p−q 2−p−q Note that if p = q = 1, then the Markov chain defined by (9) is reducible and formula (13) does not make sense. In this case, λ∗ has multiplicity 2 and every distribution is stationary. If p = q = 0, then the Markov chain is irreducible, but not ergodic. Each of the states will be periodic with period 2. We still get a unique stationary distribution π ∗ = [0.5, 0.5] from (13), but it will no longer be the case that limt→∞ π(t) = π ∗ . Instead, the interpretation of π ∗ is that it gives the longterm average proportions of time steps during which the system resides in each of the states. Note that the “empirical observations” made near the beginning of this section do not apply to this somewhat pathological example. 5. Time-reversible Markov Chains Suppose you watch the drunkard of Subsection 4.3 from afar. You can make out whether he is taking steps towards the cliff (C) or towards the haystack (H), but you cannot make out which direction he is facing. But since this is your good friend, you know that his p = 0.8 and his q = 0.6. Let’s watch some movies taken from this distance: Movie 1 CCHHHHHHHCCCHCHHCCHHCHHHHHHCCCCHHHHHHHHHHHHHHHHHHHCHHCCCCCHHHCC Movie 2 CCCCCCCCCCHCCHCHCCCHCCHCCCCHHHCCCCHHCHHHCHCCCCCCCCCCCCCCHCCCCHH Movie 3 HHHHHCCHHCHHCCCHHCCCCCHHHHHCCCCCCCCHHHHHCCCHHHHHHHHCCHHHHCCCHHC Movie 4 HCHHHHHHCCHHHHCCHCCCCCCCCCCHHHHHCCHCCCHHHHCCCCCCCCCHHHHHHHHHCCC Exercise 3. Can you make out, for each of these movies, in which direction your friend is facing? Which prediction(s) are you most confident in? Which prediction(s) are you least confident in? Hint: Don’t quantify your level of confidence at this time. We will return to that problem later. Now suppose you observe a second person who leaves the inn. You can make out who that one is, and you know that in his case p = q = 0.7. Let’s look at some movies for Drunkard Number II: Movie 5 CCCCHHHHHHCCCCHHHHCHCCCCHHHHHHHHHHHHHHHHHHCCHHCCCHCCCHHHHHCCCCC Movie 6 CHHHHHHHHHHHHHCCCCHHHHHHHHHHCCCCCCHCCHCCCCCCCCCHHHCCCCCHCCCHCHH Movie 7 HHCCHHHHCHHHHHHCHCCCCCHHCCCCHHHCCCHHHHCCCCCCCCCCCHHCCCCHCCCCHHH 8 WINFRIED JUST, OHIO UNIVERSITY Movie 8 HHHHHCHCHHCCCHHCHHCCCCCCCHCHHHHCCCCCCHHCCCCCCCCCHHHCCCHHCHHHHHC Exercise 4. Can you make out, for each of these movies, in which direction Drunkard Number II is facing? Which prediction(s) are you most confident in? Which prediction(s) are you least confident in? Now let us turn to models of molecular evolution. Consider the following matrix of mutation probabilities for a DNA locus: (14) a Mir = c g c a c g t 0.4 0.3 0.2 0.1 0.1 0.4 0.3 0.2 0.2 0.1 0.4 0.3 0.3 0.2 0.1 0.4 The vector of stationary probabilities for the corresponding Markov Chain is π ∗ = [0.25, 0.25, 0.25, 0.25]. I created four time-lapse movies of evolution of this locus, but, accidentally, one of them runs backwards. Exercise 5. Can you figure out which of the four movies below runs backwards? Explain your reasoning. Movie 9 gaccccacgttacgtcatactttttccgggttcgtaggttccattttcctaccccgtaggtaa Movie 10 cgcagagacgggcgcggggagattacgttggggacccctcgtcgaggtcgaacatccccaagt Movie 11 taaatgaggctcaggatgaatcccattggcaaaaaataatcccatggaaaatggggcatcttg Movie 12 gttgaaccggataacccttcagaaccaagaccgggctaagtatttgtccaaacaaactacgaa Now consider another matrix of mutation probabilities for a DNA locus: a c g t a 0.72 0.06 0.18 0.04 = c 0.04 0.78 0.06 0.12 g 0.12 0.06 0.78 0.04 c 0.04 0.18 0.06 0.72 (15) MK The vector of stationary probabilities for the corresponding Markov Chain is π ∗ = [0.2, 0.3, 0.3, 0.2]. MATH 3/5680: MARKOV CHAINS 9 I created four time-lapse movies of evolution of this locus, but, accidentally, one of them runs backwards. Exercise 6. Can you figure out which of the four movies below runs backwards? Explain your reasoning. Movie 13 cccccccctgccccccctttcccccagggggggggggttttggggggggtcccccttcggggaagcccggggggaactttccggggggatttttttgcct Movie 14 ggaacccccccgggaaaaatttttttggaagggggggggggttttttttttaggggcgtccttttaattaaaccccccccttcccccttttttcctccct Movie 15 cctttttggccttctttttgattaaaaaattttctccccctttccccccaggggcccctaattttctttttcccccccccccccccggggggggggcccc Movie 16 tttcccccttttttcccgggaaagggggggccccccttttttccctaaccttttttttttttttgaaggggggttttgcccaaaaaactttcttaaaaat Intuitively speaking a stochastic process is time-reversible if you get the same process by running time backwards. Exercise 7. Based on your work for Exercises 3 through 6, formulate a conjecture on how one can characterize time-reversible Markov chains in terms of their transition matrices M . Is it reasonable to assume that molecular evolution really is a time-reversible stochastic process? Obviously not. However, we almost never are in a position to actually observe molecular evolution. All we usually can observe are the outcomes of two independent runs of molecular evolution from an unknown common ancestor to two extant descendant species. Thus any model of the evolutionary changes that we observe by comparing the molecular make-up of these two observable species need to be explained in terms of a model where evolution runs backward in time to the common ancestor and then forward to the other species. Since we usually have no reason to treat one of these species as the end of the backward branch and the other as the end of the forward branch, the standard assumption in models of molecular evolution is that it is a time-reversible stochastic process.