MATH 3/5680: MARKOV CHAINS 1. Basic examples A stochastic

advertisement
MATH 3/5680: MARKOV CHAINS
WINFRIED JUST, OHIO UNIVERSITY
1. Basic examples
A stochastic process is a sequence ξt of random variables, where the index variable t stands for time. If t ∈ [0, ∞) or t ∈ R, then we talk about a continuous
stochastic process; if t ∈ [0, 1, 2, . . .) = N or t ∈ Z, then we talk about a discrete stochastic process. What makes stochastic processes different from considering merely
indexed sets of random variables is that there is some dependence between the
ξt ’s at different times t. In this course we will focus only on one kind of discrete
stochastic process called a Markov Chain.
In this type of process, each variable can take values in a (usually finite) set S
of states. Let us look at some examples:
Example 1. Consider the time course of an infection in a population of N humans.
At any given time, a person can be healthy (but susceptible to infection), infectious,
or removed. Let us assume that recovery confers permanent immunity to subsequent
infections. One can model the state ξt of the population at a given time as a
stochastic process, where ξt = (St , It , Rt ) and St , It , Rt stand for the number of
susceptible, infected, and recovered individuals at time t.
Example 2. A drunkard walks along a path. At each time step, he makes either
one step forward, one step backwards. At position 1 of the path, there is a steep
cliff; reaching position 1 is equivalent to falling off the cliff. At the other end of the
path, at position n, there is a haystack. If the drunkard reaches position n, he will
stumble into the haystack and fall asleep.
Example 3. Consider a DNA locus. Here the state space S = {a, c, g, t}. Assume
evolution of this locus proceeds in discrete steps (let us say: generations); at each
step the nucleotide at this locus may either be faithfully copied or it may be mutated
into another nucleotide (ignore indels for the moment).
What do all the stochastic processes in these examples have in
common?
• We can conceptualize time t ∈ N as progressing in discrete increments.
• We can conveniently treat S = [n] = {1, 2, . . . , n}.
• At every step, the system may transition into another state (with a certain
probability).
• The transition probabilities at any given time can be expressed by a square
matrix M (t) = [mij (t)]1≤i,j≤n of state transition probabilities.
1
2
WINFRIED JUST, OHIO UNIVERSITY
2. Properties of discrete finite-state stochastic processes
A stochastic process as above is said to have the Markov property or to be a
Markov Chain if the distribution of ξt+1 depends only on the distribution of ξt , not
on prior history.
• The process in Example 1 is a Markov Chain iff it takes at most one time
step to recover from the disease. If it takes exactly k time steps to recover,
this process can be modeled by a k-th order Markov Chain.
• The process in Example 2 is a Markov Chain iff it the probability of making
a step forward is not dependent on whether the previous step was made
forward or backward.
• The process in Example 3 is a Markov Chain iff we assume that mutation
probabilities do not depend on the prior history of mutations at this locus.
In all models of molecular evolution that we will consider in this course we
will make this assumption.
A Markov Chain is said to be stationary or time-homogeneous if the transition
probability matrix does not depend on time, that is, M (t) = M for all times t.
• The process in Example 1 is stationary in the absence of medical interventions or behavior modifications, but is non-stationary if, for example, an
immunization program is initiated or if people start avoiding contact after
the outbreak of the disease.
• The process in Example 2 is stationary for a while, but as the person sobers
up the transition probability m21 may diminish.
• The process in Example 3 is presumably not stationary due to varying evolutionary pressures over different periods of evolution (think benefits of a
higher gc-content while the organism evolves near a thermal vent). However, we usually have no specific such information and we model evolution
of a locus as a stationary process in the absence of specific evidence that
tells us otherwise.
A state i of a Markov Chain is said to be an absorbing state iff mii = 1.
• If N denotes the total population size, then the process in Example 1 has
N + 1 absorbing states: (N, 0, 0), (N − 1, 0, 1), (N − 2, 0, 2), . . . , (1, 0, N −
1), (0, 0, N ).
• The process in Example 2 has two absorbing states: 1 and n.
• The process in Example 3 has no absorbing states.
A state i of a Markov Chain is said to be irreducible iff every state can reached
from every state with positive probability.
• The process in Example 1 is not irreducible. More generally, no nontrivial
Markov Chain with an absorbing state can be irreducible.
• Thus the process in Example 2 is not irreducible. However, if we replace
the cliff and the haystack with two vertical walls, the process becomes
irreducible.
• The process in Example 3 is usually assumed irreducible. However, if we
assume that all transversions have probability 0 and only transitions happen
MATH 3/5680: MARKOV CHAINS
3
with positive probability, then we get a reducible Markov Chain without
absorbing states.
3. More on the SIR model
The model of Exercise 1 is an SIR-model. As in our previous work on disease
transmission models, let a denote the probability that a given infectious host recovers after one time step and let b denote the probability of an effective contact
between any given pair (i, j) of distinct hosts until the next time step. Also, let N
be the total population size. Suppose i is the number of the state (S, I, R), and j is
the number of the state (S − s, I + s − r, R + r), where s, r are nonnegative integers
with s ≤ S and r ≤ R. The calculations that we did in class generalize as follows:
(1)
mij =
S
R r
(1 − b)I(S−s) (1 − (1 − b)I(S−s) )s
a (1 − a)I−r .
s
r
If s = r = 0, then (1) gives the transition probability mii if we interpret (1 −
s
0
0
0
0
0
(1 − b)I(S−s)
) = (1 − 1) = 0 = 1. Note that since (1 − b) = 1 = (1 − a) = a =
S
R
S = R , for each absorbing state i the right-hand side of (1) gives mii = 1.
4. More on drunkards, their wives, and hungry bacteria
4.1. Action of the transition matrix. Let us assume for now that the story of
Example 2 can be modeled with a stationary Markov Chain, and that in each of the
non-absorbing states our drunkard makes a step forward (towards the cliff) with
probability p and a step backward (toward the haystack) with probability q = 1−p.
For n = 5 this gives the following transition probability matrix:

(2)
Md,p,5
1
p

=
0
0
0
0
0
p
0
0
0
q
0
p
0

0 0
0 0

q 0
.
0 q
0 1
Let us have a look at this matrix. Note that each row sums up to 1; it is a (right)
stochastic matrix. Every transition (probability) matrix of any Markov Chain must
be right stochastic. Not all columns sum up to 1 though; this matrix is not left
stochastic, thus it is in particular not doubly stochastic. We will see later that some
Markov Chains do have doubly stochastic transition matrices.
Suppose our drunk starts at position i. If i ∈ {1, n} we know for sure where
he will be one time step later, but if i ∈ {2, . . . , n − 1} he might be in one of two
positions. In general, given the starting position, we will know, for every t ≥ 0,
a vector π(t) = [π1 (t), . . . , πn (t)] of probabilities πi (t) that he is in position i at
time t. These vectors are probability distributions (aka probability vectors). Given
π(t), we can determine π(t + 1), in any Markov Chain, by:
(3)
π(t + 1) = π(t)M (t).
More generally, if the Markov Chain is stationary, we get for every k ∈ N:
(4)
π(t + k) = π(t)M k .
4
WINFRIED JUST, OHIO UNIVERSITY
Exercise 1. Characterize the property of irreducibility of a stationary Markov
Chain that was introduced in the previous section in terms of powers M k of the
transition matrix.
Notice that if the drunkard is never going to sober up, then with probability 1
he will eventually reach one of the two absorbing states. Suppose n is given and
the drunkard starts in position i. What is the probability that he will fall off the
cliff? This question is known, in a different interpretation, as the gambler’s ruin
problem.
We won’t ruin a perfectly fine lecture by working out the mathematics of this
problem; let us consider a different one instead:
Exercise 2. Suppose the drunkard’s wife calls the inn (which is located at position i)
and learns that he left at time t = 0 facing in the direction of the cliff. She reckons
that by now he has made about T steps.
(a) Suppose 1 i n. Where should she go looking for him, and what length of
the path does she need to inspect to have a chance of at least 0.95 of finding him?
Hint: You may assume for simplicity that she can move at the speed of light but
tires too quickly √
to inspect the whole path.
(b) Suppose i = T and p = q = 0.5. Give a lower bound for the probability that it
is already too late to prevent disaster.
4.2. An aside: hungry bacteria, the CTL, and physicists. Now back to
biology: Let’s mutate our drunkard into a colony of sober but hungry bacteria that
randomly move along a thin capillary in search for food. Instead of a probability
distribution of of the position of an individual we can describe the state of the
system at time t as a function B(x, t) that signifies the number of bacteria (say,
in thousands) at a cross section x of the capillary. Bacteria are small, so let us
assume that we have partitioned the capillary into neighboring compartments of
length dx each, so that, with probability p < 0.5, each individual bacterium moves
into the next compartment to the left or to the right over a time interval (t, t + dt);
independently of each other, with no preference of direction and the probability of
moving across a whole compartment over a time interval of length dt is negligible.
Then the number of bacteria at position x (the compartment of x) at time t + dt
will approximately be:
B(x, t + dt) − B(x, t) = pB(x + dx, t) + pB(x − dx, t) − 2pB(x, t).
Equivalently,
B(x, t + dt) − B(x, t) = p(B(x + dx, t) − B(x, t) + B(x − dx, t) − B(x, t)).
Dividing both sides by dx we get:
(5)
B(x, t + dt) − B(x, t)
B(x + dx, t) − B(x, t) B(x − dx, t) − B(x, t)
=p
+
.
dx
dx
dx
Taking the limit of (5) as dx → 0 gives
(6)
B(x, t + dt) − B(x, t)
lim
=p
dx→0
dx
∂B(x, t)
∂B(x, t)
(x, t) −
(x − dx, t) .
∂x
∂x
MATH 3/5680: MARKOV CHAINS
5
In turn, dividing both sides of (6) by dx and taking the limit as dx → 0 gives
(7)
B(x, t + dt) − B(x, t)
p
= lim
dx→0
dx→0 dx
(dx)2
lim
∂B(x, t)
∂B(x, t)
(x, t) −
(x − dx, t) .
∂x
∂x
2
The rhs of (7) is clearly ∂ B(x,t)
, but the lhs is nothing easily recognizable and
∂x2
mathematically meaningful. Unless (dx)2 = dt, that is. In this case, the lhs of (7)
simply becomes ∂B(x,t)
. So let us choose the units for x and t so that numeri∂t
p
cally dx ∼ |dt|, and bingo! we get (a version of) the linear heat equation
∂B(x, t)
∂ 2 B(x, t)
.
=p
∂t
∂x2
But wait! We can certainly fix the unit of time any way we want, pick p, choose
some dt. Thenpour set-up determines dx; we can certainly choose a unit of length
to make dx = |dt|p
numerically at the chosen time scale. But going from (7) to (8)
requires that dx ∼ |dt| at all smaller time scales, and we have already made all
choices of units over which we had control. So, either dx scales like the square root
of dt, or it does not.
(8)
Why can we be confident of the scaling law dx ∼
p
|dt|?
Physicist’s Answer: Because Nature says so.
Our little thought experiment describes processes of diffusion. We can study
those in the lab. We observe in the lab that (8) holds within reasonable bounds
of measurement error. For this to be true, under the choices made in the above
argument, dx must scale like the quare root of dt.
Probabilist’s Answer: Because the Central Limit Theorem says so.
Think about the movement of the bacterium at the level of resolution of dt and
the corresponding dx as the drunkard’s walk over many, many (k 0) independent
steps, where we may as well assume that dt ∼ k. Then by the CLT, the standard
√
deviation of the drunkard’s distance from his starting point scales like k, and
for the probability of ending up in the compartment to the immediate left to be
a fixed p, the length dx of individual compartments must scale like the standard
deviation of the drunkard’s distance from his starting point.
4.3. A Markovian model of a non-Markovian drunkard. So far, we have
modeled the state space for our drunkard as a set of possible positions. This setup
is an unlikely candidate for a Markov process though, since a forward step is more
likely followed by another forward step than a step back. But how about this:
let us ignore position (together with any cliffs, haystacks, prison cell walls, etc.)
and define the state ξt as the direction of movement at time t: a value of 1 or F
means taking a step forward, a value of 2 or B means taking a step backward. The
stochastic process thus described appears to have the Markov Property within a
very good approximation. The transition matrix takes the form
(9)
M2,p,q
p
=
1−q
1−p
.
q
6
WINFRIED JUST, OHIO UNIVERSITY
Note that this matrix is double stochastic iff it is symmetric iff p = q. If we
observe the drunkard moving forward in step 0, he will move forward in step 1 with
probability p and backward with probability q. That much is clear. But what about
step t when t is large? What can we say about the probability distribution π(t) =
[π1 (t), π2 (t)] of the direction of movement when t is large? Can we still say anything
meaningful about π(t) even if we did not observe the actual value of ξ0 ?
Numerical explorations lead to the following empirical observations:
• For any initial distribution π(0), the vectors π(t) approach a limiting distribution π ∗ ; formally limt→∞ π(t) = π ∗ .
• In general, π ∗ depends on π(0).
• If the Markov chain is stationary and irreducible, π ∗ does not depend on
π(0).
The probability vectors π ∗ that represent limiting distributions are called stationary vectors or stationary distributions of the Markov chain.
Proposition 1. (a) A vector π ∗ is stationary for a Markov chain with transition
matrix M iff π ∗ is a left eigenvector of M , that is, iff
π∗ M = π∗ .
(10)
(b) The eigenvalue for this left eigenvector must be 1.
Theorem 2. [Perron-Frobenius Theorem] Suppose M = [mij ] is a real square
matrix with all mij > 0. Then there exists a positive real eigenvalue λ∗ such that
(i) |λ∗ | > |λ| for every (real or complex) eigenvalue λ 6= λ∗ of M .
(ii) There exists an eigenvector v of M with eigenvalue λ∗ with all coordinates
vi > 0.
(iii) λ∗ has multiplicity 1.
Corollary 3. There always exists a stationary distribution, and it is unique if the
Markov Chain is irreducible (this actually requires a slight generalization of the
Perron-Frobenius Theorem). In particular, if M k has only positive entries for all
sufficiently large k (such Markov chains are called ergodic), then π(t) will always
converge to the unique stationary distribution.
For example, the characteristic polynomial of M2,p,q in (9) is
(p − λ)(q − λ) − (1 − p)(1 − q),
with roots λ∗ = 1 and λ∗∗ = p + q − 1. Now |λ∗ | = |λ∗∗ | iff p = q = 1 or
p = q = 0. In both cases, some entries of M2,p,q evaluate to 0, and we may get
multiple stationary distributions. In all other cases, the stationary distribution will
be unique, although we still may have some zero entries in M2,p,q ; the condition of
all mij > 0 is sufficient but not necessary for Theorem 2.
We can derive a formula for a stationary distribution of the Markov chain of (9)
by solving
(11)
subject to
[π1∗ , π2∗ ]
p−1
1−q
1−p
= [0, 0]
q−1
MATH 3/5680: MARKOV CHAINS
7
π1∗ + π2∗ = 1.
(12)
We get
(13)
[π1∗ , π2∗ ]
1−q
1−p
=
,
.
2−p−q 2−p−q
Note that if p = q = 1, then the Markov chain defined by (9) is reducible and
formula (13) does not make sense. In this case, λ∗ has multiplicity 2 and every
distribution is stationary. If p = q = 0, then the Markov chain is irreducible, but
not ergodic. Each of the states will be periodic with period 2. We still get a unique
stationary distribution π ∗ = [0.5, 0.5] from (13), but it will no longer be the case
that limt→∞ π(t) = π ∗ . Instead, the interpretation of π ∗ is that it gives the longterm average proportions of time steps during which the system resides in each of
the states. Note that the “empirical observations” made near the beginning of this
section do not apply to this somewhat pathological example.
5. Time-reversible Markov Chains
Suppose you watch the drunkard of Subsection 4.3 from afar. You can make
out whether he is taking steps towards the cliff (C) or towards the haystack (H),
but you cannot make out which direction he is facing. But since this is your good
friend, you know that his p = 0.8 and his q = 0.6. Let’s watch some movies taken
from this distance:
Movie 1
CCHHHHHHHCCCHCHHCCHHCHHHHHHCCCCHHHHHHHHHHHHHHHHHHHCHHCCCCCHHHCC
Movie 2
CCCCCCCCCCHCCHCHCCCHCCHCCCCHHHCCCCHHCHHHCHCCCCCCCCCCCCCCHCCCCHH
Movie 3
HHHHHCCHHCHHCCCHHCCCCCHHHHHCCCCCCCCHHHHHCCCHHHHHHHHCCHHHHCCCHHC
Movie 4
HCHHHHHHCCHHHHCCHCCCCCCCCCCHHHHHCCHCCCHHHHCCCCCCCCCHHHHHHHHHCCC
Exercise 3. Can you make out, for each of these movies, in which direction your
friend is facing? Which prediction(s) are you most confident in? Which prediction(s) are you least confident in? Hint: Don’t quantify your level of confidence at
this time. We will return to that problem later.
Now suppose you observe a second person who leaves the inn. You can make
out who that one is, and you know that in his case p = q = 0.7. Let’s look at some
movies for Drunkard Number II:
Movie 5
CCCCHHHHHHCCCCHHHHCHCCCCHHHHHHHHHHHHHHHHHHCCHHCCCHCCCHHHHHCCCCC
Movie 6
CHHHHHHHHHHHHHCCCCHHHHHHHHHHCCCCCCHCCHCCCCCCCCCHHHCCCCCHCCCHCHH
Movie 7
HHCCHHHHCHHHHHHCHCCCCCHHCCCCHHHCCCHHHHCCCCCCCCCCCHHCCCCHCCCCHHH
8
WINFRIED JUST, OHIO UNIVERSITY
Movie 8
HHHHHCHCHHCCCHHCHHCCCCCCCHCHHHHCCCCCCHHCCCCCCCCCHHHCCCHHCHHHHHC
Exercise 4. Can you make out, for each of these movies, in which direction Drunkard Number II is facing? Which prediction(s) are you most confident in? Which
prediction(s) are you least confident in?
Now let us turn to models of molecular evolution. Consider the following matrix
of mutation probabilities for a DNA locus:

(14)
a

Mir = 
c
g
c

a
c
g
t
0.4 0.3 0.2 0.1

0.1 0.4 0.3 0.2

0.2 0.1 0.4 0.3
0.3 0.2 0.1 0.4
The vector of stationary probabilities for the corresponding Markov Chain is
π ∗ = [0.25, 0.25, 0.25, 0.25].
I created four time-lapse movies of evolution of this locus, but, accidentally, one
of them runs backwards.
Exercise 5. Can you figure out which of the four movies below runs backwards?
Explain your reasoning.
Movie 9
gaccccacgttacgtcatactttttccgggttcgtaggttccattttcctaccccgtaggtaa
Movie 10
cgcagagacgggcgcggggagattacgttggggacccctcgtcgaggtcgaacatccccaagt
Movie 11
taaatgaggctcaggatgaatcccattggcaaaaaataatcccatggaaaatggggcatcttg
Movie 12
gttgaaccggataacccttcagaaccaagaccgggctaagtatttgtccaaacaaactacgaa
Now consider another matrix of mutation probabilities for a DNA locus:

a
c
g
t
a 0.72 0.06 0.18 0.04



=
 c 0.04 0.78 0.06 0.12
g 0.12 0.06 0.78 0.04
c 0.04 0.18 0.06 0.72

(15)
MK
The vector of stationary probabilities for the corresponding Markov Chain is
π ∗ = [0.2, 0.3, 0.3, 0.2].
MATH 3/5680: MARKOV CHAINS
9
I created four time-lapse movies of evolution of this locus, but, accidentally, one
of them runs backwards.
Exercise 6. Can you figure out which of the four movies below runs backwards?
Explain your reasoning.
Movie 13
cccccccctgccccccctttcccccagggggggggggttttggggggggtcccccttcggggaagcccggggggaactttccggggggatttttttgcct
Movie 14
ggaacccccccgggaaaaatttttttggaagggggggggggttttttttttaggggcgtccttttaattaaaccccccccttcccccttttttcctccct
Movie 15
cctttttggccttctttttgattaaaaaattttctccccctttccccccaggggcccctaattttctttttcccccccccccccccggggggggggcccc
Movie 16
tttcccccttttttcccgggaaagggggggccccccttttttccctaaccttttttttttttttgaaggggggttttgcccaaaaaactttcttaaaaat
Intuitively speaking a stochastic process is time-reversible if you get the same
process by running time backwards.
Exercise 7. Based on your work for Exercises 3 through 6, formulate a conjecture on how one can characterize time-reversible Markov chains in terms of their
transition matrices M .
Is it reasonable to assume that molecular evolution really is a
time-reversible stochastic process?
Obviously not. However, we almost never are in a position to actually observe
molecular evolution. All we usually can observe are the outcomes of two independent runs of molecular evolution from an unknown common ancestor to two extant
descendant species. Thus any model of the evolutionary changes that we observe
by comparing the molecular make-up of these two observable species need to be explained in terms of a model where evolution runs backward in time to the common
ancestor and then forward to the other species. Since we usually have no reason
to treat one of these species as the end of the backward branch and the other as
the end of the forward branch, the standard assumption in models of molecular
evolution is that it is a time-reversible stochastic process.
Download