Miranda Holmes-Cerfon Applied Stochastic Analysis, Spring 2015 Lecture 3: Markov Chains (II) Readings • Grimmett and Stirzaker [2001] 6.8, 6.9. Many random processes have a discrete state space, but change their values at any instant of time rather than at fixed intervals: radioactive atoms decaying, molecules in a chemical reaction, populations with birth/death/immigration/emigration, etc. In this case we may still have Markovian dynamics, which only depend on the current state, but they evolve continuously in time. In the first half of the class we’ll talk about continuous-time Markov chains: the tools to describe them, the link with discrete-time chains, and an important example called the Poisson process. In the second half of the class we’ll show two applications of Markov chains (discrete or continuous): first, an application to clustering and data science, and then, the connection between MCs, electrical networks, and flows in porous media. 3.1 Continuous-time Markov chains Let X = {X(t),t ≥ 0} be a family of random variables taking values in a finite or countable state space S, which we can take to be a subset of the integers. Definition. X satisfies the Markov property if P(X(tn ) = in |X(t1 ) = i1 , . . . , X(tn−1 ) = in−1 ) = P(X(tn ) = in |X(tn−1 ) = in−1 ) for all i1 , . . . , in ∈ S and any sequence t1 < t2 < · · · < tn of times. Remark. This property can also be formulated in terms of σ -algebras. Let (Ω, F , P) a the probability space and let {Ft }t≥0 be a filtration: an increasing sequence of σ -algebras such that Ft ⊆ F for each t, and t1 ≤ t2 ⇒ Ft1 ⊆ Ft2 . We suppose the process Xt is adapted to the filtration {Ft }t≥0 : each Xt is measurable with respect to Ft . For example, this will be true automatically if we let Ft be the σ -algebra generated by (Xs )0≤s≤t , i.e. generated by the pre-images Xs−1 (B) for Borel sets B ⊂ R. Then Xt has the Markov property if E( f (Xt )|Fs ) = E( f (Xt )|σ (Xs )) for all 0 ≤ s ≤ t and bounded, measurable functions f . Another way to say this is P(Xt ∈ A|Fs ) = P(Xt ∈ A|σ (Xs )), where P(·|·) is a regular conditional probability (see Koralov and Sinai [2010], p.184.) Remark. The difference from last class is that now, the set of times t is continuous – the chain can jump between states at any time, not just at integer times. However, we will see that we can actually say quite a lot about the distribution of the times at which it jumps. To describe a continuous-time Markov chain, there is no exact analogue of P since there is no implicit unit of time. There are two ways around this: (i) Through the generator Q, which is like an infinitesimal version of P (ii) Consider X1 , X2 , . . . at all the jumps to be a discrete-time MC, and describe the distributions of the jump times τ1 , τ2 , . . .. We will consider both approaches. Let’s start with the first one. 3.1.1 Generator, and forward and backward equations Definition. The transition probability for a homogeneous chain is pi j (t) = P(X(t + s) = j|X(s) = i), s ≤ t. (For a non-homogeneous chain, we would have pi j (s,t).) We write P(t) = (pi j (t)) for the matrix of transition probabilities. This is a stochastic matrix, that gives the transition probabilities for the MC after a fixed time t. In a similar way to the discrete case, we can show the Chapman-Kolmogorov equations hold for P(t): Chapman-Kolmogov Equation. N pi j (t + s) = ⇐⇒ ∑ pik (t)pk j (s) P(t + s) = P(t)P(s) (1) k=1 Proof. pi j (s + t) = P(X(s + t) = j|X(0) = i) = ∑ P(X(s + t) = j|X(s) = k, X(0) = i)P(X(s) = k|X(0) = i) k = ∑ pik (s)pk j (t). k This implies the following: Theorem. The family {Pt : t ≥ 0} with Pt ≡ P(t) is a stochastic semigroup; that is, it satisfies the following: (i) P0 = I, the identity matrix, (ii) Pt is stochastic, that is it has non-negative entries and its rows sum to 1, (iii) Ps+t = Ps Pt if s,t ≥ 0. Remark. One way to study Markov processes in general is by the semigroup they generate. See e.g. Grimmett and Stirzaker [2001], p.256, for more details related to continuous-time MCs, and see Koralov and Sinai [2010], Pavliotis [2014] for a discussion of general Markov processes. We would like to study the transition probability in the limit as t → 0, because this will lead to differential equations governing the evolution of the probability and observables. Let’s assume P(t) is “nice enough” as t → 0, so that P(t) → I in the following way: (i) pi j (h) = λi j h + o(h), j 6= i (ii) pii (h) = 1 − λi h + o(h), λi > 0 2 This means that (i) the probability of moving to a new state j in small time h is approximately a linear function of h, and (ii) the probability of staying in the same state approaches 1 linearly in h, where the error is o(h) and accounts for the possibility of jumping out and back in. Together, these mean the probability of two or more transitions in the interval (t,t + h) is o(h); this is something that can be proved. The numbers λi j , λi > 0 are constants that measure how the probability behaves for small h. Note we expect to have λi = ∑Nj6=i λi j .1 Definition. The generator of the Markov Chain is Q = lim h→0+ P(h) − I . h (2) Its entries are qii = −λi , qi j = λi j ( j 6= i), with ∑ j qi j = 0. The generator is the fundamental mathematical object used to study continuous-time Markov chains. We have written it above as a matrix, but if the state space is infinite similar ideas apply. Physically, the elements of the generator represent the “rate” of jumping from one state to another. Imagine an experiment where we start in state i, and every time we jump out of it we return immediately. Then qi j would be the number of jumps to state j we observe per unit time, on average. The diagonal element qii is the negative; its absolute value represents the total rate of “leaving” the ith state. The rate of leaving must be the sum of the rates to go somewhere else, so the rows of the generator must sum to 0. How does P(t) evolve in time? Let’s calculate: P0 (t) = lim h→0+ P(h) − I P(t + h) − P(t) = lim P(t) = QP(t). + h h h→0 We were able to factor P(t + h) = P(h)P(t), using the Chapman-Kolmogorov equations. We can also pull the factor out on the other side, in which case we get P0 (t) = lim h→0+ P(h) − I P(t + h) − P(t) = P(t) lim = P(t)Q. h h h→0+ This shows that QP(t) = P(t)Q, i.e. the transition probability matrix commutes with the generator. We also obtain that P0 (t) = P(t)Q = QP(t). (3) The first equality is a version of the forward equation, and the second equality is a version of the backward equation. The solution to (3) is P(t) = eQt P(0) = eQt , since P(0) = I.2 Therefore, the transition matrix is just the exponential of the generator. Now let’s think about how the probability density evolves in time. Let µ(t) be the probability density, which is a row vector, as for a discrete-time chain. Notice that this is computed from the transition matrix as µ(s) = µ(0)P(s), or, more generally, µ(t + s) = µ(t)P(s). Let’s calculate the derivative: µ(t + h) − µ(t) (P(h) − I) dµ = lim = lim µ(t) = µ(t)Q. dt h h h→0+ h→0+ 1 There are some chains for which this does not hold, but we will not consider them in this class. 2 Recall that eQt ≡ ∑∞ n=0 1 n n n! Q t for any square matrix Q. 3 They must necessarily have |S| = ∞. We obtain the following: Forward Kolmogov Equation. dµ = µQ. dt (4) This has solution µ(t) = µ(0)eQt . We can also look at expectations of functions of a random variable. Let u(k,t) = Ek f (X(t)) = E( f (X(t))|X(0) = k), as in the previous lecture. Again, we think of u(·,t) as a column vector, but here it varies in continuous time. We can write u(k,t) = ∑ j P(X(t) = j|X(0) = k) f ( j) = ∑ j f ( j)Pk j (t), so we can write u = P f , if we abuse notation and let f = ( f (1), f (2), . . .). Therefore du P(t + h) f − P(t) f (P(h) − I) = lim = lim P(t) f = Qu. h→0 dt h→0 h h This leads to: Backward Kolmogov Equation. du = Qu, dt u(k, 0) = f (k) ∀k ∈ S. (5) For the initial condition, just notice that u(k, 0) = E( f (X(0))|X(0) = k) = f (k). Mean first-passage time. The mean first-passage time (mfpt) can also be found using the generator. Let A ⊂ S be a subset of the state space, let TA = min{n ≥ 0 : X(t) ∈ A} be the time it takes for X to hit A for the first time, and let τ j = E(TA |X0 = j) be the expected value of TA , given the chain starts at j. Then the mfpt solves the system τj = 0 j∈A 1 + ∑k q jk τk = 0 j∈ /A This is sometimes written as Qτ = −1, τ(A) = 0. Therefore the mfpt solves the non-homogeneous backward equation with a particular boundary condition. The proof is left as an exercise to the reader. Stationary distribution. How do we find the stationary distribution? We need π such that π 0 = 0. From (3.1.1), this implies πQ = 0, (6) ∑ πi = 1. i Therefore π is a left eigenvector of Q corresponding to eigenvalue 0. There are corresponding ergodic theorems for certain generators Q. A continuous-time chain is irreducible if pi j (t) > 0 for all i, j, and any time t > 0. Note this is different from a discrete-time chain, because there is no notion of periodicity, so we do not distinguish between primitive and irreducible. There are uniqueness, convergence, and ergodic theorems for irreducible chains that satisfy the smoothness assumptions near 0: Theorem. If Q is irreducible, then 4 (i) If there exists a stationary distribution π, then it is unique, and pi j (t) → π j as t → ∞ for all i, j. (ii) If there is no stationary distribution, then pi j (t) → 0 as t → ∞. (iii) (Ergodic theorem) For any bounded function f we have 1 t Z t 0 f (X(s))ds → h f iπ , a.s. See e.g. Grimmett and Stirzaker [2001], Norris [1997] for a proof. If the state space is finite, then statement (i) always holds, not (ii). 3.1.2 Example: Poisson process An example of a continuous-time Markov chain that arises a lot in applications is the Poisson process. This is a process N(t) that counts the number of events that have happened at time t, if the events occur independently with rate λ . For example, it could count the number of busses that pass a bus stop in a certain time t, if the drivers have absolutely no idea where the other drivers are, and are so delayed by bad traffic that their arrival times are completely random. It could count the number of radioactive atoms that have decayed, or the number of telephone calls that arrive at a call center, or the number of search requests received by Google’s servers. It is used to model the growth of a population, say of people, bacteria, or rabbits (if we don’t count deaths.3 ) It could be used to model the number of large waves that arrive at an oil platform, or the locations of breaking waves in the ocean.4 Some realizations of the Poisson process with rate λ = 1. These were all simulated for 10 time units. There are (at least) two ways to define the Poisson process: Definition (#1). The Poisson process with rate λ is a Markov chain on S = {0, 1, 2, . . .} with generator −λ λ 0 0 ··· 0 −λ λ 0 · · · Q= 0 0 −λ λ · · · .. .. .. .. .. . . . . . Definition (#2). The Poisson process with rate λ is a process N = {N(t) : t ≥ 0} taking values in S = {0, 1, 2, . . .} such that: 3 We 4 We could include deaths by modelling them as a Poisson process too, and this would be a birth-death process. have only defined the Poisson process on a line, but it can be generalized to a function of multiple spatial coordinates. 5 (i) N(0) = 0; (ii) N(t) has stationary independent increments: for any 0 ≤ t1 < t2 < · · · < tn , the random variables N(t2 ) − N(t1 ), N(t3 ) − N(t2 ), . . . , N(tn ) − N(tn−1 ) are independent, and for any t ≥ 0, s ≥ 0, the distribution of N(t + s) − N(t) is independent of t; (iii) For any t ≥ 0, h > 0. we have if m = 1 λ h + o(h) o(h) if m > 1 P(N(t + h) = n + m|N(t) = n) = 1 − λ h + o(h) if m = 0. It can be shown that the second definition implies the first, and vice-versa. Let’s try to calculate the probability distribution of the Poisson process at each point in time. Let p j (t) = P(N(t) = j), j = 0, 1, 2, . . .. We solve for p j (t) using the forward Kolmogorov equations (3.1.1). When j = 0 we have d p0 = −λ p0 , dt p0 (0) = 1. The solution is p0 (t) = e−λt . The next equation is d p1 = λ p0 − λ p1 , dt p1 (0) = 0. Substituting for p0 (t) and solving gives p1 (t) = λte−λt . In general, we have dpj = −λ p j + λ p j−1 , dt We can solve these by induction to find that p j (t) = p j (0) = 0 ( j > 0). λ j t j −λt e . j! This shows that at fixed time t, N(t) is a Poisson random variable with rate parameter λt. It is also possible to show that the events counted by N(t) are uniformly distributed in time, with exponential waiting times in between (see section 3.1.3 for the latter.) 3.1.3 Waiting times and jump probabilities Another way to think about continuous-time Markov chains is to consider the times at which it jumps, and the distributions of the states it jumps to. 6 First, let’s consider the probability distribution of the jump times. Suppose we are in state i at time s, and we wish to know how long we will wait, before we jump to another state. This is a random variable so it is described by its probability distribution. Let’s define τi = inf{t > 0 : X(t + s) 6= i|X(s) = i} to be the time of the first jump after time s, starting at node i. Proposition. The waiting time τi is an exponentially distributed random variable, with rate −qii = ∑ j qi j . Proof. Recall that a random variable Y is exponentially distributed with parameter λ > 0, if it has probability density function p(y)dy = λ −1 e−λ y . Exponential random variables have the “lack-of-memory” property: P(Y > y + x|Y > y) = P(Y > x), and in fact, one can show they are the only continuous random variables with this property (Grimmett and Stirzaker [2001], p. 259, 140.) Here are two different proof sketches. The ideas behind them are similar. 1. Our strategy will be to show that τi has the lack-of-memory property. Let’s calculate: P(τi > x + y|τi > x) = P(τi > x + y|X(s + x) = i) = P(τi > y). The first step follows because if τi > x + y, then we necessarily have X(s + x) = i since we know that τi > x. The second step follows by the Markov property and homogeneity. Therefore P(τi > x + y) = P(τi > x)P(τi > y), and the only continuous solution to this equation is d P(τ < h)|h=0 , but P(τ < h) = ∑ j6=i qi j h + o(h) = P(τi > x) = e−λ x . To find λ , note that λ = dh −qii h + o(h), so λ = −qii . 2. Let µ(t) = P(τ ≥ t). We will proceed by deriving a differential equation for µ(t). We have µ(t) − µ(t + h) = P(τ ∈ [t,t + h)) = P(τ < t + h|τ ≥ t)P(τ > t) ! = ∑ qi j h + o(h) i6= j Therefore µ(t) − µ(t + h) = µ(t)(−qii h + o(h)), so µ 0 (t) = qii µ(t). The solution µ(t) = eqii t follows after recognizing that µ(0) = 1. 7 µ(τ > t). Because the exponential distribution is memoryless, the waiting time distribution does not depend on the time at which we start counting. Next, let’s consider the probability distribution of the states the chain jumps to. If all the jump times are ignored and set to be integers, then we simply have a discrete-time Markov chain. Therefore, we need to know the transition matrix of this embedded discrete-time Markov chain. Proposition. The transition matrix of the embedded chain has elements pi j = −qi j /qii . Proof. (Sketch) We have P(X(t + h) = j|X(t) = i) P(t < τi ≤ t + h|X(t) = i) qi j h + o(h) qi j = →− as h & 0. −qii h + O(h) qii pi j = P(X(t + h) = j|X(t) = i,t < τi ≤ t + h) = In words: Suppose that X(0) = i, and that t < τi ≤ t + h, and suppose that h is small enough that the chain jumps only once in (t,t + h]. Then P(jumps to j|it jumps) ≈ qi j pi j (h) →− 1 − pii (h) qii as h & 0. These two results are very useful for simulating continuous-time Markov chains, and form the basis of the kinetic Monte carlo algorithm (KMC), stochastic simulation algorithm (SSA), or Gillespie algorithm. These all refer to essentially the same algorithm that is used to simulate a wide range of problems in chemical kinetics and materials science (e.g. surface diffusion, surface growth, defect mobility, etc.) We will see an example of this on the homework. 3.2 Data analysis and Markov chains A lot of data takes the form of a network – a graph, with at set of nodes connected by edges, which may or may not have weights associated with them. Some examples include: – – – – – – – – – facebook users and friends websites and links people and their contacts in disease modeling interacting biological organisms chemical reactions between multiple different species languages image recognition states in a dynamical system (eg weather prediction) etc 8 The nodes may or may not be embedded in a space, typically high-dimensional, and if they are, they may or may not be restricted to a lower-dimensional submanifold. One highly relevant question is how to define a “distance” between the nodes. Even when the nodes are embedded in a Euclidean space, the natural idea, to define the distance to be the length of the shortest path, is not always a good one. Here is an example to show why, from Lafon and Lee [2006]. In this example, points lie on a noisy spiral, so we would like the shortest path to lie along the spiral. However paths can take shortcuts that dramatically reduce the distance. Is there a better notion of “distance” between the nodes of our data? And can this be used to produce a lower-dimensional representation of the data set? And can this in turn be used to “cluster” the data, into groups of similar nodes? It turns out that one can answer all of these questions by defining a Markov chain on the graph.5 Let the set of nodes be V = {1, 2, . . . , n}, and the set of edges be E = {ei j }ni, j=1 . Let’s suppose that each edge has a weight wi j associated with it, and the weights are symmetric: wi j = w ji . The weights are entirely application-driven and there are several ways to construct them: • There may be a natural notion of weight – e.g. the number of facebook messages sent between facebook friends, the number of times one person made skin-to-skin contact with another in Liberia during the Ebola crisis, etc. • If there is not natural notion of weight, then we can set all edges to have weight 1, i.e. we set wi j = 1 if nodes i, j have an edge connecting them, and wi j = 0 otherwise. • If the nodes have positions xi in a high-dimensional space, we could let the weights be a decaying 2 function of Euclidean distance, e.g. e−|xi −x j | /2ε , where ε is a lengthscale. This gives high weight to points that are nearby (in Euclidean space), but basically zero weight to points beyond a cutoff scale 5 Much of this discussion follows that in the beautiful paper Lafon and Lee [2006]. Beware that the notation for left and right eigenvectors has been inverted from that paper. The diffusion distance was introduced in a classic paper Coifman et al. [2005], although there are many other references, both earlier and later, which discuss such spectral properties of graphs. 9 ε. One has a cutoff because the data may lie on a nonlinear submanifold, so the Euclidean distance is only a good approximation to the manifold distance for very small distances. A Markov chain can be constructed by defining the transition probabilities to be pi j = wi j , di di = ∑ wi j . (7) j Here di is called the “degree” of node i. The chain chooses edges out of each node i in proportion to their weight. We showed on the homework that the stationary distribution is πi = di , ∑k dk (8) and that the Markov chain satisfies detailed balance. Because of this, the transition matrix P is self-adjoint with respect to the inner product h, iπ introduced in Lecture 2, and so it has a complete set of real eigenvalues n−1 n−1 {λk }k=0 , and corresponding right and left eigenvectors {φk }n−1 k=0 , {ψk }k=0 . Let’s order the eigenvalues so that 1 = λ0 ≥ |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn−1 |. We have ψ0 = π, and φ0 ∝ (1, 1, . . . , 1)T is a constant vector. We assume the eigenvectors are normalized so that kφ j k2π = 1, kψ j k2π −1 = 1. The spectral decomposition of Pt is given by n−1 PiTj = ∑ λkt φk (i)φk ( j)πi . (9) k=1 Now, we would like to introduce a “distance” between nodes, in a way that respects the dynamics of the Markov chain. One way is via the diffusion distance Dt : t − Ptj,· k21/π = ∑ Dt2 (i, j) = kPi,· k t − Pt )2 (Pi,k j,k πk . (10) The distance between nodes i, j is given by starting the Markov chain at each of nodes i, j, finding the probability distibutions after a fixed time t, and then computing the L2 -norm, or the mean-square distance, between these two distributions. The mean is computed with respect to a weight 1/π, which penalizes discrepancies on domains of low density more than on those of high density. 10 Inserting (9) into (10) shows the diffusion distance can be written as n−1 Dt2 (i, j) = ∑ λk2t (φk (i) − φk ( j))2 . (11) k=0 Note that since φ0 is constant, it does not enter the sum above, so the largest term comes from λ1 . The first key observation is that in many graphs, the norm of the eigenvalues decays,6 so that for large enough t only a few terms in (11) contribute. Suppose we want to calculate the diffusion distance to relative precision δ . Let q(t) be the largest index j such that kλ j |t > δ |λ1 |t . Then we can truncate the sum (11) at q(t) to obtain q(t) Dt2 (i, j) ≈ ∑ λk2t (φk (i) − φk ( j))2 . (12) k=1 The second key observation is that (12) can be interpreted as the Euclidean distance in Rq(t) , if the coordinates of each node i are given by the values of the right eigenvectors, weighted by λ tj . That is, we can define a mapping Ψt : V → Rq(t) by t λ1 φ1 (i) λ2t φ2 (i) Ψt (i) = (13) .. . t φ λq(t) q(t) (i) Then we have that Dt2 (i, j) ≈ kΨt (i) − Ψt ( j)k2 , where the approximation is good to relative precision δ . The map Ψt provides a way to take a high-dimensional, or even unstructured, dataset, and represent it in a lower-dimensional space, in such a way that the Euclidean distance in the lower-dimensional space is a good approximation to the diffusion distance: points that are close in the mapping, tend to have similar dynamics in the Markov chain on the graph. This is helpful for many reasons: to visualize the data, to compress the data by storing fewer numbers, and perhaps most importantly, for clustering or otherwise analyzing the data. Many clustering algorithms (e.g. k-means, etc) work by finding groups of points that are close together in Euclidean space, and can be implemented extremely efficiently, but are hard to adapt to other metrics. However, once the map Ψ has been applied to the data, these algorithms can be easily applied. Example. This example is from Lafon and Lee [2006]. Data is generated on a Swiss roll: a spiral, embedded in 3-dimensional space.7 Clustering the data into 4 clusters using its embedding in 3d-space leads to the partition on the left, which clearly does not respect the geometry of the manifold. Clustering the data after first applying the map Ψt leads to the partition on the right, which is captures the intrinsic geometry of the data much better. 6 For a fully connected graph, the first eigenvalue is 1 and all others are 0. For a totally disconnected graph, all eigenvalues are 1. Most graphs lie somewhere in between. 7 For an example of data analysis techniques applied to the swiss roll that do not use Markov chains, see http://www.sciencemag.org/content/290/5500/2319.full and http://www.sciencemag.org/content/290/5500/2323.full 11 References R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431, 2005. G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. L. B. Koralov and Y. G. Sinai. Theory of Probability and Random Processes. Springer, 2010. S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1393–1403, 2006. J. R. Norris. Markov Chains. Cambridge University Press, 1997. G. A. Pavliotis. Stochastic Processes and Applications. Springer, 2014. 12