Lecture 3: Markov Chains (II)

advertisement
Miranda Holmes-Cerfon
Applied Stochastic Analysis, Spring 2015
Lecture 3: Markov Chains (II)
Readings
• Grimmett and Stirzaker [2001] 6.8, 6.9.
Many random processes have a discrete state space, but change their values at any instant of time rather
than at fixed intervals: radioactive atoms decaying, molecules in a chemical reaction, populations with
birth/death/immigration/emigration, etc. In this case we may still have Markovian dynamics, which only
depend on the current state, but they evolve continuously in time. In the first half of the class we’ll talk
about continuous-time Markov chains: the tools to describe them, the link with discrete-time chains, and an
important example called the Poisson process.
In the second half of the class we’ll show two applications of Markov chains (discrete or continuous): first,
an application to clustering and data science, and then, the connection between MCs, electrical networks,
and flows in porous media.
3.1
Continuous-time Markov chains
Let X = {X(t),t ≥ 0} be a family of random variables taking values in a finite or countable state space S,
which we can take to be a subset of the integers.
Definition. X satisfies the Markov property if
P(X(tn ) = in |X(t1 ) = i1 , . . . , X(tn−1 ) = in−1 ) = P(X(tn ) = in |X(tn−1 ) = in−1 )
for all i1 , . . . , in ∈ S and any sequence t1 < t2 < · · · < tn of times.
Remark. This property can also be formulated in terms of σ -algebras. Let (Ω, F , P) a the probability space
and let {Ft }t≥0 be a filtration: an increasing sequence of σ -algebras such that Ft ⊆ F for each t, and
t1 ≤ t2 ⇒ Ft1 ⊆ Ft2 . We suppose the process Xt is adapted to the filtration {Ft }t≥0 : each Xt is measurable
with respect to Ft . For example, this will be true automatically if we let Ft be the σ -algebra generated by
(Xs )0≤s≤t , i.e. generated by the pre-images Xs−1 (B) for Borel sets B ⊂ R. Then Xt has the Markov property
if
E( f (Xt )|Fs ) = E( f (Xt )|σ (Xs ))
for all 0 ≤ s ≤ t and bounded, measurable functions f . Another way to say this is P(Xt ∈ A|Fs ) = P(Xt ∈
A|σ (Xs )), where P(·|·) is a regular conditional probability (see Koralov and Sinai [2010], p.184.)
Remark. The difference from last class is that now, the set of times t is continuous – the chain can jump
between states at any time, not just at integer times. However, we will see that we can actually say quite a
lot about the distribution of the times at which it jumps.
To describe a continuous-time Markov chain, there is no exact analogue of P since there is no implicit unit
of time. There are two ways around this:
(i) Through the generator Q, which is like an infinitesimal version of P
(ii) Consider X1 , X2 , . . . at all the jumps to be a discrete-time MC, and describe the distributions of the
jump times τ1 , τ2 , . . ..
We will consider both approaches. Let’s start with the first one.
3.1.1
Generator, and forward and backward equations
Definition. The transition probability for a homogeneous chain is pi j (t) = P(X(t + s) = j|X(s) = i), s ≤ t.
(For a non-homogeneous chain, we would have pi j (s,t).) We write P(t) = (pi j (t)) for the matrix of transition
probabilities. This is a stochastic matrix, that gives the transition probabilities for the MC after a fixed time
t.
In a similar way to the discrete case, we can show the Chapman-Kolmogorov equations hold for P(t):
Chapman-Kolmogov Equation.
N
pi j (t + s) =
⇐⇒
∑ pik (t)pk j (s)
P(t + s) = P(t)P(s)
(1)
k=1
Proof.
pi j (s + t) = P(X(s + t) = j|X(0) = i)
= ∑ P(X(s + t) = j|X(s) = k, X(0) = i)P(X(s) = k|X(0) = i)
k
= ∑ pik (s)pk j (t).
k
This implies the following:
Theorem. The family {Pt : t ≥ 0} with Pt ≡ P(t) is a stochastic semigroup; that is, it satisfies the following:
(i) P0 = I, the identity matrix,
(ii) Pt is stochastic, that is it has non-negative entries and its rows sum to 1,
(iii) Ps+t = Ps Pt if s,t ≥ 0.
Remark. One way to study Markov processes in general is by the semigroup they generate. See e.g. Grimmett and Stirzaker [2001], p.256, for more details related to continuous-time MCs, and see Koralov and
Sinai [2010], Pavliotis [2014] for a discussion of general Markov processes.
We would like to study the transition probability in the limit as t → 0, because this will lead to differential
equations governing the evolution of the probability and observables. Let’s assume P(t) is “nice enough” as
t → 0, so that P(t) → I in the following way:
(i) pi j (h) = λi j h + o(h), j 6= i
(ii) pii (h) = 1 − λi h + o(h), λi > 0
2
This means that (i) the probability of moving to a new state j in small time h is approximately a linear
function of h, and (ii) the probability of staying in the same state approaches 1 linearly in h, where the error
is o(h) and accounts for the possibility of jumping out and back in. Together, these mean the probability of
two or more transitions in the interval (t,t + h) is o(h); this is something that can be proved. The numbers
λi j , λi > 0 are constants that measure how the probability behaves for small h. Note we expect to have
λi = ∑Nj6=i λi j .1
Definition. The generator of the Markov Chain is
Q = lim
h→0+
P(h) − I
.
h
(2)
Its entries are qii = −λi , qi j = λi j ( j 6= i), with ∑ j qi j = 0.
The generator is the fundamental mathematical object used to study continuous-time Markov chains. We
have written it above as a matrix, but if the state space is infinite similar ideas apply.
Physically, the elements of the generator represent the “rate” of jumping from one state to another. Imagine
an experiment where we start in state i, and every time we jump out of it we return immediately. Then qi j
would be the number of jumps to state j we observe per unit time, on average. The diagonal element qii is
the negative; its absolute value represents the total rate of “leaving” the ith state. The rate of leaving must be
the sum of the rates to go somewhere else, so the rows of the generator must sum to 0.
How does P(t) evolve in time? Let’s calculate:
P0 (t) = lim
h→0+
P(h) − I
P(t + h) − P(t)
= lim
P(t) = QP(t).
+
h
h
h→0
We were able to factor P(t + h) = P(h)P(t), using the Chapman-Kolmogorov equations. We can also pull
the factor out on the other side, in which case we get
P0 (t) = lim
h→0+
P(h) − I
P(t + h) − P(t)
= P(t) lim
= P(t)Q.
h
h
h→0+
This shows that QP(t) = P(t)Q, i.e. the transition probability matrix commutes with the generator. We also
obtain that
P0 (t) = P(t)Q = QP(t).
(3)
The first equality is a version of the forward equation, and the second equality is a version of the backward
equation.
The solution to (3) is
P(t) = eQt P(0) = eQt ,
since P(0) = I.2 Therefore, the transition matrix is just the exponential of the generator.
Now let’s think about how the probability density evolves in time. Let µ(t) be the probability density, which
is a row vector, as for a discrete-time chain. Notice that this is computed from the transition matrix as
µ(s) = µ(0)P(s), or, more generally, µ(t + s) = µ(t)P(s). Let’s calculate the derivative:
µ(t + h) − µ(t)
(P(h) − I)
dµ
= lim
= lim µ(t)
= µ(t)Q.
dt
h
h
h→0+
h→0+
1 There are some chains for which this does not hold, but we will not consider them in this class.
2 Recall
that eQt ≡ ∑∞
n=0
1 n n
n! Q t
for any square matrix Q.
3
They must necessarily have |S| = ∞.
We obtain the following:
Forward Kolmogov Equation.
dµ
= µQ.
dt
(4)
This has solution µ(t) = µ(0)eQt .
We can also look at expectations of functions of a random variable. Let u(k,t) = Ek f (X(t)) = E( f (X(t))|X(0) =
k), as in the previous lecture. Again, we think of u(·,t) as a column vector, but here it varies in continuous
time. We can write u(k,t) = ∑ j P(X(t) = j|X(0) = k) f ( j) = ∑ j f ( j)Pk j (t), so we can write u = P f , if we
abuse notation and let f = ( f (1), f (2), . . .). Therefore
du
P(t + h) f − P(t) f
(P(h) − I)
= lim
= lim
P(t) f = Qu.
h→0
dt h→0
h
h
This leads to:
Backward Kolmogov Equation.
du
= Qu,
dt
u(k, 0) = f (k) ∀k ∈ S.
(5)
For the initial condition, just notice that u(k, 0) = E( f (X(0))|X(0) = k) = f (k).
Mean first-passage time. The mean first-passage time (mfpt) can also be found using the generator. Let
A ⊂ S be a subset of the state space, let TA = min{n ≥ 0 : X(t) ∈ A} be the time it takes for X to hit A for the
first time, and let τ j = E(TA |X0 = j) be the expected value of TA , given the chain starts at j. Then the mfpt
solves the system
τj = 0
j∈A
1 + ∑k q jk τk = 0
j∈
/A
This is sometimes written as
Qτ = −1,
τ(A) = 0.
Therefore the mfpt solves the non-homogeneous backward equation with a particular boundary condition.
The proof is left as an exercise to the reader.
Stationary distribution. How do we find the stationary distribution? We need π such that π 0 = 0. From
(3.1.1), this implies
πQ = 0,
(6)
∑ πi = 1.
i
Therefore π is a left eigenvector of Q corresponding to eigenvalue 0.
There are corresponding ergodic theorems for certain generators Q. A continuous-time chain is irreducible
if pi j (t) > 0 for all i, j, and any time t > 0. Note this is different from a discrete-time chain, because there is
no notion of periodicity, so we do not distinguish between primitive and irreducible.
There are uniqueness, convergence, and ergodic theorems for irreducible chains that satisfy the smoothness
assumptions near 0:
Theorem. If Q is irreducible, then
4
(i) If there exists a stationary distribution π, then it is unique, and pi j (t) → π j as t → ∞ for all i, j.
(ii) If there is no stationary distribution, then pi j (t) → 0 as t → ∞.
(iii) (Ergodic theorem) For any bounded function f we have
1
t
Z t
0
f (X(s))ds → h f iπ ,
a.s.
See e.g. Grimmett and Stirzaker [2001], Norris [1997] for a proof. If the state space is finite, then statement
(i) always holds, not (ii).
3.1.2
Example: Poisson process
An example of a continuous-time Markov chain that arises a lot in applications is the Poisson process.
This is a process N(t) that counts the number of events that have happened at time t, if the events occur
independently with rate λ .
For example, it could count the number of busses that pass a bus stop in a certain time t, if the drivers have
absolutely no idea where the other drivers are, and are so delayed by bad traffic that their arrival times are
completely random. It could count the number of radioactive atoms that have decayed, or the number of
telephone calls that arrive at a call center, or the number of search requests received by Google’s servers. It
is used to model the growth of a population, say of people, bacteria, or rabbits (if we don’t count deaths.3 ) It
could be used to model the number of large waves that arrive at an oil platform, or the locations of breaking
waves in the ocean.4
Some realizations of the Poisson process with rate λ = 1. These were all simulated for 10 time units.
There are (at least) two ways to define the Poisson process:
Definition (#1). The Poisson process with rate λ is a Markov chain on S = {0, 1, 2, . . .} with generator


−λ
λ
0
0 ···
 0 −λ
λ
0 · · ·


Q= 0
0
−λ
λ
· · ·


..
..
..
..
..
.
.
.
.
.
Definition (#2). The Poisson process with rate λ is a process N = {N(t) : t ≥ 0} taking values in S =
{0, 1, 2, . . .} such that:
3 We
4 We
could include deaths by modelling them as a Poisson process too, and this would be a birth-death process.
have only defined the Poisson process on a line, but it can be generalized to a function of multiple spatial coordinates.
5
(i) N(0) = 0;
(ii) N(t) has stationary independent increments: for any 0 ≤ t1 < t2 < · · · < tn , the random variables
N(t2 ) − N(t1 ), N(t3 ) − N(t2 ), . . . , N(tn ) − N(tn−1 )
are independent, and for any t ≥ 0, s ≥ 0, the distribution of N(t + s) − N(t) is independent of t;
(iii) For any t ≥ 0, h > 0. we have

if m = 1
 λ h + o(h)
o(h)
if m > 1
P(N(t + h) = n + m|N(t) = n) =

1 − λ h + o(h) if m = 0.
It can be shown that the second definition implies the first, and vice-versa.
Let’s try to calculate the probability distribution of the Poisson process at each point in time. Let p j (t) =
P(N(t) = j), j = 0, 1, 2, . . .. We solve for p j (t) using the forward Kolmogorov equations (3.1.1).
When j = 0 we have
d p0
= −λ p0 ,
dt
p0 (0) = 1.
The solution is p0 (t) = e−λt . The next equation is
d p1
= λ p0 − λ p1 ,
dt
p1 (0) = 0.
Substituting for p0 (t) and solving gives p1 (t) = λte−λt .
In general, we have
dpj
= −λ p j + λ p j−1 ,
dt
We can solve these by induction to find that
p j (t) =
p j (0) = 0
( j > 0).
λ j t j −λt
e .
j!
This shows that at fixed time t, N(t) is a Poisson random variable with rate parameter λt.
It is also possible to show that the events counted by N(t) are uniformly distributed in time, with exponential
waiting times in between (see section 3.1.3 for the latter.)
3.1.3
Waiting times and jump probabilities
Another way to think about continuous-time Markov chains is to consider the times at which it jumps, and
the distributions of the states it jumps to.
6
First, let’s consider the probability distribution of the jump times. Suppose we are in state i at time s, and
we wish to know how long we will wait, before we jump to another state. This is a random variable so it is
described by its probability distribution. Let’s define τi = inf{t > 0 : X(t + s) 6= i|X(s) = i} to be the time of
the first jump after time s, starting at node i.
Proposition. The waiting time τi is an exponentially distributed random variable, with rate −qii = ∑ j qi j .
Proof. Recall that a random variable Y is exponentially distributed with parameter λ > 0, if it has probability
density function p(y)dy = λ −1 e−λ y . Exponential random variables have the “lack-of-memory” property:
P(Y > y + x|Y > y) = P(Y > x), and in fact, one can show they are the only continuous random variables
with this property (Grimmett and Stirzaker [2001], p. 259, 140.)
Here are two different proof sketches. The ideas behind them are similar.
1. Our strategy will be to show that τi has the lack-of-memory property. Let’s calculate:
P(τi > x + y|τi > x) = P(τi > x + y|X(s + x) = i) = P(τi > y).
The first step follows because if τi > x + y, then we necessarily have X(s + x) = i since we know that
τi > x. The second step follows by the Markov property and homogeneity.
Therefore P(τi > x + y) = P(τi > x)P(τi > y), and the only continuous solution to this equation is
d
P(τ < h)|h=0 , but P(τ < h) = ∑ j6=i qi j h + o(h) =
P(τi > x) = e−λ x . To find λ , note that λ = dh
−qii h + o(h), so λ = −qii .
2. Let µ(t) = P(τ ≥ t). We will proceed by deriving a differential equation for µ(t). We have
µ(t) − µ(t + h) = P(τ ∈ [t,t + h)) = P(τ < t + h|τ ≥ t)P(τ > t)
!
=
∑ qi j h + o(h)
i6= j
Therefore µ(t) − µ(t + h) = µ(t)(−qii h + o(h)), so
µ 0 (t) = qii µ(t).
The solution µ(t) = eqii t follows after recognizing that µ(0) = 1.
7
µ(τ > t).
Because the exponential distribution is memoryless, the waiting time distribution does not depend on the
time at which we start counting.
Next, let’s consider the probability distribution of the states the chain jumps to. If all the jump times are
ignored and set to be integers, then we simply have a discrete-time Markov chain. Therefore, we need to
know the transition matrix of this embedded discrete-time Markov chain.
Proposition. The transition matrix of the embedded chain has elements pi j = −qi j /qii .
Proof. (Sketch) We have
P(X(t + h) = j|X(t) = i)
P(t < τi ≤ t + h|X(t) = i)
qi j h + o(h)
qi j
=
→−
as h & 0.
−qii h + O(h)
qii
pi j = P(X(t + h) = j|X(t) = i,t < τi ≤ t + h) =
In words: Suppose that X(0) = i, and that t < τi ≤ t + h, and suppose that h is small enough that the chain
jumps only once in (t,t + h]. Then
P(jumps to j|it jumps) ≈
qi j
pi j (h)
→−
1 − pii (h)
qii
as h & 0.
These two results are very useful for simulating continuous-time Markov chains, and form the basis of
the kinetic Monte carlo algorithm (KMC), stochastic simulation algorithm (SSA), or Gillespie algorithm.
These all refer to essentially the same algorithm that is used to simulate a wide range of problems in chemical
kinetics and materials science (e.g. surface diffusion, surface growth, defect mobility, etc.) We will see an
example of this on the homework.
3.2
Data analysis and Markov chains
A lot of data takes the form of a network – a graph, with at set of nodes connected by edges, which may or
may not have weights associated with them. Some examples include:
–
–
–
–
–
–
–
–
–
facebook users and friends
websites and links
people and their contacts in disease modeling
interacting biological organisms
chemical reactions between multiple different species
languages
image recognition
states in a dynamical system (eg weather prediction)
etc
8
The nodes may or may not be embedded in a space, typically high-dimensional, and if they are, they may or
may not be restricted to a lower-dimensional submanifold. One highly relevant question is how to define a
“distance” between the nodes. Even when the nodes are embedded in a Euclidean space, the natural idea, to
define the distance to be the length of the shortest path, is not always a good one.
Here is an example to show why, from Lafon and Lee [2006]. In this example, points lie on a noisy spiral,
so we would like the shortest path to lie along the spiral. However paths can take shortcuts that dramatically
reduce the distance.
Is there a better notion of “distance” between the nodes of our data? And can this be used to produce a
lower-dimensional representation of the data set? And can this in turn be used to “cluster” the data, into
groups of similar nodes?
It turns out that one can answer all of these questions by defining a Markov chain on the graph.5 Let the set of
nodes be V = {1, 2, . . . , n}, and the set of edges be E = {ei j }ni, j=1 . Let’s suppose that each edge has a weight
wi j associated with it, and the weights are symmetric: wi j = w ji . The weights are entirely application-driven
and there are several ways to construct them:
• There may be a natural notion of weight – e.g. the number of facebook messages sent between
facebook friends, the number of times one person made skin-to-skin contact with another in Liberia
during the Ebola crisis, etc.
• If there is not natural notion of weight, then we can set all edges to have weight 1, i.e. we set wi j = 1
if nodes i, j have an edge connecting them, and wi j = 0 otherwise.
• If the nodes have positions xi in a high-dimensional space, we could let the weights be a decaying
2
function of Euclidean distance, e.g. e−|xi −x j | /2ε , where ε is a lengthscale. This gives high weight to
points that are nearby (in Euclidean space), but basically zero weight to points beyond a cutoff scale
5 Much of this discussion follows that in the beautiful paper Lafon and Lee [2006]. Beware that the notation for left and right
eigenvectors has been inverted from that paper. The diffusion distance was introduced in a classic paper Coifman et al. [2005], although
there are many other references, both earlier and later, which discuss such spectral properties of graphs.
9
ε. One has a cutoff because the data may lie on a nonlinear submanifold, so the Euclidean distance is
only a good approximation to the manifold distance for very small distances.
A Markov chain can be constructed by defining the transition probabilities to be
pi j =
wi j
,
di
di = ∑ wi j .
(7)
j
Here di is called the “degree” of node i. The chain chooses edges out of each node i in proportion to their
weight.
We showed on the homework that the stationary distribution is
πi =
di
,
∑k dk
(8)
and that the Markov chain satisfies detailed balance. Because of this, the transition matrix P is self-adjoint
with respect to the inner product h, iπ introduced in Lecture 2, and so it has a complete set of real eigenvalues
n−1
n−1
{λk }k=0
, and corresponding right and left eigenvectors {φk }n−1
k=0 , {ψk }k=0 . Let’s order the eigenvalues so that
1 = λ0 ≥ |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn−1 |. We have ψ0 = π, and φ0 ∝ (1, 1, . . . , 1)T is a constant vector. We assume
the eigenvectors are normalized so that kφ j k2π = 1, kψ j k2π −1 = 1. The spectral decomposition of Pt is given
by
n−1
PiTj =
∑ λkt φk (i)φk ( j)πi .
(9)
k=1
Now, we would like to introduce a “distance” between nodes, in a way that respects the dynamics of the
Markov chain. One way is via the diffusion distance Dt :
t
− Ptj,· k21/π = ∑
Dt2 (i, j) = kPi,·
k
t − Pt )2
(Pi,k
j,k
πk
.
(10)
The distance between nodes i, j is given by starting the Markov chain at each of nodes i, j, finding the
probability distibutions after a fixed time t, and then computing the L2 -norm, or the mean-square distance,
between these two distributions. The mean is computed with respect to a weight 1/π, which penalizes
discrepancies on domains of low density more than on those of high density.
10
Inserting (9) into (10) shows the diffusion distance can be written as
n−1
Dt2 (i, j) =
∑ λk2t (φk (i) − φk ( j))2 .
(11)
k=0
Note that since φ0 is constant, it does not enter the sum above, so the largest term comes from λ1 .
The first key observation is that in many graphs, the norm of the eigenvalues decays,6 so that for large
enough t only a few terms in (11) contribute. Suppose we want to calculate the diffusion distance to relative
precision δ . Let q(t) be the largest index j such that kλ j |t > δ |λ1 |t . Then we can truncate the sum (11) at
q(t) to obtain
q(t)
Dt2 (i, j) ≈
∑ λk2t (φk (i) − φk ( j))2 .
(12)
k=1
The second key observation is that (12) can be interpreted as the Euclidean distance in Rq(t) , if the coordinates of each node i are given by the values of the right eigenvectors, weighted by λ tj . That is, we can define
a mapping Ψt : V → Rq(t) by
 t

λ1 φ1 (i)
 λ2t φ2 (i) 


Ψt (i) = 
(13)
..



.
t φ
λq(t)
q(t) (i)
Then we have that
Dt2 (i, j) ≈ kΨt (i) − Ψt ( j)k2 ,
where the approximation is good to relative precision δ .
The map Ψt provides a way to take a high-dimensional, or even unstructured, dataset, and represent it in a
lower-dimensional space, in such a way that the Euclidean distance in the lower-dimensional space is a good
approximation to the diffusion distance: points that are close in the mapping, tend to have similar dynamics
in the Markov chain on the graph. This is helpful for many reasons: to visualize the data, to compress the
data by storing fewer numbers, and perhaps most importantly, for clustering or otherwise analyzing the data.
Many clustering algorithms (e.g. k-means, etc) work by finding groups of points that are close together
in Euclidean space, and can be implemented extremely efficiently, but are hard to adapt to other metrics.
However, once the map Ψ has been applied to the data, these algorithms can be easily applied.
Example. This example is from Lafon and Lee [2006]. Data is generated on a Swiss roll: a spiral, embedded
in 3-dimensional space.7 Clustering the data into 4 clusters using its embedding in 3d-space leads to the
partition on the left, which clearly does not respect the geometry of the manifold. Clustering the data after
first applying the map Ψt leads to the partition on the right, which is captures the intrinsic geometry of the
data much better.
6 For
a fully connected graph, the first eigenvalue is 1 and all others are 0. For a totally disconnected graph, all eigenvalues are 1.
Most graphs lie somewhere in between.
7 For an example of data analysis techniques applied to the swiss roll that do not use Markov chains, see
http://www.sciencemag.org/content/290/5500/2319.full and http://www.sciencemag.org/content/290/5500/2323.full
11
References
R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric
diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings
of the National Academy of Sciences, 102(21):7426–7431, 2005.
G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001.
L. B. Koralov and Y. G. Sinai. Theory of Probability and Random Processes. Springer, 2010.
S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality
reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 28(9):1393–1403, 2006.
J. R. Norris. Markov Chains. Cambridge University Press, 1997.
G. A. Pavliotis. Stochastic Processes and Applications. Springer, 2014.
12
Download