Uploaded by Muhammad Imran Hakim Bin Mohamad Sharif (INTEC)

Probability II

advertisement
Probability 2, 2021-22
University of Bristol
6 March 2023
A. E. Holroyd
Probability 2 – Lecture Notes
Contents
1 Introduction
1.1 A tricky example . . . . . . . . . . . . . . . .
1.2 Axioms of probability . . . . . . . . . . . . . .
1.3 Continuity of probability . . . . . . . . . . . .
1.4 Conditioning, independence, random variables
1.5 Expectation . . . . . . . . . . . . . . . . . . .
2 Random Walks
2.1 Stochastic processes . . . . . . . . . .
2.2 Random walk basics . . . . . . . . .
2.3 Random walk with absorbing barriers
2.4 Duration of the game . . . . . . . . .
2.5 One barrier, no barrier . . . . . . . .
2.6 Stopping times, Wald’s identity . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
6
8
8
. . . . . . . . . .
. . . . . . . . . .
(Gambler’s ruin)
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
12
14
17
18
20
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
27
30
32
33
34
37
40
40
42
43
46
48
.
.
.
.
.
.
53
53
54
57
59
60
62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Markov Chains
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
3.2 Multi-step transition probabilities . . . . . . . . . .
3.3 Hitting probabilities and times . . . . . . . . . . . .
3.4 Strong Markov property . . . . . . . . . . . . . . .
3.5 Recurrence and transience . . . . . . . . . . . . . .
3.6 Random walks in d dimensions . . . . . . . . . . .
3.7 Communicating classes . . . . . . . . . . . . . . . .
3.8 Mean return times . . . . . . . . . . . . . . . . . .
3.9 Convergence examples . . . . . . . . . . . . . . . .
3.10 Stationary distributions . . . . . . . . . . . . . . .
3.11 Existence and uniqueness of stationary distributions
3.12 Periodicity and the convergence theorem . . . . . .
3.13 Reversibility . . . . . . . . . . . . . . . . . . . . . .
4 Poisson Process
4.1 Introduction . . . . . . . . . . . . . .
4.2 Increment distribution . . . . . . . .
4.3 Arrival times . . . . . . . . . . . . .
4.4 Superposition and thinning . . . . . .
4.5 Conditioning . . . . . . . . . . . . .
4.6 Inhomogeneous and Multidimensional
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Poisson processes
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Continuous-time Markov chains and queues
5.1 Introduction . . . . . . . . . . . . . . . . . .
5.2 The jump chain and explosion . . . . . . . .
5.3 Transition probabilities . . . . . . . . . . . .
5.4 Classification of states . . . . . . . . . . . .
5.5 Stationary distributions . . . . . . . . . . .
5.6 Reversibility and queueing theory . . . . . .
.
.
.
.
.
.
64
64
66
68
71
72
74
.
.
.
.
78
78
81
82
85
7 Brownian Motion
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Symmetries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Reflection principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
87
90
91
6 Martingales
6.1 Introduction . . . . . . . . . . . . . . . . . .
6.2 New martingales from old . . . . . . . . . .
6.3 Stopped martingales . . . . . . . . . . . . .
6.4 Martingale convergence (not required for the
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
course)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
Introduction
This course is a follow-on to Probability 1 / Probability and Statistics. The main object of
interest is a stochastic process, i.e. a random process that evolves over time. Mathematically, it is simply a family of random variables X0 , X1 , X2 , . . . or (Xt )t∈[0,∞) ; Xt represents the
state of the system at time t. Applications include population models, industrial processes,
chemical reactions, and financial models. The course will focus on mathematical models
rather than details of their applications. A typical question will be to understand the long
term behaviour of a process as t → ∞. The course will be partially rigorous - we’ll aim to
understand why things are true, but some of the more difficult proofs will be omitted. The
course is not primarily about memorizing proofs. It is more important to understand the
models and techniques and learn to use them.
Two of the most important general classes of stochastic processes will be Markov chains
(which can be seen as memoryless processes), and martingales (roughly speaking, fair
gambling games). Key special cases include random walks (which are Markov chains and
martingales simultaneously), Poisson processes (the simplest model for random events
happening in continuous time) and Brownian motion (random motion in continuous space
and continuous time).
Material with a left bar (like this) is not required for the course. It will not be
covered in problem sheets or exams.
1.1
A tricky example
The first few lectures will be largely of a review of some basics of probability theory, but at
a slightly deeper level than before. To whet our appetite before that, we look at a somewhat
tricky example that illustrates some important themes, and that we’ll come back to several
times.
Example 1.1. A fair coin is tossed repeatedly. Find the expected number of tosses until the
first occurrence of HT H (at 3 consecutive tosses).
We’ll eventually see 3 different approaches to this problem: an ad hoc method (now); a
more systematic method using Markov chains; and finally a very neat method using martingales.
A first guess might be: the probability of getting HT H immediately in the first 3 tosses is
(1/2)8 = 1/8. So by analogy by the Geometric distribution we might guess that the expected
time until the first occurrence is 1/ 81 = 8. But occurrences are not independent: if HT H
has just occurred then it’s impossible for it to occur again one step later (starting at the T ),
while two steps later we have a head-start because we already have a H, so we might get
HT HT H. It’s not so obvious whether these effects should make the expected waiting time
longer, or shorter, or make no difference.
Solution. Let X be the number of tosses until the first HT H. So we want its expectation,
EX. We could try to compute
P(X = 3) = P(HT H) = 18 ,
P(X = 4) = P(HHT H) + P(T HT H) = 2 ·
1
16
etc., but it quickly gets complicated. We need another approach.
3
= 18 ,
First let’s show that EX < ∞. Split the tosses into disjoint blocks of 3, e.g.
T HT |HT T |HT H| · · ·
and let Z be the number of blocks until the first HT H block. So here Z = 3 (even though
HT H occurred earlier across two blocks). The advantage is that different blocks are independent, so Z has a Geometric(1/8) distribution, hence EZ = 8. Now we have X ≤ 3Z,
so
EX ≤ 3EZ = 24 < ∞.
Before tackling X, let Y be the number of tosses until the first occurrence of HT . By the
same reasoning as above EY < ∞. Now compute EY by conditioning on the first toss F :
EY = E(Y | F = H)P(F = H) + E(Y | F = T )P(F = T )
= (1 + 2) 21 + (1 + EY ) 21 ,
because if the first toss is H we have used one toss and we just have to wait for the first T ,
which comes after Geometric(1/2) time, while if it is T then we have wasted one toss and
it is like starting again from the beginning. Now we can solve this equation for EY (it is
important that we know EY < ∞!) to get EY = 4.
Now we’ll compute EX by conditioning on V (say), the next toss after the first occurrence
of HT , and also on the value of Y :
E(X | Y = y) = E(X | Y = y, V = H) 21 + E(X | Y = y, V = T ) 21
= (y + 1) 21 + (y + 1 + EX) 12 ,
= 1 + y + 12 EX
because in the first case we are done, and in the second it is like starting again from scratch.
It may not be so clear why we are allowed to do all this - after all, V is not like the first toss
– it’s a randomly chosen toss whose location depends on the other tosses (it is toss number
Y + 1, in fact)! We’ll learn exactly how to justify such things later.
Recall that E(X | Y = y) is always a function of y, say h(y), and the conditional
expectation E(X | Y ) is defined to be the random variable h(Y ). So in this case, E(X | Y ) =
1 + Y + 21 EX. By the double expectation formula or tower property:
EX = E(E(X | Y )) = E(1 + Y + 12 EX) = 1 + EY + 21 EX.
So we can solve:
EX = 2(1 + EY ) = 2(1 + 4) = 10.
♢
A more complicated example of the same kind is: a monkey types random letters on a
keyboard; what is the expected time until it types ABRACADABRA? We’ll return to this
later. (This word is chosen to be a particularly interesting example – what features do you
think are important here?) Another thing to think about: what is the probability that one
pattern (say HT H) occurs before another (say T T H)?
1.2
Axioms of probability
A probability space is a set Ω (where the elements ω ∈ Ω represent all possible ‘outcomes’).
An event is a subset A ⊆ Ω. A probability measure is a function P from events to R
satisfying the axioms:
4
1. P(A) ≥ 0 for all events A
2. P(Ω) = 1
3. If A1 , A2 , . . . are pairwise disjoint events (i.e. Ai ∩ Aj = ∅ for i ̸= j) then
P
∞
[
∞
X
Ai =
P(Ai ).
i=1
i=1
A technical complication (not needed for this course) is that in many important settings
(primarily those where Ω is uncountable), it is not possible to assign probabilities in this
way to every subset of Ω. We have to restrict to sufficiently ‘well-behaved’ sets A, and these
are the ones that we call ‘events’. Determining what counts as well-behaved is the job of
measure theory, which is beyond the scope of this course. The good news is that any set that
one would normally want to think about is indeed well-behaved, so in practice one does not
typically have to worry about this!
It is important that in Axiom 3 (and things we’ll derive from it) the sequence of events is
countable. The same statement does not in general hold for uncountable families of events.
For instance, let the random variable X be uniformly random on the real interval [0, 1] (so it
has probability density functionR1 on [0, 1] and 0 elsewhere). Let Ar be the event {X = r},
r
where r ∈ [0, 1]. Then P(Ar ) = r 1 dx = 0, and the (Ar ) are pairwise disjoint. But
[
P
Ar = P(X ∈ [0, 1]) = 1,
r∈[0,1]
while
X
P(Ar ) =
r∈[0,1]
X
0 = 0,
r∈[0,1]
which are not equal. Uncountable sets also arise when we consider infinite sequences (even
when the elements in the sequence take discrete values, e.g. coin tosses), so we need to be
careful.
Simple examples of probability spaces include:
1. Roll two dice. Ω = {1, 2, . . . , 6}2 = {(a, b) : a, b ∈ {1, . . . , 6}} (finite). P({ω}) = 1/62
for each ω.
2. Uniform[0, 1] random variable as above. Ω = [0, 1] (uncountable); X(ω) = ω. P({ω}) =
0 for each ω, so this information is not enough to specify P.
3. Toss a coin repeatedly (forever).
+
Ω = {H, T }Z = {(x = (x1 , x2 , . . .) : xi ∈ {H, T } ∀i}
– all sequences of Hs and T s indexed by the positive integers Z+ = {1, 2, . . .}. Again,
Ω is uncountable. The first toss is the random variable X1 given by X1 (x) = x1 for
x = (x1 , x2 , . . .) ∈ Ω.
Usually we don’t talk explicitly about Ω much, but instead work with events and random
variables.
5
1.3
Continuity of probability
Continuity of a function f means (among other definitions) that if xn → x then f (xn ) → f (x).
A probability measure P satisfies a similar condition, but to state it we need to make sense
of what it means for events to converge to a limit. There are two settings where we can do
this: decreasing and increasing sequences.
Let A1 , A2 , . . . be a sequence of events. Suppose that Ai ⊆ Ai+1 (i.e. if Ai occurs
Sn then
Ai+1 occurs) for all i. We call such
S a sequence increasing. Since in that case An = i=1 Ai ,
it makes sense to consider A = ∞
i=1 Ai , the event that at least one of the events occurs. We
call A the limit of the sequence and write An ↑ A as n → ∞.
Theorem 1.2 (Continuity
for increasing limit). Let An be an increasing sequence of
S
events and let A = ∞
A
.
i=1 i then
P(A) = lim P(An ).
n→∞
(Of course P(An ) is an increasing sequence of numbers, so this is an increasing limit:
P(An ) ↑ P(A)).
Proof. We want to use Axiom 3, so we need a sequence of disjoint events. Define D1 = A1 ,
and Dn = An \ An−1 for
to check that the Di are pairwise
S∞
Snn ≥ 2. It is straightforward
disjoint, and that An = i=1 Di and A = i=1 Di . Then, using Axiom 3 and the definition
of an infinite sum,
P(A) =
∞
X
P(Di ) = lim
n→∞
i=1
n
X
P(Di ) = lim P(An ).
n→∞
i=1
Similarly, ifTAi ⊇ Ai+1 for all i then we call the sequence of events decreasing, and
interpret A = ∞
i=1 Ai as its limit, and write An ↓ A.
Theorem 1.3 (Continuity
for decreasing limit). Let An be an decreasing sequence of
T
events and let A = ∞
A
.
i=1 i then
P(A) = lim P(An ).
n→∞
Proof. This is really the last result in disguise, if we take complements. Let Bi = AC
i . Then it
is straightforward to check that Bi is an increasing sequence and that Bn ↑ B where B = AC
(this is De Morgan’s law on sets). Then using the last theorem,
P(A) = 1 − P(B) = 1 − lim P(Bn ) = lim (1 − P(Bn )) = lim P(An ).
n→∞
n→∞
n→∞
Let’s use these results to formally justify some facts that should seem pretty obvious. . .
Example 1.4. A fair coin is tossed repeatedly. What is the probability that every toss is
Heads?
Solution. The idea is to reduce to “finitely defined” events that we know how to deal with.
Let An = {tosses 1, . . . , n are all H} and A = {all tosses are H}. Then An ↓ A, so by
continuity of probability,
n
P(A) = lim P(An ) = lim 12 = 0.
♢
n→∞
n→∞
6
Similarly, the probability of any fixed sequence of Hs and T s is 0, so the situation is very
similar to the Uniform[0, 1] random variable case.
Example 1.5. A fair coin is tossed repeatedly. What is the probability that Heads appears
infinitely many times?
Solution. If there are only finitely many Heads then the sequence is all Tails after some
point. So let Bn = {tosses n, n + 1, . . . all T }. Then
S P(Bn ) = P(B1 ) = 0, from the last
example. The sequence Bn is increasing; let B = n Bn . Then B is the event that the
sequence is all T from some point on, which is the same as the event that there are only
finitely many Hs. We have P(B) = limn→∞ 0 = 0. So
P(infinitely many Hs) = P(B C ) = 1 − P(B) = 1.
♢
If an event has probability 1 we sometimes say that it occurs almost surely (or a.s.).
(The philosophy behind this terminology is that there can still be elements of Ω where
the event does not occur, so saying “surely” would be too strong).
Example 1.6. Show that almost surely there are infinitely many Heads and infinitely many
Tails.
Solution. Let A = {infinitely many Hs} and B = {infinitely many Ts}. From the previous
example we know P(A) = P(B) = 1, and we want to conclude P(A ∩ B) = 1. It’s often a bit
easier to argue that probabilities are 0 rather than 1, so we take complements:
1 − P(A ∩ B) = P((A ∩ B)C ) = P(AC ∪ B C ) = P(AC ) + P(B C ) − P(AC ∩ B C )
≤ P(AC ) + P(B C ) = 0 + 0 = 0.
So P(A ∩ B) = 1 − 0 = 1.
♢
Above we used the inequality
P(A ∪ B) ≤ P(A) + P(B),
(but applied to the complements AC , B C ). This is often useful. As usual, it works for
countable families too:
Theorem 1.7 (Union bound). Let A1 , A2 , . . . be events. Then
P
∞
[
∞
X
Ai ≤
P(Ai ).
i=1
i=1
This should seem very natural – Axiom 3 says that we have equality when the events are
disjoint; if they overlap then the left side can only become smaller. It can be proved directly
from Axiom 3 in a similar manner to before. We’ll see a different proof using indicators soon.
7
1.4
Conditioning, independence, random variables
The conditional probability of event A given event B is defined to be
P(A | B) =
P(A ∩ B)
,
P(B)
provided P(B) > 0. If B is fixed and P(B) > 0, then it’s an exercise to show that the
e defined by P(A)
e
function P
:= P(A | B) for all events A is itself a probability measure. The
e describes the state of the universe if you know that B has happened.
interpretation is that P
Events A and B are said to be independent if
P(A ∩ B) = P(A)P(B).
This is equivalent to P(A | B) = P(A) and P(B | A) = P(B) provided the relevant conditional
probability is defined (i.e. its denominator is non-zero). The interpretation is that events are
unrelated – knowing one occurred tells us nothing about the other.
An arbitrary collection of events (Ai )i∈I is said to be independent if
\ Y
P
Ai =
P(Ai ) for every finite subset J ⊆ I.
i∈J
i∈J
Note that it’s not enough to check this just for the full set J = I.
For example, in our standard setting of a fair coin tossed repeatedly, if we define the event
Hi = {toss i is Heads} then (Hi )∞
i=1 are independent.
A random variable is simply a function X from the probability space Ω to a set B
(often Z or R). We think of X a random number (or object) whose value depends on the
‘outcome’. X(ω) is the value when the outcome is ω. We use the shorthand
P(X ∈ S) = P({ω ∈ Ω : X(ω) ∈ S}),
P(X = x) = P({ω ∈ Ω : X(ω) = x}),
etc. Jointly distributed random variables are simply different functions X1 , X2 , . . . defined
on the same probability space Ω. Note that random variables are allowed to take values in
any set B. For example, if X and Y are real-valued random variables (as usual) then we
can think of the pair (X, Y ) as a random vector, i.e. a random variable ω 7→ (X(ω), Y (ω))
that takes values in the plane R2 = {(x, y) : x, y ∈ R}. It is also sometimes useful to allow
random variables to take the value infinite values. E.g. a discrete random variable X might
take values in the set N ∪ {∞} (where N = {0, 1, 2 . . .}).
Again there is a technical complication involving measure theory that we won’t need to
worry about in this course: we are only allowed to take ‘well-behaved’ functions X, which
are those for which the sets {ω ∈ Ω : · · · } that arise above are well-behaved enough to be
events.
1.5
Expectation
The expectation or expected value or mean EX of a real-valued random variable X is
defined to be
X
EX =
xp(x)
x
if X is discrete with probability mass function p, or
Z
EX = xf (x) dx
x
8
if it is continuous with probability density function f , provided the sum or integral exists.
Simple properties include: if c is a deterministic constant then Ec = c and E(cX) = cEX; if
X ≥ 0 then EX ≥ 0.
As with other sums or integrals, expectations can be infinite, but we need to be careful
with infinite values in combination with signs - the general rule is that we never allow ∞−∞.
E.g. consider a continuous random variable X with probability density function f . If X is
non-negative, i.e. X ≥ 0 almost surely, then f (x) = 0 for x < 0, and so (by the definition of
the integral)
Z K
Z ∞
xf (x)dx,
xf (x)dx := lim
EX =
K→∞
0
0
(which is an increasing limit in K because f ≥ 0). This would be equal to ∞ if the integral
diverges. On the other hand if X takes positive and negative values then (again by definition)
Z ∞
Z 0
Z ∞
xf (x)dx =
xf (x)dx +
xf (x)dx.
EX =
−∞
−∞
0
R∞
R0
If the integral 0 is ∞ and the integral −∞ is −∞ then we say that EX is undefined.
Similar remarks apply to discrete random variables.
If X is a random variable taking values in N ∪ {∞} with probability mass function
p(x) = P(X = x) then we can also define its expectation:
X
X
EX =
xp(x) = ∞ p(∞) +
xp(x).
x
x∈N
Here p(∞) = P(X = ∞), and we use the convention ∞ · 0 = 0 P
for the first term. So if
P(X = ∞) > 0 then EX = ∞, while if P(X = ∞) = 0 then EX = x∈N xp(x).
A random variable X is bounded if there is a finite constant K ∈ R for which |X| ≤ K
almost surely. It is a.s. finite if |X| < ∞ a.s. Bounded implies finite mean, which implies
a.s. finite, but none of the three conditions are equivalent.
A random variable X is deterministic if there is a fixed constant c for which X = c a.s.
That’s a very boring random variable, but occasionally it’s useful to have a name for it.
For random variables X and Y we have
E(X + Y ) = EX + EY,
provided EX and EY are finite, or alternatively if X and Y are non-negative (in which case
one or more expectation could be ∞). Once again, we can’t say anything about ∞ − ∞. It
follows that under similar conditions E(X1 + · · · + Xn ) = EX1 + · · · + EXn for any finite sum.
Note that there is no assumption on the joint distribution (like independence).
Does this work also for infinite (countable) sums? Unfortunately, not always. The problem
is that expectation is itself a sum (or integral), so it amounts to rearranging an infinite sum,
which is not always justified. The most general conditions where it is possible are beyond
the scope of this course (“dominated convergence”), but one case where it is allowed is for
non-negative random variables:
Theorem 1.8. Let X1 , X2 , . . . be random variables with Xi ≥ 0 for all i. Then
E
∞
X
Xi =
i=1
∞
X
EXi .
i=1
Proof. Not included in this course - see measure theory.
9
We can use the expectation of a sum to give a neat derivation of the Union Bound from
earlier. The indicator of an event A is the random variable
(
1, if A occurs,
IA =
0, otherwise.
Note that EIA = P(A).
P
Proof of Theorem 1.7. Let N = ∞
i=1 IAi , which is the number of the events A1 , A2 , . . . that
occur. It is a (discrete) random variable taking values in N ∪ {∞} (where N = {0, 1, 2 . . .}).
Note that
I{N >0} ≤ N
(why? - check it carefully!) Therefore,
P
∞
[
Ai = P(N > 0) = EI{N >0} ≤ EN =
∞
X
i=1
EIAi =
i=1
∞
X
P(Ai ).
i=1
Random variables X and Y are independent if
P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B)
for all sets A and B. (The comma on the left means “and” – in other words the intersection
of the two events). A family of random variables (Xi )i∈I are independent if
\
Y
P
{Xi ∈ Ai } =
P(Xi ∈ Ai )
i∈J
i∈J
for all finite subsets J ⊆ I and all sets (Ai )i∈J .
The definition applies to random variables taking values in arbitrary sets (e.g. random
vectors), so for example to say that (X, Y ) and Z are independent means
P (X, Y ) ∈ A, Z ∈ B = P((X, Y ) ∈ A) P(Z ∈ B)
for all sets A and B.
If X and Y are independent random variables with finite expectations then
E(XY ) = (EX)(EY ).
Once again we must be more careful with infinite expectations. For instance if the right side
is ∞ × 0 then we cannot conclude anything about the left side.
The opposite extreme from independence is one random variable being a function of
another, e.g. X = f (Z) where f is a fixed function (perhaps X = Z 2 or X = log Z). In other
words, knowing Z completely determines X. Note that if X and Y are independent then
f (X) and g(Y ) are independent, for any functions f and g.
Random variables X and Y have the same distribution, or are identically distributed,
or are equal in distribution, if P(X ∈ A) = P(Y ∈ A) for every A. (Sometimes this is
d
written as X = Y , or variants thereof, but this is risky as it’s too easy to leave out the “d”,
so we’ll try to avoid it.) For real-valued random variables this is equivalent to having the
same distribution function, i.e. FX = FY , where
FX (x) = P(X ≤ x).
10
It is also equivalent to having the same probability mass function for discrete random variables, or the same probability density function, for continuous random variables. As usual,
the definition applies to random variables that take values in arbitrary sets. So for example
to say that (X, Y ) has the same distribution as (X ′ , Y ′ ) means that
P (X, Y ) ∈ A = P (X ′ , Y ′ ) ∈ A
for all sets A.
11
2
Random Walks
Random walks are the central examples of a stochastic processes. As we’ll see, they come in
several flavours. We’ll start with some general remarks about stochastic processes.
2.1
Stochastic processes
A stochastic process or random process is simply an infinite family of random variables
such as X0 , X1 , X2 , . . . or (Xt )t∈[0,∞) , all on the same probability space Ω. Usually we think of
the index n or t as time. If n ∈ Z the process is in discrete time; if t ∈ R, continuous time.
We might denote the entire process by any of X = (Xn )∞
n=0 = (Xn )n≥0 = (Xn ) = (X0 , X1 , . . .).
(Sometimes people use {Xn }∞
,
but
that
can
be
confusing
because it looks like a set, whereas
n=0
we care about the indexing.)
The simplest stochastic process is a sequence of independent, identically distributed
(i.i.d.) random variables X1 , X2 , . . .. Identically distributed means they all have the same
distribution. Independent means they are all independent(!) We can specify the distribution
of an i.i.d. process just by specifying the distribution of X1 . E.g. X1 takes value 1 with
probability p and 0 otherwise; or X1 is uniform on [0, 1]; or X1 is a standard Normal.
We won’t have much more to say about i.i.d. processes themselves, but we will often use
them as an ingredient to build more interesting processes.
2.2
Random walk basics
Let 0 < p < 1 and let X1 , X2 , . . . be i.i.d. with
(
+1 with probability p,
Xi =
−1 with probability q := 1 − p,
and let
Sn =
n
X
Xi
i=1
for n ≥ 1. We also define S0 = 0 – in fact this is a special case of a convention that we’ll
always assume - a sum of zero terms is zero. We consider S0 , S1 , S2 , . . . as a stochastic process.
It is called a random walk.
Possible interpretations include:
1. You play a sequence of gambling games (roulette, etc.). In each game you win £1 with
probability p or lose £1 with probability q = 1 − p. Different games are independent.
Then Sn is the change in your wealth after you have played n games.
2. A particle or an organism moves about randomly in a linear environment, taking steps
randomly left or right (perhaps in response to collisions with other particles, or biological impulses). Sn represents its position at time n.
We have already studied the distribution of Sn for a fixed n in earlier courses. (It is a
variant of the binomial distribution; and we have laws of large numbers and the central limit
theorem). The focus now is instead to consider properties of the whole random sequence
S = (S0 , S1 , . . .).
We can imagine a graph of a typical trajectory:
12
Sn
···
n
We can also draw the transition diagram – nodes represent the possible values for Sn ,
and arrows indicate the probabilities of moving between them:
p
···
−2
p
−1
q
p
0
p
1
q
q
2
···
q
Let’s do some warm-up computations.
Example 2.1. Find an expression for P(Sn = 0).
Solution. Since each Xi is odd, Sn is the sum of n odd numbers so it must have the same
parity as n, therefore if n is odd then P(Sn = 0) = 0. If n = 2m is even then
2m m m
P(S2m = 0) = P X1 , . . . , Xn comprise m +1s and m −1s =
p q .
♢
m
In fact, the distribution of Sn is a variant of a Binomial, so really there’s not much new
here. Things get more interesting when we consider the values at several times simultaneously.
Example 2.2. Find an expression for P(S2a = S2b = 0), where 0 < a < b.
Solution.
P(S2a = S2b = 0) = P(S2a = 0, S2b − S2a = 0)
2a
2b
X
X
=P
Xi = 0,
Xi = 0
i=1
=P
2a
X
i=2a+1
2b
X
Xi = 0 P
Xi = 0
i=1
i=2a+1
2a a a 2(b − a) b−a b−a
=
p q
p q
a
b−a
2a 2(b − a)
=
(pq)b .
a
b−a
♢
The random walk we defined above is called simple random walk (SRW) on Z; “on Z”
because Sn takes values in Z; “simple” because it only takes steps to neighbouring integers.
When p = 1/2 it is called a simple symmetric random walk (SSRW).
13
Many variants are possible. The steps Xi can have a different distribution. We can
consider random walks in d dimensions, where Xi and Si take values in the set of integer
vectors Zd .
The random walk we have considered is sometimes called unrestricted, because Sn is
allowed to take any integer value. On the other hand, we can restrict the range in various
ways, which is our next topic.
2.3
Random walk with absorbing barriers (Gambler’s ruin)
A gambler plays a sequence of independent games. In each game, she either wins £1 with
probability p, or loses £1 with probability q = 1 − p. Let 0 < k < m be integers. The
gambler’s initial wealth is £k, and she plays until her wealth reaches either £0 or £m for
the first time; at that point the process is stopped.
(Reaching £0 is bankrupcy or “ruin” - at which point the gambler is not allowed to play
further. The reason for stopping at £m could be that the gambler needs this amount to
retire or buy a ticket home, or alternatively that the game is played against an opponent who
starts with £(m − k), and who must also stop if they go bankrupt).
We can interpret the process as a simple random walk started at k with absorbing
barriers at 0 and m. That is, let (Xi ) be i.i.d. taking value +1 with probability p and −1
otherwise, and let
n
X
Sn = k +
Xi ,
i=1
but stop the process at the random time T given by
T = min n : Sn ∈ {0, m} .
Much of the course will concern random times like this, defined in terms of stochastic processes. It is possible that Sn never hits 0 or m; in that case the set {n : Sn ∈ {0, m}} in the
definition above is empty, and it makes sense to take T = min ∅ = ∞. Thus T is a random
variable taking values in N ∪ {∞}. (Actually we will show later that T < ∞ almost surely).
We can think of horizontal barriers at heights 0 and m that absorb the trajectory if it
runs into one of them:
Sn
m
k
0
The transition diagram has absorbing states at 0 and m:
p
p
0
1
q
2
q
3
q
14
···
p
m−1
m
Here is the central calculation.
Example 2.3. Find the probability that the gambler finishes with £m.
Solution. The two key ideas are: (1) condition on the first step, and (2) compute what
we want simultaneously for all starting states. Let W = {ST = m} be the event that the
gambler finishes with m. Define
hk = Pk (W ),
where the subscript k means that we consider the model in which the gambler’s initial wealth
is k.
Then for all 0 < k < m,
hk = Pk (W ) = Pk (W | X1 = 1)P(X1 = 1) + Pk (W | X1 = −1)P(X1 = −1)
= hk+1 p + hk−1 q,
(1)
because conditioned on X1 = 1, i.e. on winning the first game, the gambler’s wealth S1 is now
k + 1, and the probability of ending up at m is the same as the unconditioned probability if
we had started at k + 1; similarly for conditioning on X1 = −1. The above is a difference
equation or recurrence for the family of probabilities (hk )m
k=0 . We also need the boundary
conditions:
h0 = 0;
hm = 1
(which follow immediately from the definition of W ).
Difference equations can be solved by methods analogous to those for differential equations. Here we just focus on the ones that arise. (To get a feel for the situation, you might
try solving the problem for a small case, say m = 3 – after substituting for h0 and hm this is
just a matter of solving two simultaneous equations.)
Putting (1) in standard form:
phk+1 − hk + qhk−1 = 0.
We look for a solution of the form hk = θk . Substituting:
pθk+1 − θk + qθk−1 = 0;
pθ2 − θ + q = 0;
(pθ − q)(θ − 1) = 0,
so θ = 1 or θ = q/p. We need to divide into two cases according to whether these two
solutions are equal.
Case p ̸= 1/2. The two solutions 1 and q/p are distinct, so the general solution to the
difference equation is
k
k
hk = A 1k + B pq = A + B pq .
We find the constants A, B by substituting the boundary conditions:
0 = h0 = A + B
So A = −B and 1 = A[1 −
q m
],
p
1 = hm = A + B
q m
.
p
thus A = 1/[1 −
q m
].
p
hk =
1−
1−
q k
p
q m
p
15
Therefore,
for p ̸= 21 .
(2)
Case p = 1/2. We have 1 = q/p so the two roots coincide, therefore we need another
independent solution. The general rule is to multiply the solution θk by integer powers of k,
i.e. θk k, θk k 2 , etc. So in our case the general solution is
hk = A1k + B1k k = A + Bk.
The boundary conditions give 0 = A and 1 = A + Bm = Bm, so B = 1/m. Therefore,
hk =
k
m
for p = 12 .
(3)
♢
It is well worth getting a feel for the solutions (2),(3) via examples.
Example 2.4. Suppose k = 20 and m = 100. What is the probability of ending with m
for each of p = 1/2 (fair game), p = 18/38 (slightly unfavourable game; betting on red in
roulette), and p = 20/38 (slightly favourable game; roulette from the casino’s perspective)?
Repeat for k = 200 and m = 1000.
Solution. From (2),(3) the values of hk are:
p
1/2
18/38
20/38
k = 20, m = 100
0.2
0.00019 · · ·
0.88 · · ·
k = 200, m = 1000
0.2
2.5 · · · × 10−37
0.9999999993 · · ·
Note how large an effect small changes in p can have, and how much more extreme the
effects are for large n.
♢
Example 2.5. Prove that with probability 1 the gambler’s wealth eventually reaches either 0
or m, so that the process stops; i.e. show that T < ∞ almost surely.
Solution. Since 0 < p < 1, the infinite sequence of i.i.d. steps (Xi ) almost surely contains
m consecutive +1s (e.g. by the “parrot question”). If the process has not already stopped
before this point then this will cause it to stop at m.
♢
Consequently we have
P(ST = 0) = P(ruin) = 1 − hk =



q
p
k
−
q
p
1−

 m−k
m
q
p
m
m
,
, p ̸=
1
2
p = 12 .
Note that in the fair case, p = 1/2, the entire process is also “fair”, in the sense that
expectation of the gambler’s final wealth is
EST = m
k
k
+0 1−
= k,
m
m
which is what she started with. More on this later!
16
2.4
Duration of the game
Example 2.6. In the gambler’s ruin problem, find ET , the expected number of games played
until the process finishes, in the symmetric case p = 1/2.
Solution. Let tk = Ek T . By dividing the steps into disjoint blocks of m and comparing
with a geometric distribution we have tk < ∞ for each k. Now condition on the first step,
for 0 < k < m:
tk = E(T | X1 = 1) 21 + E(T | X1 = −1) 12
= (1 + tk+1 ) 12 + (1 + tk−1 ) 12 ,
because if X1 = 1 then we have used one step, and then it is as if the process starts again
from k + 1. The boundary conditions are
t0 = 0,
tm = 0.
Putting the difference equation in standard form:
tk+1 − 2tk + tk−1 = −2.
(4)
We first solve the homogenous version:
tk+1 − 2tk + tk−1 = 0.
Trying tk = θk gives θ2 − 2θ + 1 = 0, i.e. (θ − 1)2 = 0, which has a repeated root θ = 1, so
the general solution to the homogenous equation is 1k (A + Bk) = A + Bk.
We now need a particular solution to the original equation (4). Since the right side is
a constant (or more generally if it is a polynomial) we try powers of k. Since 1 and k are
already taken as solutions to the homogenous equation we need to go to tk = ck 2 . This gives
c (k + 1)2 − 2k 2 + (k − 1)2 = −2,
i.e. c · 2 = −2 so c = −1. Thus the general solution to (4) is the particular solution plus the
general solution of the homogeneous solution:
tk = A + Bk − k 2 .
The boundary conditions give 0 = A and 0 = A + Bm − m2 so B = m2 /m = m. Thus
tk = k(m − k).
♢
In our example k = 20, m = 100 we have tk = 20 × 80 = 1600, while for k = 200 and
m = 1000 we have tk = 160,000 (perhaps surprisingly large).
Similarly, note that if m is even then tm/2 = m2 /4. The quadratic behaviour makes
sense √
if we remember that Sn is approximately normal with mean 0 and standard deviation
const n for large n. So it is natural to expect to wait until time approximately n = const m2
for it to deviate by m/2 from its starting point.
The difference equation can also be solved to get a (messier) expression for tk when p ̸= 21 .
But we will see a much neater way to do this at the end of this section.
17
2.5
One barrier, no barrier
Suppose that the gambler is playing against an infinitely rich opponent (think: a casino).
She again starts with £k > 0, but we stop the process only if her wealth reaches £0. In
other words, we stop at the random time T = min{n : Sn = 0}, where by definition we take
T = ∞ if Sn > 0 for all n. This is simple random walk with one absorbing barrier at 0.
Example 2.7. Find the ruin probability, i.e. the probability that the gambler’s wealth ever
reaches 0.
One possible approach would be to again condition on the first step to get a difference
equation for the infinite family of probabilities rk = Pk (ruin) for k = 0, 1, . . .. One problem
with this is that we only seem to have one boundary condition, r0 = 1. We’ll see some possible
solutions to this difficulty later, but for now we will instead use the previous two-barriers
case together with continuity of probability.
Solution. Fix k and let r = Pk (ruin) be the required probability. The idea is that the
situation with one barrier at 0 can be interpreted as the limit as m → ∞ of two barriers at
0 and m.
random walk S = (Sn )∞
n=0 given by Sn = k +
PnTo justify this, consider the unrestricted
∞
are
the
usual
i.i.d.
steps.
For
a
∈
Z let
X
started
at
k,
where
(X
)
i
i i=1
i=1
Ta = min{n ≥ 0 : Sn = a}
be the first time the walk hits a (where we take Ta = ∞ if it never hits a). Let
Am = {T0 < Tm }
be the event that S hits 0 before it hits m, where 0 < k < m. Note that Ak+1 , Ak+2 , . . . is an
increasing sequence of events, because the random walk cannot get to m + 1 without visiting
m. Let
∞
[
Am
A=
m=k+1
be the limit. This is the event that the walk hits 0 before m for some m, which is precisely
the event that it hits 0. So P(A) = P(T0 < ∞) = r, since it does not matter whether or
not we stop once we hit 0. On the other hand, P(Am ) is simply the ruin probability for the
random walk with barriers at 0 and m, which is 1 − hk , where hk = hk (m) is the probability
we computed before.
We conclude

k
p = 1/2;

1 − limm→∞ m k = ( 1,
q
1−
r = lim (1 − hk (m)) =
1,
p < 1/2;
p
m→∞

1 − limm→∞ 1− q m =
q k
, p > 1/2.
p
p
In summary,
r=
(
1,
q k
,
p
p ≤ 1/2
p > 1/2.
Thus, ruin is possible in a favourable game, but (almost surely) certain in an unfavourable
or fair game!
♢
18
Later we’ll compute the expected time to hit 0 in the one-barrier case.
Now we turn our attention back to the unrestricted simple random walkP
started at 0, as
defined in Subsection 2.2. Let the steps (Xi ) be i.i.d. as usual and let Sn = ni=1 Xi . Let
R0 = min{n > 0 : Sn = 0}
be the first time the walk returns to 0 (i.e. visits 0 at a strictly positive time).
Example 2.8. For each p, find the probability that the unrestricted simple random walk ever
returns to 0, i.e. P(R0 < ∞).
Solution. Let E = {R0 < ∞} and let Pk denote the probability measure for the random
walk started at k. Conditioning on the first step,
P0 (E) = P0 (E | X1 = 1)p + P0 (E | X1 = −1)q
= P1 (T0 < ∞)p + P−1 (T0 < ∞)q.
Now P1 (T0 < ∞) is the ruin probability for the random walk with an absorbing barrier at 0
started at 1, which was computed earlier. On the other hand, if (Sn ) is a simple random walk
started at −1 then (−Sn ) is a simple random walk started at 1, but with the probabilities p
and q of an up-step and a down-step reversed. So P−1 (T0 < ∞) is the same as ruin probability
P1 (T0 < ∞) in the model with p replaced with q.
Thus, by Example 2.7,

p = 1/2;
 1p + 1q = 1,
p
1p + q q, = 2p, p < 1/2;
P0 (E) =
 q
p + 1q, = 2q, p > 1/2.
p
We can express this as
1 − |p − q|.
In particular, note that return to 0 is certain if and only if p = 1/2.
♢
1 − |p − q|
1
0
1
1
2
p
We can also consider random walks with a reflecting barrier, which roughly speaking
means that the walk is prevented from moving beyond some point, but the process is not
stopped and still continues. See Problem Sheet 2.
19
2.6
Stopping times, Wald’s identity
Wald’s identity is a valuable tool for random walks and elsewhere. By way of motivation,
recall our observation in the original gambler’s ruin problem that “a fair game remains fair”:
in the case of winning probability p = 1/2, starting from wealth k, and stopping at the first
time T that it hits 0 or m, the gambler’s expected final wealth EST is also k. Is there a
general method that can predict phenomena like this?
Suppose that (Xi ) are i.i.d. and N is a non-negative integer-valued random variable. We
would like to compute the mean
N
X
E
Xi
i=1
of the sum of the first N random variables, where N is itself random. The obvious guess is
that it should equal simply
E(N ) E(X1 ).
This indeed holds if N is independent of X, but not in general when N depends on X (see
Problem Sheet 1). Wald’s identity provides another condition under which it does hold.
Here is the key definition. Let X = (Xi )∞
i=1 be a stochastic process and let N be a random
+
variable taking values in Z ∪ {∞} (where Z+ = {1, 2, . . .}).
We say that N is a stopping time of (or “for”) X if for all n ∈ Z+ :
we can determine whether or not the event {N ≤ n} occurs from X1 , . . . , Xn .
(Or equivalently, the indicator IN ≤n is a function of X1 , . . . , Xn .)
For a real-world analogy, consider giving directions. “Take the second road on the left
after the pub” is a helpful instruction, because we can wait for the pub and then count roads.
“Take the left turn two roads before the pub” is not helpful, because we don’t know when to
turn until it’s too late. A stopping time is a helpful instruction.
Example 2.9. Let X = (X1 , X2 , . . .) be a stochastic process whose elements Xi take integer
values, and let A be a proper subset of Z. Which of the following are stopping times for X?
1. N = min{n : Xn ∈ A} (the first time we hit the set A);
2. M = min{n > 1 : Xn ∈ A} − 1 (the time step immediately before the first entry to A);
3. L = sup{n : Xn ∈ A} (the last time we hit A, or ∞ if we hit A infinitely many times
so that there is no last time).
Solution.
1. Yes. N ≤ n if and only if at least one of X1 , . . . , Xn lies in A.
2. No. M ≤ m if and only if at least one of X1 , . . . , Xm+1 lies in A, which cannot
determined from X1 , . . . , Xm alone.
3. No. If X1 ∈ A then L ≤ 1 if and only if X2 , X3 . . . ∈
/ A, which cannot be determined
from X1 .
(Silly complication: strictly speaking the answers can depend on the choice of the stochastic
process X, so the “No”s should really be “No in general”. E.g. if each Xi only takes values
in {0, 1} and A = {2} then we can never hit A, so M = ∞ − 1 = ∞, and so we can
determine whether or not M ≤ m without looking at the values (Xi ) at all – we know it
never occurs!)
♢
20
Theorem 2.10 (Wald’s identity). Let X1 , X2 , . . . be i.i.d. random variables with finite
mean, and let N be a stopping time of X with finite mean. Then
N
X
E
Xi = E(N ) E(X1 ).
i=1
Proof. We give a proof under the extra assumption that X1 ≥ 0. (The proof of the general
version uses dominated convergence, which is not covered in this course). Note that
N
X
Xi =
∞
X
i=1
Xi Ii≤N .
i=1
Therefore,
N
∞
X
X
E
Xi =
E Xi Ii≤N ,
i=1
(5)
i=1
where the exchange of expectation and sum is justified because the random variables are
non-negative (Theorem 1.8). (This is where the extra assumption is used).
Now {i ≤ N } = {N ≤ i − 1}C , so Ii≤N = 1 − IN ≤i−1 . Therefore, since N is a stopping
time, Ii≤N is a function of X1 , . . . , Xi−1 . Hence Ii≤N is independent of Xi . Hence, the right
side of (5) equals
∞
X
E(Xi )E(Ii≤N ) = E(X1 )
i=1
∞
X
P(N ≥ i) = E(X1 ) E(N ),
i=1
where the last step is the expectation formula from Problem Sheet 1.
We can use Wald’s identity to give quick answers to several random walk questions.
Example 2.11. Find the expected time to hit 0 or m in the gambler’s ruin problem with
p ̸= 1/2, starting from k.
P
Solution. As usual let X = (Xi ) be the i.i.d. steps, and Sn = k + ni=1 Xi the random walk.
Let T be the first time Sn hits 0 or m. We argued earlier that ET < ∞. Let h = hk be the
probability it hits m before 0, as computed earlier. Then
EST = hm + (1 − h)0 = hm.
On the other hand, T is a stopping time for X, so Wald’s identity gives
EST = k + E
T
X
Xi = k + E(X1 )E(T ) = k + (p − q)ET.
i=1
Equating the two expressions for EST we obtain hm = k + (p − q)ET , so
1−
1
hm − k
ET =
=
m
p−q
p−q
1−
q k
p
q m
p
−k .
♢
Example 2.12. Find the expected time for simple symmetric random walk started at 0 to
hit 1.
21
Solution. Let T be the time to hit 1, and note that it is a stopping time for X. Note that
T < ∞ a.s. by Example 2.7, and ST = 1 a.s. Note also that EX1 = 0 since we are in the
symmetric case. If ET < ∞ then Wald’s identity would imply
1 = EST = E(X1 )E(T ) = 0ET = 0,
which is a contradiction. So the only possibility is that ET = ∞.
♢
If you have an unlimited bankroll then you can guarantee to make money even on a fair
game: just keep playing until the first time you are up by 1. Example 2.7 guarantees that
this will happen eventually. But Example 2.12 says that it could take a very long time – the
expected time is infinite. And if your bankroll is not really unlimited then you could end up
losing everything, by Example 2.3.
22
3
Markov Chains
3.1
Introduction
Markov chains (sometimes called Markov processes) are a broad generalization of random
walks. Before giving the formal definition, we look at a simple example, given in terms of its
transition diagram.
a
1
2
c
1
1
2
b
2
3
1
3
At each (discrete) time 0, 1, 2, . . ., the process is in one of the three states a, b, c. At
each step, we choose one of the arrows coming out of the current state, according to the
probabilities assigned to them, and move to the new state that the chosen arrow points to.
The labels of all the arrows coming out of a state must sum to 1.
We can try to compute various things. As before, Pi means the probability starting from
state i:
1. Pa (hit c). We go from a to b and then stay there until we choose the
almost surely happens eventually, so the answer is 1.
1
3
arrow. This
2. Ea (time to hit c), Getting to b takes 1 step, and then we wait on average 1/ 31 = 3 steps
(the mean of a geometric variable) to get to c. So 4.
3. Starting from a, the long-term proportion of time spent at c is 92 . Can you find a way
to see this?
We will learn how to answer such questions systematically.
Here is the formal definition. Let S be a countable set, called the state space, whose
elements are called states. Let X = (X0 , X1 , X2 , . . .) be a stochastic process, where each Xi
takes values in S. We say that X is a Markov chain if
P Xn+1 = j | X0 = i0 , X1 = i1 , . . . , Xn = in = P(Xn+1 = j | Xn = in ),
(6)
for all n ≥ 0 and i0 , . . . , in , j ∈ S for which the conditional probabilities exist. The idea of
the definition is that the distribution of the next state Xn+1 depends on the past history
X0 , . . . , Xn only through the current state Xn . Moreover, we say that the Markov chain is
(time-)homogeneous if in addition
P(Xn+1 = j | Xn = i) = pi,j
(7)
for all n ≥ 0, where (pi,j )i,j∈S are fixed numbers – in other words, the probability of moving
from i to j is the same regardless of the time step n. This probability pi,j is called the
transition probability. (Sometimes it can be written pij or p(i, j) for reasons of visual
clarity).
All the Markov chains that we consider will be time-homogenous, and we usually won’t
even bother to say so explicitly. To check that a stochastic process is a time-homogeneous
Markov chain it is enough to check the following single condition that in effect combines (6)
and (7).
23
Theorem 3.1. The stochastic process X = (Xi )i≥0 is a time homogeneous Markov chain
with transition probabilities pi,j if and only if it satisfies
P Xn+1 = j | X0 = i0 , . . . , Xn−1 = in−1 , Xn = i = pi,j ,
(8)
for all n ≥ 0 and all i0 , . . . , in−1 , i, j for which the conditional probability exists.
Proof. Clearly (6) and (7) imply (8). For the other direction, assuming (8), we have
X
P(Xn+1 = j, Xn = i) =
P Xn+1 = j, X0 = i0 , . . . , Xn−1 = in−1 , Xn = i
i0 ,...,in−1
X
=
pi,j P(X0 = i0 , . . . , Xn−1 = in−1 , Xn = i)
i0 ,...,in−1
= pi,j P(Xn = i),
which implies (7). Now equate the left sides of (8) and (7) to get (6).
The transition matrix of a Markov chain is the |S| by |S| matrix P = (pi,j )i,j∈S of
transition probabilities, where pi,j is the entry in position (i, j), i.e. in row i and column j.
So row i tells us the probability mass function of where we go from state i. For example, the
transition matrix of the example at the beginning of the section is
a
b
c


0
1
0
P = b  0 2/3 1/3 
c
1/2 1/2 0
a
The matrix formulation will turn out to be very convenient when we consider n-step transition
probabilities. Often the states are integers, in which case we don’t need to label the rows
and columns. It is immediate from (7) that the entries of a transition matrix satisfy
and
X
pi,j ≥ 0
∀ i, j ∈ S,
pi,j = 1
∀ i ∈ S.
j∈S
Conversely, any square matrix P satisfying these conditions is a transition matrix of a Markov
chain. As we have already seen, we can also represent the transition probabilities by the
transition diagram, which is the directed graph with a node for each state i ∈ S, and a
directed edge from i to j labelled with pi,j . Normally we omit edges that would have label 0,
and we can optionally P
omit self-loops (such as the one from b to itself with label 2/3 above),
because the fact that j pi,j = 1 enables us to recover the looping probability pi,i from the
other probabilities out of i.
Now we look at lots of examples.
Example 3.2. Check that the simple random walk on Z is a Markov chain with state space
Z and transition probabilities
pi,i+1 = p,
pi,i−1 = q = 1 − p,
and pi,j = 0 if |i − j| =
̸ 1.
24
i ∈ Z,
We saw the transition diagram earlier. The transition matrix is infinite:


...
...


q 0 p




q 0 p

.


q 0 p


..
..
.
.
(Blank entries in a matrix represent 0s).
Solution. For the random walk (Sn ) with steps (Xn ):
P(Sn+1 = j | S0 = i0 , . . . , Sn−1 = in−1 , Sn = i)
= P(Xn+1 = j − i | S0 = i0 , . . . , Sn−1 = in−1 , Sn = i)


p, j = i + 1
= P(Xn+1 = j − i) = q, j = i − 1


0, otherwise,
because Xn+1 is independent of (S0 , . . . , Sn ). Thus (8) holds.
♢
Example 3.3. The simple random walk with absorbing barriers at 0 and m (or gambler’s
ruin process) is a Markov chain with state space {0, 1, . . . , m} and transition matrix


1 0

q 0 p



 ..
.
.

. 
.



q 0 p
0 1
Here we have made a minor change to the definition of the process. Rather than “stopping” when we reach 0 or m (which doesn’t really make sense in our current formalism), we
make 0 and m into absorbing states by setting p0,0 = pm,m = 1. Thus, after the chain
reaches 0 or m it just stays there for all subsequent steps.
Example 3.4. A microorganism can exist in m different strains. At each generation, it
either stays the same, or with probability p it mutates to a randomly chosen different strain.
If Xn is the strain after n generations then (Xn ) is a Markov chain with m states and
transition matrix


p
p
· · · m−1
1 − p m−1
p 
 p
 m−1 1 − p
m−1 
.
 ..
...

 .
p
m−1
p
m−1
1−p
If we are only interested in whether or not the strain is the same as the original one (strain
1, say), it might be simpler to let
(
1, Xn = 1
Yn =
2, Xn ̸= 1,
in which case (Yn ) is a two-state Markov chain with
1−p
p
P =
;
p
p
1 − m−1
m−1
25
p
1
2
p
m−1
For the next few examples, let X1 , X2 , . . . be i.i.d. taking value 1 with probability p and
0 with probability q := 1 − p.
Example 3.5. (Xn ) itself is a rather boring Markov chain (with time indexed starting from
1 rather than 0) with transition matrix
q p
.
q p
Indeed,
P Xn+1 = j | X1 = i1 , . . . , Xn−1 = in−1 , Xn = i
(
q, j = 0
= P(Xn+1 = j) =
p, j = 1.
Example 3.6. Sn =
Pn
i=1
Xi is a Markov chain that can only move in one direction:
p
0
p
p
1
2
···
Example 3.7. Define the ordered pair Yn := (Xn , Xn+1 ). Then (Yn ) is a Markov chain with
transition matrix
00 01 10 11


00
q p 0 0
01  0
0 q p 

.
10  q
p 0 0 
11
0 0 q p
(Exercise: check this, and draw the transition diagram also.)
Example 3.8. Let Zn = Xn +Xn+1 . Then (Zn ) is NOT a Markov chain. The basic problem
is that if Zn = 1 then we do not know whether (Xn , Xn+1 ) is (0, 1) or (1, 0); this affects the
distribution of Zn+1 , and the prior history (e.g. Zn−1 ) may give more information about it.
To turn this into a proof, we can say e.g.
P(Z3 = 2 | Z1 = 0, Z2 = 1) = P X4 = 1 | (X1 , X2 , X3 ) = (0, 0, 1) = 1/2
but P(Z3 = 2 | Z1 = 2, Z2 = 1) = P X4 = 2 | (X1 , X2 , X3 ) = (1, 1, 0) = 0,
which are not equal so (8) cannot be satisfied. (Intuitively, we need to know Z1 as well as Z2
to decide on Z3 , so it’s impossible to say what p1,2 should be.)
Example 3.9. If the last two days were rainy then the next day will be dry. In all other
cases, the next day is equally likely to be rainy or dry.
26
Let Xn be the weather on day n, taking values in {r, d} (rainy or dry). Then (Xn ) is a
stochastic process, but it is NOT a Markov chain. For example,
but
P(X2 = d | X0 = r, X1 = r) = 1
P(X2 = d | X0 = d, X1 = r) = 1/2,
which are not equal, so (8) cannot be satisfied.
However, the pair (Xn , Xn+1 ) DOES form a Markov chain, with transition matrix:
rr
rd
dr
dd
0

rd
 01
dr 
2
dd
0
1
0
0
0
1
2
1
2
1
2
0
rr
3.2

0
1
2


.
0 
1
2
Multi-step transition probabilities
To answer questions about the evolution of a Markov chain X with state space S we need
to know the transition probabilities (pi,j )i,j ∈ S and also the initial distribution, i.e. the
distribution of the initial state X0 . We can specify the initial distribution by giving its
probability mass function λ = (λi )i∈S :
λi = P(X0 = i),
i ∈ S.
An important special case is when the intial distribution is deterministic (i.e. not random)
– that is, P(X0 = i) = λi = 1 for some fixed state i ∈ S (and λj = 0 for all j ̸= i), in
which case we can just say the Markov chain is started at i. When the transition matrix is
understood from context, we sometimes write Pi for the probability measure associated with
the Markov chain started at i.
Theorem 3.10. For a Markov chain X with initial distribution λ and transition matrix
P = (pi,j )i,j∈S ,
P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = λi0 pi0 ,i1 pi1 ,i2 · · · pin−1 in .
Proof. The left side equals
P(X0 = i0 ) P(X1 = i1 | X0 = i0 ) P(X2 = i2 | X0 = i0 , X1 = i1 ) · · ·
× P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ),
which by (8) equals the right side.
Example 3.11. In Example 3.4, suppose that at generation 0 the microorganism is strain 1.
What is the probability that it is not strain 1 for generations 1, 2 and 3, but mutates back to
strain 1 at generation 4?
Solution. Using the formulation as a two-state chain, the probability is
p 2 p
1 p12 p22 p22 p21 = p 1 −
.
m−1 m−1
27
♢
Dividing the formula in Theorem 3.10 by λi0 we get an even simpler formula for the
conditional probability:
P(X1 = i1 , . . . , Xn = in | X0 = i0 ) = pi0 ,i1 pi1 ,i2 · · · pin−1 in ,
(9)
provided the left side exists. Not surprisingly, this is the same as the unconditional probability
when we start the Markov chain from i0 (by applying Theorem 3.10 to the initial distribution
with λi0 = 1):
Pi0 (X1 = i1 , . . . , Xn = in ) = pi0 ,i1 pi1 ,i2 · · · pin−1 in .
(10)
In light of this it is sometimes useful to define the conditional probability on the left side
of (9) to equal the right side (i.e. (10)) even when P(X0 = i0 ) = 0 (so that the conditional
probability is not defined according to the usual formula). (There is no harm in this, because
if P(X0 = i0 ) = 0 then in computations, conditional probabilities P(· | X0 = i0 ) will typically
get multiplied by 0, so we can define them to be anything we want.)
We next observe that our assumption of time-homogeneity of transition probabilities
extends to the whole Markov chain, provided we condition on the starting state.
Theorem 3.12 (Markov property). For a time-homogeneous Markov chain X, the conditional distribution of (Xm+n )n≥0 given Xm = i is the same as the distribution of X given
X0 = i.
Proof. It is a standard fact (not covered in detail in this course) that the distribution of a
discrete-time stochastic process X is determined by the joint distributions of (X0 , X1 , . . . , Xn )
for each n. So we just need to consider
P(Xm+1 = i1 , Xm+2 = i2 , . . . , Xm+n = in | Xm = i0 ).
We can compute this in the same way as in the proof of Theorem 3.10, to conclude that it
equals the right side of (9).
We now define the n-step transition probability:
(n)
pi,j = Pi (Xn = j) = P(Xn = j | X0 = i).
This the probability of going from state i to state j in n time steps. By Theorem 3.12 we
have for any m ≥ 0:
(n)
P(Xn+m = j | Xm = i) = pi,j .
In a few cases we can just compute the n-step transition probabilities directly.
(n)
Example 3.13. In Example 3.6, compute pi,j .
Solution. If j < i then it is impossible to get from i to j, while if j − i > n then n steps
are not enough. Otherwise, we need exactly j − i of the n steps X1 , . . . , Xn to be +1 and the
rest to be 0, so the probability is
(
j−i n−j+i
n
p q
, 0≤j−i≤n
(n)
j−i
pi,j =
♢
0,
otherwise.
For the general case, the following is a key fact.
28
Theorem 3.14 (Chapman-Kolmogorov equations). For any time-homogeneous Markov
chain, the transition probabilities satisfy
X (m) (n)
(m+n)
pi,j
=
pi,k pk,j .
k∈S
Proof.
(m+n)
pi,j
= Pi (Xm+n = j)
X
=
Pi (Xm+n = j, Xm = k)
k∈S
=
X
Pi (Xm = k) Pi (Xm+n = j|Xm = k)
k∈S
=
X
(m)
(n)
pi,k pk,j .
k∈S
Finally, we can deduce an elegant representation of the n-step transition probabilities.
(n)
n
Theorem 3.15. The n-step transition probability pi,j equals Pi,j
= (P n )i,j , the (i, j)
n
entry in P , the nth power of the transition matrix P .
Proof. This follows from the Chapman-Kolmogorov equations by induction on n. For n = 1
it is the definition of the transition matrix P . For n > 1, assuming the result holds for n − 1,
X (n−1)
X
(n)
n−1
n
pi,j =
pi,k pk,j =
Pi,k
Pk,j = (P n−1 · P )i,j = Pi,j
.
k∈S
k∈S
(n)
n
In the light of this, we can drop the notation pi,j and just write Pi,j
! We can also
optionally write Pi,j instead of pi,j , because it is the (i, j) entry of the matrix P . (When
writing by hand it can also be convenient to drop the comma and write Pij when there is no
risk of confusion.)
Example 3.16. Compute all the 2-step transition probabilities for the 3-state Markov chain
at the beginning of this section.
Solution. With rows and columns always labelled a, b, c:


 

0
1
0
0
1
0
0
2/3 1/3
P 2 =  0 2/3 1/3  0 2/3 1/3 = 1/6 11/18 2/9 ,
1/2 1/2 0
1/2 1/2 0
0
5/6 1/6
2
so Pa,b
= 2/3, etc. (Exercise: check that each row of P 2 sums to 1, and that the numbers
make sense in the context of the transition diagram.)
♢
Denote the probability mass function of the state Xn at time n by
(n)
λi
(n)
then we can consider λ(n) = (λi
entries that sum to 1).
= P(Xn = i),
i ∈ S,
: i ∈ S) as a row vector of length |S| (with nonnegative
29
Theorem 3.17. For a Markov chain with transition matrix P we have
λ(m+n) = λ(m) P n .
Proof.
(m+n)
λj
= P(Xm+n = j) =
X
=
X
P(Xm+n = j | Xm = i)P(Xm = i)
i∈S
(m) n
λi Pi,j
= λ(m) P n j .
i∈S
Example 3.18. In the 3-state example at
state is uniformly distributed over {a, b, c},
step is

0
λ(1) = 1/3 1/3 1/3  0
1/2
3.3
the beginning of the section, suppose the initial
so λ(0) = ( 31 , 13 , 13 ). Then the distribution after 1

1
0
2/3 1/3 = 1/6 13/18 1/9 .
1/2 0
Hitting probabilities and times
Let X be a Markov chain and let A ⊆ S be any set of states. We define the hitting time
TA of A to be the first time the Markov chain is in A:
TA = min{n ≥ 0 : Xn ∈ A},
where we take the minimum of the empty set to be ∞. So if Xn is never in A then TA =
min ∅ = ∞. Thus TA is an N ∪ {∞}-valued random variable (where N = {0, 1, 2, . . .}. For a
singleton set A = {a} we can also write Ta = T{a} .
We define the hitting probability hi (A) of A to be the probability X ever hits A,
starting from state i:
hi (A) = Pi (TA < ∞) = Pi (hit A) = Pi (Xn ∈ A for some n).
The equations we derived for random walks generalize directly to Markov chains:
Theorem 3.19. For a fixed set of states A, the hitting probabilities hi = hi (A) satisfy
X
hi =
pi,j hj , i ∈
/ A,
j
(11)
hi = 1,
i ∈ A.
Moreover, (hi )i∈S is the minimal nonnegative solution to the equations (11). That is, if
(zi )i∈S is any solution to (11) satisfying zi ≥ 0 for all i then hi ≤ zi for all i.
We separate the proof into two parts.
Proof of (11). For i ∈ A we have hi = 1 by definition. For i ∈
/ A we condition on the first
step:
X
X
hi =
Pi (X1 = j) Pi (TA < ∞ | X1 = j) =
pi,j hj ,
j∈S
j∈S
where we used the Markov property (Theorem 3.12) in the second equality.
30
Proof of minimality. (Optional – not required for the course). Suppose (zi )i∈S is a
nonnegative solution to (11). For i ∈ A we certainly have hi = 1 ≤ 1 = zi . For i ∈
/ A, by
repeatedly substituting for the z’s from (11),
X
X
X
zi =
pi,j zj =
pi,j +
pi,j zj
j∈S
j∈A
j ∈A
/
= Pi (X1 ∈ A) +
X
j ∈A
/
pi,j
X
k∈A
pj,k +
X
pj,k zk
k∈A
/
= Pi (X1 ∈ A) + Pi (X1 ∈
/ A, X2 ∈ A) +
X
pi,j pj,k zk
j,k∈A
/
..
.
= Pi (X1 ∈ A) + · · · + Pi (X1 , . . . , Xn−1 ̸∈ A, Xn ∈ A)
X
+
pi,k1 pk1 ,k2 · · · pkn−1 ,kn zkn .
k1 ,...,kn ∈A
/
The last term is nonnegative, and the sum of the preceding n terms is precisely Pi (TA ≤ n).
Thus we have zi ≥ Pi (TA ≤ n). Now taking the limit as n → ∞ and using continuity of
probability, zi ≥ Pi (TA < ∞) = hi .
We have a similar result for the expected hitting time of a set A:
ti (A) = Ei TA = Ei (time to hit A).
Theorem 3.20. For a fixed set of states A, the expected hitting times ti = ti (A) satisfy
X
ti = 1 +
pi,j tj , i ∈
/ A,
j
(12)
ti = 0,
i ∈ A.
Moreover, (ti )i∈S is the minimal nonnegative solution to the equations (12). That is, if
(zi )i∈S is any solution to (12) satisfying 0 ≤ zi ≤ ∞ for all i then ti ≤ zi for all i.
Proof. Similar to the proof of Theorem 3.19.
The minimality condition in Theorems 3.19 and 3.20 is not usually relevant for finite state
spaces, but can enable us to choose between multiple solutions in the case of an infinite state
space. In the latter case, one solution is often ti = ∞ for all i ∈
/ A. This will be the minimal
solution if there is no finite solution!
The last result gives us a systematic way to solve the HT H problem.
Example 3.21. A fair coin is tossed repeatedly. Find the expected number of tosses until
the first occurrence of HT H.
Solution. Let Xn be the number of consecutive “correct” tosses we have just had, ending
at the nth toss, i.e., more precisely, the length of the longest final segment of tosses 1, . . . , n
that is an initial segment of HT H. So for example Xn = 2 means that tosses n − 1 and n
are HT . Then Xn is a Markov chain with transition diagram:
31
()
0
1
2
(H)
1
1
2
(HT )
2
1
2
(HT H)
3
1
2
(Recall that we omit self-loops from the diagram. We can assume p3,3 = 1 since it doesn’t
matter what happens after we reach our goal).
Let ti = Ei (time to hit 3). Then
t0
t1
t2
t3
= 1 + 12 t0 + 12 t1 ;
= 1 + 12 t1 + 12 t2 ;
= 1 + 12 t0 + 12 t3 ;
= 0.
Solving these equations:
t0 = 2 + t1 = 2 + 2 + t2 = 5 + 12 t0 ,
so t0 = 10 (as before).
♢
We could handle the ABRACADABRA problem in the same way, although it would be
somewhat tedious. Later in the course we’ll see a better way.
3.4
Strong Markov property
Theorem 3.12 says that if we observe a Markov chain starting from a fixed time m then the
distribution is the same as the original chain. Does this hold for a random time also? The
answer is no in general but yes for a stopping time.
This time our stochastic process X = (Xi )∞
i=0 is indexed from 0, so the definition gets a
trivial adjustment: a random variable T taking values in N ∪ {∞} is a stopping time for
X if for each n ≥ 0 we can determine whether or not T ≤ n from X0 , . . . , Xn .
Theorem 3.22 (Strong Markov property). Let (Xi )i≥0 be a time-homogeneous Markov
chain, and let T be a stopping time for X. Conditional on T < ∞ and XT = i and
any information about X0 , . . . , XT , the distribution of (XT +n )n≥0 is the same as the
conditional distribution of X given X0 = i.
Proof. (Optional; not required for the course.) Let A be any event determined by
X0 , . . . , XT . Since T is a stopping time, {T = m} ∩ A is determined by X0 , . . . , Xm for
each m ∈ N. Therefore,
P(XT +1 = j1 , . . . , XT +n = jn , XT = i, T = m, A)
= P(Xm+1 = j1 , . . . , Xm+n = jn , Xm = i, T = m, A)
= P(Xm+1 = j1 , . . . , Xm+n = jn | Xm = i, T = m, A) P(Xm = i, T = m, A)
= P(Xm+1 = j1 , . . . , Xm+n = jn | Xm = i) P(Xm = i, T = m, A)
= Pi (X1 = j1 , . . . , Xn = jn ) P(Xm = i, T = m, A).
Now sum over m to get
P(XT +1 = j1 , . . . , XT +n = jn , XT = i, T < ∞, A)
= Pi (X1 = j1 , . . . , Xn = jn ) P(Xm = i, T < ∞, A),
32
and divide both sides by the last factor to give
P(XT +1 = j1 , . . . , XT +n = jn | XT = i, T < ∞, A) = Pi (X1 = j1 , . . . , Xn = jn ).
Example 3.23. (Yet another approach to random walk). Let Sn be the simple random walk
on Z with pi,i+1 = p and pi,i−1 = q = 1 − p for all i. Find the probability that the walk started
at 1 ever hits 0.
We did this before by taking a limit of the gambler’s ruin model. Here is another way.
Solution. Let hi = Pi (hit 0). Then
h0 = 1;
hi = phi+1 + qhi−1 , i > 0.
(13)
Rather than trying to solve this infinite system of equations directly, we can get a shortcut
using the strong Markov property. Suppose the walk starts at 2. In order to hit 0 it must
first hit 1. But the first hitting time of 1 is a stopping time, so after hitting 1 the walk
behaves like a walk started from 1. Moreover, the probability to hit 1 from 2 is the same as
the probability to hit 0 from 1, by space-homogeneity of the random walk. So
h2 = P2 (hit 0) = P2 (hit 1)P1 (hit 0) = h21 .
Combining this with the equations for h0 and h1 in (13) we have h1 = ph21 + q. Solving this
quadratic gives h1 = 1 or h1 = q/p.
Since h1 ≤ 1 this implies that h1 = 1 for all p ≤ 1/2. For p > 1/2 we still need some way
to decide between the two solutions. One way is as follows. Extending the above stopping
time argument gives that hi = hi1 for all i > 0. Either hi = 1 or hi = (q/p)i gives a
solution to (13), but Theorem 3.19 tells us to take the minimal non-negative solution, which
is h1 = q/p.
♢
3.5
Recurrence and transience
Let i ∈ S be a state of a Markov chain X. Let
Vi = |{n ≥ 0 : Xn = i}| =
∞
X
IXn =i
n=0
be the total number of visits to i ever by X. We say that state i is
recurrent if Pi (Vi = ∞) = 1;
transient if Pi (Vi < ∞) = 1.
Starting from a recurrent state, the chain keeps returning to it forever (almost surely). From
a transient state, the chain eventually wanders off never to return (almost surely).
Example 3.24. Consider the chain with transition diagram:
1
1
2
1
2
1
2
1
2
3
4
1
2
33
It is not difficult to check that 1 and 2 are transient while 3 and 4 are recurrent.
Example 3.25. In this chain, every state is transient:
p
0
p
p
1
2
···
Obvious questions (to be answered) include: is it possible for a state to be neither recurrent
nor transient, i.e. to have infinitely many visits with probability strictly between zero or one?
How can we determine whether a state is recurrent or transient for a simple random walk
(for various p, with or without a barrier)?
Theorem 3.26. Every state is either recurrent or transient. Moreover, we have the
following dichotomy. For each state i, either:
P
n
(i) i is recurrent, Pi (Vi > 1) = 1, Ei Vi = ∞, and ∞
n=0 Pi,i = ∞;
P
n
or (ii) i is transient, Pi (Vi > 1) < 1, Ei Vi < ∞, and ∞
n=0 Pi,i < ∞.
Proof. Let
Ri = min{n > 0 : Xi = i}
be the first return time to i, and note that Ri is a stopping time. (The only difference
compared with the hitting time Ti = T{i} is that we don’t include n = 0). Consider the two
possibilities: either (i) Pi (Ri < ∞) = 1 or (ii) Pi (Ri = ∞) > 0.
In case (i), clearly Pi (Vi > 1) = 1 (the first visit is at time 0, the second is at time
Ri ). Now apply the strong Markov property at time Ri ; the chain is back at i, so its future
evolution is distributed identically to the original chain. In particular it will again return to
i almost surely. Then we can apply the strong Markov property again at the next return
time, and so on. We conclude that Vi = ∞ almost surely, i.e. i is recurrent. In particular
this implies Ei Vi = ∞.
In case (ii), let q = Pi (Ri < ∞) < 1. Clearly Pi (Vi > 1) = q < 1. If Ri < ∞ then we again
apply the strong Markov property at Ri . Conditional on Ri < ∞, the chain returns to i again
with probability q, and we can apply the strong Markov property again, etc. This continues
until the first time the chain does not return to i. The conclusion is that the number of visits
Vi to i has Geometric distribution with parameter 1 − q. In particular Vi < ∞ almost surely,
so i is transient, and indeed Ei Vi = 1/(1 − q) < ∞.
For the final criterion, in both cases
Ei Vi = Ei
∞
X
n=0
IXn =i =
∞
X
Pi (Xn = i) =
n=0
∞
X
n
Pi,i
.
n=0
Example 3.27. Consider simple random walk with parameter p on Z. Clearly every state is
recurrent or every state is transient. From the earlier computations, if p = 1/2 then starting
from 0 the walk returns to 0 a.s. So 0 is recurrent. If p ̸= 1/2 then P0 (return to 0) < 1, so
0 is transient.
3.6
Random walks in d dimensions
We now discuss one of the most important and striking stochastic processes. Recall that the
unrestricted random walk on Z could model for a particle or organism wandering in a linear
environment. What happens if it is allowed to wander on a plane, or in space?
34
Let d ≥ 1 be an integer, the number of dimensions. Our state space is
Zd = (x1 , . . . , xd ) : xi ∈ Z ∀i ,
the set of all d-vectors of integers. Let ei = (0, . . . , 0, 1, 0 . . . , 0) be the ith unit vector (with
a 1 in coordinate i).
The steps X1 , X2 , . . . of the random walk are i.i.d. random vectors, with Xn taking values
−e1 , +e1 , −e2 , +e2 , . . . , −ed , +ed
P
each with probability 1/(2d). Let Sn = nj=1 Xj . Thus the walk starts at the origin 0 =
(0, . . . , 0), and moves to a uniformly random nearest neighbour of the current state at each
d
step; (Sn )∞
n=0 is a Markov chain with state space Z and transition probabilities px,x±ei =
1/(2d) and px,y = 0 otherwise. It is called the simple symmetric random walk in Zd (or in
d dimensions).
For d = 1 this is just the usual simple symmetric walk on Z that we have already seen.
Here are partial pictures for d = 2 and d = 3. The full transition diagram has arrows labelled
1/(2d) pointing in both directions between each pair of nearest neighbours in Zd .
1
4
1
4
1
4
1
6
1
4
Theorem 3.28. For the simple symmetric random walk in Zd , every state is recurrent
if d ≤ 2, and every state is transient if d ≥ 3.
Proved by Pólya in 1921, this is one of the most celebrated results of probability theory.
As Kakutani described it, “A drunk man will find his way home, but a drunk bird may get
lost forever.”
An easier variant – diagonal steps
We will prove a slightly easier variant of Theorem 3.28 that nevertheless retains the main
point. To define this variant, let the steps X1 , X2 , . . . be i.i.d. and instead uniformly distributed on set of 2d vectors
{−1, +1}d = (x1 , . . . , xd ) : xj ∈ {−1, +1}∀j ,
−d
(i.e. taking each value in {−1, +1}d with probability
Pn 2 ). Then consider the Markov chain
(a non-simple random walk) (Sn ) given by Sn = i=1 Xi . Thus, the steps of this walk are
along the main diagonals of the integer grid Zd . E.g. for d = 2:
35
1
4
1
4
1
4
1
4
The great advantage now is that the d coordinates of a single step X1 :
(X1 )1 , . . . , (X1 )d
are simply i.i.d. random variables taking values −1 and +1 each with probability 1/2.
Theorem 3.29. For the random walk defined above, every state is recurrent if d ≤ 2, and
every state is transient if d ≥ 3.
In the proof we will use Stirling’s formula for the factorial:
√
n! ∼ 2πn e−n nn
as n → ∞.
Here an ∼ bn (“an is asymptotic to bn ”) means that an /bn → 1 as n → ∞. It is easy to
′
′
′
′
check that if an ∼ bn and aP
while if an ∼ bn and an and bn are
n ∼ bn then an an ∼ bn bn ;P
nonnegative and finite then n an < ∞ if and only if n bn < ∞.
Proof of Theorem 3.29. All states are clearly equivalent,
suffices to establish recurrence
P so it
n
or transience of 0. We will use the criterion involving n P0,0 from Theorem 3.26.
We first consider the case d = 1. This is just simple symmetric walk on Z. We already
argued that this is recurrent by other methods, but now we need more information, to be
used to analyse higher d. Since the steps Xi take values ±1, note that Sn has the same parity
n
= 0 for all odd n. For an even time 2n, the walk is back at 0 if it has taken
as n, so P0,0
equal numbers of +1 and −1 steps, so, using Stirling’s formula
2n 1 2n (2n)! −2n
2n
P0,0 =
( ) =
2
n 2
(n!)2
√
2π2n e−2n (2n)(2n) 2−2n
1
√
∼
=√
as n → ∞.
πn
( 2πn e−n nn )2
Now
so
∞
X
1
√ = ∞,
n
n=1
∞
X
n=0
n
P0,0
=
∞
X
2n
P0,0
= ∞,
n=0
so 0 is recurrent.
Now we turn to the case d ≥ 2. As observed above, the coordinates of each step are
i.i.d., with the same distribution as simple symmetric random walk steps in Z. Therefore,
the coordinates of Sn :
(Sn )1 , (Sn )2 , . . . , (Sn )d
36
are independent, and each has the same distribution as Sn had in the d = 1 case above.
Therefore,
d
2n
1
2n
1 2n
(2)
P0,0 = P0 (S2n = 0) = P0 (S2n )1 = · · · = (S2n )d = 0 =
.
∼
n
(πn)d/2
n
(And P0,0
= 0 for n odd). Since
∞
X
1
nα
n=1
(
= ∞, 0 < α ≤ 1
< ∞, α > 1,
(e.g. by the integral test) we conclude that 0 is recurrent if d/2 ≤ 1 and transient if d/2 >
1.
We make a few remarks about the proof for the original simple random walk case.
Proof of Theorem 3.28 (sketch) – not required for the course. The d = 1 case is the
same as the diagonal steps case, so it is recurrent.
The d = 2 case is also the same: from the transition diagram one can see that the
random walk with diagonal steps performs a simple random walk on the even sublattice
{(x, y) ∈ Z2 : x √
+ y is even}, which is just a copy of Z2 rotated by 45 degrees and enlarged
by a factor of 2. More formally, for the simple random walk Sn = (Un , Vn ), the two
processes (Un + Vn )/2 and (Un − Vn )/2 are independent simple symmetric random walks
on Z. Therefore 0 is recurrent.
The case d = 3 is a little harder. We have
2n
P0,0
=
X
i,j,k≥0:
i+j+k=n
(2n)!
(i!j!k!)2
1 2n
.
6
Using some combinatorial tricks and Stirling’s formula, it can be shown this is asymptotic
to c/n3/2 for some c. (See e.g. Norris’ book). So
X
n
P0,0
< ∞,
n
so 0 is transient.
For d ≥ 4 one can compare with d = 3. (See the Problem sheet).
3.7
Communicating classes
We now consider how the states of a Markov chain are related to each other. Consider a
Markov chain with state space S and transition matrix P = (pi,j ). Let i, j ∈ S be states.
n
We say i leads to j and write i → j if Pi,j
> 0 for some n ≥ 0. I.e., if it is possible
to get from i to j in some number of steps. We say that i communicates with j and
write i ↔ j if both i → j and j → i.
Note that i → j is NOT the same thing as the existence an arrow (with positive label
pi,j ) from i to j in the transition diagram. The latter means it is possible to get from i to j
in ONE step.
37
Recall that a relation ∼ on a set S is called an equivalence relation if it is reflexive
(i.e. i ∼ i for all i ∈ S), symmetric (i.e. i ∼ j implies j ∼ i), and transitive (i.e. i ∼ j and
j ∼ k imply i ∼ k).
Theorem 3.30. The relation ↔ is an equivalence relation on S.
0
Proof. Reflexivity follows since Pi,i
= 1 for all i. Symmetry is immediate from the definition
n
m
> 0 for some m and
> 0 and Pj,k
of ↔. For transitivity, if i → j and j → k then Pi,j
m+n
m n
n. By the Chapman-Kolmogorov equations, Pi,k ≥ Pi,j Pj,k > 0, so i → k. By similarly
considering the reverse direction, ↔ is transitive.
It follows from the general theory of equivalence relations that the state space is partitioned into equivalence classes, called communicating classes here. I.e. S = C1 ∪ C2 ∪ · · · ,
where the classes Cr are pairwise disjoint, and
i ↔ j if and only if i and j lie in the same class.
Example 3.31. Find the communicating classes, if
 1 1
0 0 0
2
2
 0 0 1 0 0
 1

0 0 31 31
3
P =
 0 0 0 1 1
2
2

 0 0 0 0 0
0 0 0 0 1
0
0
0
0
1
0




.



Solution. This becomes easy when we look at the transition diagram. The actual transition
probabilities don’t matter here – only whether or not they are positive.
1
4
3
2
5
6
The classes are {1, 2, 3}, {4} and {5, 6}.
♢
Theorem 3.32. Recurrence and transience are class properties; that is, in each class,
either all states are recurrent or all states are transient.
Proof. Suppose that i, j are in the same class and j is recurrent. It is enough to show that i
m
n
m n
is recurrent. Since i ↔ j there exist m, n such that Pi,j
, Pj,i
> 0, so a := Pi,j
Pj,i > 0. Then
m+k+n
m k
n
k
Pi,i
≥ Pi,j
Pj,j Pj,i
= aPj,j
,
so
P
k
k
Pj,j
= ∞ implies
P
ℓ
ℓ
Pi,i
= ∞.
A few more definitions. We call a Markov chain irreducible if it has only one communicating class, C = S. We call a class C closed if i ∈ C and i → j imply j ∈ C. I.e., if C is
impossible to escape from. A state i is absorbing if {i} is a closed class (i.e. if pi,i = 1).
We can call a class recurrent or transient according to whether its states are recurrent or
transient. In the case of an irreducible chain we can call the chain itself recurrent or transient.
38
Example 3.33. In Example 3.31, only {5, 6} is a closed class.
Example 3.34. In the standard gambler’s ruin chain with absorbing barriers at 0 and m,
the classes are {0}, {1, . . . , m − 1}, {m}, with {0} and {m} being closed; states 0 and m are
absorbing.
Example 3.35. The simple random walk on Zd is irreducible.
Theorem 3.36. (a) Every recurrent class is closed. (b) Every finite closed class is recurrent.
Proof. Recall that Vi denotes the total number of visits to i.
(a) Suppose class C is not closed. Then there exist i ∈ C and j ∈
/ C with i → j, i.e. there
exists m with Pi (Xm = j) > 0. But j ̸→ i, so once we visit j we can never get back to i,
therefore
Pi (Xm = j) ≤ Pi (Vi ≤ m) ≤ Pi (Vi < ∞),
hence the last probability is positive, and so i is not recurrent, i.e. C is not recurrent.
(b) Suppose C is finite and closed, and fix i ∈ C. Since the chain must go somewhere,
for some j ∈ C,
0 < Pi (Vj = ∞)
= Pi (hit j)Pj (Vj = ∞),
by the strong Markov property. Therefore Pj (Vj = ∞) > 0, so j is not transient, so it is
recurrent, so C is recurrent.
So in a finite state space, recurrence and transience are easy: a class is recurrent if and
only if it is closed.
Example 3.37. In Example 3.31, 1, 2, 3, 4 are transient and 5, 6 are recurrent.
Example 3.38. Consider the simple random walk with one absorbing barrier on N = {0, 1, 2, . . .}:
p0,0 = 1 and pi,i+1 = p and pi,i−1 = q = 1 − p for all i ≥ 1. Then {0} is a finite closed class,
and therefore recurrent, and the other class {1, 2, . . .} is not closed, therefore it is transient.
Note that this holds regardless of p, even though the behaviour varies drastically: for
p ≤ 1/2 the walk a.s. gets absorbed into 0, while for p > 1/2 it may wander off to infinity –
but in both cases the positive states are transient.
However, these methods cannot resolve every case. Recall:
Example 3.39. The simple symmetric random walk on Zd is recurrent if d ≤ 2 but transient
if d ≥ 3.
Recall that this required a detailed calculation. The simple arguments of this subsection
cannot determine recurrence and transience for an infinite closed class.
We will mostly focus on irreducible Markov chains from now on. If a chain is not irreducible then one may split it into classes and try to consider each class separately.
39
3.8
Mean return times
We now further distinguish two types of recurrence – this will play an important role in limit
behaviour.
For a Markov chain X, define the return time
Ri = min{n ≥ 1 : Xn = i};
this is the first time the chain visits i except that time 0 is excluded (because of the ≥ 1 in
the definition). Let
µi = Ei (Ri )
be the mean return time to i starting at i. Recall that if i is transient then Pi (Ri = ∞) > 0
so µi = ∞.
If i is recurrent then we call i positive recurrent if µi < ∞, or null recurrent µi = ∞.
The names will make more sense later. We can think of a null recurrent state as “only
just recurrent” – the chain does return, but it is very slow to do so.
Theorem 3.40. Positive recurrence and null recurrence are class properties.
Proof. See later.
We’ll see a systematic way to determine null and positive recurrent later. For now, we
can deal with some examples:
Example 3.41. Show that simple symmetric random walk on Z is null recurrent.
Solution. The chain is irreducible, so all states have the same classification. We saw earlier
that it is recurrent. We also saw that E0 T1 = ∞, so similarly E1 T0 = E−1 T0 = ∞. Therefore,
conditioning on the first step, E0 R0 = 1 + 21 E1 T0 + 21 E−1 T0 = ∞, so the chain is null
recurrent.
♢
Example 3.42. Show that simple random walk on N with parameter p and a reflecting barrier
at 0 (so that p0,1 = p and p0,0 = 1 − p) is positive recurrent if p < 1/2.
Solution. The chain is irreducible. We saw in Problem sheet 4 that E1 T0 = 1/(q − p) < ∞,
so E0 R0 = 1 + q0 + p1/(q − p) = q/(q − p) < ∞, so it is positive recurrent.
♢
3.9
Convergence examples
Our goal is to understand long term behaviour of a Markov chain. How does the n-step
n
behave when n is large?
transition probability Pi,j
Example 3.43. Let’s look at the chain:
2
1
2
1
1
1
2
40
3
We can compute some large powers by squaring: P 2 = P P , P 4 = P 2 P 2 , etc. This gives, to
3 decimal places






0 1 0
0 .5 .5
.25 .375 .375
P =  0 .5 .5 , P 2 = .25 .25 .5  , P 4 = .188 .438 .375 ,
.5 0 .5
.25 .5 .25
.188 .375 .438




.203 .398 .398
.200 .400 .400
P 8 = .199 402 .398 , P 16 = .200 .400 .400 .
.199 .398 .402
.200 .400 .400
n
It appears that Pi,j
→ πj as n → ∞, where π = (π1 , π2 , π3 ) = (1/5, 2/5, 2/5). In other words,
the distribution of the state Xn at time n converges to the distribution π on S, regardless
of the initial state i. This indeed holds, as we shall see. Moreover there is a simple way to
determine the limiting distribution π.
However, this kind of convergence does not happen in every case, as the following three
examples show.
Example 3.44.
2
1
1
1
1
3
n
n
n
Here it is clear that P1,1
= 1 if n is a multiple of 3, while P1,1
= 0 for all other n. So P1,1
does not converge as n → ∞. The obstacle here is called periodicity.
Example 3.45.
2
1/2
1/2
1
3
We can show easily see that


0 1/2 1/2
0 
P n = P = 0 1
0 0
1
n
for all n. So Pi,j
does converge, but the limit depends on i as well as j. The limiting
distribution depends on the starting state. The issue is that the chain is not irreducible.
√
2n
Example 3.46. Simple symmetric random walk on Z. We know that Pi,i
∼ 1/ πn, so
q
n
2n
n
n
n 2
n
2n
Pi,i
→ 0. Also note that Pi,i
≥ Pi,j
Pj,i
= (Pi,j
) , so Pi,j
≤ Pi,i
→ 0 as n → ∞ for all i, j.
n
In other words, Pi,j → πj where π = (. . . , 0, 0, 0, . . .). This is not a probability distribution
on Z, because it sums to 0 rather than 1. Where does the walk end up “at time infinity”?
Nowhere – it “diffuses away”, even though it is recurrent! In fact, the issue is that it is null
recurrent.
41
3.10
Stationary distributions
Our goal will be to understand when convergence to a distribution happens, as in Example 3.43, as well as to determine the limiting distribution. We start with the latter
question. The key idea is that any potential limiting distribution needs to remain the same
over time.
A distribution (on the state space S) is any row vector λ = (λi )i∈S such that λi ≥ 0
X
for all i, and
λi = 1 . Recall that if a Markov chain has initial distribution λ then
i∈S
the distribution after one step is λP (and after n steps, λP n ). A stationary distribution
(sometimes called an invariant or equilibrium distribution) for a Markov chain with transition
matrix P is a distribution π such that
πP = π.
If a Markov chain starts in a stationary distribution, it stays there. Stationary distributions will be candidate limiting distributions.
Example 3.47. Find all stationary distributions in Example 3.43.
Solution. We want to find π = (π1 , π2 , π3 ) such that πP = π, i.e.
That is,


0 1 0
(π1 π2 π3 )  0 .5 .5 = (π1 π2 π3 ),
.5 0 .5
or
π1
1
π
2 2
+
+
1
π
2 3
1
π
2 2
1
π
2 3
P
i
πi pi,j = πj for each j.
= π1
= π2
= π3 .
This is three equations in three variables, but one equation is redundant. (This will always
happen because if π is a solution to πP = π then cπ is another solution for any constant c).
We get π2 = π3 = 2π1 . But for π to be a distribution we also need π1 + π2 + π3 = 1, which
gives π1 + 2π1 + 2π1 = 1, so π1 = 1/5, and thus
π = ( 51 , 52 , 25 )
is the unique stationary distribution. (Note that this was the apparent limiting distribution
that we observed).
♢
Note that in each of the equations
X
πi pi,j = πj ,
j∈S
i∈S
that make up the identity πP = π, we sum along a column of P , rather than a row as in the
hitting probability and hitting time equations from earlier. In terms of the diagram, we sum
over arrows leading into a state. We can think of it as: πj is the equilibrium probability of
being at j, which can be expressed as a sum over how we got there at the previous step.
It is an exercise to find the stationary distributions for Examples 3.44 and 3.45 – the
former has a unique stationary distribution but the probabilities do not converge to it; the
latter has multiple stationary distributions. We now consider out third “problem example”.
42
Example 3.48. Find all stationary distributions for Example 3.46, the simple symmetric
random walk on Z.
Solution. The equation πP = π gives
1
π
2 i−1
+ 12 πi+1 = π,
i∈Z
I.e. πi+1 − 2πi + πi−1 = 0. This is a difference equation we have seen before: the general
solution is πi = A + Bi. We must P
have B = 0, otherwise πi would be negative for some i. So
πi = A, a constant. But we need i∈Z πi = 1, which is impossible: if A = 0 then the sum is
0; if A ̸= 0 then it is infinite. So there is no stationary distribution.
♢
The equations πP = π are linear, but it can be hard or tedious to solve them. Sometimes
we can use the structure of the Markov chain to guess a solution, and check it.
Example 3.49. A microorganism can exist in m different strains. At each generation, it
either stays the same, or with probability p it mutates to a randomly chosen different strain.
Find a stationary distribution.
Solution. Recall the transition matrix

p
···
1 − p m−1
p

 m−1 1 − p
P =  ..
..
 .
.
p
m−1
p
m−1
p
m−1
p
m−1
1−p



.

Because of the symmetry between the m states, we can guess that the uniform distribution
π = ( m1 , . . . , m1 )
(14)
should be stationary. To check this, for any state j,
X
πi pi,j = (m − 1) ×
i∈S
1 p
1
1
+ (1 − p) =
= πj ,
mm−1 m
m
as required.
♢
It is not so obvious from this whether the stationary distribution (14) is unique. One
way to see that it is unique in this case is to consider the two-state chain associated with
the problem, as defined earlier, and solve the equations to get the unique solution π =
(1/m, (m − 1)/m). This tells us that π1 must be 1/m, but the same applies to every state
by symmetry. As we see next, uniqueness is in fact guaranteed for more general reasons.
3.11
Existence and uniqueness of stationary distributions
Here is the key result about stationary distributions.
Theorem 3.50. An irreducible Markov chain has a stationary distribution π if and only
if it is positive recurrent. In that case π is unique, and it is given by
πi =
1
,
µi
i ∈ S,
where µi is the mean return time of state i.
43
The formula πi = 1/µi is an unexpected bonus. It should make intuitive sense. If the
initial distribution is π so that the chain is “in equilibrium”, then we expect πi to be the long
term proportion of time spent in state i. This is one over the average time spent between
visits to i.
We will give most but not all details of the proof. As suggested by the examples discussed
above, there are two issues in play when looking for a stationary distribution: finding a
solution to the equation xP = x, and normalizing it to be a probability distribution.
The first step is a straightforward one.
Theorem 3.51. Suppose that the chain is irreducible and that the row vector x = (xi )i∈S
satisfies xi ≥ 0 for all i ∈ S and is a solution to xP = x. If xj > 0 for some j then xi > 0
for all i. If xj < ∞ for some j then xi < ∞ for all i.
Proof. Since xP = x we also have xP n = x for all n ≥ 1, by induction.
m
> 0 for some m. Then xi = (xP m )i ≥
Suppose xj > 0. For any i, by irreducibility, Pj,i
m
> 0.
xj Pj,i
n
> 0 for some n. Then xj = (xP n )j ≥
Suppose xj < ∞. For any i, by irreducibility, Pi,j
n
n
xi Pi,j , so xi ≤ xj /Pi,j < ∞.
We call x positive and finite if 0 < xi < ∞ for all i ∈ S.
Theorem 3.52. If the Markov chain is irreducible and recurrent then any positive finite
solution to the equation xP = x is unique up to a multiplicative constant. I.e. if x and
x′ are two positive finite solutions then x′ = ax for some a ∈ (0, ∞).
Proof. Omitted.
The key idea for the proof of Theorem 3.50 is the following clever construction. Suppose
that the Markov chain X with transition matrix P is recurrent. Fix a state k ∈ S and as
usual let Rk = min{n ≥ 1 : Xn = k} be the return time to k. Define for each i ∈ S,
ρi (k) = Ek
R
k −1
X
IXn =i ,
n=0
and consider the row vector ρ(k) = (ρi (k))i∈S . Thus ρi (k) is the expected number of visits
to i between visits to k (counting one but not both of the visits to k at the ends, in the case
when i = k). Note that ρk (k) = Ek 1 = 1.
Theorem 3.53. If the Markov chain X with transition matrix P is irreducible and recurrent
then, for any fixed k ∈ S, we have 0 < ρi (k) < ∞ for all i ∈ S, and
ρ(k)P = ρ(k).
Proof. (not required for the course). Since the chain is recurrent, under Pk we have
Rk < ∞ and X0 = XRk = k. Note that the event {n ≤ Rk } is the complement of
{Rk ≤ n − 1}, and, since Rk is a stopping time, whether n ≤ Rk can be determined from
44
X0 , . . . , Xn−1 . Now
ρj (k) = Ek
∞
X
I[Xn = j, n ≤ Rk ]
n=1
=
∞
X
Pk (Xn = j, n ≤ Rk )
n=1
=
=
∞
XX
i∈S n=1
∞
XX
Pk (Xn−1 = i, Xn = j, n ≤ Rk )
Pk (Xn−1 = i, n ≤ Rk ) pi,j
i∈S n=1
=
X
pi,j
=
pi,j Ek
X
R
k −1
X
I[Xm = i]
m=0
i∈S
=
Pk (Xm = i, m ≤ Rk − 1)
m=0
i∈S
X
∞
X
pi,j ρi (k),
i∈S
as required, where the various interchanges of sums and expectations are justified because
the terms are nonnegative, in the 4th equality we used the Markov property, and in the
last equality we used that we may count either the initial or the final visit to k in the
definition of ρi (k).
Since ρk (k) = 1, Theorem 3.51 implies that ρ(k) is positive and finite.
So to get a stationary distribution, we need to normalize ρ(k) to have sum 1:
πi = P
ρi (k)
.
j∈S ρj (k)
This will only work if the sum in the denominator is finite. But we have
X
i∈S
ρj (k) =
X
j∈S
Ek
R
k −1
X
IXn =j = Ek
n=0
R
k −1 X
X
n=0 j∈S
IXn =j = Ek
R
k −1
X
1 = Ek Rk = µk ,
n=0
the mean return time! So we can normalize if any only if µk < ∞, i.e. if and only if k is
positive recurrent.
Theorem 3.54. For an irreducible, recurrent Markov chain, the following are equivalent:
(i) every state is positive recurrent;
(ii) some state is positive recurrent;
(iii) there is a stationary distribution.
Proof. (not required for the course) Clearly (i) implies (ii).
P
To show (ii) implies (iii), suppose k is positive recurrent, so µk = i ρi (k) < ∞. Then
πi = ρi (k)/µk gives a stationary distribution.
45
To show (iii) implies (i), suppose π is an invariant distribution and k is any state.
Since π and ρ(k)Pare positiveP
solutions to xP = x we have ρ(k) = aπ for some a ∈ (0, ∞).
Therefore µk = i ρi (k) = a i πi = a < ∞, so k is positive recurrent.
In particular the last result shows that positive and null recurrence are class properties.
(To make this explicit: a recurrent class C ⊆ S is closed, and so in particular pi,j = 0 for all
i ∈ C and j ∈
/ C. Therefore in C the Markov chain simply behaves like the irreducible chain
with transition matrix (pi,j )i,j∈C , but the theorem say that this has states positive recurrent
or all states null recurrent).
Moreover, for an irreducible recurrent chain, the unique stationary distribution π satisfies
πk =
1
ρk (k)
= ,
µk
µk
as claimed earlier.
Proof of Theorem 3.50. (not required for the course) If the chain is positive recurrent
then we saw above that the unique stationary distribution indeed satisfies πi = 1/µi . If it
is null recurrent then the only
P positive solutions to xP = x are of the form x = aρ(k) with
a ∈ (0, ∞), but these have i xi = aµk = ∞.
P n
n
→ 0 as n → ∞
< ∞ so Pi,i
Finally suppose the chain is transient. We have n Pi,i
m+n
m
n
m
→
0 as n → ∞.
≤
P
/P
so
P
for each i. For any i, j we have Pj,i > 0 for some m,
j,i
i,j
i,i
P
n
Therefore if π were a stationary distribution, πj = i πi Pi,j → 0, so π = 0.
3.12
Periodicity and the convergence theorem
The one remaining obstacle to convergence is periodicity, which is what went wrong in Example 3.44. For a state i of a Markov chain with transition matrix P we define
n
Ji = {n ≥ 1 : Pi,i
> 0},
i.e. the set of times at which it is possible for the chain to return to i.
The period of state i is gcd(Ji ), the greatest common divisor of the set Ji , i.e. the
largest positive integer that is a factor (divisor) of all elements of Ji . The state i is called
aperiodic if it has period 1.
Example 3.55.
2
1
1
1
1
Here J1 = {3, 6, 9, 12, . . .}, so state 1 has period 3.
Example 3.56.
46
3
1
2
1
1
2
1
2
1
2
3
4
1
2
1
1
2
1
2
Here J1 = {2, 4, 6, 8, . . .}, so 1 has period 2.
Example 3.57.
1
1
2
1
2
3
1
2
4
1
2
Because of the self-loop at 4, J1 = {2, 4, 6, 7, 8, . . .}, so 1 is aperiodic.
Theorem 3.58. If m, n ∈ Ji then m + n ∈ Ji .
m+n
m
n
m n
Proof. If Pi,i
, Pi,i
> 0 then Pi,i
≥ Pi,i
Pi,i > 0.
Theorem 3.59. State i is aperiodic if and only if Ji ⊇ {m, m + 1, m + 2, . . .} for some m.
Proof. (not required for the course) This depends on a fact from number theory (not
covered here): if P
gcd(r1 , . . . , rk ) = 1 then there exists m such that any n ≥ m can be
expressed as n = k ak rk where the coefficients ai are nonnegative integers. Then we use
the last theorem.
Theorem 3.60. Aperiodicity is a class property.
n
Proof. Suppose i is aperiodic and i ↔ j. There exists m such that Pi,i
> 0 for all n ≥ m.
k
ℓ
Moreover there exist k, ℓ such that Pi,j , Pj,i > 0. Then for all n ≥ m,
ℓ+n+k
ℓ
n k
Pj,j
≥ Pj,i
Pi,i
Pi,j > 0,
which implies that j is aperiodic.
(It can be shown similarly that period is a class property).
Finally we are ready for (arguably) the main result about Markov chains!
Theorem 3.61 (Convergence Theorem). For any irreducible, aperiodic, positive recurrent Markov chain X with any initial distribution, and with stationary distribution π, for
every state j,
P(Xn = j) → πj
as n → ∞.
In particular, taking the initial distribution to be deterministically i, we deduce that
n
Pi,j
→ πj
as n → ∞
for all states i, j.
47
Proof of Theorem 3.61. (sketch – not required for the course) We do not give full
details, but we mention the main idea: coupling. We run two copies of the Markov chain
simultaneously. The first, X, starts from an arbitrary initial distribution. The second, Y ,
starts from the stationary distribution π, so that at every time step it is in distribution π.
We let X and Y evolve independently of each other until the first time they meet, i.e. until
the first time T that XT = YT . After time T , X and Y stick together, so that Xn = Yn for
all n ≥ T , both still evolving according to the transition matrix. Under this construction,
the pair (Xn , Yn ) is itself a Markov chain with state space S × S, while (Xn ) and (Yn )
individually are Markov chains with the original transition matrix.
The central calculation is the following comparison between Xn and Yn :
|P(Xn = j) − P(Yn = j)|
= |P(Xn = j, n < T ) + P(Xn = j, n ≥ T ) − P(Yn = j, n < T ) − P(Yn = j, n ≥ T )|
= |P(Xn = j, n < T ) − P(Yn = j, n < T )|
≤ P(T > n).
Since P(Yn = j) = πj for all n, we just need to show that P(T > n) → 0, which by
continuity of probability is equivalent to P(T = ∞) = 0. In other words we must show
that X and Y meet (and hence couple) almost surely.
In fact it is enough to fix one state k ∈ S and show that (Xn , Yn ) hits (k, k) almost
surely. This is done by showing that the chain (Xn , Yn ) is irreducible and recurrent. In
fact it is positive recurrent, which follows from the same fact about the original chain (Xn )
by constructing a stationary distribution. Irreducibility can be deduced from irreducibility
and aperiodicity of (Xn ).
We also have another type of convergence to the stationary distribution, and for this we
do not need aperiodicity.
Theorem 3.62 (Ergodic Theorem). For any irreducible, positive recurrent Markov chain
X with any initial distribution, and with stationary distribution π, for every state j,
1
{t = 0, . . . , n − 1 : Xt = j} → πj
n
as n → ∞ almost surely.
The left side above is the proportion of time that the chain spends in state j before time n.
Thus the theorem states that the long-term proportion of time spent at j is πj . The proof is
not covered in this course.
3.13
Reversibility
Finally, we look at an important and useful special class of Markov chains. Suppose P is a
transition matrix and π is a distribution. We say that P and π satisfy the detailed balance
equations if
πi pi,j = πj pj,i
for all i, j ∈ S.
(15)
If we think of π as an equilibrium distribution then this is a very natural condition. The
term πi pi,j represents the long-term frequency of transitions from i to j, or probabilistic
“traffic” from i to j, and detailed balance says this equals the traffic from j to i. In fact,
detailed balance implies stationarity:
48
Theorem 3.63. If the transition matrix P and a distribution π satisfy the detailed balance equations then π is a stationary distribution for P .
Proof. This is an exercise on the problem sheet. In fact we do not even need that π is a
distribution – any vector x that satisfies detailed balance will satisfy xP = x.
Note that (15) is a much stronger condition that πP = π; if S is finite then it is |S|2
equations rather than |S| equations. The condition is important for several reasons. When
they hold, the detailed balance equations are often much easier to solve than the stationarity
equation πP = π, because each equation only has two terms. For a general P there is no
guarantee that the detailed balance equations have a solution, but we will see some important
cases where they do. Finally, they imply the following very interesting property.
Theorem 3.64. Suppose π is a distribution that satisfies the detailed balance equations
for P . Let the Markov chain (Xn ) have initial distribution π and transition matrix P .
Then (X0 , . . . , Xn ) has the same distribution as (Xn , . . . , X0 ).
In other words, if we make a video if the Markov chain in equilibrium and play it backwards, it is impossible to tell the difference. In light of this, a Markov chain that satisfies the
conditions of Theorem 3.64 is called reversible.
Proof. By applying the detailed balance condition successively,
P (X0 , . . . , Xn ) = (i0 , . . . , in ) = πi0 pi0 ,i1 pi1 ,i2 · · · pin−1 ,in
= pi1 ,i0 πi1 pi1 ,i2 · · · pin−1 ,in
= pi1 ,i0 pi2 ,i1 πi2 · · · pin−1 ,in
..
.
= pi1 ,i0 pi2 ,i1 · · · pin ,in−1 πin
= πin pin ,in−1 · · · pi2 ,i1 pi1 ,i0
= P (X0 , . . . , Xn ) = (in , . . . , i0 ) .
We discuss two main classes of reversible chains.
Example 3.65 (Birth-death chains). Let S = {0, 1, . . . , m} or S = {0, 1, . . .}, and let
pi,j = 0 for all |i − j| ≥ 2. Thus, the state space is a finite or infinite line, and only
nearest-neighbour transitions are allowed. We can denote the transition probabilities
pi,i+1 = bi
pi,i = si
pi,i−1 = di
for i ∈ S, where bi + si + di = 1 for all i. If we think of the chain as representing the size of
a population model, bi and di are probabilities of a birth and a death when the population is
i.
Provided bi > 0 for i < m and di > 0 for i > 0 the chain is irreducible. In that case it
satisfies the detailed balance equations. We can just check this: the equation πi pi,j = πj pj,i
49
holds trivially when i = j, and when |i − j| ≥ 2 it is just 0 = 0. Exchanging i and j gives the
same equation. So we only have to satisfy the cases j = i + 1:
π 0 b0 = π 1 d 1
π 1 b1 = π 2 d 2
π 2 b2 = π 3 d 3
..
.
πi−1 bi−1 = πi di
..
.
The general solution is
b0 b1 · · · bi−1
,
i ∈ S.
d1 d2 · · · di
Of course this may or may not be summable, depending on the parameters, which will determine whether or not the chain is positive recurrent.
πi = π0
Example 3.66. Consider (yet again) simple random walk on N with a reflecting barrier at
0. So pi,i+1 = p for i ≥ 0 and pi,i−1 = q = 1 − p for i ≥ 1. The chain is irreducible, and it
is aperiodic because p0,0 > 0. We know that it is transient when p > 1/2 and recurrent for
p ≤ 1/2. Let us focus on p ≤ 1/2.
Since this is a birth-death chain, we can try to find a stationary distribution using the
detailed balance equations:
π0 p = π1 q
π1 p = π2 q
π2 p = π3 q
..
.
giving
πi =
p i
q
i ≥ 0.
π0 ,
Normalizing to get a distribution:
1=
∞
X
πi = π0
i=0
but
∞ X
p i
i=0
q
i=0
(
=
∞ X
p i
q
,
1
,
1−p/q
p < 1/2
∞,
p = 1/2.
So (as seen before), if p = 1/2 there is no stationary distribution so the chain must be
null recurrent.
If p < 1/2 then the chain is positive recurrent with stationary distribution
p p i
πi = 1 −
,
i ≥ 0.
q q
(In fact this is the probability mass function of a geometric random variable minus 1). So we
can deduce for instance:
p
lim P(Xn = 0) = π0 = 1 − .
n→∞
q
50
µ0 = E0 (return time to 0) = 1/π0 =
1
q
=
.
1 − p/q
q−p
Also letting t = E1 (time to hit 0) we have µ0 = 1 + q0 + pt, so
t=
µ0 − 1
1
=
,
p
q−p
(as we got before by more complicated methods).
Moreover, for p < 1/2 the walk is reversible. This is quite startling. Since p < 1/2 the
walk tends to drift leftwards – it is more likely to decrease than increase. We might expect
that running the video backwards would result in it tending instead to drift to the right. But
this is incorrect because the stationary distribution is concentrated near 0. The leftward drift
is only visible after there has been an unusual rightward excursion. If we focus attention
on such excursions then in the time-reversed picture the drift back toward zero becomes an
unusual rightward excursion, and the excursion becomes drift back to zero.
Even if you buy this explanation it is still startling!
Example 3.67 (Random walk on a graph). Consider a finite graph G = (V, E). V is the
set of vertices, and E is the set of edges, which are unordered pairs of distinct. Two vertices
are called neighbours if there is an edge between them, and the degree deg(v) of a vertex v
is the number of its neighbours. The (simple symmetric) random walk on G is the Markov
chain with state space S = V in which, at each step, we move from the current vertex to a
uniformly random neighbour. Thus,
(
1
if w is a neighbour of v,
pv,w = deg(v)
0
otherwise.
We claim that the chain is reversible with stationary distribution given by
deg(v)
.
u∈V deg(u)
πv = P
The denominator (which incidentally equals 2|E|) is just a normalizing factor to ensure
that π is a distribution – the stationary distribution is proportional to degree. (One way
to understand this is that the walk spend equal timeP
on each edge).
This is almost trivial to check. Writing D = u∈V deg(u), we only need to check the
detailed balance equations when v, w are neighbours:
πv pv,w =
deg(w) 1
deg(v) 1
=
= πw pw,v .
D deg(v)
D deg(w)
Example 3.68. A chess King is moved about at random on an otherwise empty chess board.
At each step, independently, it is moved to a uniformly random square of those that it can
legally move to.
A chess board is an 8-by-8 grid of squares, and a King may move by one place horizontally,
vertically or diagonally at 45 degrees, so long as it does not leave the board. The King’s
position Xn is a random walk on a graph. The vertex set is V = {1, 2, . . . , 8}2 = {(x, y) :
x, y = 1, . . . , 8}, and two vertices u, v are adjacent if u − v is a non-zero vector with both
coordinates in {−1, 0, +1}.
The degree of a vertex is 3 if it is a corner, 5 if it is on an edge, and 8 in the interior, so
the sum of all degrees is
4 × 3 + (4 × 6) × 5 + 62 × 8 = 420.
51
The chain is irreducible and aperiodic (check this!), so the convergence theorem and ergodic theorem apply. So e.g. for a corner c, as n → ∞,
P(Xn = c) → πc = deg(c)/420 = 3/420 = 1/140;
and the expected time starting from c to return to c is µc = 1/πc = 140; and the long-term
proportion of time spent in corners is 4 × 1/140 = 1/35.
52
4
Poisson Process
4.1
Introduction
This will be our first continuous-time stochastic process. Time t will be a real parameter with
0 ≤ t < ∞. We want to model a series of occurrences (often called arrivals) that happen
at random times, such as: emissions from a radioactive source, meteor impacts, financial
crashes, rain storms, rain drops, road accidents. We don’t distinguish between different
arrivals (for now) – we are just interested in their times. The Poisson process is the most
basic such model, in which arrivals are equally common at all times, and past arrivals provide
no information about future ones.
Rather than the arrival times themselves (which would be a random set), we focus on a
random function (stochastic process) that encapsulates the same information: the number of
arrivals up to including time t, which we denote N (t). For s < t, the number of arrivals in
the time interval (s, t] is then the increment N (t) − N (s) of N over the interval.
Here is what a graph of the function N might look like. It jumps up by 1 every time
there is an arrival. Because of the convention to count arrivals up to and including time t,
it takes the upper of the two values at the time of the jump itself; this is the meaning of the
solid (versus open) discs in the picture.
N (t)
t
0
We will end up with three equivalent definitions.
53
Definition 1. A stochastic process (N (t))t≥0 is a Poisson process of rate λ > 0 if:
1. N (0) = 0, N (t) is integer-valued, and the function N is right-continuous;
2. N has independent increments (over disjoint intervals); i.e. for any 0 ≤ t0 ≤
t1 ≤ · · · ≤ tn , the random variables
N (t1 ) − N (t0 ), N (t2 ) − N (t1 ), . . . , N (tn ) − N (tn−1 )
are independent;
3. for any t ≥ 0,

1 − λh + o(h), n = 0;

P N (t + h) − N (t) = n = λh + o(h),
n = 1;


o(h),
n ≥ 2,
as h ↓ 0.
The process N jumps up by 1 whenever there is an arrival. Right-continuity just means
that we choose the upper value at the time of the jump. Independent increments means that
the numbers of arrivals in disjoint time intervals are independent. The “little-oh” notation
o(h) means a function f (h) (perhaps a different one at each place where the notation appears)
such that f (h)/h → 0 as h → 0, in other words a term of strictly lower order than h (such
as h2 or h3 ). (Not to be confused with “big-oh” notation – O(h) means a function g(h) such
that g(h)/h ≤ const.) Condition 3 says that the probability of an arrival in a short interval
is approximately the rate times its length, with two or more arrivals being much less likely.
Our first observation is a version of time-homogeneity: the process of arrivals started from
some later time s looks like the original process.
Theorem 4.1. Let (N (t))t≥0 be a Poisson process of rate λ and fix s ≥ 0. The process
(N (s + t) − N (s))t≥0 is a Poisson process of rate λ.
Proof. Let M (t) = N (s + t) − N (s). We check that it satisfies the conditions 1–3.
Since N is integer valued and right continuous, the same holds for M . We have M (0) =
N (s + 0) − N (s) = 0, so condition 1 holds.
The increment of M over an interval (ti , ti+1 ] is M (ti+1 ) − M (ti ) = N (s + ti+1 ) − N (s) −
N (s + ti ) + N (s) = N (s + ti+1 ) − N (s + ti ), which is the increment of N over (s + ti , s + ti+1 ].
Therefore, since N has independent increments, so does M – this is condition 2.
By similar reasoning,

1 − λh + o(h), n = 0;

P M (t + h) − M (t) = n = P N (s + t + h) − N (s + t) = n = λh + o(h),
n = 1;


o(h),
n ≥ 2,
so condition 3 is satisfied.
4.2
Increment distribution
A Poisson process is a family of random variables N (t) indexed by t. What are their distributions?
54
Theorem 4.2. If (N (t))t≥0 is a Poisson process of rate λ then N (t) is a Poisson random
variable with parameter (mean) λt.
Proof. Write pj (t) = P(N (t) = j) for the probability mass function of N (t). We will derive
an expression for the derivative of pj (t) with respect to t. Consider a (small) parameter h > 0
and write Y = N (t + h) − N (t) for the number of arrivals from t to t + h. Then
pj (t + h) = P(N (t + h) = j) =
j
X
P(Y = i, N (t) = j − i)
i=0
j
=
X
P(Y = i)pj−i (t)
i=0
= (1 − λh)pj (t) + λhpj−1 (t) + o(h)
as h ↓ 0 for j ≥ 1. So
pj (t + h) − pj (t)
o(h)
= −λpj (t) + λpj−1 (t) +
.
h
h
Taking the limit h ↓ 0 gives
p′j (t) = −λpj (t) + λpj−1 (t),
j ≥ 1.
(16)
In the case j = 0 we have P(N (t) = j − 1) = 0 so we obtain by a similar argument
p′0 (t) = −λp0 (t).
(17)
Since N (0) = 0 a.s., we have the boundary condition
(
1, j = 0
pj (0) =
0, j ≥ 1.
(18)
We need to solve the family of linked differential equations (16),(17). One way is to solve
(17) for p0 , then substitute into the j = 1 equation and find p1 , then find p2 , etc. Another
way is to introduce the probability generating function of N (t):
X
G(s, t) = E(sN (t) ) =
pj (t)sj .
j≥0
Multiplying (16) by sj and summing over j, together with (17) for j = 0, gives
X
X
X
sj p′j (t) = −λ
sj pj (t) + λ
sj pj−1 (t),
j≥0
j≥0
j≥1
i.e.
∂G
= −λG + λsG = λ(s − 1)G.
∂t
(We omit the details needed to make this fully rigorous, including differentiating the power
series term-by-term).
This differential equation can be solved by multiplying by the integrating factor e−λ(s−1)t :
e−λ(s−1)t
∂G
− e−λ(s−1)t λ(s − 1)G = 0;
∂t
∂ −λ(s−1)t e
G = 0.
∂t
55
After using the boundary condition (18) to get G(s, 0) = 1, we deduce that
G = eλ(s−1)t ,
which we recognize as the probability generating function of a Poisson(λt) variable.
Since (N (t + s) − N (s))t≥0 is a Poisson process of rate λ (Theorem 4.1) we immediately
deduce from Theorem 4.2 that N (t + s) − N (s) is also Poisson(λt).
Example 4.3. If (N (t))t≥0 is a Poisson process of rate λ, find the covariance Cov(N (s), N (t)).
Solution. Recall that Cov(X, Y ) = E(XY ) − (EX)(EY ). Suppose that s ≤ t. Since N (s)
and N (t) are not independent, we split N (t) up into the increments over (0, s] and (s, t]:
E(N (s)N (t)) = E N (s) N (s) + N (t) − N (s)
= E(N (s)2 ) + E N (s)(N (t) − N (s)) .
Now N (s) is Poisson(λs), and its second moment E(N (s)2 ) is the variance plus the mean
squared; and the increments N (s) and N (t) − N (s) are independent, so the expectation
factorizes:
E(N (s)N (t)) = λs + (λs)2 + λsλ(t − s) = λs + λ2 st.
So Cov(N (s), N (t)) = λs + λ2 st − λsλt = λs.
Since this was under the assumption s ≤ t, in general we have
Cov(N (s), N (t)) = λ min(s, t).
♢
We can now give an alternative definition of the Poisson process.
Definition 2. A stochastic process (N (t))t≥0 is a Poisson process of rate λ > 0 if:
1. N (0) = 0, and N is integer-valued and right-continuous;
2. N has independent increments.
3. for any s ≥ 0 and t > 0, N (s + t) − N (s) has Poisson(λt) distribution.
Only condition 3 has changed. This version is cleaner, but the previous infinitessimal definition is in a sense more fundamental.
We have seen that Definition 1 implies Definition 2. On the other hand if N (s+h)−N (s) ∼
Poi(λh) then

−λh

= 1 − λh + o(h), n = 0;
e
−λh
P(N (s + h) − N (s) = n) = e λh = λh + o(h),
n = 1;

 −λh (λh)n
= o(h),
n ≥ 2,
e
n!
as h ↓ 0. So the two definitions are equivalent.
56
4.3
Arrival times
Next we consider the arrival times themselves as random variables. Let
Tj = inf{t ≥ 0 : N (t) = j}
for j ≥ 1; this is the jth arrival time. We could also have written it as min{t ≥ 0 : N (t) = j}
or inf{t ≥ 0 : N (t) ≥ j} – convince yourself that these are all equal. Also for convenience
set T0 = 0, and define
Xj = Tj − Tj−1
for j ≥ 1; this is the jth inter-arrival time, i.e. the time between the (j − 1)th and jth
arrivals (or just the 1st arrival time in the case j = 1).
Here is an illustration of these definitions.
N (t)
X1
0
T0
X2
X3
T1
X4
T2 T3
T4
t
Theorem 4.4. The inter-arrival times X1 , X2 , . . . of a Poisson process of rate λ are i.i.d.
Exponential random variables with parameter λ.
Proof. First consider X1 :
P(X1 > t) = P(N (t) = 0) = e−λt ,
so X1 ∼ Exp(λ).
Moving on to X2 :
P(X2 > t | X1 = t1 ) = P(no arrival in (t1 , t1 + t] | X1 = t1 ).
Since the event X1 = t1 depends only on arrivals in [0, t1 ], while the other event depends on
arrivals after t1 , the independent increments property gives that this equals the unconditional
probability
P(no arrival in (t1 , t1 + t]) = e−λt ,
so X2 is Exp(λ) and independent of X1 . (To make this fully rigorous actually needs a bit
more care, because the event X1 = t1 has probability zero. One way to do this is to consider
the limit of a small interval; we omit the details).
Similarly for the general case,
P(Xn > t | X1 = t1 , . . . , Xn−1 = tn−1 )
= P(no arrival in (t1 + · · · + tn−1 , t1 + · · · + tn−1 + t] | X1 = t1 , . . . , Xn−1 = tn−1 )
= P(no arrival in (t1 + · · · + tn−1 , t1 + · · · + tn−1 + t])
= e−λt ,
so Xn is Exp(λ) and independent of X1 , . . . , Xn−1 .
57
Unpacking the definitions back to N gives another equivalent definition, this time a constructive one (meaning that it gives a way to generate the process rather than just a description of its properties).
Definition
Pn 3. Let X1 , X2 , . . . be i.i.d. Exponential random variables with parameter λ. Let
Tn = i=1 Xi and let
N (t) = max{n : Tn ≤ t},
t ≥ 0.
The stochastic process (N (t))t≥0 is a Poisson process of rate λ > 0.
We have already seen that Definitions 1 and 2 are equivalent and they imply that N has
the properties in Definition 3. Because Definition 3 clearly uniquely specifies the distribution
of the (N (t))t≥0 , the three definitions are equivalent.
One simple consequence is a scaling property.
Theorem 4.5. If (N (t))t≥0 is a Poisson process of rate λ, and a ∈ (0, ∞) is a constant, then
(N (at))t≥0 is a Poisson process of rate aλ.
Proof. If (Tj )j≥0 and (Xj )j≥1 are the arrival and inter-arrival times of N , then the jth arrival
time of the scaled process (N (at))t≥0 is
inf{t ≥ 0 : N (at) = j} =
1
Tj
inf{at ≥ 0 : N (at) = j} = ,
a
a
and so the inter-arrival times of the scaled process are (Xj /a), which are i.i.d. To find their
distributions, note that X ∼ Exp(λ) then X/a ∼ Exp(aλ), because P(X/a > t) = P(X >
at) = e−λat .
We also note a connection with the well-known memoryless property of the Exponential
distribution:
Theorem 4.6 (Memoryless property). If X is an Exponential random variable then for any
s, t > 0,
P(X > s + t | X > s) = P(X > t).
If X is interpreted as a waiting time, this says that given that you have already waited time
s, the probability of waiting a further time t is exactly the same as the original probability
of waiting t. The distribution is memoryless in the sense that knowing we having waited a
certain time doesn’t affect the distribution of the remaining time. This result is very easy to
prove directly, but it is also instructive to give an argument in terms of the Poisson process.
Proof. Consider a Poisson process N of rate λ and let T1 be the first arrival time, which has
Exp(λ) distribution. Then, using the independent increments property and time homogeneity,
P(T1 > t + s | T1 > s) = P(N (s + t) = 0 | N (s) = 0)
= P(N (s + t) − N (s) = 0 | N (s) = 0)
= P(N (s + t) − N (s) = 0)
= P(N (t) = 0) = P(T1 > t).
Time homogeneity combined with i.i.d. inter-arrival times has the following quite surprising consequence. If we start observing a Poisson process at a fixed time s, then we just see
a Poisson process of arrivals. In particular, the time until the first arrival after s is Exp(λ),
58
as well as the subsequent inter-arrival times. But viewed in terms of the original process, s
is in between two arrivals, so we might expect the time until the first arrival after s to be
typically shorter than Exp(λ), because it is only part of the inter-arrival time that contains s.
The memoryless property indicates the solution to the apparent paradox – the inter-arrival
time containing s is special because we require it to contain s — that makes it larger than a
typical inter-arrival time. If we condition on how much of it happened before s, the rest is
still just Exponential.
4.4
Superposition and thinning
We next consider several Poisson processes together.
Theorem 4.7 (Superposition). If (M (t))t≥0 and (N (t))t≥0 are independent Poisson processes of rates λ and µ respectively then (M (t) + N (t))t≥0 is a Poisson process of rate
λ + µ.
Proof. See the problem sheet.
In other words, if we have two different types of arrivals, occurring as independent Poisson
processes at the same time, then the process of both types of arrival together is also Poisson.
Theorem 4.8 (Thinning). Let (N (t))t≥0 be a Poisson processes of rate λ and let Z1 , Z2 , . . .
be i.i.d. taking values 1 with probability p and 0 with probability 1 − p, independent of N .
Let
N (t)
X
M (t) =
Zi .
i=1
Then (M (t))t≥0 is a Poisson process of rate pλ.
Proof. See the problem sheet.
In this case the interpretation is that we “thin” the stream of arrivals by independently
keeping or deleting each one according to a (biased) coin flip. The remaining arrival process
is Poisson, with the rate being reduced by a factor of the retention probability p.
We can take this further: let
L(t) = N (t) − M (t) =
N (t)
X
(1 − Zi );
i=1
then (L(t))t≥0 is the process of deleted arrivals of the original process. By applying Theorem 4.8 to (1 − Zi ) we see that (L(t))t≥0 is a Poisson process of rate qλ, where q = 1 − p.
Perhaps more surprisingly, we have the following.
Theorem 4.9. The processes (M (t))t≥0 and (L(t))t≥0 are independent.
Proof. (partial) We first establish that for any fixed t, the random variables M (t) and L(t)
are independent. (This is a much weaker statement than what we need, but it turns out to
be the key step). We have
P(M (t) = m, L(t) = ℓ) = P(M (t) = m, L(t) = ℓ | N (t) = m + ℓ) P(N (t) = m + ℓ)
m+ℓ m ℓ
(λt)m+ℓ
=
p q × e−λt
(m + ℓ)!
m
m
(λpt)
(λqt)ℓ
= e−λpt
× e−λqt
,
m!
ℓ!
59
as required.
Using this and independent increments, one can deduce that, for t1 ≤ t2 ≤ · · · ≤ tn , the
2n random variables
M (t1 ) − M (0), M (t2 ) − M (t1 ), . . . , M (tn ) − M (tn−1 ),
L(t1 ) − L(0), L(t2 ) − L(t1 ), . . . , L(tn ) − L(tn−1 )
are all independent (we omit the details of the argument). This implies that the two vectors
(M (t1 ), . . . , M (tn )) and (L(t1 ), . . . , L(tn ))
are independent. Using general stochastic process technology (not covered in this course),
this is enough to deduce that the two processes are independent.
In summary, the superposition and thinning pictures are really equivalent. In scenario 1
we consider a Poisson process of “red” arrivals or rate pλ and an independent Poisson process
of “blue” arrivals of rate (1−p)λ together. In scenario 2 we consider a Poisson process of rate
λ, and independently colour each arrival red with probability p and blue with probability
1 − p. The two resulting processes of arrivals with colours have the same distribution.
Focussing on the first arrival, we get the following useful fact.
Theorem 4.10. (Exponential races) Let T1 , . . . , Tn be independent with Exponential random variables with parameters λ1 , . . . , λn . Let
M = min(T1 , . . . , Tn ),
and let J ∈ {1, . . . , n} be such that TJ = M (i.e. the index of the minimal Tj ). Then
P
(i) M ∼ Exp( i λi );
λj
(ii) P(J = j) = P ;
i λi
(iii) J and M are independent.
It may help to think of T1 , . . . , Tn as the finishing times of competitors in a race. Again,
this is quite straightforward to prove directly, but it is illuminating to do it using Poisson
processes.
P
Proof. Consider a Poisson process of rate λ := i λi , and independently label each arrival
“type j” with probability pj := λj /λ. By thinning (applied iteratively), the type j arrivals
form a Poisson process of rate λj , and different types form independent arrival processes. Let
Tj be the time of the 1st type j arrival, which is Exp(λi ); the (Ti ) are independent. Then M
is the time of the first arrival, which is Exp(λ). From the definition of the thinning process
we have P(J = j) = pj , and J and M are independent.
4.5
Conditioning
Various conditional distributions involving the Poisson process have very neat forms.
Theorem 4.11. Let (N (t)) be a Poisson process. Given N (t) = 1, the conditional distribution of the first arrival time T1 is Uniform[0, t].
60
Note that the rate does not appear in this statement.
Proof. We compute the conditional distribution function. For 0 ≤ u ≤ t,
P(T1 ≤ u, N (t) = 1)
P(N (t) = 1)
P(N (u) = 1, N (t) − N (u) = 0)
=
P(N (t) = 1)
−λu
−λ(t−u)
e λue
=
e−λt λt
u
= .
t
P(T1 ≤ u | N (t) = 1) =
More generally, what is the distribution of the first n arrivals given that N (t) = n? The
answer is basically that they are i.i.d. uniform, but there is a complication that they need to
be ordered.
Let U1 , . . . , Un be i.i.d. Uniform[0, t]. Let U(1) < U(2) < · · · < U(n) be U1 , . . . , Un rearranged into increasing order. They are called the order statistics of U1 , . . . , Un . So in
particular, U(1) = min(U1 , . . . , Un ), U(n) = max(U1 , . . . , Un ), and U(k) is the kth smallest of
U1 , . . . , Un .
Theorem 4.12. If U1 , . . . , Un are i.i.d. Uniform[0, t], their order statistics U(1) , . . . , U(n) have
joint probability density function

 n!
, 0 < x1 < x2 < · · · < xn < t,
f (x1 , . . . , xn ) = tn
0,
otherwise.
Proof. Fix 0 < x1 < · · · xn < t and take h1 , . . . , hn small enough that the intervals Ii =
(xi , xi + h) are pairwise disjoint. Then
P(U(1) ∈ I1 , . . . , U(n) ∈ In ) = P(U1 ∈ I1 , U2 ∈ I2 , . . . , Un ∈ In )
+ P(U2 ∈ I1 , U1 ∈ I2 , . . . , Un ∈ In )
+ ···
+ P(Un ∈ I1 , Un−1 ∈ I2 , . . . , U1 ∈ In ),
where we sum over all n! possible permutations of 1, . . . , n. This equals
n!
hn
n!
h1
···
= n h1 · · · hn ;
t
t
t
now divide by h1 · · · hn and let the hi ’s tend to 0 to get the density function. Finally, the
density is clearly 0 outside the given range.
Theorem 4.13. Let (N (t))t≥0 be a Poisson process. Given N (t) = n, the conditional distribution of the arrival times T1 , . . . , Tn is the distribution of the order statistics
U(1) , . . . , U(n) of n i.i.d. Uniform[0, t] random variables.
61
Proof. We take small intervals Ij as in the last proof, and let h = h1 + · · · + hn . Then
P(T1 ∈ I1 , . . . , Tn ∈ In | N (t) = n)
P(T1 ∈ I1 , . . . , Tn ∈ In , N (t) = n)
=
P(N (t) = n)
S
P(1 arrival in each of I1 , . . . , In , none in (0, t] \ i Ii )
=
P(N (t) = n)
(λh1 + o(h1 )) · · · (λhn + o(hn ))e−λ(t−h)
=
e−λt (λt)n /n!
n!
= n h1 · · · hn + o(h1 · · · hn )
t
as h1 , . . . , hn → 0, where in the last step we used that eh = 1 + h + o(h) (imagine multiplying
out the product).
Example 4.14. For a Poisson process (N (t))t≥0 of rate λ, find E(N (t) | N (s)).
Solution. First suppose s ≤ t. Then, using the independent increments property,
E(N (t) | N (s) = n) = E(N (t) − N (s) | N (s) = n) + E(N (s) | N (s) = n)
= E(N (t) − N (s)) + n
= λ(t − s) + n.
Now suppose s ≥ t. Conditional on N (s) = n, T1 , . . . , Tn are distributed as the order
statistics of U1 , . . . , Un , say, which are i.i.d. Uniform on [0, s]. Each Ui lies in [0, t] with
probability t/s, so the number of them that lie in [0, t] is Binomial(n, t/s). So
t
E(N (t) | N (s) = n) = n .
s
In conclusion,

N (s) + λ(t − s), s ≤ t
E(N (t) | N (s)) =
t
N (s) ,
s ≥ t.
s
4.6
♢
Inhomogeneous and Multidimensional Poisson processes
We briefly mention two extensions.
Inhomogeneous process. Let λ(t) be a function (assumed continuous for simplicity).
A process (N (t))t≥0 is an inhomogeneous Poisson process with rate function λ(t) if
it satisfies conditions 1 (basic properties) and 2 (independent increments) in Definition 1,
together with
P(N (t + h) − N (t) = 1) = λ(t)h + o(h),
and analogously for the cases 0 and n ≥ 2.
One can show that N (t) has a Poisson distribution with parameter
Z t
λ(s) ds.
0
62
Higher dimensions. Here we want to model random points in Rd . Imagine the locations
of meteor impacts, seedlings, or rain drops.
For every (well behaved) set A ⊂ Rd we have a random variable N (A) which represents
the number of points that fall in A. The collection (N (A))A⊂Rd of all these variables is a
Poisson point process in Rd with rate (or intensity) λ if it satisfies:
1. N is Z ∪ {∞}-valued, N (∅) = 0, and
N
∞
[
Ai =
i=1
∞
X
N (Ai )
i=1
for any pairwise disjoint A1 , A2 , . . .;
2. N (A1 ), . . . , N (An ) are independent for any pairwise disjoint A1 , . . . , An ;
3. N (A) has Poisson distribution with parameter λ vol(A) for any A with 0 < vol(A) < ∞.
Here vol(A) is the volume of the set A.
63
5
5.1
Continuous-time Markov chains and queues
Introduction
We introduce Markov chains in continuous time. We will not give a fully comprehensive
treatment of this subject; rather we will focus on important highlights. Much of the theory
will mirror, and piggy-back on, the discrete-time case. In particular we will address special
cases that arise in queueing theory, a central topic of applied probability.
Here is a transition diagram for a continuous-time Markov chain.
a
1
2
1
1
2
d
c
b
1
The state space is a countable set, in this case S = {a, b, c, d}. Time is now a real variable
t ∈ [0, ∞), and the Markov chain will be denoted (X(t))t≥0 , where X(t) is the state at time
t.
The edge labels represent rates of moving from one state to another. The meaning is,
for example: when the current state is a, the process waits a random time with Exp(1/2)
distribution, then moves to b. At c, we simultaneously start independent Exp(1) and Exp(2)
clocks corresponding to the edges leading to a and d respectively. We jump along the edge
whose clock goes off first. In other words, (by the exponential races theorem) we wait a
random time with Exp(3) distribution, and then jump to a with probability 1/3 or d with
probability 2/3, independently of the time we waited. At d the chain stay in place forever,
and never jumps.
Rather than a transition matrix, the parameters of a continuous-time Markov chain with
state space S are summarized by its generator matrix. This is an |S| by |S| matrix
G = (gi,j )i,j∈S with entries:
for i ̸= j,
gi,j = rate of going from i to j
X
gi,j .
and gi,i = −
j̸=i
In other words, the diagonal entry gi,i is minus the total rate of leaving i, i.e. minus the rate
of the exponential time until the first jump out of i. So each row sums to 0. (The reason for
this choice of diagonal entries will become clearer later). For the above example,
a
b
c
d

−1/2 1/2 0 0
b 
1
−2 1 0 
.
G= 

c
1
0 −3 2 
d
0
0
0 0
a

Example 5.1 (Poisson process). The Poisson process of rate λ is itself a continuous-time
Markov chain! As before, the state X(t) at time t (previously called N (t)) represents the
number of arrivals by time t. The state space is N = {0, 1 . . .}, the initial state is X(0) = 0,
and the transition diagram is:
64
λ
0
λ
λ
1
···
2
When there have been n arrivals, we wait an Exp(λ) time for the next arrival, at which point
the number of arrivals jumps to n + 1. The generator is


−λ λ
0
0 ···
 0 −λ λ

0


 0

0
−λ
λ
G=

 0

0
0
−λ


..
..
.
.
Example 5.2 (A simple queue). Five customers arrive at a bank branch that has three
servers. Each customer requires a random service time with Exp(µ) distribution to complete
their business. The service times of different customers are independent. A customer can
only be served when there is a server available to help them. When all servers are busy helping
customers, any further customers wait in a queue until a server is available. Once a customer
has been served they leave.
This situation sounds very complicated to model, but in fact, much of what we might want
to know can be described by a simple continuous-time Markov chain. Let X(t) be the total
number of customers present in the bank at time t (including both those being served and
those waiting). Then X(0) = 5, and the state space is S = {0, 1, . . . , 5}.
If there is 1 customer present, they are served by one of the servers, so they leave after
an Exp(µ) time and then there are 0 customers. If there are 2 customers, they are each being
served by a server (while the other server is idle), so the time until the next customer finishes
is the minimum of two independent Exp(µ) services times, which is Exp(2µ). If there are 3
or more customers, then exactly 3 of them are served simultaneously, so the next customer
leaves at rate 3µ. By the memoryless property of the Exponential, it doesn’t matter how we
got to a state or how long we have been there – the same rules apply for the jumps out of it.
So (X(t))t≥0 is a continuous-time Markov chain with transition diagram:
0
µ
1
2µ
2
3µ
3
3µ
4
3µ
5
Note that we did not need to consider the queueing discipline – i.e. how we decide
what order customers are served in, and which server serves which customer. Such choices
might be of interest, but they do not affect the process (X(t))t≥0 . Our only requirement is
that servers are not idle while there are customers waiting to be served.
The generator matrix is


0 0
0
0
0
0
µ −µ
0
0
0
0 


 0 2µ −2µ
0
0
0 
.

G=

0
0
3µ
−3µ
0
0


0 0
0
3µ −3µ
0 
0 0
0
0
3µ −3µ
Example 5.3. In the above example, what is the expected time until all customers have been
served?
65
Solution. Looking at the transition diagram, the time starting from 5 to hit 4 has Exp(3µ)
distribution, which has mean 1/(3µ), then the time starting from 4 to hit 3 also has mean
1/(3µ), and so on. So the expected time to hit 0 is
1
1
1
1
5
1
+
+
+
+ =
.
3µ 3µ 3µ 2µ µ
2µ
♢
Example 5.4 (M/M/3 queue). Consider the bank branch with 3 servers again, but now
suppose that customers arrive at the times of a Poisson process of rate λ, independent of the
service process, which is as before. Again let X(t) be the number of customers in the bank at
time t. Now the state space is N, because there could be any number of customers. From any
state n, a new customer arrives (so that X increases to n + 1) at rate λ, or a customer is
served (so X decreases) at rate min(3µ, nµ), because each server can serve one customer at
a time. So the transition diagram is
λ
λ
0
λ
1
µ
λ
2
3
2µ
3µ
λ
λ
4
3µ
···
5
3µ
3µ
The name “M/M/3 queue” belongs to the standard naming system (Kendall’s notation)
for queueing models. The first M means that the arrival process is “Markovian” (or “memoryless”), meaning that arrivals occur as a Poisson process. The second M means that service
times are also “Markovian” (or “memoryless”), meaning that they are Exponentially distributed. The 3 means that there are 3 servers.
Natural questions (to be addressed) include: can the servers cope with the demand, or
does the queue grow longer and longer? How long do servers typically spend idle? Or busy?
What can we say about the distribution of the stream of customers leaving?
5.2
The jump chain and explosion
We now give a formal construction of a continuous-time Markov chain, based on the informal
description above. Let G = (gi,j )i,j∈S be a generator matrix on state space S. Write
gi := −gi,i
for the total jump rate out of state i. The jump chain is a discrete-time Markov chain
(Yn )n=0,1,... with state space S and transition matrix R = (ri,j )i,j∈S given by:

ri,j = gi,j , j ̸= i,
gi
If gi > 0 :
r = 0.
i,i
(
ri,j = 0, j ̸= i,
If gi = 0 :
ri,i = 1.
For example, for the chain at the beginning of the chapter we have
a

b
c
d
a

−1/2 1/2 0 0

b
1
−2 1 0 
;
G= 
c
1
0 −3 2 
d
0
0
0 0
a

0

b
1/2
R= 
c  1/3
d
0
a
66
b
c
d

1 0
0
0 1/2 0 
.
0 0 2/3 
0 0
1
We want the continuous-time chain (X(t)) to perform the same jumps as (Yn ), staying a
random time with distribution Exp(gi ) in state i. It is convenient to get these waiting times
by scaling standard exponentials. Therefore, let Z1 , Z2 , . . . be i.i.d. Exp(1) random variables,
independent of the jump chain (Yn ). Then define the holding times
Sn =
Zn
gYn−1
,
n ≥ 1,
(recall that if Z ∼ Exp(1) then Z/a ∼ Exp(a)). Define the jump times
Tn =
n
X
Si ,
i=1
(and T0 = 0). Finally let
if n ≥ 0 is such that Tn ≤ t < Tn+1 .
X(t) = Yn
(19)
We then say that the stochastic process (X(t))t≥0 is a continuous-time Markov chain
with generator G.
As in the discrete case, we may also specify the initial distribution µ = (µi )i∈S , where
µi = P(X(0) = i). Again, we sometimes indicate the case of a deterministic starting state j
(so µj = 1) by writing Pj for the probability measure.
Explosion. Because transition rates can be arbitrarily large, we need to deal with a new
complication not seen in the discrete case. We say that a chain explodes if it makes infinitely
many jumps in finite time.
Example 5.5. Consider the continuous-time chain with initial state 0 and diagram:
0
1
1
2
2
22
3
23
4
24
···
Note the rapidly increasing rates.
Let Tn be the time of the nth jump. Then Tn is a sum of independent exponentials, and
E(Tn ) =
1 1
1
1
+ + 2 + · · · + n−1 .
1 2 2
2
Moreover, let T∞ = limn→∞ Tn be the time until the chain makes infinitely many jumps. Then
E(T∞ ) =
1 1
1
+ + 2 + · · · = 2.
1 2 2
In particular T∞ is a.s. finite. So the chain almost surely makes infinitely many jumps in
finite time, i.e. it explodes.
A trajectory might look something like this:
67
X(t)
0
T∞
t
Note that our construction above is silent on what X(t) should be after an explosion time
T∞ . For t > T∞ there is no n satisfying the condition in (19).
Explosion is an interesting subject to pure mathematicians, but we will focus on cases
where it can be ruled out. A Markov chain X is called non-explosive if Pi (X explodes) = 0
for all i ∈ S.
Theorem 5.6. Any of the following conditions on a continuous-time Markov chain guarantees that it is non-explosive:
(i) finite state space S;
(ii) bounded jump rates: −gi,i ≤ C for all i ∈ S, for some C < ∞;
(iii) deterministic initial state X(0) = j that is recurrent for the jump chain.
Proof. (not required for the course.) Let the holding times (Sn ), jump times (Tn ) and
jump chain (Yn ), and the Exp(1) variables (Zn ) be as in the construction of the chain, and
let T∞ = limn→∞ Tn . Case (i) is a special case of (ii). In case (ii),
T∞ =
∞
X
i=1
∞
∞
X
Zn
1 X
Si =
≥
Zi = ∞ a.s.
g
C
Y
n−1
i=1
i=1
by the strong law of large numbers.
In case (iii), the jump chain returns to j infinitely many times, say at times N1 , N2 , . . . ∈
Z. Then
∞
X
X
1 X
ZNk = ∞ a.s.
T∞ =
Si ≥
SNk ≥
gj k
i=1
k
by the strong law of large numbers.
5.3
Transition probabilities
For a continuous-time Markov chain X, we define the transition probability from i to j
in time t to be
pi,j (t) = P X(t) = j | X(0) = i = Pi (X(t) = j).
(This is the analogue of the multistep transition probability in the discrete case; the different
notation reminds us that it is not quite the same. Of course there is no analogue of the
one-step transition probability, because there is no smallest unit of time).
68
Theorem 5.7 (Markov property). Let X be a continuous-time Markov chain. For any times
0 ≤ t0 ≤ t1 ≤ · · · ≤ tn+1 and states i0 , . . . , in−1 , i, j, we have
P X(tn+1 ) = j | X(t0 ) = i0 , . . . , X(tn−1 ) = in−1 , X(tn ) = i = pi,j (tn+1 − tn ).
In the discrete case we took a similar statement as the definition of a Markov chain. In
the continuous case it can be shown, subject to certain conditions that Markov chains as we
constructed them are the only processes satisfying the above property.
Proof. (sketch, not required for the course). Consider the following alternative construction of a Markov chain. To each pair of distinct states i, j we assign an independent
Poisson process (Ni,j (t))t≥0 of rate gi,j . At each arrival time of Ni,j we imagine that a
“gate” opens allowing the chain to move from i to j if it is at i at that time. It is not
difficult to show using the properties of Poisson processes that this is equivalent to the
earlier definition. Now, given X(tn ) = i, the future movement of the chain depends only
on the future (t > tn ) arrivals of the Poisson processes, which are independent of the past,
and distributed identically to the original Poisson processes started from time 0. Now the
claimed result follows.
One can also define a stopping time in continuous time, and there is also a strong Markov
property similar to the discrete case. We omit the details. One can also compute hitting
probabilities and expected hitting times similarly to the discrete case. We omit this as well,
except for a few very simple cases.
Theorem 5.8 (Chapman-Kolmogorov equations).
X
pi,j (s + t) =
pi,k (s)pk,j (t).
k∈S
Proof. Similar to the discrete case.
For each t we can form a matrix
P (t) = (pi,j (t))i,j∈S
of the transition probabilities. It is a transition matrix (i.e. the entries are non-negative and
each row sums to 1). We take pi,j (0) to be 1 if i = j and 0 otherwise, so P (0) is the identity
matrix I. The Chapman-Kolmogorov equations can be written in matrix form as
P (s + t) = P (s)P (t).
In the discrete case we could in principle express everything in terms of the 1-step transition matrix P . In continuous time it is not quite so simple, because there is no smallest unit
of time. Instead we can consider the limit as the time interval goes to zero. If we do that we
essentially get back the generator matrix.
Theorem 5.9. For any i, j ∈ S we have
p′i,j (0) = gi,j .
Here p′i,j (t) means the derivative dtd pi,j (t), and when we evaluate it at 0 it is to be understood as a right-derivative, since pi,j (t) does not make sense for negative t.
69
Proof. (sketch; not required for the course). Let (Yn ) be the jump chain and let T1
be the first jump time. Also write gi = −gi,i as usual.
For i ̸= j, as h ↓ 0,
pi,j (h) = Pi (X(h) = j)
= Pi (T1 < h, Y1 = j) + o(h)
gi,j
= (1 − e−gi h )
+ o(h)
gi
gi,j
= gi h
+ o(h)
gi
= hgi,j + o(h).
The second equality is because any other route from i to j other than a single direct jump
involves two or more jumps in a short time h, which is much less likely; we omit the full
rigorous justification of this.
Similarly,
pi,i (h) = Pi (X(h) = i)
= Pi (T1 > h) + o(h)
= e−gi h + o(h)
= 1 − gi h + o(h)
= 1 + hgi,i + o(h).
Since pi,j (0) is 1 if i = j and 0 otherwise, in both cases we have
o(h)
pi,j (h) − pi,j (0)
= gi,j +
,
h
h
and so taking h ↓ 0 gives the result.
We can again write the last result in matrix form:
P ′ (0) = G,
where P ′ (t) is the matrix of derivatives (p′i,j (t))i,j∈S .
Combining the last two results, we can say how P (t) and G are related in general.
Theorem 5.10. The transition matrices P (t) and generator matrix G satisfy both the
differential equations
(i) P ′ (t) = P (t) G (“forward equation”);
(ii) P ′ (t) = G P (t) (“backward equation”).
Proof. (sketch) (i) Using the Chapman-Kolmogorov equations, we have
P ′ (t) =
d
∂
P (t) =
P (t + h)
dt
∂h
=
h=0
∂
P (t)P (h)
∂h
= P (t)P ′ (h)
h=0
h=0
= P (t)G.
(For an infinite state space, some justification is needed to make this fully rigorous, because
the matrix multiplcation hides an infinite sum, which we are differentiating term by term.)
(ii) is similar, using the Chapman-Kolmogorov equations with t and h in the other order.
70
The neat form of these matrix equations is the reason for defining the generator the way
we did, with entries −gi on the diagonal.
Matrix exponential formula – not required for the course. In the case of a finite
state space, it is possible to solve the forward or backward equations (together with the
initial condition P (0) = I), to get the remarkable formula
P (t) = exp(tG),
where the exponential of a matrix M is defined as
exp(M ) =
∞
X
Mn
n!
n=0
5.4
.
Classification of states
For a continuous time Markov chain, the definitions of i → j, i ↔ j, communicating
classes and irreducibility, recurrence and transience are all simply inherited from
the jump chain.
(Not positive and null recurrence, however). We next explore briefly what these definitions
mean.
Example 5.11.
a
1
d
2
1
2
1
c
b
1
In this example, {a, b, c} is a transient class and {d} is a recurrent class.
The following fact makes many things simpler than the discrete case. In particular there
is no periodicity.
Theorem 5.12. For two states i and j of a continuous-time chain, the following are equivalent.
(i) i → j;
(ii) pi,j (t) > 0 for all t > 0;
(iii) pi,j (t) > 0 for some t > 0.
Proof. It is obvious that (ii) implies (iii).
To show that (iii) implies (i), assuming (iii), there must exist some finite sequence of
states such that with positive probability the chain can get from i to j via this sequence.
But this implies that the jump chain can follow the same sequence with positive probability,
which gives (i).
71
We now show that (i) implies (ii). Assuming (i), there exist i1 , . . . , in−1 such that
ri,i1 ri1 ,i2 · · · rin−1 ,j > 0, where R = (ri,j ) is the jump chain matrix. By the definition of
the jump chain this implies that gi,i1 gi1 ,i2 · · · gin−1 ,j > 0, where G = (gi,j ) is the generator
matrix. Note that if gk,ℓ > 0 then
pk,ℓ (t) ≥ Pk T1 < t, T2 − T1 > t, Y1 = ℓ ≥ (1 − e−gk t )e−gℓ t rk,ℓ > 0.
Therefore,
pi,j (t) ≥ pi,i1 (t/n) · · · pin−1 ,j (t/n) > 0,
giving (ii).
A set A ⊆ [0, ∞) is bounded if A ⊆ [0, C) for some C < ∞, and unbounded otherwise.
Theorem 5.13. Let i be a state of a continuous-time Markov chain X.
(i) If i is recurrent then Pi ({t : X(t) = i} is unbounded) = 1.
(ii) If i is transient then Pi ({t : X(t) = i} is bounded) = 1.
Proof. (i) Since i is recurrent the chain does not explode, so the jump times Tn satisfy
Tn → ∞ as n → ∞. But X(Tn ) = Yn which equals i for infinitely many i, so {t : X(t) = i}
is unbounded.
(ii) Since i is transient, the jump chain has only finitely many visits to i, so the last visit
is at time N < ∞ say. So {t : X(t) = i} is bounded by TN < ∞.
5.5
Stationary distributions
P
As usual, a distribution is a row vector γ = (γi : i ∈ S) that satisfies γi ≥ 0 and i∈S γi = 1.
As before if the initial distribution is γ then the distribution of Xt is γP (t). A distribution
π is stationary for a Markov chain with transition matrices P (t) if
π P (t) = π
for all t > 0.
Since P (t) is usually not easy to calculate, this definition is not so useful in practice. Fortunately, there is an equivalent condition in terms of the generator.
Theorem 5.14. For a continuous-time Markov chain with generator G and transition
matrices P (t), the distribution π is stationary if and only if πG = 0.
Here the 0 on the right side of the equation means the row vector (0, 0, . . . , 0).
Proof. We will give the proof assuming that the state space is finite; in the general case the
various interchanges of limits require additional justification.
If π is stationary then πP (t) = π, so differentiating gives πP ′ (t) = 0. Taking t = 0 gives
πG = 0.
If πG = 0 then by the backward equation,
d
(πP (t)) = πP ′ (t) = π(GP (t)) = (πG)P (t) = 0,
dt
so πP (t) is constant in t (i.e. each entry of the row vector is constant). But πP (0) = πI = π,
so the constant is π, i.e. π is stationary.
72
Writing gi = −gi,i as usual, the equation πG = 0 can be written in the form
X
πi gi,j = πj gj ,
j ∈ S.
i̸=j
Example 5.15. Find all stationary distributions of the continuous-time Markov chain with
transition diagram
a
1
2
1
1
c
b
1
1
1
Solution. πG = 0 gives πb + πP
c = 2 πa ; 2 πa = 2πb ; πb = πc . One equation is redundant, and
we get π = (4πc , πc , πc ). Using
πi = 1 gives π = (4/6, 1/6, 1/6).
♢
We omit the proofs of the remaining results in this section. They can generally be deduced
from their discrete-time counterparts – some such deductions are straightforward, others are
a bit harder.
Theorem 5.16. For an irreducible, recurrent continuous-time Markov chain, the equation xG = 0 has at most one positive finite solution up to multiplication by a constant.
Define the mean return time to state i:
mi = Ei Ri
where Ri = inf{t > T1 : X(t) = i},
where T1 is the first jump time. (So, we measure the time to first leave i and then return to
it.) A recurrent state i is called positive if mi < ∞ and null if mi = ∞.
Theorem 5.17. Positive and null recurrence are class properties.
Theorem 5.18. An irreducible continuous-time Markov chain is positive recurrent if
and only if it is non-explosive and has a stationary distribution. Moreover, in that case
the stationary distribution π is given by
πi =
1
mi gi
(where gi = −gi,i is the jump rate out of i and mi is the mean return time of i).
The formula πi = 1/(mi gi ) makes sense because it takes time mi to return to i, of which
(in expectation) time 1/gi was spent in i before leaving, so the average proportion of time in
i is (1/gi )/mi .
Note that, unlike the discrete case, existence of a stationary distribution is not enough to
deduce positive recurrence - we need to check non-explosion as well. See the problem sheet
for an example that illustrates this. Here is the continuous-time version of the convergence
theorem and ergodic theorem.
73
Theorem 5.19. For an irreducible, positive recurrent continuous-time Markov chain X
with stationary distribution π and any initial distribution, as t → ∞,
and
P(X(t) = j) → πj ,
Z t
1
IX(s)=j ds → πj .
t 0
In the second convergence statement, the left side is the proportion of time spent at j up to
time t.
Example 5.20. In Example 5.15, find (a) the expected return time to a, starting from a; (b)
the expected hitting time of a, starting from b; (c) limt→∞ pa,b (t).
Solution. (a) ma = 1/(πa ga ) = 1/(4/6 · 1/2) = 3. (b) From a, the first jump is to b, in
expected time 1/(1/2) = 2, so the expected time to hit a from b is 3 − 2 = 1. (c) The chain
is non-explosive since it is finite, and has a stationary distribution, so it is positive recurrent.
Hence limt→∞ pa,b (t) = πb = 1/6.
♢
5.6
Reversibility and queueing theory
The theory of reversibility in continuous time closely mirrors the discrete case. A main
application is to the theory of queues, one of the central topics of applied probability.
Theorem 5.21. Let G be a generator matrix. If x = (xi : i ∈ S) satisfies the detailed
balance equations
xi gi,j = xj gj,i
i, j ∈ S
then xG = 0.
Proof. See problem sheet.
Theorem 5.22. If X is an irreducible positive recurrent continuous-time Markov chain
whose stationary distribution π and generator G satisfy the detailed balance equations,
then, if the initial distribution is π, then for any fixed u > 0, the process (X(u − t))t∈[0,u]
has the same distribution as (X(t−))t∈[0,u] .
(The meaning of X(t−) is lims↑t X(s) – the value immediately before time t. We need this
because of the technical convention (19) that at a jump time the chain takes the new value
– this gets reversed when we reverse time. There’s no need to worry about any of this!)
In the above situation, we say that the Markov chain is reversible. As in the discrete
case, a Markov chain with state space {0, 1, . . .} or {0, 1, . . . , n} is a birth-death chain if
gi,j = 0 for all i, j ∈ S with |i − j| ≥ 2. It is easy to check that a birth-death chain is
reversible provided it is positive recurrent.
Example 5.23. M/M/1 queue. Customers arrive as a Poisson process of rate λ, and
require independent Exponential(µ) service times. There is one server, and if they are busy
then customers wait to be served in a queue, which is allowed to grow arbitrarily long.
This is the canonical queue model. Let X(t) be the total number of customers present
(waiting or being served) at time t. This is a continuous-time Markov chain with transition
diagram
74
λ
0
λ
1
µ
λ
λ
2
···
3
µ
µ
µ
To analyse it, first note that the jump chain is a random walk with a reflecting barrier:
λ
λ+µ
1
0
1
µ
λ+µ
λ
λ+µ
λ
λ+µ
2
···
3
µ
λ+µ
µ
λ+µ
µ
λ+µ
This is recurrent if and only if λ/(λ + µ) ≤ 1/2, i.e. λ ≤ µ. (The reflecting barrier is slightly
different from usual: the transition probability from 0 to 1 is 1 rather than λ/(λ + µ). But
we can see that this does not affect P1 (hit 0), and from this we can deduce that the criteria
for recurrent and transience are also not affected.)
This conclusion makes sense from a queueing perspective. If λ > µ then the arrival rate λ
exceeds the maximum possible service rate µ, so in the long term the server will not be able
to cope with demand, and the queue length will go to infinity. If λ ≤ µ then the server has
sufficient capacity, and the server will have idle periods (times when X(t) = 0) indefinitely.
Let λ ≤ µ. To determine positive or null recurrence we look for stationary distributions.
The chain is irreducible, and it is non-explosive because the rates are bounded, or alternatively
because the jump chain is recurrent. Moreover it is a birth-death chain, so we can use the
detailed balance equations:
πi λ = πi+1 µ,
i ≥ 0.
The general solution is
πi = π0
λ i
µ
,
i ≥ 0.
P
If λ = µ then (λ/µ)i = ∞, so there is no stationary distribution, hence the chain is null
recurrent. In terms of the queue, the server just has enough capacity to handle the load, so
the queue length does not go to infinity, but idle periods are very rare – the expected return
time to 0 is infinite.
P
If λ < µ then (λ/µ)i = 1/(1 − λ/µ), so there is a stationary distribution given by
λ λ i
πi = 1 −
,
µ µ
i ≥ 0.
Here the maximum possible service rate µ exceeds the arrival rate λ, so the queue length has
an equilibrium distribution, and the server is idle for a positive fraction π0 of the time. In
fact π0 = 1 − λ/µ.
We can calculate many more things:
Example 5.24. For an M/M/1 queue in the positive recurrent regime λ < µ, in equilibrium,
find
(a)
(b)
(c)
(d)
the
the
the
the
mean
mean
mean
mean
number of customers present;
duration of a typical idle period;
duration of a typical busy period;
waiting time (queueing plus service) for a typical customer.
75
Solution. “In equilibrium” means that we start in the stationary distribution π.
(a) The stationary distribution is a Geometric distribution with parameter 1−λ/µ shifted
by 1. So the mean is 1/(1 − λ/µ) − 1 = λ/(µ − λ).
(b) An idle period is the time from when the server becomes idle (when X hits 0) until
the first jump out of 0. This has mean 1/g0 = 1/λ.
(c) A busy period means an interval of time during which the server is working, i.e.
X(t) > 0. The mean return time of 0 is 1/(π0 g0 ). This comprises an idle period followed by a
busy period. So the mean duration of a typical busy period is the difference 1/(π0 g0 )−1/g0 =
µ/[λ(µ − λ)] − 1/λ = 1/(µ − λ).
(d) Suppose that a new customer arrives when there are Q customers already present.
Then they must wait for Q + 1 service times – those of customers already present plus their
own. Since the service times are i.i.d. Exp(µ) (and using the memoryless property for the
customer currently being served, if any), the new customer’s waiting time W satisfies
E(W | Q = i) =
so
E(W | Q) =
1
(i + 1),
µ
1
(Q + 1).
µ
Since the queue is in equilibrium prior to the customer’s arrival, using (i),
1
1 λ
1
EW = (EQ + 1) =
+1 =
.
µ
µ µ−λ
µ−λ
♢
(Strictly speaking, the last step needs some more justification, which we won’t go into – it is
an example of something called the PASTA principle – Poisson Arrivals See Time Averages.)
Example 5.25. (M/M/s queue) Now suppose that customers arrive as a Poisson process
of rate λ and require Exponential service times with parameter µ, and there are s servers.
When all s servers are busy, customers wait in the queue. When there are s or fewer customers, each is served simultaneously by a different server, and any remaining servers are
idle. When is the chain positive recurrent?
Solution. The number of customers present is a Markov chain with rates
i≥0
i ≥ 1.
gi,i+1 = λ,
gi,i−1 = µ min(i, s),
The chain is non-explosive since the rates are bounded. It is a birth-death chain, so we can
solve the detailed balance equations:
π 0 λ = π1 µ
π1 λ = π2 2µ
π2 λ = π3 3µ
···
πs−1 λ = πs sµ
πs λ = πs+1 sµ
πs+1 λ = πs+2 sµ
···
76
to give
 i
λ 1

π 0
,
i ≤ s,
i!
πi =
µ
λ i 1

π 0
, i > s.
µ s!si−s
The chain is positive recurrent if this can be normalised, i.e. if
s ∞ X
X
λ i1
λ i 1
+
< ∞,
µ i! i=s+1 µ s!si−s
i=0
or equivalently when
∞ ∞
X
λ i 1
ss X λ i
=
< ∞.
i−s
µ
s!s
s!
sµ
i=s+1
i=s+1
That is, when λ/(sµ) < 1, i.e. λ < sµ. (This makes sense: it is when the service capacity
exceeds the arrival rate).
♢
Example 5.26. (M/M/∞ queue) Suppose that now there are an unlimited number of
servers, and everything else is as in the last example. Then gi,i+1 = λ for all i ≥ 0, and
gi,i−1 = µi for all i ≥ 1. The rates are not bounded, but we can conclude that the chain is
non-explosive as follows. Since the arrival process is a homogeneous Poisson process, after
any finite time, the number of arrivals is finite, therefore the number of jumps made by the
chain is also finite.
The detailed balance equations are
i ≥ 1,
πi−1 λ = πi iµ,
which give
πi = π0
We have
λ i 1
µ
X λ i 1
i≥0
µ
i!
i!
,
i ≥ 0.
= eλ/µ < ∞,
so the chain is positive recurrent for all λ and µ, and π0 = e−λ/µ , so the stationary distribution
is Poisson with parameter λ/µ.
Output of a queue. Suppose you see only the stream of customers coming out of the
shop. What do the times of their departures look like as a stochastic process? Remarkably,
reversibility can give us the answer.
Consider an M/M/1 or M/M/s or M/M/∞ queue in the positive recurrent regime, in
stationarity. Since the detailed balance equations are satisfied, the process is reversible. The
process of departure times is the same as the process of times when the number of customers
X(t) decreases by 1. By reversibility, this has the same distribution as the process of times
when X(t) increases by 1. But this is simply the arrivals process. So we have established:
Theorem 5.27. For a positive recurrent stationary M/M/s queue (where s ≥ 1 or
s = ∞) in equilibrium, the process of departure times is a Poisson process of rate λ.
Like some earlier results involving reversibility, this is quite surprising. One might expect
that departures would be more “clustered together”. There is a higher departure rate µ
during busy periods, and no departures during idle periods; but, remarkably, the random
switching between the two is such that the result is just a Poisson process.
77
6
Martingales
6.1
Introduction
Besides Markov chains, martingales are the other main class of stochastic processes. Whereas
Markov chains are “memoryless processes”, martingales are “fair gambling games”. The
unusual name “martingale” is an 18th century a term for a betting strategy – this in turn
appears to derive from a word for a kind of harness.
By way of motivation, consider a simple of gambling game: a fair coin is tossed repeatedly.
Just before each toss, you are allowed to bet any amount y of your choice that the next toss
will be Heads. If the next toss is Heads you win y; if it is Tails you lose y. We allow y to be
negative (in which case one would normally call it betting on Tails). For example, you might
choose to:
1. Bet £10 on the first toss and then stop. Then your expected profit is 21 (10)+ 12 (−10) =
0.
2. Bet £1 on each of the first 10 tosses. It is clear that the distribution of your profit at
the end is symmetric around 0, so your expected profit is again 0.
Is there any reasonable betting strategy that will give a positive expected profit? Well,
we might hope to try:
3. Bet £1 on the first toss that is Heads(!) That certainly guarantees a profit of £1, but
it is not really a reasonable strategy. We have to place the bet before the toss (good luck
finding a casino that lets you place it afterwards), so the only way to do this is if we can see
into the future!
However, what about:
4. Bet £1 on every toss until the first time your profit is 1, and then stop. Perhaps
surprisingly, this works. Your profit is a simple symmetric random walk. We know that it
hits 1 almost surely, so you will definitely walk away with £1! Two disadvantages are: (a)
it may take a very long time – we know that the expected time to hit 1 is infinite; (b) you
need to be infinitely rich (or have an unlimited line of credit!) – the random walk might get
arbitrarily negative before it hits 1.
Another approach:
5. Bet 1 on H, then 2 on H, then 22 , then 23 , etc. until the first time you win a bet. This
also works – we know there will be a H eventually, and if it happens on the N th toss then
your profit is
−1 − 2 − 22 − 23 − · · · − 2N −1 + 2N = 1,
so we are guaranteed a £1 profit again. This seems like a much quicker way to get rich:
N has Geometric distribution, so EN = 2. But the problem again is that you need to be
infinitely rich to start with. If not, things could go very badly. Just before winning, you’ll
be 2N − 1 in debt!
In fact we’ll see that for any strategy that doesn’t involve seeing into the future and
also keeps you wealth above some constant −c, your expected profit cannot be positive.
This turns out to be a very powerful fact with implications far beyond gambling. We’ll
see some nice applications to random walks, and a simple elegant solution to the HTH and
ABRACADABRA problems.
Here is the main definition. Let X = (Xn )n≥0 be a discrete-time stochastic process, where
each Xn is real-valued. We say that X is a martingale if for each n:
78
(i) Xn has finite mean, and
(ii) E(Xn+1 | X0 , X1 , . . . , Xn ) = Xn .
Let’s pause to understand the meaning of condition (ii). Recall the definition of conditional expectation: for two random variables X and Y , E(X | Y = y) is simply the mean
of the conditional distribution of X given that Y = y. This is necessarily a function of y,
say E(X | Y = y) = g(y). Then we define E(X | Y ) = g(Y ). We are used to Y being a
real-valued random variable, but in fact it could be any random variable. In (ii) above it
is the random vector (X0 , X1 , . . . , Xn ). So E(Xn+1 | X0 , X1 , . . . , Xn ) = g(X0 , X1 , . . . , Xn )
where the function g is given by g(x0 , . . . , xn ) = E(Xn+1 | (X0 , . . . , Xn ) = (x0 , . . . , xn )). We
don’t want to write all that out every time, which is why the conditional expectation notation
is so useful.
More importantly, how should we interpret the definition? The idea is that Xn represents
your wealth an time n in any fair gambling game. (i) is just a technicality to make sure
all expectations make sense. (ii) says that given complete knowledge of everything that has
happened to your wealth up till now, X0 , . . . , Xn , the best guess (i.e. mean) for your wealth
Xn+1 after one more step is your current wealth Xn .
To make use of the concept we need a few basic properties of conditional expectation.
Theorem 6.1. For random variables X, Y, Z and constants a, b,
(i) E(aX + bY | Z) = aE(X | Z) + bE(Y | Z);
(ii) E(g(Y )X | Y ) = g(Y )E(X | Y ) for any function g;
(iii) E(X | Y ) = E(X) if X and Y are independent.
It is important to appreciate the difference between (ii) and (iii). If we know Y , then we
know g(Y ), so it behaves like a deterministic constant and comes out of the expectation. On
the other other hand if X and Y are independent then knowing Y tells us nothing about X,
so the conditional expectation just becomes an ordinary expectation.
Proof. These follow from straightforward manipulations, first showing the corresponding
statements involving conditioning on events. For instance, for (ii) we show that E(g(Y )X |
Y = y) = g(y)E(X | Y = y). We omit the details.
Recall also the tower property or double expectation formula:
Theorem 6.2. E(E(X | Y )) = E(X).
Example
Let (Xi )i≥0 be i.i.d. taking values +1 and −1 each with probability 1/2, and
P6.3.
n
let Sn = i=1 Xi be the simple symmetric random walk. Show that (Sn ) is a martingale.
Solution. First note that EXi = 0, so ESn = 0 which is finite. Now,
E(Sn+1 | S0 , . . . , Sn ) = E(Sn + Xn+1 | S0 , . . . , Sn )
= E(Sn | S0 , . . . , Sn ) + E(Xn+1 | S0 , . . . , Sn )
= Sn + E(Xn+1 )
= Sn ,
because Sn is a function of S0 , . . . , Sn while Xn+1 is independent of S0 , . . . , Sn .
79
♢
Our first property of martingales is natural given the fair game interpretation, although
not all that useful on its own.
Theorem 6.4. If (Xn ) is a martingale then EXn = EX0 for every n.
Proof. Using the tower property,
EXn+1 = E(E(Xn+1 | X0 , . . . , Xn )) = EXn .
Now use induction.
Sometimes the following slightly more general definition of a martingale is useful. Let
(Xn )n≥0 and (Yn )n≥0 be stochastic processes. Then we say that (Xn ) is a martingale with
respect to (Yn ) if for each n:
1. Xn has finite mean, and
2. E(Xn+1 | Y0 , . . . , Yn ) = Xn .
We can imagine that Yn represents all new information learned at time n, for instance the
state of the stock market, and so (Y0 , . . . , Yn ) represents all information known up to time
n, while Xn is the quantity we are actually interested in, perhaps the value of your stock
portfolio. It follows
from condition 2 that Xn must be a function of Y0 , . . . , Yn .
Pn
Let Sn = i=1 Xi be the simple symmetric random walk started at 0 as usual. Also take
X0 = 0, since the definitions assume that both process are indexed from 0.
Example 6.5. Show that (Sn ) is a martingale with respect to (Xn ).
Solution. Sn is a function of X0 , . . . Xn and has finite mean, and
E(Sn+1 | X0 , . . . , Xn ) = E(Sn | X0 , . . . , Xn ) + E(Xn+1 | X0 , . . . , Xn )
= Sn + E(Xn+1 )
= Sn .
♢
Example 6.6. Is (Sn2 ) is a martingale with respect to (Xn )?
Solution. Since −n ≤ Sn ≤ n, it follows that Sn2 has finite mean. But
2
E(Sn+1
| X0 , . . . , Xn ) = E((Sn + Xn+1 )2 | X0 , . . . , Xn )
2
= E(Sn2 + 2Sn Xn+1 + Xn+1
| X0 , . . . , X n )
2
)
= Sn2 + 2Sn E(Xn+1 ) + E(Xn+1
2
2
= Sn + 2Sn 0 + 1 = Sn + 1.
So (Sn2 ) is not a martingale.
♢
However, it is very close: we can fix it by adding a deterministic function:
Example 6.7. Show that (Sn2 − n) is a martingale with respect to (Xn ).
Solution. Let Mn = Sn2 − n. From above, Mn has finite mean, and
2
E(Mn+1 | X0 , . . . , Xn ) = E(Sn+1
− n − 1 | X0 , . . . , X n )
= Sn2 + 1 − n − 1 = Sn2 − n = Mn .
80
♢
6.2
New martingales from old
Let (Xn ) be a martingale with respect to (Yn ), and think of Xn as representing your wealth
at time n in a fair gambling game – perhaps betting 1 unit repeatedly on a sequence of fair
coin tosses. Now suppose you are allowed to vary your betting strategy. You can bet any
amount Hn you want to on the outcome of game n. For this to be sensible, we should insist
that you cannot see into the future, so you must base your decision on what has happened
before game n.
To formalize this, we say that a stochastic process (Hn )n≥1 is predictable (sometimes
called previsible) with respect to (Yn )n≥0 if for each n, Hn is a function of Y0 , . . . , Yn−1 . In
other words, if Y0 , . . . , Yn represents the information known at time n, then Hn is known
strictly before time n.
For stochastic processes X = (Xn )n≥0 and H = (Hn )n≥1 we define the process H · X =
((H · X)n )n≥0 by
n
X
(H · X)n =
Hi (Xi − Xi−1 ).
i=1
This is precisely the betting process referred to above. If Xn is the wealth after n games
of a player who bets $1 on each game, then Xi − Xi−1 is the profit from betting $1 on game
i, so Hi (Xi − Xi−1 ) is the profit from betting $Hi on game i, so (H · X)n is the wealth after
n games of a player who starts with 0 and bets Hi on game i for each i.
Example
Let Z1 , Z2 , . . . be i.i.d. taking values +1 and −1 with equal probability, and
P6.8.
n
let Sn = i=1 Zi be the simple symmetric random walk. Let
N = min{n ≥ 1 : Zn = +1},
and let
Hn = 2n−1 In≤N ,
n ≥ 1.
In other words, we double our bet (on Heads, say) at each step until the first win, and then
stop.
Since {n ≤ N } = {N < n}C = {Z1 = · · · = Zn−1 = −1}, the process (Hn ) is predictable
with respect to (Zn ).
We have
(
n
X
−1 − 2 − 22 − · · · − 2n−1 = −2n + 1,
n<N
(H · S)n =
Hi Zi =
2
N −2
N −1
−1 − 2 − 2 − · · · − 2
+2
+ 0 + 0 · · · = 1, n ≥ N.
i=1
So using this betting strategy we eventually end up with a profit of 1, although we might
go a long way down first.
The next result says that betting strategies cannot “beat the system” - a fair game remains
a fair game. The traditional but slightly confusing name “optional sampling” conveys the
idea that you are allowed to choose when to bet on, or “sample” from X.
Theorem 6.9 (Optional sampling). Let X be a martingale with respect to Y and let H
be predictable with respect to Y . Then H · X is a martingale with respect to Y provided
(H · X)n has finite expectation.
Example 6.10. In the last example, note that P(N > n) = (1/2)n , so
E(H · S)n = (−2n + 1)(1/2)n + (1)(1 − (1/2)n ) = 0 = E(H · S)0 ,
consistent with the theorem.
81
Proof of Theorem 6.9. Since H is predictable, (H · X)n is a function of Y0 , . . . , Yn , as are
Hn+1 and Xn . So
E((H · X)n+1 | Y0 , . . . , Yn ) = E (H · X)n + Hn+1 (Xn+1 − Xn ) | Y0 , . . . , Yn
= (H · X)n + Hn+1 E(Xn+1 − Xn | Y0 , . . . , Yn )
= (H · X)n + Hn+1 (E(Xn+1 | Y0 , . . . , Yn ) − Xn )
= (H · X)n + Hn+1 (Xn − Xn )
= (H · X)n .
6.3
Stopped martingales
One example of a “betting strategy” is simply to stop playing the game at some point. We
can decide when that is based on what has happened so far. In other words, we stop at a
stopping time.
Recall that a random variable T taking values in N ∪ {∞} is a stopping time with
respect to (Yn )n≥0 if we can determine whether or not T ≤ n from Y0 , . . . , Yn . We use the
following notation for the minimum:
a ∧ b = min(a, b).
If X = (Xn )n≥0 is a stochastic process and T is a random time, then the process
(Xn∧T )n≥0
is X stopped at T .
(Xn∧T )
(Xn )
n
0
n
0
T
T
Given the gambling interpretation, the following should not be a surprise. A gambler can
walk away whenever he wants.
Theorem 6.11. If X is a martingale with respect to Y and T is a stopping time with
respect to Y then (Xn∧T )n≥0 is a martingale with respect to Y .
Proof. We bet 1 unit on each game until T , and then stop. In other words, let Hn = IT ≥n .
Then {Hn = 1} = {T ≥ n} = {T ≤ n − 1}C , which can be determined from Y0 , . . . , Yn−1 . So
H is predictable. Hence H · X is a martingale, by Theorem 6.9. But
(H · X)n =
n
X
Hi (Xi − Xi−1 ) =
n∧T
X
i=1
1(Xi − Xi−1 ) = Xn∧T − X0 .
i=1
So Xn∧T = (H · X)n + X0 . Now the constant sequence (X0 )n≥0 is trivially a martingale,
and the sum of two martingales is a martingale (see the problem sheet), so (Xn∧T ))n≥0 is a
martingale.
82
Using Theorem 6.4 we can immediately deduce
Theorem 6.12. Under the conditions of the last theorem, EXn∧T = EX0 .
Similarly to our earlier investigations, a natural and useful question is: when is it the case
EXT = EX0 for a random time T . Here is an answer.
Theorem 6.13 (Optional stopping). Let X be a martingale and T a stopping time with
respect to Y . Any of the following conditions ensures that EXT = EX0 .
(i) T is bounded (i.e. T < c a.s. for some constant c < ∞);
(ii) T < ∞ a.s. and X is bounded (i.e. |Xn | ≤ c for all n and some fixed c < ∞);
(iii) ET < ∞ and X has bounded increments (i.e. |Xn+1 − Xn | < c for all n and some
fixed c < ∞).
Proof. (partial) (i) We have EXn∧T = EX0 . We have T < c, so T ∧ c = T . So taking n = c
gives EXT = EXT ∧c = EX0 .
For (ii) and (iii) it can be shown using tools beyond the scope of this course (dominated
convergence) that EXn∧T → EXT as n → ∞. The result then follows because EXn∧T =
EX0 .
Example 6.14. Let S = (Sn ) be the simple symmetric random walk started at 0, and let T
be the hitting time of 1. S is a martingale and T is a stopping time, but
1 = EST ̸= ES0 = 0.
Since S has bounded increments (bounded by 1), but the conclusion of the optional stopping
theorem fails, we must have ET = ∞ (as we found before).
Example 6.15. (Gambler’s ruin yet again) Let S = (Sn ) be the simple symmetric random
walk started at k and let T be the first hitting time of {0, m}, where 0 < k < m. The
martingale S has bounded increments, and by the disjoint blocks argument we know that the
stopping time T has finite mean. Therefore by part (iii) of the optional stopping theorem,
EST = ES0 , i.e.
hm + (1 − h)0 = k,
where h = P(S hits m before 0). Hence
h=
k
m
(as we found before).
Example 6.16. (Gambler’s ruin duration) In the set-up of the last example, let Mn =
Sn2 − n. As we have seen, this is a martingale, but it does not have bounded increments, so
we cannot apply optional stopping directly. However, the stopped martingale (Mn∧T )n≥0 does
have bounded increments, because Sn∧T is bounded between 0 and m, while n has increments
of 1. Therefore, we can apply optional stopping to (Mn∧T ) to give EMT ∧T = EM0∧T . But
EM0∧T = EM0 = k 2 − 0 = k 2 , and ST2 is m2 with probability h and 02 with probability 1 − h,
so
EMT ∧T = EMT = E(ST2 − T ) = hm2 + (1 − h)02 − ET.
We conclude that hm2 − ET = k 2 , so
ET = hm2 − k 2 = km − k 2 = k(m − k)
(as we found before).
83
The method used above applies in general: if M is a martingale and T is a stopping time
then we can apply the optional stopping theorem to the stopped martingale (Mn∧T ) rather
than to M (provided one of the conditions (i)–(iii) holds for the stopped martingale).
The conclusion EM0 = EMT is the same, since M0∧T = M0 and MT ∧T = MT .
Finally, we can give the promised elegant solution to this question:
Example 6.17. A monkey types random letters on a keyboard. Assume that each press is
a uniformly random choice from A, . . . , Z, and different presses are independent. Find the
expected time until it types ABRACADABRA.
Solution. Suppose that just before each key press, a new gambler enters the casino. The
gambler arrives with £1, and bets that the next letter will be A. If she loses the bet then she
foreits her stake and leaves. If she wins, her wealth increases to £26. In that case she then
bets her entire wealth that the next letter will be B, and either loses her stake and leaves, or
wins and increases her wealth to £262 . In that case she stakes her entire wealth on R, and
so on. If the gambler is lucky enough to win all 11 bets until ABRACADABRA appears, she
retires with the extremely comfortable sum of £2611 .
Let Mn be the total profit of the casino after n key presses. Because all the bets are fair,
M is a martingale. Let T be the number of presses until the end of the first appearance of
ABRACADABRA, which is a stopping time. By dividing into disjoint blocks we know that
ET < ∞. And M has bounded increments, because at any given time there are at most 11
gamblers in the casino, and the total of all their bets is at most 1 + 26 + 262 + · · · + 2611 .
Hence we can apply optional stopping to deduce
EMT = EM0 .
We have M0 = 0 so EM0 = 0. What is MT ? At time T there have been T presses, so
T gamblers have entered and each paid in £1, so from this the casino has gained T . But
ABRACADABRA has just come up, so the casino has just paid out 2611 to one gambler. In
addition, another gambler has just won 4 bets in a row because of the last ABRA, so has
been paid 264 , and the last gambler to enter has just won 1 bet because of the final A, so
has been paid 26. Hence
MT = T − 2611 − 264 − 261 ,
so
0 = EM0 = EMT = ET − 2611 − 264 − 261 .
Therefore
ET = 2611 + 264 + 261 .
♢
We can see from this solution that what matters is possible overlaps between different
occurrences of ABRACADABRA. One way to think of it is that each possible overlap increases the degree of clustering of occurrences, which makes the mean waiting time longer. It
is interesting to compare this with the problem sheet questions about arrival times of buses
and taxis. Equally spaced arrivals (buses) are more regular and less clustered than Poisson
arrivals (taxis) of the same rate, resulting in a shorter mean waiting time.
84
6.4
Martingale convergence (not required for the course)
We give a very brief look at this important topic.
Theorem 6.18. Let X = (Xn ) be a non-negative martingale (i.e. satisfying Xn ≥ 0 for
all n). Then there is a random variable X such that
Xn → X
almost surely as n → ∞.
(The non-negativity assumption of Theorem 6.18 can be weakened in various ways.)
Proof. (sketch) We do not give the full proof, but we mention the main idea, which is
beautiful. If Xn is bounded but does not converge, then it must oscillate forever.
Imagine Xn is a stock price. Suppose there are two fixed values a < b for which Xn
oscillates between being < a and > b forever. Then there would be a guaranteed way to
make money with limited risk. Every time Xn drops below a we buy the stock, and every
time it goes above b we sell it. This guarantees a profit of b − a for each oscillation, but
because the martingale is non-negative our liability is limited. We can formalize all this
by choosing a suitable predictable H so that our wealth is H · X. It can be shown that
this contradicts the martingale property for H · X.
Example 6.19. (Polya urn) An urn initially contains one red and one blue ball. At each
step, a ball is drawn from the urn at random, and replaced together with another new ball
of the same colour.
Let Rn be the number of red balls in the urn after n steps (when there are n + 2 balls in
total). So R0 = 1, and (Rn ) is an inhomogeneous Markov chain; (Rn , n) is a homogeneous
Markov chain. We have
P(Rn+1 = j | R0 = i0 , . . . , Rn = i2 ) = E(Rn+1 = j | Rn = in ).
Let
Rn
,
n+2
the proportion of red balls. How does this behave as n → ∞? We have
Xn =
E(Rn+1 = r + 1 | Rn = r) = P(red chosen at step n + 1) =
r
n+2
E(Rn+1 = r | Rn = r) = P(blue chosen at step n + 1) = 1 −
Therefore
E(Rn+1
r
n+2
r
r n+3
| Rn = r) = (r + 1)
+r 1−
=r
,
n+2
n+2
n+2
so
E(Rn+1 | R0 , . . . , Rn ) = Rn
n+3
,
n+2
and therefore
E(Xn+1 | R0 , . . . , Rn ) =
1
n+3
Rn
Rn
=
= Xn .
n+3 n+2
n+2
Thus (Xn ) is a martingale with respect to (Rn ).
Note that 0 ≤ Xn ≤ 1 for all n. So by the convergence theorem,
Xn → X
85
a.s.
(20)
for some random variable X.
What is the distribution of X? Let
ar,n = P(Rn = r)
(which is the probability that the Markov chain (Rn , n) hits (r, n)). Then from (20) we
get a difference equation:
r r
ar,n+1 = ar,n 1 −
+ ar−1,n
.
n+2
n+2
Experimenting with small values, we can guess the solution
ar,n =
1
,
n+1
r = 1, . . . , n + 1,
which can then be proved by induction on n.
This means that Xn is uniformly random on the set
n 1
2
n + 1o
,
,...,
.
n+2 n+2
n+2
From this we can deduce that for any u ∈ [0, 1],
P(Xn < u) → u
as n → ∞.
In other words,
P(Xn < u) → P(U < u)
as n → ∞,
where U ∼ Unif[0, 1]. We say that Xn converges in distribution to U , sometimes
written Xn ⇒ U . (This is the same type of convergence that appears in the central limit
theorem). This is a much weaker statement than almost sure convergence. However, we
know that Xn → X a.s. for some X, and it can be shown that this implies Xn ⇒ X. It
follows that X has the same distribution as U , i.e. it is Uniform on [0, 1].
This is quite a striking result. In the long term, the proportion of red balls settles down
to a limit, but the limit is random, and in fact uniformly distributed. (We might imagine
that the limiting proportion is somehow chosen ahead of time, and gradually revealed
during the experiment, although of course that’s not what is really happening.)
There are several other notions of convergence of random variables. Understanding
how they relate to each other is a key topic in more advanced probability theory.
86
7
Brownian Motion
7.1
Introduction
Brownian motion (also called the Wiener process) is one of the most important and wellstudied stochastic processes. Like the Poisson process it is just one process, rather than a
broad family like Markov chains or martingales. Unlike everything else we have looked at so
far, it has continuous time and continuous space.
What does the simple symmetric random walk (Sn ) on Z look like if we run it for a large
number of steps? We can try plotting a typical trajectory, here for the first 10, 100, 1000, 10000
steps:
Sn
Sn
5
50
n
0
n
0
10
100
−5
−50
Sn
Sn
500
0
5000
n
1000
−500
n
0
10000
−5000
The results are a bit disappointing. But that’s because we didn’t scale the axes appropriately. In fact we could have
Pn predicted this. A single step Xi of the random walk has mean 0
and
√ variance 1, so Sn = 1 Xi has mean 0 and variance n, and therefore standard deviation
n. (In fact the central limit theorem tells us that it is approximately
Normal with these
√
parameters for large n). So if both axes have scale n, the typical n deviation of Sn from
0 will be too small to see when√n is large. The correct thing to do is stretch the picture
vertically to have scale of order n:
87
Sn
√
10
√
n
0
Sn
100
n
0
10
100
√
− 10
√
− 100
Sn
√
1000
√
0
n
Sn
10000
1000
√
− 1000
n
0
10000
√
− 10000
That looks much more interesting! Note that the picture looks similar in character for
n = 1000 and n = 10000. The idea is that Brownian motion is the limit of these pictures as
n → ∞. It’s not so clear what a limit of a random function means, nor why it should exist.
These issues can be resolved, but that is beyond the scope of this course. Instead we will
define Brownian motion via its properties, as we have done with other processes.
We already remarked that the distribution of Sn is approximately Normal, by the central
limit theorem. So it’s no surprise that the Normal distribution will feature in the definition.
Recall that a standard Normal distribution has probability density function
1
2
ϕ(z) = √ e−z /2 ,
2π
−∞ < z < ∞,
and has mean 0 and variance 1. If Z has standard Normal distribution, denoted Z ∼
N (0, 1), then X = σZ + µ has Normal distribution with mean µ and variance σ 2 , denoted
X ∼ N (µ, σ 2 ). (It’s important to remember that the second parameter σ 2 is the variance,
not the standard deviation). More generally, if X ∼ N (µ, σ 2 ) then aX + b ∼ N (aµ + b, a2 σ 2 ).
Recall also that if X1 and X2 are independent with Xi ∼ N (µi , σi2 ) then X1 + X2 ∼ N (µ1 +
µ2 , σ12 + σ22 ).
88
Now we can give the main definition. Consider a stochastic process B = (B(t))t≥0 , where
t takes non-negative real values and each B(t) is a real-valued random variable. We say that
B is a (standard) Brownian motion if:
1. B(0) = 0, and B is a continuous function almost surely;
2. B has independent increments (over disjoint intervals);
3. B(s + t) − B(s) ∼ N (0, t) for all s ≥ 0 and t > 0.
As before, “independent increments” means that for any 0 ≤ t0 ≤ t1 ≤ · · · ≤ tn , the
random variables
B(t1 ) − B(t0 ), B(t2 ) − B(t1 ), . . . B(tn ) − B(tn−1 )
are independent. Sometimes it is convenient to define a N (0, 0) random variable to be
deterministically equal to 0. Then property 3 is true for t = 0 as well.
Brownian motion typically looks something like this:
B(t)
1
t
0
1
−1
(In fact this is just the last random walk picture from before with different axis labels, which
is a good approximation. If we zoom in far enough on this picture we’ll see discrete steps.
Genuine Brownian motion would continue to look rough and spiky no matter how far we
zoom in.)
Example 7.1. For a standard Brownian motion B, compute Cov(B(s), B(t)).
Solution. Assume s ≤ t. Using independent increments,
E(B(s)B(t)) = E B(s)(B(t) − B(s) + B(s)) = E(B(s))E(B(t) − B(s)) + E(B(s)2 ) = 0 + s,
because B(s) ∼ N(0, s), so EB(s)2 = Var B(s) = s. So
Cov(B(s), B(t)) = E(B(s)B(t)) − EB(s)EB(t) = s − 0 = s.
Therefore, in general Cov(B(s), B(t)) = min(s, t).
89
♢
7.2
Symmetries
Many transformations of Brownian motion yield Brownian motion back again.
Theorem 7.2. Let B = (B(t))t≥0 be a standard Brownian motion. Then each of the
following is also a standard Brownian motion:
(i) (−B(t))t≥0 ;
(ii) (B(t + s) − B(s))t≥0 , for any constant s ≥ 0;
1
√
B(ct)
, for any constant c > 0.
(iii)
c
t≥0
Properties (i) (reflection symmetry) and (ii) (time-homogeneity) are not particularly surprising. (ii) says that Brownian motion viewed from time s onwards is just like the original,
provided we put our origin at the point it has got to, (s, B(s)) (see the picture below). (iii) is
more interesting: it says that Brownian motion is invariant under scaling, but we must scale
the horizontal and vertical axes differently – this reflects the square-root scaling that we saw
for the random walk pictures.
1
B(s)
0
s
1
−1
Proof. We just need to check the properties in each case.
(i) We have −B(0) = B(0) = 0, and −B is continuous. Since B has independent
increments, so does −B. Since B(t + s) − B(s) ∼ N (0, t) and the N (0, t) distribution is
symmetric about 0, we have −B(s + t) − (−B(s)) = −(B(s + t) − B(s)) ∼ N (0, t).
(ii) Let X(t) = B(s + t) − B(s). Then X(0) = B(s) − B(s) = 0, and X is continuous since
B is. The increment of X over an interval [u, v] is X(v) − X(u) = B(s + v) − B(s + u), so X
has independent increments. Finally, X(t + u) − X(u) = B(t + u + s) − B(u + s) ∼ N (0, t),
as required.
(iii) Let Y (t) = √1c B(ct). Then Y (0) = 0 and Y is continuous and has independent
√
increments. Moreover, we have Y (s + t) − Y (s) = (1/ √
c)(B(cs + ct) − B(cs)). Since
B(cs + ct) − B(cs) ∼ N (0, ct), Y (s + t) − Y (s) ∼ N (0, ct(1/ c)2 ) = N (0, t).
We can also reverse time.
90
Theorem 7.3 (reversal). Let B be a Brownian motion and fix u > 0. Then the timereversed process (B(u − t) − B(u))0≤t≤u has the same distribution as (B(t))0≤t≤u .
In other words, running time backwards from u to 0 simply gives a Brownian motion run
up to time u (provided we subtract the starting value B(u) to make the process start at 0).
Proof. Let X(t) = B(u−t)−B(u). Then X(0) = B(u)−B(u) = 0, and X is continuous since
B is. For 0 ≤ s ≤ t ≤ u, the increment of X over [s, t] is X(t) − X(s) = B(u − t) − B(u) −
B(u − s) + B(u) = −(B(u − s) − B(u − t)), i.e. minus the increment of B over [u − t, u − s].
Therefore X has independent increments, since B does. And finally, for 0 ≤ s ≤ s + t ≤ u
we have X(s + t) − X(s) = −(B(u − s) − B(u − s − t)) ∼ N (0, t).
The next property is considerably more surprising – we can invert time, meaning we
replace t with 1/t, so that in particular the behaviour of B near ∞ gets mapped to near 0
and vice versa.
Theorem 7.4 (inversion). If B is a standard Brownian motion then so is the process
X = (X(t))t≥0 defined by
(
tB(1/t), t > 0;
X(t) =
0,
t = 0.
We won’t prove this, although in the problem sheet we check some of the ingredients. It
has some striking consequences. For example, it can be shown that almost surely B takes
both positive and negative values at arbitrarily large times. From the inversion formula it
follows that also it takes both positive and negative values at arbitrarily small times. By
time-homogeneity it then also follows that for any fixed s that B takes both values greater
than B(s) and less than B(s) at times t arbitrarily close to s. Brownian motion is very
wiggly!
7.3
Reflection principle
Let B be a standard Brownian motion and let
Ta = min{t : B(t) = a}
be the first hitting time of a. It can be shown (see below) that Ta < ∞ almost surely. This
should not come as a surprise, by analogy with the recurrence of simple symmetric random
walk.
Here is a beautiful fact that allows us to compute many things.
Theorem 7.5 (reflection principle). If B is a standard Brownian motion then the process
(
B(t),
t ≤ Ta ,
W (t) =
2a − B(t), t > Ta
is a standard Brownian motion.
91
B(t)
a
Ta
W (t)
See the picture. The process W (t) follows the original Brownian motion B(t) until the first
time Ta that it meets the horizonal line at height a, and thereafter it follows the trajectory of
B(t) reflected about that horizontal line. (The formula arises because B(t) − a is the height
of B(t) relative to a, so negating this relative height gives a − (B(t) − a) = 2a − B(t).)
Proof. (sketch) There is a strong Markov property for Brownian motion, and Ta is a stopping
time. We omit the general statement and definitions; in our case the conclusion is that the
process
X(t) = B(Ta + t) − B(Ta ), t ≥ 0
started at time Ta is a Brownian motion, and it is independent of the process (B(t))0≤t≤Ta
before Ta . Since B is continuous we have B(Ta ) = a. So we can obviously express the original
Brownian motion as
(
B(t),
t ≤ Ta ,
B(t) =
a + X(t − Ta ), t > Ta .
To get W we just replace X with its negative:
(
B(t),
t ≤ Ta ,
W (t) =
a − X(t − Ta ), t > Ta .
This must give a process with the same distribution, because if X is a Brownian motion then
so is −X, and both are independent of the process before Ta .
Now we look at some applications. Let
M (t) = max{B(s) : 0 ≤ s ≤ t}
be the running maximum of Brownian motion. It might appear hard to compute the distribution of M (t) – it is the maximum of infinitely many random variables, which are highly
dependent. But amazingly we have the following.
Theorem 7.6. M (t) has the same distribution as |B(t)|.
Of course B(t) ∼ N (0, t), so there is no particular mystery about the distribution of |B(t)|
itself.
92
Proof. Fix a > 0. We have
P(M (t) ≥ a) = P(M (t) ≥ a, B(t) ≥ a) + P(M (t) ≥ a, B(t) < a)
= P(B(t) ≥ a) + P(M (t) ≥ a, B(t) < a),
(21)
because M (t) ≥ B(t). To deal with the second term, note the equivalence of the events
(exercise: check this carefully)
{M (t) ≥ a} = {Ta ≤ t},
(22)
and also note that the hitting time of a is the same for B as for the process W defined in
Theorem 7.5 (because they agree up to the hitting time). Hence
P(M (t) ≥ a, B(t) < a) = P(Ta ≤ t, B(t) < a)
= P(Ta ≤ t, W (t) < a)
= P(Ta ≤ t, 2a − B(t) < a)
= P(Ta ≤ t, B(t) > a)
= P(M (t) ≥ a, B(t) > a)
= P(B(t) > a).
Therefore, from (21), and since B(t) is a continuous symmetric random variable
P(M (t) ≥ a) = P(B(t) ≥ a) + P(B(t) > a) = 2P(B(t) ≥ a) = P(|B(t)| ≥ a).
We can use the last result, together with (22), to compute the distribution of the hitting
time.
Theorem 7.7. For a > 0, the random variable Ta has probability density function
a2
a
f (t) = √
e− 2t ,
2πt3
t > 0.
Proof. See the problem sheet.
In particular the proof of Theorem 7.7 implies that Ta < ∞ almost surely. (This fact was
not needed for the proof of Theorem 7.6.) It is also easy to deduce that ETa = ∞, which
should not be surprising by analogy with simple symmetric random walk.
It can be shown that almost surely there is a unique time at which Brownian motion
attains its maximum value in the interval [0, 1]. We let Y be this time, so that 0 ≤ Y ≤ 1
and
B(Y ) = max{B(t) : t ∈ [0, 1]}.
It turns out that we can compute the distribution of this as well!
Remark – not required for the course. There is a subtle point here. We saw earlier
that for any fixed time s, almost surely Brownian motion B takes values both greater
than and less than B(s) arbitrarily close to s on both the left and the right. But that
is not the case at Y ! By definition of a maximum, B takes only values less than B(Y )
immediately to the left and right of Y . This is not a contradiction, because the time
interval [0, 1] is uncountable. We could conclude (by the union bound) that the earlier
93
almost sure statement about s holds simultaneously for a fixed countable set of times, such
as all s ∈ [0, 1] ∩ Q, but not for all of [0, 1]. The random variable Y is continuous, so the
probability that it takes any particular value is zero.
It turns out that certain other properties do hold for all times simultaneously. For
instance, it can be shown that Brownian motion is nowhere differentiable almost surely.
One must be careful with such issues when dealing with continuous time processes!
Theorem 7.8. The location Y of the maximum of Brownian motion on [0, 1] has cumulative distribution function
P(Y < t) =
√
2
arcsin t,
π
0 ≤ t ≤ 1.
Proof. For any t ∈ (0, 1),
P(Y < t) = P max B(s) > max B(s)
s∈[0,t]
s∈[t,1]
= P max B(s) − B(t) > max B(s) − B(t)
s∈[0,t]
s∈[t,1]
= P max B(t − u) − B(t) > max B(t + v) − B(t) .
u∈[0,t]
v∈[0,1−t]
The purpose of re-writing like this is to consider the two parts of the process before and
after t as processes started from t. By time homogeneity, (B(t + v) − B(t))v≥0 is a Brownian
motion. By reversibility, (B(t − u) − B(t))0≤u≤t is also a Brownian motion. Moreover, by
independent increments, these two processes are independent. Therefore, by Theorem 7.6, the
two maxima in the last line above are simply absolute values
√of independent Normal random
variables. More precisely, since B(r) can be expressed as Z r for a standard Normal Z, the
above probability equals
√
√
P |Z1 | t > |Z2 | 1 − t ,
where Z1 , Z2 are independent standard Normal random variables.
The joint probability density function of Z1 , Z2 is
f (Z1 , Z2 ) =
1 − z12 +z22
e 2 ,
2π
which is circularly symmetric. Therefore if (R, Θ) are the polar coordinates of (Z1 , Z2 )√
then
Θ is√uniformly distributed on [0, 2π] (and independent of R). The region {(z1 , z2 ) : |z1 | t >
|z2 | 1 − t} is simply the two shaded sectors:
94
z2
√
√
( 1 − t, t)
α
z1
√ √
√
where the angle α is given by tan α = t/ 1 − t, so sin α = t. Therefore the probability
that the random point (Z1 , Z2 ) lies in the region is
√
2
4α
= arcsin t.
2π
π
The distribution of Y is called the arcsine distribution. By differentiating the distribution function in Theorem 7.8, its probability density function is
1
p
,
π t(1 − t)
0 < t < 1;
see below.
5
4
√1
π
3
t(1−t)
2
1
0
0
0.2
0.4
0.6
0.8
1
t
Further remarks – not required for the course. Remarkably, it can be shown that
both of the following random variables also have arcsine distribution: the last zero of
Brownian motion in [0, 1]:
L = sup{t ∈ [0, 1] : B(t) = 0};
95
and the amount of time in [0, 1] for which it is positive :
Z 1
IB(t)>0 dt.
H=
0
Brownian motion has many applications, including to physics and finance. It is named
after the botanist Robert Brown, who observed random motion of pollen grains under a
microscope. Albert Einstein concluded that this motion was due to random collisions with
molecules. The mathematical definition of Brownian motion is due to Norbert Wiener.
Brownian motion forms the underpinning of Îto calculus, which in turn is the basis of
modern mathematical finance.
There are thousands of mathematical papers written on Brownian motion. Of particular interest is d-dimensional Brownian motion, which is simply the vector-valued process
B (d) (t) = (B1 (t), . . . , Bd (t)), where the components (Bi (t))t≥0 are independent Brownian
motions. A few celebrated results are:
1. If B is a 1-dimensional Brownian motion and M is its running maximum as defined
before, then the process (2M (t)−B(t))t≥0 has the same distribution as (∥B (3) (t)∥)t≥0 ,
the Euclidean norm of a 3-dimensional Brownian motion.
2. If S = {B (2) (t) : t ∈ [0, 1]} is the image of 2-dimensional Brownian motion up to
time 1 then conditioned on B(1) = 0, the area enclosed by S has mean π/5.
3. The outer boundary of S has fractal dimension (Hausdorff dimension) 4/3.
96
Download