Probability II

Probability 2, 2021-22 University of Bristol 6 March 2023 A. E. Holroyd Probability 2 – Lecture Notes Contents 1 Introduction 1.1 A tricky example . . . . . . . . . . . . . . . . 1.2 Axioms of probability . . . . . . . . . . . . . . 1.3 Continuity of probability . . . . . . . . . . . . 1.4 Conditioning, independence, random variables 1.5 Expectation . . . . . . . . . . . . . . . . . . . 2 Random Walks 2.1 Stochastic processes . . . . . . . . . . 2.2 Random walk basics . . . . . . . . . 2.3 Random walk with absorbing barriers 2.4 Duration of the game . . . . . . . . . 2.5 One barrier, no barrier . . . . . . . . 2.6 Stopping times, Wald’s identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 6 8 8 . . . . . . . . . . . . . . . . . . . . (Gambler’s ruin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 12 14 17 18 20 . . . . . . . . . . . . . 23 23 27 30 32 33 34 37 40 40 42 43 46 48 . . . . . . 53 53 54 57 59 60 62 . . . . . . . . . . . . . . . 3 Markov Chains 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 3.2 Multi-step transition probabilities . . . . . . . . . . 3.3 Hitting probabilities and times . . . . . . . . . . . . 3.4 Strong Markov property . . . . . . . . . . . . . . . 3.5 Recurrence and transience . . . . . . . . . . . . . . 3.6 Random walks in d dimensions . . . . . . . . . . . 3.7 Communicating classes . . . . . . . . . . . . . . . . 3.8 Mean return times . . . . . . . . . . . . . . . . . . 3.9 Convergence examples . . . . . . . . . . . . . . . . 3.10 Stationary distributions . . . . . . . . . . . . . . . 3.11 Existence and uniqueness of stationary distributions 3.12 Periodicity and the convergence theorem . . . . . . 3.13 Reversibility . . . . . . . . . . . . . . . . . . . . . . 4 Poisson Process 4.1 Introduction . . . . . . . . . . . . . . 4.2 Increment distribution . . . . . . . . 4.3 Arrival times . . . . . . . . . . . . . 4.4 Superposition and thinning . . . . . . 4.5 Conditioning . . . . . . . . . . . . . 4.6 Inhomogeneous and Multidimensional 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Continuous-time Markov chains and queues 5.1 Introduction . . . . . . . . . . . . . . . . . . 5.2 The jump chain and explosion . . . . . . . . 5.3 Transition probabilities . . . . . . . . . . . . 5.4 Classification of states . . . . . . . . . . . . 5.5 Stationary distributions . . . . . . . . . . . 5.6 Reversibility and queueing theory . . . . . . . . . . . . 64 64 66 68 71 72 74 . . . . 78 78 81 82 85 7 Brownian Motion 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Symmetries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Reflection principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 90 91 6 Martingales 6.1 Introduction . . . . . . . . . . . . . . . . . . 6.2 New martingales from old . . . . . . . . . . 6.3 Stopped martingales . . . . . . . . . . . . . 6.4 Martingale convergence (not required for the 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . course) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction This course is a follow-on to Probability 1 / Probability and Statistics. The main object of interest is a stochastic process, i.e. a random process that evolves over time. Mathematically, it is simply a family of random variables X0 , X1 , X2 , . . . or (Xt )t∈[0,∞) ; Xt represents the state of the system at time t. Applications include population models, industrial processes, chemical reactions, and financial models. The course will focus on mathematical models rather than details of their applications. A typical question will be to understand the long term behaviour of a process as t → ∞. The course will be partially rigorous - we’ll aim to understand why things are true, but some of the more difficult proofs will be omitted. The course is not primarily about memorizing proofs. It is more important to understand the models and techniques and learn to use them. Two of the most important general classes of stochastic processes will be Markov chains (which can be seen as memoryless processes), and martingales (roughly speaking, fair gambling games). Key special cases include random walks (which are Markov chains and martingales simultaneously), Poisson processes (the simplest model for random events happening in continuous time) and Brownian motion (random motion in continuous space and continuous time). Material with a left bar (like this) is not required for the course. It will not be covered in problem sheets or exams. 1.1 A tricky example The first few lectures will be largely of a review of some basics of probability theory, but at a slightly deeper level than before. To whet our appetite before that, we look at a somewhat tricky example that illustrates some important themes, and that we’ll come back to several times. Example 1.1. A fair coin is tossed repeatedly. Find the expected number of tosses until the first occurrence of HT H (at 3 consecutive tosses). We’ll eventually see 3 different approaches to this problem: an ad hoc method (now); a more systematic method using Markov chains; and finally a very neat method using martingales. A first guess might be: the probability of getting HT H immediately in the first 3 tosses is (1/2)8 = 1/8. So by analogy by the Geometric distribution we might guess that the expected time until the first occurrence is 1/ 81 = 8. But occurrences are not independent: if HT H has just occurred then it’s impossible for it to occur again one step later (starting at the T ), while two steps later we have a head-start because we already have a H, so we might get HT HT H. It’s not so obvious whether these effects should make the expected waiting time longer, or shorter, or make no difference. Solution. Let X be the number of tosses until the first HT H. So we want its expectation, EX. We could try to compute P(X = 3) = P(HT H) = 18 , P(X = 4) = P(HHT H) + P(T HT H) = 2 · 1 16 etc., but it quickly gets complicated. We need another approach. 3 = 18 , First let’s show that EX < ∞. Split the tosses into disjoint blocks of 3, e.g. T HT |HT T |HT H| · · · and let Z be the number of blocks until the first HT H block. So here Z = 3 (even though HT H occurred earlier across two blocks). The advantage is that different blocks are independent, so Z has a Geometric(1/8) distribution, hence EZ = 8. Now we have X ≤ 3Z, so EX ≤ 3EZ = 24 < ∞. Before tackling X, let Y be the number of tosses until the first occurrence of HT . By the same reasoning as above EY < ∞. Now compute EY by conditioning on the first toss F : EY = E(Y | F = H)P(F = H) + E(Y | F = T )P(F = T ) = (1 + 2) 21 + (1 + EY ) 21 , because if the first toss is H we have used one toss and we just have to wait for the first T , which comes after Geometric(1/2) time, while if it is T then we have wasted one toss and it is like starting again from the beginning. Now we can solve this equation for EY (it is important that we know EY < ∞!) to get EY = 4. Now we’ll compute EX by conditioning on V (say), the next toss after the first occurrence of HT , and also on the value of Y : E(X | Y = y) = E(X | Y = y, V = H) 21 + E(X | Y = y, V = T ) 21 = (y + 1) 21 + (y + 1 + EX) 12 , = 1 + y + 12 EX because in the first case we are done, and in the second it is like starting again from scratch. It may not be so clear why we are allowed to do all this - after all, V is not like the first toss – it’s a randomly chosen toss whose location depends on the other tosses (it is toss number Y + 1, in fact)! We’ll learn exactly how to justify such things later. Recall that E(X | Y = y) is always a function of y, say h(y), and the conditional expectation E(X | Y ) is defined to be the random variable h(Y ). So in this case, E(X | Y ) = 1 + Y + 21 EX. By the double expectation formula or tower property: EX = E(E(X | Y )) = E(1 + Y + 12 EX) = 1 + EY + 21 EX. So we can solve: EX = 2(1 + EY ) = 2(1 + 4) = 10. ♢ A more complicated example of the same kind is: a monkey types random letters on a keyboard; what is the expected time until it types ABRACADABRA? We’ll return to this later. (This word is chosen to be a particularly interesting example – what features do you think are important here?) Another thing to think about: what is the probability that one pattern (say HT H) occurs before another (say T T H)? 1.2 Axioms of probability A probability space is a set Ω (where the elements ω ∈ Ω represent all possible ‘outcomes’). An event is a subset A ⊆ Ω. A probability measure is a function P from events to R satisfying the axioms: 4 1. P(A) ≥ 0 for all events A 2. P(Ω) = 1 3. If A1 , A2 , . . . are pairwise disjoint events (i.e. Ai ∩ Aj = ∅ for i ̸= j) then P ∞ [ ∞ X Ai = P(Ai ). i=1 i=1 A technical complication (not needed for this course) is that in many important settings (primarily those where Ω is uncountable), it is not possible to assign probabilities in this way to every subset of Ω. We have to restrict to sufficiently ‘well-behaved’ sets A, and these are the ones that we call ‘events’. Determining what counts as well-behaved is the job of measure theory, which is beyond the scope of this course. The good news is that any set that one would normally want to think about is indeed well-behaved, so in practice one does not typically have to worry about this! It is important that in Axiom 3 (and things we’ll derive from it) the sequence of events is countable. The same statement does not in general hold for uncountable families of events. For instance, let the random variable X be uniformly random on the real interval [0, 1] (so it has probability density functionR1 on [0, 1] and 0 elsewhere). Let Ar be the event {X = r}, r where r ∈ [0, 1]. Then P(Ar ) = r 1 dx = 0, and the (Ar ) are pairwise disjoint. But [ P Ar = P(X ∈ [0, 1]) = 1, r∈[0,1] while X P(Ar ) = r∈[0,1] X 0 = 0, r∈[0,1] which are not equal. Uncountable sets also arise when we consider infinite sequences (even when the elements in the sequence take discrete values, e.g. coin tosses), so we need to be careful. Simple examples of probability spaces include: 1. Roll two dice. Ω = {1, 2, . . . , 6}2 = {(a, b) : a, b ∈ {1, . . . , 6}} (finite). P({ω}) = 1/62 for each ω. 2. Uniform[0, 1] random variable as above. Ω = [0, 1] (uncountable); X(ω) = ω. P({ω}) = 0 for each ω, so this information is not enough to specify P. 3. Toss a coin repeatedly (forever). + Ω = {H, T }Z = {(x = (x1 , x2 , . . .) : xi ∈ {H, T } ∀i} – all sequences of Hs and T s indexed by the positive integers Z+ = {1, 2, . . .}. Again, Ω is uncountable. The first toss is the random variable X1 given by X1 (x) = x1 for x = (x1 , x2 , . . .) ∈ Ω. Usually we don’t talk explicitly about Ω much, but instead work with events and random variables. 5 1.3 Continuity of probability Continuity of a function f means (among other definitions) that if xn → x then f (xn ) → f (x). A probability measure P satisfies a similar condition, but to state it we need to make sense of what it means for events to converge to a limit. There are two settings where we can do this: decreasing and increasing sequences. Let A1 , A2 , . . . be a sequence of events. Suppose that Ai ⊆ Ai+1 (i.e. if Ai occurs Sn then Ai+1 occurs) for all i. We call such S a sequence increasing. Since in that case An = i=1 Ai , it makes sense to consider A = ∞ i=1 Ai , the event that at least one of the events occurs. We call A the limit of the sequence and write An ↑ A as n → ∞. Theorem 1.2 (Continuity for increasing limit). Let An be an increasing sequence of S events and let A = ∞ A . i=1 i then P(A) = lim P(An ). n→∞ (Of course P(An ) is an increasing sequence of numbers, so this is an increasing limit: P(An ) ↑ P(A)). Proof. We want to use Axiom 3, so we need a sequence of disjoint events. Define D1 = A1 , and Dn = An \ An−1 for to check that the Di are pairwise S∞ Snn ≥ 2. It is straightforward disjoint, and that An = i=1 Di and A = i=1 Di . Then, using Axiom 3 and the definition of an infinite sum, P(A) = ∞ X P(Di ) = lim n→∞ i=1 n X P(Di ) = lim P(An ). n→∞ i=1 Similarly, ifTAi ⊇ Ai+1 for all i then we call the sequence of events decreasing, and interpret A = ∞ i=1 Ai as its limit, and write An ↓ A. Theorem 1.3 (Continuity for decreasing limit). Let An be an decreasing sequence of T events and let A = ∞ A . i=1 i then P(A) = lim P(An ). n→∞ Proof. This is really the last result in disguise, if we take complements. Let Bi = AC i . Then it is straightforward to check that Bi is an increasing sequence and that Bn ↑ B where B = AC (this is De Morgan’s law on sets). Then using the last theorem, P(A) = 1 − P(B) = 1 − lim P(Bn ) = lim (1 − P(Bn )) = lim P(An ). n→∞ n→∞ n→∞ Let’s use these results to formally justify some facts that should seem pretty obvious. . . Example 1.4. A fair coin is tossed repeatedly. What is the probability that every toss is Heads? Solution. The idea is to reduce to “finitely defined” events that we know how to deal with. Let An = {tosses 1, . . . , n are all H} and A = {all tosses are H}. Then An ↓ A, so by continuity of probability, n P(A) = lim P(An ) = lim 12 = 0. ♢ n→∞ n→∞ 6 Similarly, the probability of any fixed sequence of Hs and T s is 0, so the situation is very similar to the Uniform[0, 1] random variable case. Example 1.5. A fair coin is tossed repeatedly. What is the probability that Heads appears infinitely many times? Solution. If there are only finitely many Heads then the sequence is all Tails after some point. So let Bn = {tosses n, n + 1, . . . all T }. Then S P(Bn ) = P(B1 ) = 0, from the last example. The sequence Bn is increasing; let B = n Bn . Then B is the event that the sequence is all T from some point on, which is the same as the event that there are only finitely many Hs. We have P(B) = limn→∞ 0 = 0. So P(infinitely many Hs) = P(B C ) = 1 − P(B) = 1. ♢ If an event has probability 1 we sometimes say that it occurs almost surely (or a.s.). (The philosophy behind this terminology is that there can still be elements of Ω where the event does not occur, so saying “surely” would be too strong). Example 1.6. Show that almost surely there are infinitely many Heads and infinitely many Tails. Solution. Let A = {infinitely many Hs} and B = {infinitely many Ts}. From the previous example we know P(A) = P(B) = 1, and we want to conclude P(A ∩ B) = 1. It’s often a bit easier to argue that probabilities are 0 rather than 1, so we take complements: 1 − P(A ∩ B) = P((A ∩ B)C ) = P(AC ∪ B C ) = P(AC ) + P(B C ) − P(AC ∩ B C ) ≤ P(AC ) + P(B C ) = 0 + 0 = 0. So P(A ∩ B) = 1 − 0 = 1. ♢ Above we used the inequality P(A ∪ B) ≤ P(A) + P(B), (but applied to the complements AC , B C ). This is often useful. As usual, it works for countable families too: Theorem 1.7 (Union bound). Let A1 , A2 , . . . be events. Then P ∞ [ ∞ X Ai ≤ P(Ai ). i=1 i=1 This should seem very natural – Axiom 3 says that we have equality when the events are disjoint; if they overlap then the left side can only become smaller. It can be proved directly from Axiom 3 in a similar manner to before. We’ll see a different proof using indicators soon. 7 1.4 Conditioning, independence, random variables The conditional probability of event A given event B is defined to be P(A | B) = P(A ∩ B) , P(B) provided P(B) > 0. If B is fixed and P(B) > 0, then it’s an exercise to show that the e defined by P(A) e function P := P(A | B) for all events A is itself a probability measure. The e describes the state of the universe if you know that B has happened. interpretation is that P Events A and B are said to be independent if P(A ∩ B) = P(A)P(B). This is equivalent to P(A | B) = P(A) and P(B | A) = P(B) provided the relevant conditional probability is defined (i.e. its denominator is non-zero). The interpretation is that events are unrelated – knowing one occurred tells us nothing about the other. An arbitrary collection of events (Ai )i∈I is said to be independent if \ Y P Ai = P(Ai ) for every finite subset J ⊆ I. i∈J i∈J Note that it’s not enough to check this just for the full set J = I. For example, in our standard setting of a fair coin tossed repeatedly, if we define the event Hi = {toss i is Heads} then (Hi )∞ i=1 are independent. A random variable is simply a function X from the probability space Ω to a set B (often Z or R). We think of X a random number (or object) whose value depends on the ‘outcome’. X(ω) is the value when the outcome is ω. We use the shorthand P(X ∈ S) = P({ω ∈ Ω : X(ω) ∈ S}), P(X = x) = P({ω ∈ Ω : X(ω) = x}), etc. Jointly distributed random variables are simply different functions X1 , X2 , . . . defined on the same probability space Ω. Note that random variables are allowed to take values in any set B. For example, if X and Y are real-valued random variables (as usual) then we can think of the pair (X, Y ) as a random vector, i.e. a random variable ω 7→ (X(ω), Y (ω)) that takes values in the plane R2 = {(x, y) : x, y ∈ R}. It is also sometimes useful to allow random variables to take the value infinite values. E.g. a discrete random variable X might take values in the set N ∪ {∞} (where N = {0, 1, 2 . . .}). Again there is a technical complication involving measure theory that we won’t need to worry about in this course: we are only allowed to take ‘well-behaved’ functions X, which are those for which the sets {ω ∈ Ω : · · · } that arise above are well-behaved enough to be events. 1.5 Expectation The expectation or expected value or mean EX of a real-valued random variable X is defined to be X EX = xp(x) x if X is discrete with probability mass function p, or Z EX = xf (x) dx x 8 if it is continuous with probability density function f , provided the sum or integral exists. Simple properties include: if c is a deterministic constant then Ec = c and E(cX) = cEX; if X ≥ 0 then EX ≥ 0. As with other sums or integrals, expectations can be infinite, but we need to be careful with infinite values in combination with signs - the general rule is that we never allow ∞−∞. E.g. consider a continuous random variable X with probability density function f . If X is non-negative, i.e. X ≥ 0 almost surely, then f (x) = 0 for x < 0, and so (by the definition of the integral) Z K Z ∞ xf (x)dx, xf (x)dx := lim EX = K→∞ 0 0 (which is an increasing limit in K because f ≥ 0). This would be equal to ∞ if the integral diverges. On the other hand if X takes positive and negative values then (again by definition) Z ∞ Z 0 Z ∞ xf (x)dx = xf (x)dx + xf (x)dx. EX = −∞ −∞ 0 R∞ R0 If the integral 0 is ∞ and the integral −∞ is −∞ then we say that EX is undefined. Similar remarks apply to discrete random variables. If X is a random variable taking values in N ∪ {∞} with probability mass function p(x) = P(X = x) then we can also define its expectation: X X EX = xp(x) = ∞ p(∞) + xp(x). x x∈N Here p(∞) = P(X = ∞), and we use the convention ∞ · 0 = 0 P for the first term. So if P(X = ∞) > 0 then EX = ∞, while if P(X = ∞) = 0 then EX = x∈N xp(x). A random variable X is bounded if there is a finite constant K ∈ R for which |X| ≤ K almost surely. It is a.s. finite if |X| < ∞ a.s. Bounded implies finite mean, which implies a.s. finite, but none of the three conditions are equivalent. A random variable X is deterministic if there is a fixed constant c for which X = c a.s. That’s a very boring random variable, but occasionally it’s useful to have a name for it. For random variables X and Y we have E(X + Y ) = EX + EY, provided EX and EY are finite, or alternatively if X and Y are non-negative (in which case one or more expectation could be ∞). Once again, we can’t say anything about ∞ − ∞. It follows that under similar conditions E(X1 + · · · + Xn ) = EX1 + · · · + EXn for any finite sum. Note that there is no assumption on the joint distribution (like independence). Does this work also for infinite (countable) sums? Unfortunately, not always. The problem is that expectation is itself a sum (or integral), so it amounts to rearranging an infinite sum, which is not always justified. The most general conditions where it is possible are beyond the scope of this course (“dominated convergence”), but one case where it is allowed is for non-negative random variables: Theorem 1.8. Let X1 , X2 , . . . be random variables with Xi ≥ 0 for all i. Then E ∞ X Xi = i=1 ∞ X EXi . i=1 Proof. Not included in this course - see measure theory. 9 We can use the expectation of a sum to give a neat derivation of the Union Bound from earlier. The indicator of an event A is the random variable ( 1, if A occurs, IA = 0, otherwise. Note that EIA = P(A). P Proof of Theorem 1.7. Let N = ∞ i=1 IAi , which is the number of the events A1 , A2 , . . . that occur. It is a (discrete) random variable taking values in N ∪ {∞} (where N = {0, 1, 2 . . .}). Note that I{N >0} ≤ N (why? - check it carefully!) Therefore, P ∞ [ Ai = P(N > 0) = EI{N >0} ≤ EN = ∞ X i=1 EIAi = i=1 ∞ X P(Ai ). i=1 Random variables X and Y are independent if P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B) for all sets A and B. (The comma on the left means “and” – in other words the intersection of the two events). A family of random variables (Xi )i∈I are independent if \ Y P {Xi ∈ Ai } = P(Xi ∈ Ai ) i∈J i∈J for all finite subsets J ⊆ I and all sets (Ai )i∈J . The definition applies to random variables taking values in arbitrary sets (e.g. random vectors), so for example to say that (X, Y ) and Z are independent means P (X, Y ) ∈ A, Z ∈ B = P((X, Y ) ∈ A) P(Z ∈ B) for all sets A and B. If X and Y are independent random variables with finite expectations then E(XY ) = (EX)(EY ). Once again we must be more careful with infinite expectations. For instance if the right side is ∞ × 0 then we cannot conclude anything about the left side. The opposite extreme from independence is one random variable being a function of another, e.g. X = f (Z) where f is a fixed function (perhaps X = Z 2 or X = log Z). In other words, knowing Z completely determines X. Note that if X and Y are independent then f (X) and g(Y ) are independent, for any functions f and g. Random variables X and Y have the same distribution, or are identically distributed, or are equal in distribution, if P(X ∈ A) = P(Y ∈ A) for every A. (Sometimes this is d written as X = Y , or variants thereof, but this is risky as it’s too easy to leave out the “d”, so we’ll try to avoid it.) For real-valued random variables this is equivalent to having the same distribution function, i.e. FX = FY , where FX (x) = P(X ≤ x). 10 It is also equivalent to having the same probability mass function for discrete random variables, or the same probability density function, for continuous random variables. As usual, the definition applies to random variables that take values in arbitrary sets. So for example to say that (X, Y ) has the same distribution as (X ′ , Y ′ ) means that P (X, Y ) ∈ A = P (X ′ , Y ′ ) ∈ A for all sets A. 11 2 Random Walks Random walks are the central examples of a stochastic processes. As we’ll see, they come in several flavours. We’ll start with some general remarks about stochastic processes. 2.1 Stochastic processes A stochastic process or random process is simply an infinite family of random variables such as X0 , X1 , X2 , . . . or (Xt )t∈[0,∞) , all on the same probability space Ω. Usually we think of the index n or t as time. If n ∈ Z the process is in discrete time; if t ∈ R, continuous time. We might denote the entire process by any of X = (Xn )∞ n=0 = (Xn )n≥0 = (Xn ) = (X0 , X1 , . . .). (Sometimes people use {Xn }∞ , but that can be confusing because it looks like a set, whereas n=0 we care about the indexing.) The simplest stochastic process is a sequence of independent, identically distributed (i.i.d.) random variables X1 , X2 , . . .. Identically distributed means they all have the same distribution. Independent means they are all independent(!) We can specify the distribution of an i.i.d. process just by specifying the distribution of X1 . E.g. X1 takes value 1 with probability p and 0 otherwise; or X1 is uniform on [0, 1]; or X1 is a standard Normal. We won’t have much more to say about i.i.d. processes themselves, but we will often use them as an ingredient to build more interesting processes. 2.2 Random walk basics Let 0 < p < 1 and let X1 , X2 , . . . be i.i.d. with ( +1 with probability p, Xi = −1 with probability q := 1 − p, and let Sn = n X Xi i=1 for n ≥ 1. We also define S0 = 0 – in fact this is a special case of a convention that we’ll always assume - a sum of zero terms is zero. We consider S0 , S1 , S2 , . . . as a stochastic process. It is called a random walk. Possible interpretations include: 1. You play a sequence of gambling games (roulette, etc.). In each game you win £1 with probability p or lose £1 with probability q = 1 − p. Different games are independent. Then Sn is the change in your wealth after you have played n games. 2. A particle or an organism moves about randomly in a linear environment, taking steps randomly left or right (perhaps in response to collisions with other particles, or biological impulses). Sn represents its position at time n. We have already studied the distribution of Sn for a fixed n in earlier courses. (It is a variant of the binomial distribution; and we have laws of large numbers and the central limit theorem). The focus now is instead to consider properties of the whole random sequence S = (S0 , S1 , . . .). We can imagine a graph of a typical trajectory: 12 Sn ··· n We can also draw the transition diagram – nodes represent the possible values for Sn , and arrows indicate the probabilities of moving between them: p ··· −2 p −1 q p 0 p 1 q q 2 ··· q Let’s do some warm-up computations. Example 2.1. Find an expression for P(Sn = 0). Solution. Since each Xi is odd, Sn is the sum of n odd numbers so it must have the same parity as n, therefore if n is odd then P(Sn = 0) = 0. If n = 2m is even then 2m m m P(S2m = 0) = P X1 , . . . , Xn comprise m +1s and m −1s = p q . ♢ m In fact, the distribution of Sn is a variant of a Binomial, so really there’s not much new here. Things get more interesting when we consider the values at several times simultaneously. Example 2.2. Find an expression for P(S2a = S2b = 0), where 0 < a < b. Solution. P(S2a = S2b = 0) = P(S2a = 0, S2b − S2a = 0) 2a 2b X X =P Xi = 0, Xi = 0 i=1 =P 2a X i=2a+1 2b X Xi = 0 P Xi = 0 i=1 i=2a+1 2a a a 2(b − a) b−a b−a = p q p q a b−a 2a 2(b − a) = (pq)b . a b−a ♢ The random walk we defined above is called simple random walk (SRW) on Z; “on Z” because Sn takes values in Z; “simple” because it only takes steps to neighbouring integers. When p = 1/2 it is called a simple symmetric random walk (SSRW). 13 Many variants are possible. The steps Xi can have a different distribution. We can consider random walks in d dimensions, where Xi and Si take values in the set of integer vectors Zd . The random walk we have considered is sometimes called unrestricted, because Sn is allowed to take any integer value. On the other hand, we can restrict the range in various ways, which is our next topic. 2.3 Random walk with absorbing barriers (Gambler’s ruin) A gambler plays a sequence of independent games. In each game, she either wins £1 with probability p, or loses £1 with probability q = 1 − p. Let 0 < k < m be integers. The gambler’s initial wealth is £k, and she plays until her wealth reaches either £0 or £m for the first time; at that point the process is stopped. (Reaching £0 is bankrupcy or “ruin” - at which point the gambler is not allowed to play further. The reason for stopping at £m could be that the gambler needs this amount to retire or buy a ticket home, or alternatively that the game is played against an opponent who starts with £(m − k), and who must also stop if they go bankrupt). We can interpret the process as a simple random walk started at k with absorbing barriers at 0 and m. That is, let (Xi ) be i.i.d. taking value +1 with probability p and −1 otherwise, and let n X Sn = k + Xi , i=1 but stop the process at the random time T given by T = min n : Sn ∈ {0, m} . Much of the course will concern random times like this, defined in terms of stochastic processes. It is possible that Sn never hits 0 or m; in that case the set {n : Sn ∈ {0, m}} in the definition above is empty, and it makes sense to take T = min ∅ = ∞. Thus T is a random variable taking values in N ∪ {∞}. (Actually we will show later that T < ∞ almost surely). We can think of horizontal barriers at heights 0 and m that absorb the trajectory if it runs into one of them: Sn m k 0 The transition diagram has absorbing states at 0 and m: p p 0 1 q 2 q 3 q 14 ··· p m−1 m Here is the central calculation. Example 2.3. Find the probability that the gambler finishes with £m. Solution. The two key ideas are: (1) condition on the first step, and (2) compute what we want simultaneously for all starting states. Let W = {ST = m} be the event that the gambler finishes with m. Define hk = Pk (W ), where the subscript k means that we consider the model in which the gambler’s initial wealth is k. Then for all 0 < k < m, hk = Pk (W ) = Pk (W | X1 = 1)P(X1 = 1) + Pk (W | X1 = −1)P(X1 = −1) = hk+1 p + hk−1 q, (1) because conditioned on X1 = 1, i.e. on winning the first game, the gambler’s wealth S1 is now k + 1, and the probability of ending up at m is the same as the unconditioned probability if we had started at k + 1; similarly for conditioning on X1 = −1. The above is a difference equation or recurrence for the family of probabilities (hk )m k=0 . We also need the boundary conditions: h0 = 0; hm = 1 (which follow immediately from the definition of W ). Difference equations can be solved by methods analogous to those for differential equations. Here we just focus on the ones that arise. (To get a feel for the situation, you might try solving the problem for a small case, say m = 3 – after substituting for h0 and hm this is just a matter of solving two simultaneous equations.) Putting (1) in standard form: phk+1 − hk + qhk−1 = 0. We look for a solution of the form hk = θk . Substituting: pθk+1 − θk + qθk−1 = 0; pθ2 − θ + q = 0; (pθ − q)(θ − 1) = 0, so θ = 1 or θ = q/p. We need to divide into two cases according to whether these two solutions are equal. Case p ̸= 1/2. The two solutions 1 and q/p are distinct, so the general solution to the difference equation is k k hk = A 1k + B pq = A + B pq . We find the constants A, B by substituting the boundary conditions: 0 = h0 = A + B So A = −B and 1 = A[1 − q m ], p 1 = hm = A + B q m . p thus A = 1/[1 − q m ]. p hk = 1− 1− q k p q m p 15 Therefore, for p ̸= 21 . (2) Case p = 1/2. We have 1 = q/p so the two roots coincide, therefore we need another independent solution. The general rule is to multiply the solution θk by integer powers of k, i.e. θk k, θk k 2 , etc. So in our case the general solution is hk = A1k + B1k k = A + Bk. The boundary conditions give 0 = A and 1 = A + Bm = Bm, so B = 1/m. Therefore, hk = k m for p = 12 . (3) ♢ It is well worth getting a feel for the solutions (2),(3) via examples. Example 2.4. Suppose k = 20 and m = 100. What is the probability of ending with m for each of p = 1/2 (fair game), p = 18/38 (slightly unfavourable game; betting on red in roulette), and p = 20/38 (slightly favourable game; roulette from the casino’s perspective)? Repeat for k = 200 and m = 1000. Solution. From (2),(3) the values of hk are: p 1/2 18/38 20/38 k = 20, m = 100 0.2 0.00019 · · · 0.88 · · · k = 200, m = 1000 0.2 2.5 · · · × 10−37 0.9999999993 · · · Note how large an effect small changes in p can have, and how much more extreme the effects are for large n. ♢ Example 2.5. Prove that with probability 1 the gambler’s wealth eventually reaches either 0 or m, so that the process stops; i.e. show that T < ∞ almost surely. Solution. Since 0 < p < 1, the infinite sequence of i.i.d. steps (Xi ) almost surely contains m consecutive +1s (e.g. by the “parrot question”). If the process has not already stopped before this point then this will cause it to stop at m. ♢ Consequently we have P(ST = 0) = P(ruin) = 1 − hk =    q p k − q p 1−   m−k m q p m m , , p ̸= 1 2 p = 12 . Note that in the fair case, p = 1/2, the entire process is also “fair”, in the sense that expectation of the gambler’s final wealth is EST = m k k +0 1− = k, m m which is what she started with. More on this later! 16 2.4 Duration of the game Example 2.6. In the gambler’s ruin problem, find ET , the expected number of games played until the process finishes, in the symmetric case p = 1/2. Solution. Let tk = Ek T . By dividing the steps into disjoint blocks of m and comparing with a geometric distribution we have tk < ∞ for each k. Now condition on the first step, for 0 < k < m: tk = E(T | X1 = 1) 21 + E(T | X1 = −1) 12 = (1 + tk+1 ) 12 + (1 + tk−1 ) 12 , because if X1 = 1 then we have used one step, and then it is as if the process starts again from k + 1. The boundary conditions are t0 = 0, tm = 0. Putting the difference equation in standard form: tk+1 − 2tk + tk−1 = −2. (4) We first solve the homogenous version: tk+1 − 2tk + tk−1 = 0. Trying tk = θk gives θ2 − 2θ + 1 = 0, i.e. (θ − 1)2 = 0, which has a repeated root θ = 1, so the general solution to the homogenous equation is 1k (A + Bk) = A + Bk. We now need a particular solution to the original equation (4). Since the right side is a constant (or more generally if it is a polynomial) we try powers of k. Since 1 and k are already taken as solutions to the homogenous equation we need to go to tk = ck 2 . This gives c (k + 1)2 − 2k 2 + (k − 1)2 = −2, i.e. c · 2 = −2 so c = −1. Thus the general solution to (4) is the particular solution plus the general solution of the homogeneous solution: tk = A + Bk − k 2 . The boundary conditions give 0 = A and 0 = A + Bm − m2 so B = m2 /m = m. Thus tk = k(m − k). ♢ In our example k = 20, m = 100 we have tk = 20 × 80 = 1600, while for k = 200 and m = 1000 we have tk = 160,000 (perhaps surprisingly large). Similarly, note that if m is even then tm/2 = m2 /4. The quadratic behaviour makes sense √ if we remember that Sn is approximately normal with mean 0 and standard deviation const n for large n. So it is natural to expect to wait until time approximately n = const m2 for it to deviate by m/2 from its starting point. The difference equation can also be solved to get a (messier) expression for tk when p ̸= 21 . But we will see a much neater way to do this at the end of this section. 17 2.5 One barrier, no barrier Suppose that the gambler is playing against an infinitely rich opponent (think: a casino). She again starts with £k > 0, but we stop the process only if her wealth reaches £0. In other words, we stop at the random time T = min{n : Sn = 0}, where by definition we take T = ∞ if Sn > 0 for all n. This is simple random walk with one absorbing barrier at 0. Example 2.7. Find the ruin probability, i.e. the probability that the gambler’s wealth ever reaches 0. One possible approach would be to again condition on the first step to get a difference equation for the infinite family of probabilities rk = Pk (ruin) for k = 0, 1, . . .. One problem with this is that we only seem to have one boundary condition, r0 = 1. We’ll see some possible solutions to this difficulty later, but for now we will instead use the previous two-barriers case together with continuity of probability. Solution. Fix k and let r = Pk (ruin) be the required probability. The idea is that the situation with one barrier at 0 can be interpreted as the limit as m → ∞ of two barriers at 0 and m. random walk S = (Sn )∞ n=0 given by Sn = k + PnTo justify this, consider the unrestricted ∞ are the usual i.i.d. steps. For a ∈ Z let X started at k, where (X ) i i i=1 i=1 Ta = min{n ≥ 0 : Sn = a} be the first time the walk hits a (where we take Ta = ∞ if it never hits a). Let Am = {T0 < Tm } be the event that S hits 0 before it hits m, where 0 < k < m. Note that Ak+1 , Ak+2 , . . . is an increasing sequence of events, because the random walk cannot get to m + 1 without visiting m. Let ∞ [ Am A= m=k+1 be the limit. This is the event that the walk hits 0 before m for some m, which is precisely the event that it hits 0. So P(A) = P(T0 < ∞) = r, since it does not matter whether or not we stop once we hit 0. On the other hand, P(Am ) is simply the ruin probability for the random walk with barriers at 0 and m, which is 1 − hk , where hk = hk (m) is the probability we computed before. We conclude  k p = 1/2;  1 − limm→∞ m k = ( 1, q 1− r = lim (1 − hk (m)) = 1, p < 1/2; p m→∞  1 − limm→∞ 1− q m = q k , p > 1/2. p p In summary, r= ( 1, q k , p p ≤ 1/2 p > 1/2. Thus, ruin is possible in a favourable game, but (almost surely) certain in an unfavourable or fair game! ♢ 18 Later we’ll compute the expected time to hit 0 in the one-barrier case. Now we turn our attention back to the unrestricted simple random walkP started at 0, as defined in Subsection 2.2. Let the steps (Xi ) be i.i.d. as usual and let Sn = ni=1 Xi . Let R0 = min{n > 0 : Sn = 0} be the first time the walk returns to 0 (i.e. visits 0 at a strictly positive time). Example 2.8. For each p, find the probability that the unrestricted simple random walk ever returns to 0, i.e. P(R0 < ∞). Solution. Let E = {R0 < ∞} and let Pk denote the probability measure for the random walk started at k. Conditioning on the first step, P0 (E) = P0 (E | X1 = 1)p + P0 (E | X1 = −1)q = P1 (T0 < ∞)p + P−1 (T0 < ∞)q. Now P1 (T0 < ∞) is the ruin probability for the random walk with an absorbing barrier at 0 started at 1, which was computed earlier. On the other hand, if (Sn ) is a simple random walk started at −1 then (−Sn ) is a simple random walk started at 1, but with the probabilities p and q of an up-step and a down-step reversed. So P−1 (T0 < ∞) is the same as ruin probability P1 (T0 < ∞) in the model with p replaced with q. Thus, by Example 2.7,  p = 1/2;  1p + 1q = 1, p 1p + q q, = 2p, p < 1/2; P0 (E) =  q p + 1q, = 2q, p > 1/2. p We can express this as 1 − |p − q|. In particular, note that return to 0 is certain if and only if p = 1/2. ♢ 1 − |p − q| 1 0 1 1 2 p We can also consider random walks with a reflecting barrier, which roughly speaking means that the walk is prevented from moving beyond some point, but the process is not stopped and still continues. See Problem Sheet 2. 19 2.6 Stopping times, Wald’s identity Wald’s identity is a valuable tool for random walks and elsewhere. By way of motivation, recall our observation in the original gambler’s ruin problem that “a fair game remains fair”: in the case of winning probability p = 1/2, starting from wealth k, and stopping at the first time T that it hits 0 or m, the gambler’s expected final wealth EST is also k. Is there a general method that can predict phenomena like this? Suppose that (Xi ) are i.i.d. and N is a non-negative integer-valued random variable. We would like to compute the mean N X E Xi i=1 of the sum of the first N random variables, where N is itself random. The obvious guess is that it should equal simply E(N ) E(X1 ). This indeed holds if N is independent of X, but not in general when N depends on X (see Problem Sheet 1). Wald’s identity provides another condition under which it does hold. Here is the key definition. Let X = (Xi )∞ i=1 be a stochastic process and let N be a random + variable taking values in Z ∪ {∞} (where Z+ = {1, 2, . . .}). We say that N is a stopping time of (or “for”) X if for all n ∈ Z+ : we can determine whether or not the event {N ≤ n} occurs from X1 , . . . , Xn . (Or equivalently, the indicator IN ≤n is a function of X1 , . . . , Xn .) For a real-world analogy, consider giving directions. “Take the second road on the left after the pub” is a helpful instruction, because we can wait for the pub and then count roads. “Take the left turn two roads before the pub” is not helpful, because we don’t know when to turn until it’s too late. A stopping time is a helpful instruction. Example 2.9. Let X = (X1 , X2 , . . .) be a stochastic process whose elements Xi take integer values, and let A be a proper subset of Z. Which of the following are stopping times for X? 1. N = min{n : Xn ∈ A} (the first time we hit the set A); 2. M = min{n > 1 : Xn ∈ A} − 1 (the time step immediately before the first entry to A); 3. L = sup{n : Xn ∈ A} (the last time we hit A, or ∞ if we hit A infinitely many times so that there is no last time). Solution. 1. Yes. N ≤ n if and only if at least one of X1 , . . . , Xn lies in A. 2. No. M ≤ m if and only if at least one of X1 , . . . , Xm+1 lies in A, which cannot determined from X1 , . . . , Xm alone. 3. No. If X1 ∈ A then L ≤ 1 if and only if X2 , X3 . . . ∈ / A, which cannot be determined from X1 . (Silly complication: strictly speaking the answers can depend on the choice of the stochastic process X, so the “No”s should really be “No in general”. E.g. if each Xi only takes values in {0, 1} and A = {2} then we can never hit A, so M = ∞ − 1 = ∞, and so we can determine whether or not M ≤ m without looking at the values (Xi ) at all – we know it never occurs!) ♢ 20 Theorem 2.10 (Wald’s identity). Let X1 , X2 , . . . be i.i.d. random variables with finite mean, and let N be a stopping time of X with finite mean. Then N X E Xi = E(N ) E(X1 ). i=1 Proof. We give a proof under the extra assumption that X1 ≥ 0. (The proof of the general version uses dominated convergence, which is not covered in this course). Note that N X Xi = ∞ X i=1 Xi Ii≤N . i=1 Therefore, N ∞ X X E Xi = E Xi Ii≤N , i=1 (5) i=1 where the exchange of expectation and sum is justified because the random variables are non-negative (Theorem 1.8). (This is where the extra assumption is used). Now {i ≤ N } = {N ≤ i − 1}C , so Ii≤N = 1 − IN ≤i−1 . Therefore, since N is a stopping time, Ii≤N is a function of X1 , . . . , Xi−1 . Hence Ii≤N is independent of Xi . Hence, the right side of (5) equals ∞ X E(Xi )E(Ii≤N ) = E(X1 ) i=1 ∞ X P(N ≥ i) = E(X1 ) E(N ), i=1 where the last step is the expectation formula from Problem Sheet 1. We can use Wald’s identity to give quick answers to several random walk questions. Example 2.11. Find the expected time to hit 0 or m in the gambler’s ruin problem with p ̸= 1/2, starting from k. P Solution. As usual let X = (Xi ) be the i.i.d. steps, and Sn = k + ni=1 Xi the random walk. Let T be the first time Sn hits 0 or m. We argued earlier that ET < ∞. Let h = hk be the probability it hits m before 0, as computed earlier. Then EST = hm + (1 − h)0 = hm. On the other hand, T is a stopping time for X, so Wald’s identity gives EST = k + E T X Xi = k + E(X1 )E(T ) = k + (p − q)ET. i=1 Equating the two expressions for EST we obtain hm = k + (p − q)ET , so 1− 1 hm − k ET = = m p−q p−q 1− q k p q m p −k . ♢ Example 2.12. Find the expected time for simple symmetric random walk started at 0 to hit 1. 21 Solution. Let T be the time to hit 1, and note that it is a stopping time for X. Note that T < ∞ a.s. by Example 2.7, and ST = 1 a.s. Note also that EX1 = 0 since we are in the symmetric case. If ET < ∞ then Wald’s identity would imply 1 = EST = E(X1 )E(T ) = 0ET = 0, which is a contradiction. So the only possibility is that ET = ∞. ♢ If you have an unlimited bankroll then you can guarantee to make money even on a fair game: just keep playing until the first time you are up by 1. Example 2.7 guarantees that this will happen eventually. But Example 2.12 says that it could take a very long time – the expected time is infinite. And if your bankroll is not really unlimited then you could end up losing everything, by Example 2.3. 22 3 Markov Chains 3.1 Introduction Markov chains (sometimes called Markov processes) are a broad generalization of random walks. Before giving the formal definition, we look at a simple example, given in terms of its transition diagram. a 1 2 c 1 1 2 b 2 3 1 3 At each (discrete) time 0, 1, 2, . . ., the process is in one of the three states a, b, c. At each step, we choose one of the arrows coming out of the current state, according to the probabilities assigned to them, and move to the new state that the chosen arrow points to. The labels of all the arrows coming out of a state must sum to 1. We can try to compute various things. As before, Pi means the probability starting from state i: 1. Pa (hit c). We go from a to b and then stay there until we choose the almost surely happens eventually, so the answer is 1. 1 3 arrow. This 2. Ea (time to hit c), Getting to b takes 1 step, and then we wait on average 1/ 31 = 3 steps (the mean of a geometric variable) to get to c. So 4. 3. Starting from a, the long-term proportion of time spent at c is 92 . Can you find a way to see this? We will learn how to answer such questions systematically. Here is the formal definition. Let S be a countable set, called the state space, whose elements are called states. Let X = (X0 , X1 , X2 , . . .) be a stochastic process, where each Xi takes values in S. We say that X is a Markov chain if P Xn+1 = j | X0 = i0 , X1 = i1 , . . . , Xn = in = P(Xn+1 = j | Xn = in ), (6) for all n ≥ 0 and i0 , . . . , in , j ∈ S for which the conditional probabilities exist. The idea of the definition is that the distribution of the next state Xn+1 depends on the past history X0 , . . . , Xn only through the current state Xn . Moreover, we say that the Markov chain is (time-)homogeneous if in addition P(Xn+1 = j | Xn = i) = pi,j (7) for all n ≥ 0, where (pi,j )i,j∈S are fixed numbers – in other words, the probability of moving from i to j is the same regardless of the time step n. This probability pi,j is called the transition probability. (Sometimes it can be written pij or p(i, j) for reasons of visual clarity). All the Markov chains that we consider will be time-homogenous, and we usually won’t even bother to say so explicitly. To check that a stochastic process is a time-homogeneous Markov chain it is enough to check the following single condition that in effect combines (6) and (7). 23 Theorem 3.1. The stochastic process X = (Xi )i≥0 is a time homogeneous Markov chain with transition probabilities pi,j if and only if it satisfies P Xn+1 = j | X0 = i0 , . . . , Xn−1 = in−1 , Xn = i = pi,j , (8) for all n ≥ 0 and all i0 , . . . , in−1 , i, j for which the conditional probability exists. Proof. Clearly (6) and (7) imply (8). For the other direction, assuming (8), we have X P(Xn+1 = j, Xn = i) = P Xn+1 = j, X0 = i0 , . . . , Xn−1 = in−1 , Xn = i i0 ,...,in−1 X = pi,j P(X0 = i0 , . . . , Xn−1 = in−1 , Xn = i) i0 ,...,in−1 = pi,j P(Xn = i), which implies (7). Now equate the left sides of (8) and (7) to get (6). The transition matrix of a Markov chain is the |S| by |S| matrix P = (pi,j )i,j∈S of transition probabilities, where pi,j is the entry in position (i, j), i.e. in row i and column j. So row i tells us the probability mass function of where we go from state i. For example, the transition matrix of the example at the beginning of the section is a b c   0 1 0 P = b  0 2/3 1/3  c 1/2 1/2 0 a The matrix formulation will turn out to be very convenient when we consider n-step transition probabilities. Often the states are integers, in which case we don’t need to label the rows and columns. It is immediate from (7) that the entries of a transition matrix satisfy and X pi,j ≥ 0 ∀ i, j ∈ S, pi,j = 1 ∀ i ∈ S. j∈S Conversely, any square matrix P satisfying these conditions is a transition matrix of a Markov chain. As we have already seen, we can also represent the transition probabilities by the transition diagram, which is the directed graph with a node for each state i ∈ S, and a directed edge from i to j labelled with pi,j . Normally we omit edges that would have label 0, and we can optionally P omit self-loops (such as the one from b to itself with label 2/3 above), because the fact that j pi,j = 1 enables us to recover the looping probability pi,i from the other probabilities out of i. Now we look at lots of examples. Example 3.2. Check that the simple random walk on Z is a Markov chain with state space Z and transition probabilities pi,i+1 = p, pi,i−1 = q = 1 − p, and pi,j = 0 if |i − j| = ̸ 1. 24 i ∈ Z, We saw the transition diagram earlier. The transition matrix is infinite:   ... ...   q 0 p     q 0 p  .   q 0 p   .. .. . . (Blank entries in a matrix represent 0s). Solution. For the random walk (Sn ) with steps (Xn ): P(Sn+1 = j | S0 = i0 , . . . , Sn−1 = in−1 , Sn = i) = P(Xn+1 = j − i | S0 = i0 , . . . , Sn−1 = in−1 , Sn = i)   p, j = i + 1 = P(Xn+1 = j − i) = q, j = i − 1   0, otherwise, because Xn+1 is independent of (S0 , . . . , Sn ). Thus (8) holds. ♢ Example 3.3. The simple random walk with absorbing barriers at 0 and m (or gambler’s ruin process) is a Markov chain with state space {0, 1, . . . , m} and transition matrix   1 0  q 0 p     .. . .  .  .    q 0 p 0 1 Here we have made a minor change to the definition of the process. Rather than “stopping” when we reach 0 or m (which doesn’t really make sense in our current formalism), we make 0 and m into absorbing states by setting p0,0 = pm,m = 1. Thus, after the chain reaches 0 or m it just stays there for all subsequent steps. Example 3.4. A microorganism can exist in m different strains. At each generation, it either stays the same, or with probability p it mutates to a randomly chosen different strain. If Xn is the strain after n generations then (Xn ) is a Markov chain with m states and transition matrix   p p · · · m−1 1 − p m−1 p   p  m−1 1 − p m−1  .  .. ...   . p m−1 p m−1 1−p If we are only interested in whether or not the strain is the same as the original one (strain 1, say), it might be simpler to let ( 1, Xn = 1 Yn = 2, Xn ̸= 1, in which case (Yn ) is a two-state Markov chain with 1−p p P = ; p p 1 − m−1 m−1 25 p 1 2 p m−1 For the next few examples, let X1 , X2 , . . . be i.i.d. taking value 1 with probability p and 0 with probability q := 1 − p. Example 3.5. (Xn ) itself is a rather boring Markov chain (with time indexed starting from 1 rather than 0) with transition matrix q p . q p Indeed, P Xn+1 = j | X1 = i1 , . . . , Xn−1 = in−1 , Xn = i ( q, j = 0 = P(Xn+1 = j) = p, j = 1. Example 3.6. Sn = Pn i=1 Xi is a Markov chain that can only move in one direction: p 0 p p 1 2 ··· Example 3.7. Define the ordered pair Yn := (Xn , Xn+1 ). Then (Yn ) is a Markov chain with transition matrix 00 01 10 11   00 q p 0 0 01  0 0 q p   . 10  q p 0 0  11 0 0 q p (Exercise: check this, and draw the transition diagram also.) Example 3.8. Let Zn = Xn +Xn+1 . Then (Zn ) is NOT a Markov chain. The basic problem is that if Zn = 1 then we do not know whether (Xn , Xn+1 ) is (0, 1) or (1, 0); this affects the distribution of Zn+1 , and the prior history (e.g. Zn−1 ) may give more information about it. To turn this into a proof, we can say e.g. P(Z3 = 2 | Z1 = 0, Z2 = 1) = P X4 = 1 | (X1 , X2 , X3 ) = (0, 0, 1) = 1/2 but P(Z3 = 2 | Z1 = 2, Z2 = 1) = P X4 = 2 | (X1 , X2 , X3 ) = (1, 1, 0) = 0, which are not equal so (8) cannot be satisfied. (Intuitively, we need to know Z1 as well as Z2 to decide on Z3 , so it’s impossible to say what p1,2 should be.) Example 3.9. If the last two days were rainy then the next day will be dry. In all other cases, the next day is equally likely to be rainy or dry. 26 Let Xn be the weather on day n, taking values in {r, d} (rainy or dry). Then (Xn ) is a stochastic process, but it is NOT a Markov chain. For example, but P(X2 = d | X0 = r, X1 = r) = 1 P(X2 = d | X0 = d, X1 = r) = 1/2, which are not equal, so (8) cannot be satisfied. However, the pair (Xn , Xn+1 ) DOES form a Markov chain, with transition matrix: rr rd dr dd 0  rd  01 dr  2 dd 0 1 0 0 0 1 2 1 2 1 2 0 rr 3.2  0 1 2   . 0  1 2 Multi-step transition probabilities To answer questions about the evolution of a Markov chain X with state space S we need to know the transition probabilities (pi,j )i,j ∈ S and also the initial distribution, i.e. the distribution of the initial state X0 . We can specify the initial distribution by giving its probability mass function λ = (λi )i∈S : λi = P(X0 = i), i ∈ S. An important special case is when the intial distribution is deterministic (i.e. not random) – that is, P(X0 = i) = λi = 1 for some fixed state i ∈ S (and λj = 0 for all j ̸= i), in which case we can just say the Markov chain is started at i. When the transition matrix is understood from context, we sometimes write Pi for the probability measure associated with the Markov chain started at i. Theorem 3.10. For a Markov chain X with initial distribution λ and transition matrix P = (pi,j )i,j∈S , P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = λi0 pi0 ,i1 pi1 ,i2 · · · pin−1 in . Proof. The left side equals P(X0 = i0 ) P(X1 = i1 | X0 = i0 ) P(X2 = i2 | X0 = i0 , X1 = i1 ) · · · × P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ), which by (8) equals the right side. Example 3.11. In Example 3.4, suppose that at generation 0 the microorganism is strain 1. What is the probability that it is not strain 1 for generations 1, 2 and 3, but mutates back to strain 1 at generation 4? Solution. Using the formulation as a two-state chain, the probability is p 2 p 1 p12 p22 p22 p21 = p 1 − . m−1 m−1 27 ♢ Dividing the formula in Theorem 3.10 by λi0 we get an even simpler formula for the conditional probability: P(X1 = i1 , . . . , Xn = in | X0 = i0 ) = pi0 ,i1 pi1 ,i2 · · · pin−1 in , (9) provided the left side exists. Not surprisingly, this is the same as the unconditional probability when we start the Markov chain from i0 (by applying Theorem 3.10 to the initial distribution with λi0 = 1): Pi0 (X1 = i1 , . . . , Xn = in ) = pi0 ,i1 pi1 ,i2 · · · pin−1 in . (10) In light of this it is sometimes useful to define the conditional probability on the left side of (9) to equal the right side (i.e. (10)) even when P(X0 = i0 ) = 0 (so that the conditional probability is not defined according to the usual formula). (There is no harm in this, because if P(X0 = i0 ) = 0 then in computations, conditional probabilities P(· | X0 = i0 ) will typically get multiplied by 0, so we can define them to be anything we want.) We next observe that our assumption of time-homogeneity of transition probabilities extends to the whole Markov chain, provided we condition on the starting state. Theorem 3.12 (Markov property). For a time-homogeneous Markov chain X, the conditional distribution of (Xm+n )n≥0 given Xm = i is the same as the distribution of X given X0 = i. Proof. It is a standard fact (not covered in detail in this course) that the distribution of a discrete-time stochastic process X is determined by the joint distributions of (X0 , X1 , . . . , Xn ) for each n. So we just need to consider P(Xm+1 = i1 , Xm+2 = i2 , . . . , Xm+n = in | Xm = i0 ). We can compute this in the same way as in the proof of Theorem 3.10, to conclude that it equals the right side of (9). We now define the n-step transition probability: (n) pi,j = Pi (Xn = j) = P(Xn = j | X0 = i). This the probability of going from state i to state j in n time steps. By Theorem 3.12 we have for any m ≥ 0: (n) P(Xn+m = j | Xm = i) = pi,j . In a few cases we can just compute the n-step transition probabilities directly. (n) Example 3.13. In Example 3.6, compute pi,j . Solution. If j < i then it is impossible to get from i to j, while if j − i > n then n steps are not enough. Otherwise, we need exactly j − i of the n steps X1 , . . . , Xn to be +1 and the rest to be 0, so the probability is ( j−i n−j+i n p q , 0≤j−i≤n (n) j−i pi,j = ♢ 0, otherwise. For the general case, the following is a key fact. 28 Theorem 3.14 (Chapman-Kolmogorov equations). For any time-homogeneous Markov chain, the transition probabilities satisfy X (m) (n) (m+n) pi,j = pi,k pk,j . k∈S Proof. (m+n) pi,j = Pi (Xm+n = j) X = Pi (Xm+n = j, Xm = k) k∈S = X Pi (Xm = k) Pi (Xm+n = j|Xm = k) k∈S = X (m) (n) pi,k pk,j . k∈S Finally, we can deduce an elegant representation of the n-step transition probabilities. (n) n Theorem 3.15. The n-step transition probability pi,j equals Pi,j = (P n )i,j , the (i, j) n entry in P , the nth power of the transition matrix P . Proof. This follows from the Chapman-Kolmogorov equations by induction on n. For n = 1 it is the definition of the transition matrix P . For n > 1, assuming the result holds for n − 1, X (n−1) X (n) n−1 n pi,j = pi,k pk,j = Pi,k Pk,j = (P n−1 · P )i,j = Pi,j . k∈S k∈S (n) n In the light of this, we can drop the notation pi,j and just write Pi,j ! We can also optionally write Pi,j instead of pi,j , because it is the (i, j) entry of the matrix P . (When writing by hand it can also be convenient to drop the comma and write Pij when there is no risk of confusion.) Example 3.16. Compute all the 2-step transition probabilities for the 3-state Markov chain at the beginning of this section. Solution. With rows and columns always labelled a, b, c:      0 1 0 0 1 0 0 2/3 1/3 P 2 =  0 2/3 1/3  0 2/3 1/3 = 1/6 11/18 2/9 , 1/2 1/2 0 1/2 1/2 0 0 5/6 1/6 2 so Pa,b = 2/3, etc. (Exercise: check that each row of P 2 sums to 1, and that the numbers make sense in the context of the transition diagram.) ♢ Denote the probability mass function of the state Xn at time n by (n) λi (n) then we can consider λ(n) = (λi entries that sum to 1). = P(Xn = i), i ∈ S, : i ∈ S) as a row vector of length |S| (with nonnegative 29 Theorem 3.17. For a Markov chain with transition matrix P we have λ(m+n) = λ(m) P n . Proof. (m+n) λj = P(Xm+n = j) = X = X P(Xm+n = j | Xm = i)P(Xm = i) i∈S (m) n λi Pi,j = λ(m) P n j . i∈S Example 3.18. In the 3-state example at state is uniformly distributed over {a, b, c}, step is  0 λ(1) = 1/3 1/3 1/3  0 1/2 3.3 the beginning of the section, suppose the initial so λ(0) = ( 31 , 13 , 13 ). Then the distribution after 1  1 0 2/3 1/3 = 1/6 13/18 1/9 . 1/2 0 Hitting probabilities and times Let X be a Markov chain and let A ⊆ S be any set of states. We define the hitting time TA of A to be the first time the Markov chain is in A: TA = min{n ≥ 0 : Xn ∈ A}, where we take the minimum of the empty set to be ∞. So if Xn is never in A then TA = min ∅ = ∞. Thus TA is an N ∪ {∞}-valued random variable (where N = {0, 1, 2, . . .}. For a singleton set A = {a} we can also write Ta = T{a} . We define the hitting probability hi (A) of A to be the probability X ever hits A, starting from state i: hi (A) = Pi (TA < ∞) = Pi (hit A) = Pi (Xn ∈ A for some n). The equations we derived for random walks generalize directly to Markov chains: Theorem 3.19. For a fixed set of states A, the hitting probabilities hi = hi (A) satisfy X hi = pi,j hj , i ∈ / A, j (11) hi = 1, i ∈ A. Moreover, (hi )i∈S is the minimal nonnegative solution to the equations (11). That is, if (zi )i∈S is any solution to (11) satisfying zi ≥ 0 for all i then hi ≤ zi for all i. We separate the proof into two parts. Proof of (11). For i ∈ A we have hi = 1 by definition. For i ∈ / A we condition on the first step: X X hi = Pi (X1 = j) Pi (TA < ∞ | X1 = j) = pi,j hj , j∈S j∈S where we used the Markov property (Theorem 3.12) in the second equality. 30 Proof of minimality. (Optional – not required for the course). Suppose (zi )i∈S is a nonnegative solution to (11). For i ∈ A we certainly have hi = 1 ≤ 1 = zi . For i ∈ / A, by repeatedly substituting for the z’s from (11), X X X zi = pi,j zj = pi,j + pi,j zj j∈S j∈A j ∈A / = Pi (X1 ∈ A) + X j ∈A / pi,j X k∈A pj,k + X pj,k zk k∈A / = Pi (X1 ∈ A) + Pi (X1 ∈ / A, X2 ∈ A) + X pi,j pj,k zk j,k∈A / .. . = Pi (X1 ∈ A) + · · · + Pi (X1 , . . . , Xn−1 ̸∈ A, Xn ∈ A) X + pi,k1 pk1 ,k2 · · · pkn−1 ,kn zkn . k1 ,...,kn ∈A / The last term is nonnegative, and the sum of the preceding n terms is precisely Pi (TA ≤ n). Thus we have zi ≥ Pi (TA ≤ n). Now taking the limit as n → ∞ and using continuity of probability, zi ≥ Pi (TA < ∞) = hi . We have a similar result for the expected hitting time of a set A: ti (A) = Ei TA = Ei (time to hit A). Theorem 3.20. For a fixed set of states A, the expected hitting times ti = ti (A) satisfy X ti = 1 + pi,j tj , i ∈ / A, j (12) ti = 0, i ∈ A. Moreover, (ti )i∈S is the minimal nonnegative solution to the equations (12). That is, if (zi )i∈S is any solution to (12) satisfying 0 ≤ zi ≤ ∞ for all i then ti ≤ zi for all i. Proof. Similar to the proof of Theorem 3.19. The minimality condition in Theorems 3.19 and 3.20 is not usually relevant for finite state spaces, but can enable us to choose between multiple solutions in the case of an infinite state space. In the latter case, one solution is often ti = ∞ for all i ∈ / A. This will be the minimal solution if there is no finite solution! The last result gives us a systematic way to solve the HT H problem. Example 3.21. A fair coin is tossed repeatedly. Find the expected number of tosses until the first occurrence of HT H. Solution. Let Xn be the number of consecutive “correct” tosses we have just had, ending at the nth toss, i.e., more precisely, the length of the longest final segment of tosses 1, . . . , n that is an initial segment of HT H. So for example Xn = 2 means that tosses n − 1 and n are HT . Then Xn is a Markov chain with transition diagram: 31 () 0 1 2 (H) 1 1 2 (HT ) 2 1 2 (HT H) 3 1 2 (Recall that we omit self-loops from the diagram. We can assume p3,3 = 1 since it doesn’t matter what happens after we reach our goal). Let ti = Ei (time to hit 3). Then t0 t1 t2 t3 = 1 + 12 t0 + 12 t1 ; = 1 + 12 t1 + 12 t2 ; = 1 + 12 t0 + 12 t3 ; = 0. Solving these equations: t0 = 2 + t1 = 2 + 2 + t2 = 5 + 12 t0 , so t0 = 10 (as before). ♢ We could handle the ABRACADABRA problem in the same way, although it would be somewhat tedious. Later in the course we’ll see a better way. 3.4 Strong Markov property Theorem 3.12 says that if we observe a Markov chain starting from a fixed time m then the distribution is the same as the original chain. Does this hold for a random time also? The answer is no in general but yes for a stopping time. This time our stochastic process X = (Xi )∞ i=0 is indexed from 0, so the definition gets a trivial adjustment: a random variable T taking values in N ∪ {∞} is a stopping time for X if for each n ≥ 0 we can determine whether or not T ≤ n from X0 , . . . , Xn . Theorem 3.22 (Strong Markov property). Let (Xi )i≥0 be a time-homogeneous Markov chain, and let T be a stopping time for X. Conditional on T < ∞ and XT = i and any information about X0 , . . . , XT , the distribution of (XT +n )n≥0 is the same as the conditional distribution of X given X0 = i. Proof. (Optional; not required for the course.) Let A be any event determined by X0 , . . . , XT . Since T is a stopping time, {T = m} ∩ A is determined by X0 , . . . , Xm for each m ∈ N. Therefore, P(XT +1 = j1 , . . . , XT +n = jn , XT = i, T = m, A) = P(Xm+1 = j1 , . . . , Xm+n = jn , Xm = i, T = m, A) = P(Xm+1 = j1 , . . . , Xm+n = jn | Xm = i, T = m, A) P(Xm = i, T = m, A) = P(Xm+1 = j1 , . . . , Xm+n = jn | Xm = i) P(Xm = i, T = m, A) = Pi (X1 = j1 , . . . , Xn = jn ) P(Xm = i, T = m, A). Now sum over m to get P(XT +1 = j1 , . . . , XT +n = jn , XT = i, T < ∞, A) = Pi (X1 = j1 , . . . , Xn = jn ) P(Xm = i, T < ∞, A), 32 and divide both sides by the last factor to give P(XT +1 = j1 , . . . , XT +n = jn | XT = i, T < ∞, A) = Pi (X1 = j1 , . . . , Xn = jn ). Example 3.23. (Yet another approach to random walk). Let Sn be the simple random walk on Z with pi,i+1 = p and pi,i−1 = q = 1 − p for all i. Find the probability that the walk started at 1 ever hits 0. We did this before by taking a limit of the gambler’s ruin model. Here is another way. Solution. Let hi = Pi (hit 0). Then h0 = 1; hi = phi+1 + qhi−1 , i > 0. (13) Rather than trying to solve this infinite system of equations directly, we can get a shortcut using the strong Markov property. Suppose the walk starts at 2. In order to hit 0 it must first hit 1. But the first hitting time of 1 is a stopping time, so after hitting 1 the walk behaves like a walk started from 1. Moreover, the probability to hit 1 from 2 is the same as the probability to hit 0 from 1, by space-homogeneity of the random walk. So h2 = P2 (hit 0) = P2 (hit 1)P1 (hit 0) = h21 . Combining this with the equations for h0 and h1 in (13) we have h1 = ph21 + q. Solving this quadratic gives h1 = 1 or h1 = q/p. Since h1 ≤ 1 this implies that h1 = 1 for all p ≤ 1/2. For p > 1/2 we still need some way to decide between the two solutions. One way is as follows. Extending the above stopping time argument gives that hi = hi1 for all i > 0. Either hi = 1 or hi = (q/p)i gives a solution to (13), but Theorem 3.19 tells us to take the minimal non-negative solution, which is h1 = q/p. ♢ 3.5 Recurrence and transience Let i ∈ S be a state of a Markov chain X. Let Vi = |{n ≥ 0 : Xn = i}| = ∞ X IXn =i n=0 be the total number of visits to i ever by X. We say that state i is recurrent if Pi (Vi = ∞) = 1; transient if Pi (Vi < ∞) = 1. Starting from a recurrent state, the chain keeps returning to it forever (almost surely). From a transient state, the chain eventually wanders off never to return (almost surely). Example 3.24. Consider the chain with transition diagram: 1 1 2 1 2 1 2 1 2 3 4 1 2 33 It is not difficult to check that 1 and 2 are transient while 3 and 4 are recurrent. Example 3.25. In this chain, every state is transient: p 0 p p 1 2 ··· Obvious questions (to be answered) include: is it possible for a state to be neither recurrent nor transient, i.e. to have infinitely many visits with probability strictly between zero or one? How can we determine whether a state is recurrent or transient for a simple random walk (for various p, with or without a barrier)? Theorem 3.26. Every state is either recurrent or transient. Moreover, we have the following dichotomy. For each state i, either: P n (i) i is recurrent, Pi (Vi > 1) = 1, Ei Vi = ∞, and ∞ n=0 Pi,i = ∞; P n or (ii) i is transient, Pi (Vi > 1) < 1, Ei Vi < ∞, and ∞ n=0 Pi,i < ∞. Proof. Let Ri = min{n > 0 : Xi = i} be the first return time to i, and note that Ri is a stopping time. (The only difference compared with the hitting time Ti = T{i} is that we don’t include n = 0). Consider the two possibilities: either (i) Pi (Ri < ∞) = 1 or (ii) Pi (Ri = ∞) > 0. In case (i), clearly Pi (Vi > 1) = 1 (the first visit is at time 0, the second is at time Ri ). Now apply the strong Markov property at time Ri ; the chain is back at i, so its future evolution is distributed identically to the original chain. In particular it will again return to i almost surely. Then we can apply the strong Markov property again at the next return time, and so on. We conclude that Vi = ∞ almost surely, i.e. i is recurrent. In particular this implies Ei Vi = ∞. In case (ii), let q = Pi (Ri < ∞) < 1. Clearly Pi (Vi > 1) = q < 1. If Ri < ∞ then we again apply the strong Markov property at Ri . Conditional on Ri < ∞, the chain returns to i again with probability q, and we can apply the strong Markov property again, etc. This continues until the first time the chain does not return to i. The conclusion is that the number of visits Vi to i has Geometric distribution with parameter 1 − q. In particular Vi < ∞ almost surely, so i is transient, and indeed Ei Vi = 1/(1 − q) < ∞. For the final criterion, in both cases Ei Vi = Ei ∞ X n=0 IXn =i = ∞ X Pi (Xn = i) = n=0 ∞ X n Pi,i . n=0 Example 3.27. Consider simple random walk with parameter p on Z. Clearly every state is recurrent or every state is transient. From the earlier computations, if p = 1/2 then starting from 0 the walk returns to 0 a.s. So 0 is recurrent. If p ̸= 1/2 then P0 (return to 0) < 1, so 0 is transient. 3.6 Random walks in d dimensions We now discuss one of the most important and striking stochastic processes. Recall that the unrestricted random walk on Z could model for a particle or organism wandering in a linear environment. What happens if it is allowed to wander on a plane, or in space? 34 Let d ≥ 1 be an integer, the number of dimensions. Our state space is Zd = (x1 , . . . , xd ) : xi ∈ Z ∀i , the set of all d-vectors of integers. Let ei = (0, . . . , 0, 1, 0 . . . , 0) be the ith unit vector (with a 1 in coordinate i). The steps X1 , X2 , . . . of the random walk are i.i.d. random vectors, with Xn taking values −e1 , +e1 , −e2 , +e2 , . . . , −ed , +ed P each with probability 1/(2d). Let Sn = nj=1 Xj . Thus the walk starts at the origin 0 = (0, . . . , 0), and moves to a uniformly random nearest neighbour of the current state at each d step; (Sn )∞ n=0 is a Markov chain with state space Z and transition probabilities px,x±ei = 1/(2d) and px,y = 0 otherwise. It is called the simple symmetric random walk in Zd (or in d dimensions). For d = 1 this is just the usual simple symmetric walk on Z that we have already seen. Here are partial pictures for d = 2 and d = 3. The full transition diagram has arrows labelled 1/(2d) pointing in both directions between each pair of nearest neighbours in Zd . 1 4 1 4 1 4 1 6 1 4 Theorem 3.28. For the simple symmetric random walk in Zd , every state is recurrent if d ≤ 2, and every state is transient if d ≥ 3. Proved by Pólya in 1921, this is one of the most celebrated results of probability theory. As Kakutani described it, “A drunk man will find his way home, but a drunk bird may get lost forever.” An easier variant – diagonal steps We will prove a slightly easier variant of Theorem 3.28 that nevertheless retains the main point. To define this variant, let the steps X1 , X2 , . . . be i.i.d. and instead uniformly distributed on set of 2d vectors {−1, +1}d = (x1 , . . . , xd ) : xj ∈ {−1, +1}∀j , −d (i.e. taking each value in {−1, +1}d with probability Pn 2 ). Then consider the Markov chain (a non-simple random walk) (Sn ) given by Sn = i=1 Xi . Thus, the steps of this walk are along the main diagonals of the integer grid Zd . E.g. for d = 2: 35 1 4 1 4 1 4 1 4 The great advantage now is that the d coordinates of a single step X1 : (X1 )1 , . . . , (X1 )d are simply i.i.d. random variables taking values −1 and +1 each with probability 1/2. Theorem 3.29. For the random walk defined above, every state is recurrent if d ≤ 2, and every state is transient if d ≥ 3. In the proof we will use Stirling’s formula for the factorial: √ n! ∼ 2πn e−n nn as n → ∞. Here an ∼ bn (“an is asymptotic to bn ”) means that an /bn → 1 as n → ∞. It is easy to ′ ′ ′ ′ check that if an ∼ bn and aP while if an ∼ bn and an and bn are n ∼ bn then an an ∼ bn bn ;P nonnegative and finite then n an < ∞ if and only if n bn < ∞. Proof of Theorem 3.29. All states are clearly equivalent, suffices to establish recurrence P so it n or transience of 0. We will use the criterion involving n P0,0 from Theorem 3.26. We first consider the case d = 1. This is just simple symmetric walk on Z. We already argued that this is recurrent by other methods, but now we need more information, to be used to analyse higher d. Since the steps Xi take values ±1, note that Sn has the same parity n = 0 for all odd n. For an even time 2n, the walk is back at 0 if it has taken as n, so P0,0 equal numbers of +1 and −1 steps, so, using Stirling’s formula 2n 1 2n (2n)! −2n 2n P0,0 = ( ) = 2 n 2 (n!)2 √ 2π2n e−2n (2n)(2n) 2−2n 1 √ ∼ =√ as n → ∞. πn ( 2πn e−n nn )2 Now so ∞ X 1 √ = ∞, n n=1 ∞ X n=0 n P0,0 = ∞ X 2n P0,0 = ∞, n=0 so 0 is recurrent. Now we turn to the case d ≥ 2. As observed above, the coordinates of each step are i.i.d., with the same distribution as simple symmetric random walk steps in Z. Therefore, the coordinates of Sn : (Sn )1 , (Sn )2 , . . . , (Sn )d 36 are independent, and each has the same distribution as Sn had in the d = 1 case above. Therefore, d 2n 1 2n 1 2n (2) P0,0 = P0 (S2n = 0) = P0 (S2n )1 = · · · = (S2n )d = 0 = . ∼ n (πn)d/2 n (And P0,0 = 0 for n odd). Since ∞ X 1 nα n=1 ( = ∞, 0 < α ≤ 1 < ∞, α > 1, (e.g. by the integral test) we conclude that 0 is recurrent if d/2 ≤ 1 and transient if d/2 > 1. We make a few remarks about the proof for the original simple random walk case. Proof of Theorem 3.28 (sketch) – not required for the course. The d = 1 case is the same as the diagonal steps case, so it is recurrent. The d = 2 case is also the same: from the transition diagram one can see that the random walk with diagonal steps performs a simple random walk on the even sublattice {(x, y) ∈ Z2 : x √ + y is even}, which is just a copy of Z2 rotated by 45 degrees and enlarged by a factor of 2. More formally, for the simple random walk Sn = (Un , Vn ), the two processes (Un + Vn )/2 and (Un − Vn )/2 are independent simple symmetric random walks on Z. Therefore 0 is recurrent. The case d = 3 is a little harder. We have 2n P0,0 = X i,j,k≥0: i+j+k=n (2n)! (i!j!k!)2 1 2n . 6 Using some combinatorial tricks and Stirling’s formula, it can be shown this is asymptotic to c/n3/2 for some c. (See e.g. Norris’ book). So X n P0,0 < ∞, n so 0 is transient. For d ≥ 4 one can compare with d = 3. (See the Problem sheet). 3.7 Communicating classes We now consider how the states of a Markov chain are related to each other. Consider a Markov chain with state space S and transition matrix P = (pi,j ). Let i, j ∈ S be states. n We say i leads to j and write i → j if Pi,j > 0 for some n ≥ 0. I.e., if it is possible to get from i to j in some number of steps. We say that i communicates with j and write i ↔ j if both i → j and j → i. Note that i → j is NOT the same thing as the existence an arrow (with positive label pi,j ) from i to j in the transition diagram. The latter means it is possible to get from i to j in ONE step. 37 Recall that a relation ∼ on a set S is called an equivalence relation if it is reflexive (i.e. i ∼ i for all i ∈ S), symmetric (i.e. i ∼ j implies j ∼ i), and transitive (i.e. i ∼ j and j ∼ k imply i ∼ k). Theorem 3.30. The relation ↔ is an equivalence relation on S. 0 Proof. Reflexivity follows since Pi,i = 1 for all i. Symmetry is immediate from the definition n m > 0 for some m and > 0 and Pj,k of ↔. For transitivity, if i → j and j → k then Pi,j m+n m n n. By the Chapman-Kolmogorov equations, Pi,k ≥ Pi,j Pj,k > 0, so i → k. By similarly considering the reverse direction, ↔ is transitive. It follows from the general theory of equivalence relations that the state space is partitioned into equivalence classes, called communicating classes here. I.e. S = C1 ∪ C2 ∪ · · · , where the classes Cr are pairwise disjoint, and i ↔ j if and only if i and j lie in the same class. Example 3.31. Find the communicating classes, if  1 1 0 0 0 2 2  0 0 1 0 0  1  0 0 31 31 3 P =  0 0 0 1 1 2 2   0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0     .    Solution. This becomes easy when we look at the transition diagram. The actual transition probabilities don’t matter here – only whether or not they are positive. 1 4 3 2 5 6 The classes are {1, 2, 3}, {4} and {5, 6}. ♢ Theorem 3.32. Recurrence and transience are class properties; that is, in each class, either all states are recurrent or all states are transient. Proof. Suppose that i, j are in the same class and j is recurrent. It is enough to show that i m n m n is recurrent. Since i ↔ j there exist m, n such that Pi,j , Pj,i > 0, so a := Pi,j Pj,i > 0. Then m+k+n m k n k Pi,i ≥ Pi,j Pj,j Pj,i = aPj,j , so P k k Pj,j = ∞ implies P ℓ ℓ Pi,i = ∞. A few more definitions. We call a Markov chain irreducible if it has only one communicating class, C = S. We call a class C closed if i ∈ C and i → j imply j ∈ C. I.e., if C is impossible to escape from. A state i is absorbing if {i} is a closed class (i.e. if pi,i = 1). We can call a class recurrent or transient according to whether its states are recurrent or transient. In the case of an irreducible chain we can call the chain itself recurrent or transient. 38 Example 3.33. In Example 3.31, only {5, 6} is a closed class. Example 3.34. In the standard gambler’s ruin chain with absorbing barriers at 0 and m, the classes are {0}, {1, . . . , m − 1}, {m}, with {0} and {m} being closed; states 0 and m are absorbing. Example 3.35. The simple random walk on Zd is irreducible. Theorem 3.36. (a) Every recurrent class is closed. (b) Every finite closed class is recurrent. Proof. Recall that Vi denotes the total number of visits to i. (a) Suppose class C is not closed. Then there exist i ∈ C and j ∈ / C with i → j, i.e. there exists m with Pi (Xm = j) > 0. But j ̸→ i, so once we visit j we can never get back to i, therefore Pi (Xm = j) ≤ Pi (Vi ≤ m) ≤ Pi (Vi < ∞), hence the last probability is positive, and so i is not recurrent, i.e. C is not recurrent. (b) Suppose C is finite and closed, and fix i ∈ C. Since the chain must go somewhere, for some j ∈ C, 0 < Pi (Vj = ∞) = Pi (hit j)Pj (Vj = ∞), by the strong Markov property. Therefore Pj (Vj = ∞) > 0, so j is not transient, so it is recurrent, so C is recurrent. So in a finite state space, recurrence and transience are easy: a class is recurrent if and only if it is closed. Example 3.37. In Example 3.31, 1, 2, 3, 4 are transient and 5, 6 are recurrent. Example 3.38. Consider the simple random walk with one absorbing barrier on N = {0, 1, 2, . . .}: p0,0 = 1 and pi,i+1 = p and pi,i−1 = q = 1 − p for all i ≥ 1. Then {0} is a finite closed class, and therefore recurrent, and the other class {1, 2, . . .} is not closed, therefore it is transient. Note that this holds regardless of p, even though the behaviour varies drastically: for p ≤ 1/2 the walk a.s. gets absorbed into 0, while for p > 1/2 it may wander off to infinity – but in both cases the positive states are transient. However, these methods cannot resolve every case. Recall: Example 3.39. The simple symmetric random walk on Zd is recurrent if d ≤ 2 but transient if d ≥ 3. Recall that this required a detailed calculation. The simple arguments of this subsection cannot determine recurrence and transience for an infinite closed class. We will mostly focus on irreducible Markov chains from now on. If a chain is not irreducible then one may split it into classes and try to consider each class separately. 39 3.8 Mean return times We now further distinguish two types of recurrence – this will play an important role in limit behaviour. For a Markov chain X, define the return time Ri = min{n ≥ 1 : Xn = i}; this is the first time the chain visits i except that time 0 is excluded (because of the ≥ 1 in the definition). Let µi = Ei (Ri ) be the mean return time to i starting at i. Recall that if i is transient then Pi (Ri = ∞) > 0 so µi = ∞. If i is recurrent then we call i positive recurrent if µi < ∞, or null recurrent µi = ∞. The names will make more sense later. We can think of a null recurrent state as “only just recurrent” – the chain does return, but it is very slow to do so. Theorem 3.40. Positive recurrence and null recurrence are class properties. Proof. See later. We’ll see a systematic way to determine null and positive recurrent later. For now, we can deal with some examples: Example 3.41. Show that simple symmetric random walk on Z is null recurrent. Solution. The chain is irreducible, so all states have the same classification. We saw earlier that it is recurrent. We also saw that E0 T1 = ∞, so similarly E1 T0 = E−1 T0 = ∞. Therefore, conditioning on the first step, E0 R0 = 1 + 21 E1 T0 + 21 E−1 T0 = ∞, so the chain is null recurrent. ♢ Example 3.42. Show that simple random walk on N with parameter p and a reflecting barrier at 0 (so that p0,1 = p and p0,0 = 1 − p) is positive recurrent if p < 1/2. Solution. The chain is irreducible. We saw in Problem sheet 4 that E1 T0 = 1/(q − p) < ∞, so E0 R0 = 1 + q0 + p1/(q − p) = q/(q − p) < ∞, so it is positive recurrent. ♢ 3.9 Convergence examples Our goal is to understand long term behaviour of a Markov chain. How does the n-step n behave when n is large? transition probability Pi,j Example 3.43. Let’s look at the chain: 2 1 2 1 1 1 2 40 3 We can compute some large powers by squaring: P 2 = P P , P 4 = P 2 P 2 , etc. This gives, to 3 decimal places       0 1 0 0 .5 .5 .25 .375 .375 P =  0 .5 .5 , P 2 = .25 .25 .5  , P 4 = .188 .438 .375 , .5 0 .5 .25 .5 .25 .188 .375 .438     .203 .398 .398 .200 .400 .400 P 8 = .199 402 .398 , P 16 = .200 .400 .400 . .199 .398 .402 .200 .400 .400 n It appears that Pi,j → πj as n → ∞, where π = (π1 , π2 , π3 ) = (1/5, 2/5, 2/5). In other words, the distribution of the state Xn at time n converges to the distribution π on S, regardless of the initial state i. This indeed holds, as we shall see. Moreover there is a simple way to determine the limiting distribution π. However, this kind of convergence does not happen in every case, as the following three examples show. Example 3.44. 2 1 1 1 1 3 n n n Here it is clear that P1,1 = 1 if n is a multiple of 3, while P1,1 = 0 for all other n. So P1,1 does not converge as n → ∞. The obstacle here is called periodicity. Example 3.45. 2 1/2 1/2 1 3 We can show easily see that   0 1/2 1/2 0  P n = P = 0 1 0 0 1 n for all n. So Pi,j does converge, but the limit depends on i as well as j. The limiting distribution depends on the starting state. The issue is that the chain is not irreducible. √ 2n Example 3.46. Simple symmetric random walk on Z. We know that Pi,i ∼ 1/ πn, so q n 2n n n n 2 n 2n Pi,i → 0. Also note that Pi,i ≥ Pi,j Pj,i = (Pi,j ) , so Pi,j ≤ Pi,i → 0 as n → ∞ for all i, j. n In other words, Pi,j → πj where π = (. . . , 0, 0, 0, . . .). This is not a probability distribution on Z, because it sums to 0 rather than 1. Where does the walk end up “at time infinity”? Nowhere – it “diffuses away”, even though it is recurrent! In fact, the issue is that it is null recurrent. 41 3.10 Stationary distributions Our goal will be to understand when convergence to a distribution happens, as in Example 3.43, as well as to determine the limiting distribution. We start with the latter question. The key idea is that any potential limiting distribution needs to remain the same over time. A distribution (on the state space S) is any row vector λ = (λi )i∈S such that λi ≥ 0 X for all i, and λi = 1 . Recall that if a Markov chain has initial distribution λ then i∈S the distribution after one step is λP (and after n steps, λP n ). A stationary distribution (sometimes called an invariant or equilibrium distribution) for a Markov chain with transition matrix P is a distribution π such that πP = π. If a Markov chain starts in a stationary distribution, it stays there. Stationary distributions will be candidate limiting distributions. Example 3.47. Find all stationary distributions in Example 3.43. Solution. We want to find π = (π1 , π2 , π3 ) such that πP = π, i.e. That is,   0 1 0 (π1 π2 π3 )  0 .5 .5 = (π1 π2 π3 ), .5 0 .5 or π1 1 π 2 2 + + 1 π 2 3 1 π 2 2 1 π 2 3 P i πi pi,j = πj for each j. = π1 = π2 = π3 . This is three equations in three variables, but one equation is redundant. (This will always happen because if π is a solution to πP = π then cπ is another solution for any constant c). We get π2 = π3 = 2π1 . But for π to be a distribution we also need π1 + π2 + π3 = 1, which gives π1 + 2π1 + 2π1 = 1, so π1 = 1/5, and thus π = ( 51 , 52 , 25 ) is the unique stationary distribution. (Note that this was the apparent limiting distribution that we observed). ♢ Note that in each of the equations X πi pi,j = πj , j∈S i∈S that make up the identity πP = π, we sum along a column of P , rather than a row as in the hitting probability and hitting time equations from earlier. In terms of the diagram, we sum over arrows leading into a state. We can think of it as: πj is the equilibrium probability of being at j, which can be expressed as a sum over how we got there at the previous step. It is an exercise to find the stationary distributions for Examples 3.44 and 3.45 – the former has a unique stationary distribution but the probabilities do not converge to it; the latter has multiple stationary distributions. We now consider out third “problem example”. 42 Example 3.48. Find all stationary distributions for Example 3.46, the simple symmetric random walk on Z. Solution. The equation πP = π gives 1 π 2 i−1 + 12 πi+1 = π, i∈Z I.e. πi+1 − 2πi + πi−1 = 0. This is a difference equation we have seen before: the general solution is πi = A + Bi. We must P have B = 0, otherwise πi would be negative for some i. So πi = A, a constant. But we need i∈Z πi = 1, which is impossible: if A = 0 then the sum is 0; if A ̸= 0 then it is infinite. So there is no stationary distribution. ♢ The equations πP = π are linear, but it can be hard or tedious to solve them. Sometimes we can use the structure of the Markov chain to guess a solution, and check it. Example 3.49. A microorganism can exist in m different strains. At each generation, it either stays the same, or with probability p it mutates to a randomly chosen different strain. Find a stationary distribution. Solution. Recall the transition matrix  p ··· 1 − p m−1 p   m−1 1 − p P =  .. ..  . . p m−1 p m−1 p m−1 p m−1 1−p    .  Because of the symmetry between the m states, we can guess that the uniform distribution π = ( m1 , . . . , m1 ) (14) should be stationary. To check this, for any state j, X πi pi,j = (m − 1) × i∈S 1 p 1 1 + (1 − p) = = πj , mm−1 m m as required. ♢ It is not so obvious from this whether the stationary distribution (14) is unique. One way to see that it is unique in this case is to consider the two-state chain associated with the problem, as defined earlier, and solve the equations to get the unique solution π = (1/m, (m − 1)/m). This tells us that π1 must be 1/m, but the same applies to every state by symmetry. As we see next, uniqueness is in fact guaranteed for more general reasons. 3.11 Existence and uniqueness of stationary distributions Here is the key result about stationary distributions. Theorem 3.50. An irreducible Markov chain has a stationary distribution π if and only if it is positive recurrent. In that case π is unique, and it is given by πi = 1 , µi i ∈ S, where µi is the mean return time of state i. 43 The formula πi = 1/µi is an unexpected bonus. It should make intuitive sense. If the initial distribution is π so that the chain is “in equilibrium”, then we expect πi to be the long term proportion of time spent in state i. This is one over the average time spent between visits to i. We will give most but not all details of the proof. As suggested by the examples discussed above, there are two issues in play when looking for a stationary distribution: finding a solution to the equation xP = x, and normalizing it to be a probability distribution. The first step is a straightforward one. Theorem 3.51. Suppose that the chain is irreducible and that the row vector x = (xi )i∈S satisfies xi ≥ 0 for all i ∈ S and is a solution to xP = x. If xj > 0 for some j then xi > 0 for all i. If xj < ∞ for some j then xi < ∞ for all i. Proof. Since xP = x we also have xP n = x for all n ≥ 1, by induction. m > 0 for some m. Then xi = (xP m )i ≥ Suppose xj > 0. For any i, by irreducibility, Pj,i m > 0. xj Pj,i n > 0 for some n. Then xj = (xP n )j ≥ Suppose xj < ∞. For any i, by irreducibility, Pi,j n n xi Pi,j , so xi ≤ xj /Pi,j < ∞. We call x positive and finite if 0 < xi < ∞ for all i ∈ S. Theorem 3.52. If the Markov chain is irreducible and recurrent then any positive finite solution to the equation xP = x is unique up to a multiplicative constant. I.e. if x and x′ are two positive finite solutions then x′ = ax for some a ∈ (0, ∞). Proof. Omitted. The key idea for the proof of Theorem 3.50 is the following clever construction. Suppose that the Markov chain X with transition matrix P is recurrent. Fix a state k ∈ S and as usual let Rk = min{n ≥ 1 : Xn = k} be the return time to k. Define for each i ∈ S, ρi (k) = Ek R k −1 X IXn =i , n=0 and consider the row vector ρ(k) = (ρi (k))i∈S . Thus ρi (k) is the expected number of visits to i between visits to k (counting one but not both of the visits to k at the ends, in the case when i = k). Note that ρk (k) = Ek 1 = 1. Theorem 3.53. If the Markov chain X with transition matrix P is irreducible and recurrent then, for any fixed k ∈ S, we have 0 < ρi (k) < ∞ for all i ∈ S, and ρ(k)P = ρ(k). Proof. (not required for the course). Since the chain is recurrent, under Pk we have Rk < ∞ and X0 = XRk = k. Note that the event {n ≤ Rk } is the complement of {Rk ≤ n − 1}, and, since Rk is a stopping time, whether n ≤ Rk can be determined from 44 X0 , . . . , Xn−1 . Now ρj (k) = Ek ∞ X I[Xn = j, n ≤ Rk ] n=1 = ∞ X Pk (Xn = j, n ≤ Rk ) n=1 = = ∞ XX i∈S n=1 ∞ XX Pk (Xn−1 = i, Xn = j, n ≤ Rk ) Pk (Xn−1 = i, n ≤ Rk ) pi,j i∈S n=1 = X pi,j = pi,j Ek X R k −1 X I[Xm = i] m=0 i∈S = Pk (Xm = i, m ≤ Rk − 1) m=0 i∈S X ∞ X pi,j ρi (k), i∈S as required, where the various interchanges of sums and expectations are justified because the terms are nonnegative, in the 4th equality we used the Markov property, and in the last equality we used that we may count either the initial or the final visit to k in the definition of ρi (k). Since ρk (k) = 1, Theorem 3.51 implies that ρ(k) is positive and finite. So to get a stationary distribution, we need to normalize ρ(k) to have sum 1: πi = P ρi (k) . j∈S ρj (k) This will only work if the sum in the denominator is finite. But we have X i∈S ρj (k) = X j∈S Ek R k −1 X IXn =j = Ek n=0 R k −1 X X n=0 j∈S IXn =j = Ek R k −1 X 1 = Ek Rk = µk , n=0 the mean return time! So we can normalize if any only if µk < ∞, i.e. if and only if k is positive recurrent. Theorem 3.54. For an irreducible, recurrent Markov chain, the following are equivalent: (i) every state is positive recurrent; (ii) some state is positive recurrent; (iii) there is a stationary distribution. Proof. (not required for the course) Clearly (i) implies (ii). P To show (ii) implies (iii), suppose k is positive recurrent, so µk = i ρi (k) < ∞. Then πi = ρi (k)/µk gives a stationary distribution. 45 To show (iii) implies (i), suppose π is an invariant distribution and k is any state. Since π and ρ(k)Pare positiveP solutions to xP = x we have ρ(k) = aπ for some a ∈ (0, ∞). Therefore µk = i ρi (k) = a i πi = a < ∞, so k is positive recurrent. In particular the last result shows that positive and null recurrence are class properties. (To make this explicit: a recurrent class C ⊆ S is closed, and so in particular pi,j = 0 for all i ∈ C and j ∈ / C. Therefore in C the Markov chain simply behaves like the irreducible chain with transition matrix (pi,j )i,j∈C , but the theorem say that this has states positive recurrent or all states null recurrent). Moreover, for an irreducible recurrent chain, the unique stationary distribution π satisfies πk = 1 ρk (k) = , µk µk as claimed earlier. Proof of Theorem 3.50. (not required for the course) If the chain is positive recurrent then we saw above that the unique stationary distribution indeed satisfies πi = 1/µi . If it is null recurrent then the only P positive solutions to xP = x are of the form x = aρ(k) with a ∈ (0, ∞), but these have i xi = aµk = ∞. P n n → 0 as n → ∞ < ∞ so Pi,i Finally suppose the chain is transient. We have n Pi,i m+n m n m → 0 as n → ∞. ≤ P /P so P for each i. For any i, j we have Pj,i > 0 for some m, j,i i,j i,i P n Therefore if π were a stationary distribution, πj = i πi Pi,j → 0, so π = 0. 3.12 Periodicity and the convergence theorem The one remaining obstacle to convergence is periodicity, which is what went wrong in Example 3.44. For a state i of a Markov chain with transition matrix P we define n Ji = {n ≥ 1 : Pi,i > 0}, i.e. the set of times at which it is possible for the chain to return to i. The period of state i is gcd(Ji ), the greatest common divisor of the set Ji , i.e. the largest positive integer that is a factor (divisor) of all elements of Ji . The state i is called aperiodic if it has period 1. Example 3.55. 2 1 1 1 1 Here J1 = {3, 6, 9, 12, . . .}, so state 1 has period 3. Example 3.56. 46 3 1 2 1 1 2 1 2 1 2 3 4 1 2 1 1 2 1 2 Here J1 = {2, 4, 6, 8, . . .}, so 1 has period 2. Example 3.57. 1 1 2 1 2 3 1 2 4 1 2 Because of the self-loop at 4, J1 = {2, 4, 6, 7, 8, . . .}, so 1 is aperiodic. Theorem 3.58. If m, n ∈ Ji then m + n ∈ Ji . m+n m n m n Proof. If Pi,i , Pi,i > 0 then Pi,i ≥ Pi,i Pi,i > 0. Theorem 3.59. State i is aperiodic if and only if Ji ⊇ {m, m + 1, m + 2, . . .} for some m. Proof. (not required for the course) This depends on a fact from number theory (not covered here): if P gcd(r1 , . . . , rk ) = 1 then there exists m such that any n ≥ m can be expressed as n = k ak rk where the coefficients ai are nonnegative integers. Then we use the last theorem. Theorem 3.60. Aperiodicity is a class property. n Proof. Suppose i is aperiodic and i ↔ j. There exists m such that Pi,i > 0 for all n ≥ m. k ℓ Moreover there exist k, ℓ such that Pi,j , Pj,i > 0. Then for all n ≥ m, ℓ+n+k ℓ n k Pj,j ≥ Pj,i Pi,i Pi,j > 0, which implies that j is aperiodic. (It can be shown similarly that period is a class property). Finally we are ready for (arguably) the main result about Markov chains! Theorem 3.61 (Convergence Theorem). For any irreducible, aperiodic, positive recurrent Markov chain X with any initial distribution, and with stationary distribution π, for every state j, P(Xn = j) → πj as n → ∞. In particular, taking the initial distribution to be deterministically i, we deduce that n Pi,j → πj as n → ∞ for all states i, j. 47 Proof of Theorem 3.61. (sketch – not required for the course) We do not give full details, but we mention the main idea: coupling. We run two copies of the Markov chain simultaneously. The first, X, starts from an arbitrary initial distribution. The second, Y , starts from the stationary distribution π, so that at every time step it is in distribution π. We let X and Y evolve independently of each other until the first time they meet, i.e. until the first time T that XT = YT . After time T , X and Y stick together, so that Xn = Yn for all n ≥ T , both still evolving according to the transition matrix. Under this construction, the pair (Xn , Yn ) is itself a Markov chain with state space S × S, while (Xn ) and (Yn ) individually are Markov chains with the original transition matrix. The central calculation is the following comparison between Xn and Yn : |P(Xn = j) − P(Yn = j)| = |P(Xn = j, n < T ) + P(Xn = j, n ≥ T ) − P(Yn = j, n < T ) − P(Yn = j, n ≥ T )| = |P(Xn = j, n < T ) − P(Yn = j, n < T )| ≤ P(T > n). Since P(Yn = j) = πj for all n, we just need to show that P(T > n) → 0, which by continuity of probability is equivalent to P(T = ∞) = 0. In other words we must show that X and Y meet (and hence couple) almost surely. In fact it is enough to fix one state k ∈ S and show that (Xn , Yn ) hits (k, k) almost surely. This is done by showing that the chain (Xn , Yn ) is irreducible and recurrent. In fact it is positive recurrent, which follows from the same fact about the original chain (Xn ) by constructing a stationary distribution. Irreducibility can be deduced from irreducibility and aperiodicity of (Xn ). We also have another type of convergence to the stationary distribution, and for this we do not need aperiodicity. Theorem 3.62 (Ergodic Theorem). For any irreducible, positive recurrent Markov chain X with any initial distribution, and with stationary distribution π, for every state j, 1 {t = 0, . . . , n − 1 : Xt = j} → πj n as n → ∞ almost surely. The left side above is the proportion of time that the chain spends in state j before time n. Thus the theorem states that the long-term proportion of time spent at j is πj . The proof is not covered in this course. 3.13 Reversibility Finally, we look at an important and useful special class of Markov chains. Suppose P is a transition matrix and π is a distribution. We say that P and π satisfy the detailed balance equations if πi pi,j = πj pj,i for all i, j ∈ S. (15) If we think of π as an equilibrium distribution then this is a very natural condition. The term πi pi,j represents the long-term frequency of transitions from i to j, or probabilistic “traffic” from i to j, and detailed balance says this equals the traffic from j to i. In fact, detailed balance implies stationarity: 48 Theorem 3.63. If the transition matrix P and a distribution π satisfy the detailed balance equations then π is a stationary distribution for P . Proof. This is an exercise on the problem sheet. In fact we do not even need that π is a distribution – any vector x that satisfies detailed balance will satisfy xP = x. Note that (15) is a much stronger condition that πP = π; if S is finite then it is |S|2 equations rather than |S| equations. The condition is important for several reasons. When they hold, the detailed balance equations are often much easier to solve than the stationarity equation πP = π, because each equation only has two terms. For a general P there is no guarantee that the detailed balance equations have a solution, but we will see some important cases where they do. Finally, they imply the following very interesting property. Theorem 3.64. Suppose π is a distribution that satisfies the detailed balance equations for P . Let the Markov chain (Xn ) have initial distribution π and transition matrix P . Then (X0 , . . . , Xn ) has the same distribution as (Xn , . . . , X0 ). In other words, if we make a video if the Markov chain in equilibrium and play it backwards, it is impossible to tell the difference. In light of this, a Markov chain that satisfies the conditions of Theorem 3.64 is called reversible. Proof. By applying the detailed balance condition successively, P (X0 , . . . , Xn ) = (i0 , . . . , in ) = πi0 pi0 ,i1 pi1 ,i2 · · · pin−1 ,in = pi1 ,i0 πi1 pi1 ,i2 · · · pin−1 ,in = pi1 ,i0 pi2 ,i1 πi2 · · · pin−1 ,in .. . = pi1 ,i0 pi2 ,i1 · · · pin ,in−1 πin = πin pin ,in−1 · · · pi2 ,i1 pi1 ,i0 = P (X0 , . . . , Xn ) = (in , . . . , i0 ) . We discuss two main classes of reversible chains. Example 3.65 (Birth-death chains). Let S = {0, 1, . . . , m} or S = {0, 1, . . .}, and let pi,j = 0 for all |i − j| ≥ 2. Thus, the state space is a finite or infinite line, and only nearest-neighbour transitions are allowed. We can denote the transition probabilities pi,i+1 = bi pi,i = si pi,i−1 = di for i ∈ S, where bi + si + di = 1 for all i. If we think of the chain as representing the size of a population model, bi and di are probabilities of a birth and a death when the population is i. Provided bi > 0 for i < m and di > 0 for i > 0 the chain is irreducible. In that case it satisfies the detailed balance equations. We can just check this: the equation πi pi,j = πj pj,i 49 holds trivially when i = j, and when |i − j| ≥ 2 it is just 0 = 0. Exchanging i and j gives the same equation. So we only have to satisfy the cases j = i + 1: π 0 b0 = π 1 d 1 π 1 b1 = π 2 d 2 π 2 b2 = π 3 d 3 .. . πi−1 bi−1 = πi di .. . The general solution is b0 b1 · · · bi−1 , i ∈ S. d1 d2 · · · di Of course this may or may not be summable, depending on the parameters, which will determine whether or not the chain is positive recurrent. πi = π0 Example 3.66. Consider (yet again) simple random walk on N with a reflecting barrier at 0. So pi,i+1 = p for i ≥ 0 and pi,i−1 = q = 1 − p for i ≥ 1. The chain is irreducible, and it is aperiodic because p0,0 > 0. We know that it is transient when p > 1/2 and recurrent for p ≤ 1/2. Let us focus on p ≤ 1/2. Since this is a birth-death chain, we can try to find a stationary distribution using the detailed balance equations: π0 p = π1 q π1 p = π2 q π2 p = π3 q .. . giving πi = p i q i ≥ 0. π0 , Normalizing to get a distribution: 1= ∞ X πi = π0 i=0 but ∞ X p i i=0 q i=0 ( = ∞ X p i q , 1 , 1−p/q p < 1/2 ∞, p = 1/2. So (as seen before), if p = 1/2 there is no stationary distribution so the chain must be null recurrent. If p < 1/2 then the chain is positive recurrent with stationary distribution p p i πi = 1 − , i ≥ 0. q q (In fact this is the probability mass function of a geometric random variable minus 1). So we can deduce for instance: p lim P(Xn = 0) = π0 = 1 − . n→∞ q 50 µ0 = E0 (return time to 0) = 1/π0 = 1 q = . 1 − p/q q−p Also letting t = E1 (time to hit 0) we have µ0 = 1 + q0 + pt, so t= µ0 − 1 1 = , p q−p (as we got before by more complicated methods). Moreover, for p < 1/2 the walk is reversible. This is quite startling. Since p < 1/2 the walk tends to drift leftwards – it is more likely to decrease than increase. We might expect that running the video backwards would result in it tending instead to drift to the right. But this is incorrect because the stationary distribution is concentrated near 0. The leftward drift is only visible after there has been an unusual rightward excursion. If we focus attention on such excursions then in the time-reversed picture the drift back toward zero becomes an unusual rightward excursion, and the excursion becomes drift back to zero. Even if you buy this explanation it is still startling! Example 3.67 (Random walk on a graph). Consider a finite graph G = (V, E). V is the set of vertices, and E is the set of edges, which are unordered pairs of distinct. Two vertices are called neighbours if there is an edge between them, and the degree deg(v) of a vertex v is the number of its neighbours. The (simple symmetric) random walk on G is the Markov chain with state space S = V in which, at each step, we move from the current vertex to a uniformly random neighbour. Thus, ( 1 if w is a neighbour of v, pv,w = deg(v) 0 otherwise. We claim that the chain is reversible with stationary distribution given by deg(v) . u∈V deg(u) πv = P The denominator (which incidentally equals 2|E|) is just a normalizing factor to ensure that π is a distribution – the stationary distribution is proportional to degree. (One way to understand this is that the walk spend equal timeP on each edge). This is almost trivial to check. Writing D = u∈V deg(u), we only need to check the detailed balance equations when v, w are neighbours: πv pv,w = deg(w) 1 deg(v) 1 = = πw pw,v . D deg(v) D deg(w) Example 3.68. A chess King is moved about at random on an otherwise empty chess board. At each step, independently, it is moved to a uniformly random square of those that it can legally move to. A chess board is an 8-by-8 grid of squares, and a King may move by one place horizontally, vertically or diagonally at 45 degrees, so long as it does not leave the board. The King’s position Xn is a random walk on a graph. The vertex set is V = {1, 2, . . . , 8}2 = {(x, y) : x, y = 1, . . . , 8}, and two vertices u, v are adjacent if u − v is a non-zero vector with both coordinates in {−1, 0, +1}. The degree of a vertex is 3 if it is a corner, 5 if it is on an edge, and 8 in the interior, so the sum of all degrees is 4 × 3 + (4 × 6) × 5 + 62 × 8 = 420. 51 The chain is irreducible and aperiodic (check this!), so the convergence theorem and ergodic theorem apply. So e.g. for a corner c, as n → ∞, P(Xn = c) → πc = deg(c)/420 = 3/420 = 1/140; and the expected time starting from c to return to c is µc = 1/πc = 140; and the long-term proportion of time spent in corners is 4 × 1/140 = 1/35. 52 4 Poisson Process 4.1 Introduction This will be our first continuous-time stochastic process. Time t will be a real parameter with 0 ≤ t < ∞. We want to model a series of occurrences (often called arrivals) that happen at random times, such as: emissions from a radioactive source, meteor impacts, financial crashes, rain storms, rain drops, road accidents. We don’t distinguish between different arrivals (for now) – we are just interested in their times. The Poisson process is the most basic such model, in which arrivals are equally common at all times, and past arrivals provide no information about future ones. Rather than the arrival times themselves (which would be a random set), we focus on a random function (stochastic process) that encapsulates the same information: the number of arrivals up to including time t, which we denote N (t). For s < t, the number of arrivals in the time interval (s, t] is then the increment N (t) − N (s) of N over the interval. Here is what a graph of the function N might look like. It jumps up by 1 every time there is an arrival. Because of the convention to count arrivals up to and including time t, it takes the upper of the two values at the time of the jump itself; this is the meaning of the solid (versus open) discs in the picture. N (t) t 0 We will end up with three equivalent definitions. 53 Definition 1. A stochastic process (N (t))t≥0 is a Poisson process of rate λ > 0 if: 1. N (0) = 0, N (t) is integer-valued, and the function N is right-continuous; 2. N has independent increments (over disjoint intervals); i.e. for any 0 ≤ t0 ≤ t1 ≤ · · · ≤ tn , the random variables N (t1 ) − N (t0 ), N (t2 ) − N (t1 ), . . . , N (tn ) − N (tn−1 ) are independent; 3. for any t ≥ 0,  1 − λh + o(h), n = 0;  P N (t + h) − N (t) = n = λh + o(h), n = 1;   o(h), n ≥ 2, as h ↓ 0. The process N jumps up by 1 whenever there is an arrival. Right-continuity just means that we choose the upper value at the time of the jump. Independent increments means that the numbers of arrivals in disjoint time intervals are independent. The “little-oh” notation o(h) means a function f (h) (perhaps a different one at each place where the notation appears) such that f (h)/h → 0 as h → 0, in other words a term of strictly lower order than h (such as h2 or h3 ). (Not to be confused with “big-oh” notation – O(h) means a function g(h) such that g(h)/h ≤ const.) Condition 3 says that the probability of an arrival in a short interval is approximately the rate times its length, with two or more arrivals being much less likely. Our first observation is a version of time-homogeneity: the process of arrivals started from some later time s looks like the original process. Theorem 4.1. Let (N (t))t≥0 be a Poisson process of rate λ and fix s ≥ 0. The process (N (s + t) − N (s))t≥0 is a Poisson process of rate λ. Proof. Let M (t) = N (s + t) − N (s). We check that it satisfies the conditions 1–3. Since N is integer valued and right continuous, the same holds for M . We have M (0) = N (s + 0) − N (s) = 0, so condition 1 holds. The increment of M over an interval (ti , ti+1 ] is M (ti+1 ) − M (ti ) = N (s + ti+1 ) − N (s) − N (s + ti ) + N (s) = N (s + ti+1 ) − N (s + ti ), which is the increment of N over (s + ti , s + ti+1 ]. Therefore, since N has independent increments, so does M – this is condition 2. By similar reasoning,  1 − λh + o(h), n = 0;  P M (t + h) − M (t) = n = P N (s + t + h) − N (s + t) = n = λh + o(h), n = 1;   o(h), n ≥ 2, so condition 3 is satisfied. 4.2 Increment distribution A Poisson process is a family of random variables N (t) indexed by t. What are their distributions? 54 Theorem 4.2. If (N (t))t≥0 is a Poisson process of rate λ then N (t) is a Poisson random variable with parameter (mean) λt. Proof. Write pj (t) = P(N (t) = j) for the probability mass function of N (t). We will derive an expression for the derivative of pj (t) with respect to t. Consider a (small) parameter h > 0 and write Y = N (t + h) − N (t) for the number of arrivals from t to t + h. Then pj (t + h) = P(N (t + h) = j) = j X P(Y = i, N (t) = j − i) i=0 j = X P(Y = i)pj−i (t) i=0 = (1 − λh)pj (t) + λhpj−1 (t) + o(h) as h ↓ 0 for j ≥ 1. So pj (t + h) − pj (t) o(h) = −λpj (t) + λpj−1 (t) + . h h Taking the limit h ↓ 0 gives p′j (t) = −λpj (t) + λpj−1 (t), j ≥ 1. (16) In the case j = 0 we have P(N (t) = j − 1) = 0 so we obtain by a similar argument p′0 (t) = −λp0 (t). (17) Since N (0) = 0 a.s., we have the boundary condition ( 1, j = 0 pj (0) = 0, j ≥ 1. (18) We need to solve the family of linked differential equations (16),(17). One way is to solve (17) for p0 , then substitute into the j = 1 equation and find p1 , then find p2 , etc. Another way is to introduce the probability generating function of N (t): X G(s, t) = E(sN (t) ) = pj (t)sj . j≥0 Multiplying (16) by sj and summing over j, together with (17) for j = 0, gives X X X sj p′j (t) = −λ sj pj (t) + λ sj pj−1 (t), j≥0 j≥0 j≥1 i.e. ∂G = −λG + λsG = λ(s − 1)G. ∂t (We omit the details needed to make this fully rigorous, including differentiating the power series term-by-term). This differential equation can be solved by multiplying by the integrating factor e−λ(s−1)t : e−λ(s−1)t ∂G − e−λ(s−1)t λ(s − 1)G = 0; ∂t ∂ −λ(s−1)t e G = 0. ∂t 55 After using the boundary condition (18) to get G(s, 0) = 1, we deduce that G = eλ(s−1)t , which we recognize as the probability generating function of a Poisson(λt) variable. Since (N (t + s) − N (s))t≥0 is a Poisson process of rate λ (Theorem 4.1) we immediately deduce from Theorem 4.2 that N (t + s) − N (s) is also Poisson(λt). Example 4.3. If (N (t))t≥0 is a Poisson process of rate λ, find the covariance Cov(N (s), N (t)). Solution. Recall that Cov(X, Y ) = E(XY ) − (EX)(EY ). Suppose that s ≤ t. Since N (s) and N (t) are not independent, we split N (t) up into the increments over (0, s] and (s, t]: E(N (s)N (t)) = E N (s) N (s) + N (t) − N (s) = E(N (s)2 ) + E N (s)(N (t) − N (s)) . Now N (s) is Poisson(λs), and its second moment E(N (s)2 ) is the variance plus the mean squared; and the increments N (s) and N (t) − N (s) are independent, so the expectation factorizes: E(N (s)N (t)) = λs + (λs)2 + λsλ(t − s) = λs + λ2 st. So Cov(N (s), N (t)) = λs + λ2 st − λsλt = λs. Since this was under the assumption s ≤ t, in general we have Cov(N (s), N (t)) = λ min(s, t). ♢ We can now give an alternative definition of the Poisson process. Definition 2. A stochastic process (N (t))t≥0 is a Poisson process of rate λ > 0 if: 1. N (0) = 0, and N is integer-valued and right-continuous; 2. N has independent increments. 3. for any s ≥ 0 and t > 0, N (s + t) − N (s) has Poisson(λt) distribution. Only condition 3 has changed. This version is cleaner, but the previous infinitessimal definition is in a sense more fundamental. We have seen that Definition 1 implies Definition 2. On the other hand if N (s+h)−N (s) ∼ Poi(λh) then  −λh  = 1 − λh + o(h), n = 0; e −λh P(N (s + h) − N (s) = n) = e λh = λh + o(h), n = 1;   −λh (λh)n = o(h), n ≥ 2, e n! as h ↓ 0. So the two definitions are equivalent. 56 4.3 Arrival times Next we consider the arrival times themselves as random variables. Let Tj = inf{t ≥ 0 : N (t) = j} for j ≥ 1; this is the jth arrival time. We could also have written it as min{t ≥ 0 : N (t) = j} or inf{t ≥ 0 : N (t) ≥ j} – convince yourself that these are all equal. Also for convenience set T0 = 0, and define Xj = Tj − Tj−1 for j ≥ 1; this is the jth inter-arrival time, i.e. the time between the (j − 1)th and jth arrivals (or just the 1st arrival time in the case j = 1). Here is an illustration of these definitions. N (t) X1 0 T0 X2 X3 T1 X4 T2 T3 T4 t Theorem 4.4. The inter-arrival times X1 , X2 , . . . of a Poisson process of rate λ are i.i.d. Exponential random variables with parameter λ. Proof. First consider X1 : P(X1 > t) = P(N (t) = 0) = e−λt , so X1 ∼ Exp(λ). Moving on to X2 : P(X2 > t | X1 = t1 ) = P(no arrival in (t1 , t1 + t] | X1 = t1 ). Since the event X1 = t1 depends only on arrivals in [0, t1 ], while the other event depends on arrivals after t1 , the independent increments property gives that this equals the unconditional probability P(no arrival in (t1 , t1 + t]) = e−λt , so X2 is Exp(λ) and independent of X1 . (To make this fully rigorous actually needs a bit more care, because the event X1 = t1 has probability zero. One way to do this is to consider the limit of a small interval; we omit the details). Similarly for the general case, P(Xn > t | X1 = t1 , . . . , Xn−1 = tn−1 ) = P(no arrival in (t1 + · · · + tn−1 , t1 + · · · + tn−1 + t] | X1 = t1 , . . . , Xn−1 = tn−1 ) = P(no arrival in (t1 + · · · + tn−1 , t1 + · · · + tn−1 + t]) = e−λt , so Xn is Exp(λ) and independent of X1 , . . . , Xn−1 . 57 Unpacking the definitions back to N gives another equivalent definition, this time a constructive one (meaning that it gives a way to generate the process rather than just a description of its properties). Definition Pn 3. Let X1 , X2 , . . . be i.i.d. Exponential random variables with parameter λ. Let Tn = i=1 Xi and let N (t) = max{n : Tn ≤ t}, t ≥ 0. The stochastic process (N (t))t≥0 is a Poisson process of rate λ > 0. We have already seen that Definitions 1 and 2 are equivalent and they imply that N has the properties in Definition 3. Because Definition 3 clearly uniquely specifies the distribution of the (N (t))t≥0 , the three definitions are equivalent. One simple consequence is a scaling property. Theorem 4.5. If (N (t))t≥0 is a Poisson process of rate λ, and a ∈ (0, ∞) is a constant, then (N (at))t≥0 is a Poisson process of rate aλ. Proof. If (Tj )j≥0 and (Xj )j≥1 are the arrival and inter-arrival times of N , then the jth arrival time of the scaled process (N (at))t≥0 is inf{t ≥ 0 : N (at) = j} = 1 Tj inf{at ≥ 0 : N (at) = j} = , a a and so the inter-arrival times of the scaled process are (Xj /a), which are i.i.d. To find their distributions, note that X ∼ Exp(λ) then X/a ∼ Exp(aλ), because P(X/a > t) = P(X > at) = e−λat . We also note a connection with the well-known memoryless property of the Exponential distribution: Theorem 4.6 (Memoryless property). If X is an Exponential random variable then for any s, t > 0, P(X > s + t | X > s) = P(X > t). If X is interpreted as a waiting time, this says that given that you have already waited time s, the probability of waiting a further time t is exactly the same as the original probability of waiting t. The distribution is memoryless in the sense that knowing we having waited a certain time doesn’t affect the distribution of the remaining time. This result is very easy to prove directly, but it is also instructive to give an argument in terms of the Poisson process. Proof. Consider a Poisson process N of rate λ and let T1 be the first arrival time, which has Exp(λ) distribution. Then, using the independent increments property and time homogeneity, P(T1 > t + s | T1 > s) = P(N (s + t) = 0 | N (s) = 0) = P(N (s + t) − N (s) = 0 | N (s) = 0) = P(N (s + t) − N (s) = 0) = P(N (t) = 0) = P(T1 > t). Time homogeneity combined with i.i.d. inter-arrival times has the following quite surprising consequence. If we start observing a Poisson process at a fixed time s, then we just see a Poisson process of arrivals. In particular, the time until the first arrival after s is Exp(λ), 58 as well as the subsequent inter-arrival times. But viewed in terms of the original process, s is in between two arrivals, so we might expect the time until the first arrival after s to be typically shorter than Exp(λ), because it is only part of the inter-arrival time that contains s. The memoryless property indicates the solution to the apparent paradox – the inter-arrival time containing s is special because we require it to contain s — that makes it larger than a typical inter-arrival time. If we condition on how much of it happened before s, the rest is still just Exponential. 4.4 Superposition and thinning We next consider several Poisson processes together. Theorem 4.7 (Superposition). If (M (t))t≥0 and (N (t))t≥0 are independent Poisson processes of rates λ and µ respectively then (M (t) + N (t))t≥0 is a Poisson process of rate λ + µ. Proof. See the problem sheet. In other words, if we have two different types of arrivals, occurring as independent Poisson processes at the same time, then the process of both types of arrival together is also Poisson. Theorem 4.8 (Thinning). Let (N (t))t≥0 be a Poisson processes of rate λ and let Z1 , Z2 , . . . be i.i.d. taking values 1 with probability p and 0 with probability 1 − p, independent of N . Let N (t) X M (t) = Zi . i=1 Then (M (t))t≥0 is a Poisson process of rate pλ. Proof. See the problem sheet. In this case the interpretation is that we “thin” the stream of arrivals by independently keeping or deleting each one according to a (biased) coin flip. The remaining arrival process is Poisson, with the rate being reduced by a factor of the retention probability p. We can take this further: let L(t) = N (t) − M (t) = N (t) X (1 − Zi ); i=1 then (L(t))t≥0 is the process of deleted arrivals of the original process. By applying Theorem 4.8 to (1 − Zi ) we see that (L(t))t≥0 is a Poisson process of rate qλ, where q = 1 − p. Perhaps more surprisingly, we have the following. Theorem 4.9. The processes (M (t))t≥0 and (L(t))t≥0 are independent. Proof. (partial) We first establish that for any fixed t, the random variables M (t) and L(t) are independent. (This is a much weaker statement than what we need, but it turns out to be the key step). We have P(M (t) = m, L(t) = ℓ) = P(M (t) = m, L(t) = ℓ | N (t) = m + ℓ) P(N (t) = m + ℓ) m+ℓ m ℓ (λt)m+ℓ = p q × e−λt (m + ℓ)! m m (λpt) (λqt)ℓ = e−λpt × e−λqt , m! ℓ! 59 as required. Using this and independent increments, one can deduce that, for t1 ≤ t2 ≤ · · · ≤ tn , the 2n random variables M (t1 ) − M (0), M (t2 ) − M (t1 ), . . . , M (tn ) − M (tn−1 ), L(t1 ) − L(0), L(t2 ) − L(t1 ), . . . , L(tn ) − L(tn−1 ) are all independent (we omit the details of the argument). This implies that the two vectors (M (t1 ), . . . , M (tn )) and (L(t1 ), . . . , L(tn )) are independent. Using general stochastic process technology (not covered in this course), this is enough to deduce that the two processes are independent. In summary, the superposition and thinning pictures are really equivalent. In scenario 1 we consider a Poisson process of “red” arrivals or rate pλ and an independent Poisson process of “blue” arrivals of rate (1−p)λ together. In scenario 2 we consider a Poisson process of rate λ, and independently colour each arrival red with probability p and blue with probability 1 − p. The two resulting processes of arrivals with colours have the same distribution. Focussing on the first arrival, we get the following useful fact. Theorem 4.10. (Exponential races) Let T1 , . . . , Tn be independent with Exponential random variables with parameters λ1 , . . . , λn . Let M = min(T1 , . . . , Tn ), and let J ∈ {1, . . . , n} be such that TJ = M (i.e. the index of the minimal Tj ). Then P (i) M ∼ Exp( i λi ); λj (ii) P(J = j) = P ; i λi (iii) J and M are independent. It may help to think of T1 , . . . , Tn as the finishing times of competitors in a race. Again, this is quite straightforward to prove directly, but it is illuminating to do it using Poisson processes. P Proof. Consider a Poisson process of rate λ := i λi , and independently label each arrival “type j” with probability pj := λj /λ. By thinning (applied iteratively), the type j arrivals form a Poisson process of rate λj , and different types form independent arrival processes. Let Tj be the time of the 1st type j arrival, which is Exp(λi ); the (Ti ) are independent. Then M is the time of the first arrival, which is Exp(λ). From the definition of the thinning process we have P(J = j) = pj , and J and M are independent. 4.5 Conditioning Various conditional distributions involving the Poisson process have very neat forms. Theorem 4.11. Let (N (t)) be a Poisson process. Given N (t) = 1, the conditional distribution of the first arrival time T1 is Uniform[0, t]. 60 Note that the rate does not appear in this statement. Proof. We compute the conditional distribution function. For 0 ≤ u ≤ t, P(T1 ≤ u, N (t) = 1) P(N (t) = 1) P(N (u) = 1, N (t) − N (u) = 0) = P(N (t) = 1) −λu −λ(t−u) e λue = e−λt λt u = . t P(T1 ≤ u | N (t) = 1) = More generally, what is the distribution of the first n arrivals given that N (t) = n? The answer is basically that they are i.i.d. uniform, but there is a complication that they need to be ordered. Let U1 , . . . , Un be i.i.d. Uniform[0, t]. Let U(1) < U(2) < · · · < U(n) be U1 , . . . , Un rearranged into increasing order. They are called the order statistics of U1 , . . . , Un . So in particular, U(1) = min(U1 , . . . , Un ), U(n) = max(U1 , . . . , Un ), and U(k) is the kth smallest of U1 , . . . , Un . Theorem 4.12. If U1 , . . . , Un are i.i.d. Uniform[0, t], their order statistics U(1) , . . . , U(n) have joint probability density function   n! , 0 < x1 < x2 < · · · < xn < t, f (x1 , . . . , xn ) = tn 0, otherwise. Proof. Fix 0 < x1 < · · · xn < t and take h1 , . . . , hn small enough that the intervals Ii = (xi , xi + h) are pairwise disjoint. Then P(U(1) ∈ I1 , . . . , U(n) ∈ In ) = P(U1 ∈ I1 , U2 ∈ I2 , . . . , Un ∈ In ) + P(U2 ∈ I1 , U1 ∈ I2 , . . . , Un ∈ In ) + ··· + P(Un ∈ I1 , Un−1 ∈ I2 , . . . , U1 ∈ In ), where we sum over all n! possible permutations of 1, . . . , n. This equals n! hn n! h1 ··· = n h1 · · · hn ; t t t now divide by h1 · · · hn and let the hi ’s tend to 0 to get the density function. Finally, the density is clearly 0 outside the given range. Theorem 4.13. Let (N (t))t≥0 be a Poisson process. Given N (t) = n, the conditional distribution of the arrival times T1 , . . . , Tn is the distribution of the order statistics U(1) , . . . , U(n) of n i.i.d. Uniform[0, t] random variables. 61 Proof. We take small intervals Ij as in the last proof, and let h = h1 + · · · + hn . Then P(T1 ∈ I1 , . . . , Tn ∈ In | N (t) = n) P(T1 ∈ I1 , . . . , Tn ∈ In , N (t) = n) = P(N (t) = n) S P(1 arrival in each of I1 , . . . , In , none in (0, t] \ i Ii ) = P(N (t) = n) (λh1 + o(h1 )) · · · (λhn + o(hn ))e−λ(t−h) = e−λt (λt)n /n! n! = n h1 · · · hn + o(h1 · · · hn ) t as h1 , . . . , hn → 0, where in the last step we used that eh = 1 + h + o(h) (imagine multiplying out the product). Example 4.14. For a Poisson process (N (t))t≥0 of rate λ, find E(N (t) | N (s)). Solution. First suppose s ≤ t. Then, using the independent increments property, E(N (t) | N (s) = n) = E(N (t) − N (s) | N (s) = n) + E(N (s) | N (s) = n) = E(N (t) − N (s)) + n = λ(t − s) + n. Now suppose s ≥ t. Conditional on N (s) = n, T1 , . . . , Tn are distributed as the order statistics of U1 , . . . , Un , say, which are i.i.d. Uniform on [0, s]. Each Ui lies in [0, t] with probability t/s, so the number of them that lie in [0, t] is Binomial(n, t/s). So t E(N (t) | N (s) = n) = n . s In conclusion,  N (s) + λ(t − s), s ≤ t E(N (t) | N (s)) = t N (s) , s ≥ t. s 4.6 ♢ Inhomogeneous and Multidimensional Poisson processes We briefly mention two extensions. Inhomogeneous process. Let λ(t) be a function (assumed continuous for simplicity). A process (N (t))t≥0 is an inhomogeneous Poisson process with rate function λ(t) if it satisfies conditions 1 (basic properties) and 2 (independent increments) in Definition 1, together with P(N (t + h) − N (t) = 1) = λ(t)h + o(h), and analogously for the cases 0 and n ≥ 2. One can show that N (t) has a Poisson distribution with parameter Z t λ(s) ds. 0 62 Higher dimensions. Here we want to model random points in Rd . Imagine the locations of meteor impacts, seedlings, or rain drops. For every (well behaved) set A ⊂ Rd we have a random variable N (A) which represents the number of points that fall in A. The collection (N (A))A⊂Rd of all these variables is a Poisson point process in Rd with rate (or intensity) λ if it satisfies: 1. N is Z ∪ {∞}-valued, N (∅) = 0, and N ∞ [ Ai = i=1 ∞ X N (Ai ) i=1 for any pairwise disjoint A1 , A2 , . . .; 2. N (A1 ), . . . , N (An ) are independent for any pairwise disjoint A1 , . . . , An ; 3. N (A) has Poisson distribution with parameter λ vol(A) for any A with 0 < vol(A) < ∞. Here vol(A) is the volume of the set A. 63 5 5.1 Continuous-time Markov chains and queues Introduction We introduce Markov chains in continuous time. We will not give a fully comprehensive treatment of this subject; rather we will focus on important highlights. Much of the theory will mirror, and piggy-back on, the discrete-time case. In particular we will address special cases that arise in queueing theory, a central topic of applied probability. Here is a transition diagram for a continuous-time Markov chain. a 1 2 1 1 2 d c b 1 The state space is a countable set, in this case S = {a, b, c, d}. Time is now a real variable t ∈ [0, ∞), and the Markov chain will be denoted (X(t))t≥0 , where X(t) is the state at time t. The edge labels represent rates of moving from one state to another. The meaning is, for example: when the current state is a, the process waits a random time with Exp(1/2) distribution, then moves to b. At c, we simultaneously start independent Exp(1) and Exp(2) clocks corresponding to the edges leading to a and d respectively. We jump along the edge whose clock goes off first. In other words, (by the exponential races theorem) we wait a random time with Exp(3) distribution, and then jump to a with probability 1/3 or d with probability 2/3, independently of the time we waited. At d the chain stay in place forever, and never jumps. Rather than a transition matrix, the parameters of a continuous-time Markov chain with state space S are summarized by its generator matrix. This is an |S| by |S| matrix G = (gi,j )i,j∈S with entries: for i ̸= j, gi,j = rate of going from i to j X gi,j . and gi,i = − j̸=i In other words, the diagonal entry gi,i is minus the total rate of leaving i, i.e. minus the rate of the exponential time until the first jump out of i. So each row sums to 0. (The reason for this choice of diagonal entries will become clearer later). For the above example, a b c d  −1/2 1/2 0 0 b  1 −2 1 0  . G=   c 1 0 −3 2  d 0 0 0 0 a  Example 5.1 (Poisson process). The Poisson process of rate λ is itself a continuous-time Markov chain! As before, the state X(t) at time t (previously called N (t)) represents the number of arrivals by time t. The state space is N = {0, 1 . . .}, the initial state is X(0) = 0, and the transition diagram is: 64 λ 0 λ λ 1 ··· 2 When there have been n arrivals, we wait an Exp(λ) time for the next arrival, at which point the number of arrivals jumps to n + 1. The generator is   −λ λ 0 0 ···  0 −λ λ  0    0  0 −λ λ G=   0  0 0 −λ   .. .. . . Example 5.2 (A simple queue). Five customers arrive at a bank branch that has three servers. Each customer requires a random service time with Exp(µ) distribution to complete their business. The service times of different customers are independent. A customer can only be served when there is a server available to help them. When all servers are busy helping customers, any further customers wait in a queue until a server is available. Once a customer has been served they leave. This situation sounds very complicated to model, but in fact, much of what we might want to know can be described by a simple continuous-time Markov chain. Let X(t) be the total number of customers present in the bank at time t (including both those being served and those waiting). Then X(0) = 5, and the state space is S = {0, 1, . . . , 5}. If there is 1 customer present, they are served by one of the servers, so they leave after an Exp(µ) time and then there are 0 customers. If there are 2 customers, they are each being served by a server (while the other server is idle), so the time until the next customer finishes is the minimum of two independent Exp(µ) services times, which is Exp(2µ). If there are 3 or more customers, then exactly 3 of them are served simultaneously, so the next customer leaves at rate 3µ. By the memoryless property of the Exponential, it doesn’t matter how we got to a state or how long we have been there – the same rules apply for the jumps out of it. So (X(t))t≥0 is a continuous-time Markov chain with transition diagram: 0 µ 1 2µ 2 3µ 3 3µ 4 3µ 5 Note that we did not need to consider the queueing discipline – i.e. how we decide what order customers are served in, and which server serves which customer. Such choices might be of interest, but they do not affect the process (X(t))t≥0 . Our only requirement is that servers are not idle while there are customers waiting to be served. The generator matrix is   0 0 0 0 0 0 µ −µ 0 0 0 0     0 2µ −2µ 0 0 0  .  G=  0 0 3µ −3µ 0 0   0 0 0 3µ −3µ 0  0 0 0 0 3µ −3µ Example 5.3. In the above example, what is the expected time until all customers have been served? 65 Solution. Looking at the transition diagram, the time starting from 5 to hit 4 has Exp(3µ) distribution, which has mean 1/(3µ), then the time starting from 4 to hit 3 also has mean 1/(3µ), and so on. So the expected time to hit 0 is 1 1 1 1 5 1 + + + + = . 3µ 3µ 3µ 2µ µ 2µ ♢ Example 5.4 (M/M/3 queue). Consider the bank branch with 3 servers again, but now suppose that customers arrive at the times of a Poisson process of rate λ, independent of the service process, which is as before. Again let X(t) be the number of customers in the bank at time t. Now the state space is N, because there could be any number of customers. From any state n, a new customer arrives (so that X increases to n + 1) at rate λ, or a customer is served (so X decreases) at rate min(3µ, nµ), because each server can serve one customer at a time. So the transition diagram is λ λ 0 λ 1 µ λ 2 3 2µ 3µ λ λ 4 3µ ··· 5 3µ 3µ The name “M/M/3 queue” belongs to the standard naming system (Kendall’s notation) for queueing models. The first M means that the arrival process is “Markovian” (or “memoryless”), meaning that arrivals occur as a Poisson process. The second M means that service times are also “Markovian” (or “memoryless”), meaning that they are Exponentially distributed. The 3 means that there are 3 servers. Natural questions (to be addressed) include: can the servers cope with the demand, or does the queue grow longer and longer? How long do servers typically spend idle? Or busy? What can we say about the distribution of the stream of customers leaving? 5.2 The jump chain and explosion We now give a formal construction of a continuous-time Markov chain, based on the informal description above. Let G = (gi,j )i,j∈S be a generator matrix on state space S. Write gi := −gi,i for the total jump rate out of state i. The jump chain is a discrete-time Markov chain (Yn )n=0,1,... with state space S and transition matrix R = (ri,j )i,j∈S given by:  ri,j = gi,j , j ̸= i, gi If gi > 0 : r = 0. i,i ( ri,j = 0, j ̸= i, If gi = 0 : ri,i = 1. For example, for the chain at the beginning of the chapter we have a  b c d a  −1/2 1/2 0 0  b 1 −2 1 0  ; G=  c 1 0 −3 2  d 0 0 0 0 a  0  b 1/2 R=  c  1/3 d 0 a 66 b c d  1 0 0 0 1/2 0  . 0 0 2/3  0 0 1 We want the continuous-time chain (X(t)) to perform the same jumps as (Yn ), staying a random time with distribution Exp(gi ) in state i. It is convenient to get these waiting times by scaling standard exponentials. Therefore, let Z1 , Z2 , . . . be i.i.d. Exp(1) random variables, independent of the jump chain (Yn ). Then define the holding times Sn = Zn gYn−1 , n ≥ 1, (recall that if Z ∼ Exp(1) then Z/a ∼ Exp(a)). Define the jump times Tn = n X Si , i=1 (and T0 = 0). Finally let if n ≥ 0 is such that Tn ≤ t < Tn+1 . X(t) = Yn (19) We then say that the stochastic process (X(t))t≥0 is a continuous-time Markov chain with generator G. As in the discrete case, we may also specify the initial distribution µ = (µi )i∈S , where µi = P(X(0) = i). Again, we sometimes indicate the case of a deterministic starting state j (so µj = 1) by writing Pj for the probability measure. Explosion. Because transition rates can be arbitrarily large, we need to deal with a new complication not seen in the discrete case. We say that a chain explodes if it makes infinitely many jumps in finite time. Example 5.5. Consider the continuous-time chain with initial state 0 and diagram: 0 1 1 2 2 22 3 23 4 24 ··· Note the rapidly increasing rates. Let Tn be the time of the nth jump. Then Tn is a sum of independent exponentials, and E(Tn ) = 1 1 1 1 + + 2 + · · · + n−1 . 1 2 2 2 Moreover, let T∞ = limn→∞ Tn be the time until the chain makes infinitely many jumps. Then E(T∞ ) = 1 1 1 + + 2 + · · · = 2. 1 2 2 In particular T∞ is a.s. finite. So the chain almost surely makes infinitely many jumps in finite time, i.e. it explodes. A trajectory might look something like this: 67 X(t) 0 T∞ t Note that our construction above is silent on what X(t) should be after an explosion time T∞ . For t > T∞ there is no n satisfying the condition in (19). Explosion is an interesting subject to pure mathematicians, but we will focus on cases where it can be ruled out. A Markov chain X is called non-explosive if Pi (X explodes) = 0 for all i ∈ S. Theorem 5.6. Any of the following conditions on a continuous-time Markov chain guarantees that it is non-explosive: (i) finite state space S; (ii) bounded jump rates: −gi,i ≤ C for all i ∈ S, for some C < ∞; (iii) deterministic initial state X(0) = j that is recurrent for the jump chain. Proof. (not required for the course.) Let the holding times (Sn ), jump times (Tn ) and jump chain (Yn ), and the Exp(1) variables (Zn ) be as in the construction of the chain, and let T∞ = limn→∞ Tn . Case (i) is a special case of (ii). In case (ii), T∞ = ∞ X i=1 ∞ ∞ X Zn 1 X Si = ≥ Zi = ∞ a.s. g C Y n−1 i=1 i=1 by the strong law of large numbers. In case (iii), the jump chain returns to j infinitely many times, say at times N1 , N2 , . . . ∈ Z. Then ∞ X X 1 X ZNk = ∞ a.s. T∞ = Si ≥ SNk ≥ gj k i=1 k by the strong law of large numbers. 5.3 Transition probabilities For a continuous-time Markov chain X, we define the transition probability from i to j in time t to be pi,j (t) = P X(t) = j | X(0) = i = Pi (X(t) = j). (This is the analogue of the multistep transition probability in the discrete case; the different notation reminds us that it is not quite the same. Of course there is no analogue of the one-step transition probability, because there is no smallest unit of time). 68 Theorem 5.7 (Markov property). Let X be a continuous-time Markov chain. For any times 0 ≤ t0 ≤ t1 ≤ · · · ≤ tn+1 and states i0 , . . . , in−1 , i, j, we have P X(tn+1 ) = j | X(t0 ) = i0 , . . . , X(tn−1 ) = in−1 , X(tn ) = i = pi,j (tn+1 − tn ). In the discrete case we took a similar statement as the definition of a Markov chain. In the continuous case it can be shown, subject to certain conditions that Markov chains as we constructed them are the only processes satisfying the above property. Proof. (sketch, not required for the course). Consider the following alternative construction of a Markov chain. To each pair of distinct states i, j we assign an independent Poisson process (Ni,j (t))t≥0 of rate gi,j . At each arrival time of Ni,j we imagine that a “gate” opens allowing the chain to move from i to j if it is at i at that time. It is not difficult to show using the properties of Poisson processes that this is equivalent to the earlier definition. Now, given X(tn ) = i, the future movement of the chain depends only on the future (t > tn ) arrivals of the Poisson processes, which are independent of the past, and distributed identically to the original Poisson processes started from time 0. Now the claimed result follows. One can also define a stopping time in continuous time, and there is also a strong Markov property similar to the discrete case. We omit the details. One can also compute hitting probabilities and expected hitting times similarly to the discrete case. We omit this as well, except for a few very simple cases. Theorem 5.8 (Chapman-Kolmogorov equations). X pi,j (s + t) = pi,k (s)pk,j (t). k∈S Proof. Similar to the discrete case. For each t we can form a matrix P (t) = (pi,j (t))i,j∈S of the transition probabilities. It is a transition matrix (i.e. the entries are non-negative and each row sums to 1). We take pi,j (0) to be 1 if i = j and 0 otherwise, so P (0) is the identity matrix I. The Chapman-Kolmogorov equations can be written in matrix form as P (s + t) = P (s)P (t). In the discrete case we could in principle express everything in terms of the 1-step transition matrix P . In continuous time it is not quite so simple, because there is no smallest unit of time. Instead we can consider the limit as the time interval goes to zero. If we do that we essentially get back the generator matrix. Theorem 5.9. For any i, j ∈ S we have p′i,j (0) = gi,j . Here p′i,j (t) means the derivative dtd pi,j (t), and when we evaluate it at 0 it is to be understood as a right-derivative, since pi,j (t) does not make sense for negative t. 69 Proof. (sketch; not required for the course). Let (Yn ) be the jump chain and let T1 be the first jump time. Also write gi = −gi,i as usual. For i ̸= j, as h ↓ 0, pi,j (h) = Pi (X(h) = j) = Pi (T1 < h, Y1 = j) + o(h) gi,j = (1 − e−gi h ) + o(h) gi gi,j = gi h + o(h) gi = hgi,j + o(h). The second equality is because any other route from i to j other than a single direct jump involves two or more jumps in a short time h, which is much less likely; we omit the full rigorous justification of this. Similarly, pi,i (h) = Pi (X(h) = i) = Pi (T1 > h) + o(h) = e−gi h + o(h) = 1 − gi h + o(h) = 1 + hgi,i + o(h). Since pi,j (0) is 1 if i = j and 0 otherwise, in both cases we have o(h) pi,j (h) − pi,j (0) = gi,j + , h h and so taking h ↓ 0 gives the result. We can again write the last result in matrix form: P ′ (0) = G, where P ′ (t) is the matrix of derivatives (p′i,j (t))i,j∈S . Combining the last two results, we can say how P (t) and G are related in general. Theorem 5.10. The transition matrices P (t) and generator matrix G satisfy both the differential equations (i) P ′ (t) = P (t) G (“forward equation”); (ii) P ′ (t) = G P (t) (“backward equation”). Proof. (sketch) (i) Using the Chapman-Kolmogorov equations, we have P ′ (t) = d ∂ P (t) = P (t + h) dt ∂h = h=0 ∂ P (t)P (h) ∂h = P (t)P ′ (h) h=0 h=0 = P (t)G. (For an infinite state space, some justification is needed to make this fully rigorous, because the matrix multiplcation hides an infinite sum, which we are differentiating term by term.) (ii) is similar, using the Chapman-Kolmogorov equations with t and h in the other order. 70 The neat form of these matrix equations is the reason for defining the generator the way we did, with entries −gi on the diagonal. Matrix exponential formula – not required for the course. In the case of a finite state space, it is possible to solve the forward or backward equations (together with the initial condition P (0) = I), to get the remarkable formula P (t) = exp(tG), where the exponential of a matrix M is defined as exp(M ) = ∞ X Mn n! n=0 5.4 . Classification of states For a continuous time Markov chain, the definitions of i → j, i ↔ j, communicating classes and irreducibility, recurrence and transience are all simply inherited from the jump chain. (Not positive and null recurrence, however). We next explore briefly what these definitions mean. Example 5.11. a 1 d 2 1 2 1 c b 1 In this example, {a, b, c} is a transient class and {d} is a recurrent class. The following fact makes many things simpler than the discrete case. In particular there is no periodicity. Theorem 5.12. For two states i and j of a continuous-time chain, the following are equivalent. (i) i → j; (ii) pi,j (t) > 0 for all t > 0; (iii) pi,j (t) > 0 for some t > 0. Proof. It is obvious that (ii) implies (iii). To show that (iii) implies (i), assuming (iii), there must exist some finite sequence of states such that with positive probability the chain can get from i to j via this sequence. But this implies that the jump chain can follow the same sequence with positive probability, which gives (i). 71 We now show that (i) implies (ii). Assuming (i), there exist i1 , . . . , in−1 such that ri,i1 ri1 ,i2 · · · rin−1 ,j > 0, where R = (ri,j ) is the jump chain matrix. By the definition of the jump chain this implies that gi,i1 gi1 ,i2 · · · gin−1 ,j > 0, where G = (gi,j ) is the generator matrix. Note that if gk,ℓ > 0 then pk,ℓ (t) ≥ Pk T1 < t, T2 − T1 > t, Y1 = ℓ ≥ (1 − e−gk t )e−gℓ t rk,ℓ > 0. Therefore, pi,j (t) ≥ pi,i1 (t/n) · · · pin−1 ,j (t/n) > 0, giving (ii). A set A ⊆ [0, ∞) is bounded if A ⊆ [0, C) for some C < ∞, and unbounded otherwise. Theorem 5.13. Let i be a state of a continuous-time Markov chain X. (i) If i is recurrent then Pi ({t : X(t) = i} is unbounded) = 1. (ii) If i is transient then Pi ({t : X(t) = i} is bounded) = 1. Proof. (i) Since i is recurrent the chain does not explode, so the jump times Tn satisfy Tn → ∞ as n → ∞. But X(Tn ) = Yn which equals i for infinitely many i, so {t : X(t) = i} is unbounded. (ii) Since i is transient, the jump chain has only finitely many visits to i, so the last visit is at time N < ∞ say. So {t : X(t) = i} is bounded by TN < ∞. 5.5 Stationary distributions P As usual, a distribution is a row vector γ = (γi : i ∈ S) that satisfies γi ≥ 0 and i∈S γi = 1. As before if the initial distribution is γ then the distribution of Xt is γP (t). A distribution π is stationary for a Markov chain with transition matrices P (t) if π P (t) = π for all t > 0. Since P (t) is usually not easy to calculate, this definition is not so useful in practice. Fortunately, there is an equivalent condition in terms of the generator. Theorem 5.14. For a continuous-time Markov chain with generator G and transition matrices P (t), the distribution π is stationary if and only if πG = 0. Here the 0 on the right side of the equation means the row vector (0, 0, . . . , 0). Proof. We will give the proof assuming that the state space is finite; in the general case the various interchanges of limits require additional justification. If π is stationary then πP (t) = π, so differentiating gives πP ′ (t) = 0. Taking t = 0 gives πG = 0. If πG = 0 then by the backward equation, d (πP (t)) = πP ′ (t) = π(GP (t)) = (πG)P (t) = 0, dt so πP (t) is constant in t (i.e. each entry of the row vector is constant). But πP (0) = πI = π, so the constant is π, i.e. π is stationary. 72 Writing gi = −gi,i as usual, the equation πG = 0 can be written in the form X πi gi,j = πj gj , j ∈ S. i̸=j Example 5.15. Find all stationary distributions of the continuous-time Markov chain with transition diagram a 1 2 1 1 c b 1 1 1 Solution. πG = 0 gives πb + πP c = 2 πa ; 2 πa = 2πb ; πb = πc . One equation is redundant, and we get π = (4πc , πc , πc ). Using πi = 1 gives π = (4/6, 1/6, 1/6). ♢ We omit the proofs of the remaining results in this section. They can generally be deduced from their discrete-time counterparts – some such deductions are straightforward, others are a bit harder. Theorem 5.16. For an irreducible, recurrent continuous-time Markov chain, the equation xG = 0 has at most one positive finite solution up to multiplication by a constant. Define the mean return time to state i: mi = Ei Ri where Ri = inf{t > T1 : X(t) = i}, where T1 is the first jump time. (So, we measure the time to first leave i and then return to it.) A recurrent state i is called positive if mi < ∞ and null if mi = ∞. Theorem 5.17. Positive and null recurrence are class properties. Theorem 5.18. An irreducible continuous-time Markov chain is positive recurrent if and only if it is non-explosive and has a stationary distribution. Moreover, in that case the stationary distribution π is given by πi = 1 mi gi (where gi = −gi,i is the jump rate out of i and mi is the mean return time of i). The formula πi = 1/(mi gi ) makes sense because it takes time mi to return to i, of which (in expectation) time 1/gi was spent in i before leaving, so the average proportion of time in i is (1/gi )/mi . Note that, unlike the discrete case, existence of a stationary distribution is not enough to deduce positive recurrence - we need to check non-explosion as well. See the problem sheet for an example that illustrates this. Here is the continuous-time version of the convergence theorem and ergodic theorem. 73 Theorem 5.19. For an irreducible, positive recurrent continuous-time Markov chain X with stationary distribution π and any initial distribution, as t → ∞, and P(X(t) = j) → πj , Z t 1 IX(s)=j ds → πj . t 0 In the second convergence statement, the left side is the proportion of time spent at j up to time t. Example 5.20. In Example 5.15, find (a) the expected return time to a, starting from a; (b) the expected hitting time of a, starting from b; (c) limt→∞ pa,b (t). Solution. (a) ma = 1/(πa ga ) = 1/(4/6 · 1/2) = 3. (b) From a, the first jump is to b, in expected time 1/(1/2) = 2, so the expected time to hit a from b is 3 − 2 = 1. (c) The chain is non-explosive since it is finite, and has a stationary distribution, so it is positive recurrent. Hence limt→∞ pa,b (t) = πb = 1/6. ♢ 5.6 Reversibility and queueing theory The theory of reversibility in continuous time closely mirrors the discrete case. A main application is to the theory of queues, one of the central topics of applied probability. Theorem 5.21. Let G be a generator matrix. If x = (xi : i ∈ S) satisfies the detailed balance equations xi gi,j = xj gj,i i, j ∈ S then xG = 0. Proof. See problem sheet. Theorem 5.22. If X is an irreducible positive recurrent continuous-time Markov chain whose stationary distribution π and generator G satisfy the detailed balance equations, then, if the initial distribution is π, then for any fixed u > 0, the process (X(u − t))t∈[0,u] has the same distribution as (X(t−))t∈[0,u] . (The meaning of X(t−) is lims↑t X(s) – the value immediately before time t. We need this because of the technical convention (19) that at a jump time the chain takes the new value – this gets reversed when we reverse time. There’s no need to worry about any of this!) In the above situation, we say that the Markov chain is reversible. As in the discrete case, a Markov chain with state space {0, 1, . . .} or {0, 1, . . . , n} is a birth-death chain if gi,j = 0 for all i, j ∈ S with |i − j| ≥ 2. It is easy to check that a birth-death chain is reversible provided it is positive recurrent. Example 5.23. M/M/1 queue. Customers arrive as a Poisson process of rate λ, and require independent Exponential(µ) service times. There is one server, and if they are busy then customers wait to be served in a queue, which is allowed to grow arbitrarily long. This is the canonical queue model. Let X(t) be the total number of customers present (waiting or being served) at time t. This is a continuous-time Markov chain with transition diagram 74 λ 0 λ 1 µ λ λ 2 ··· 3 µ µ µ To analyse it, first note that the jump chain is a random walk with a reflecting barrier: λ λ+µ 1 0 1 µ λ+µ λ λ+µ λ λ+µ 2 ··· 3 µ λ+µ µ λ+µ µ λ+µ This is recurrent if and only if λ/(λ + µ) ≤ 1/2, i.e. λ ≤ µ. (The reflecting barrier is slightly different from usual: the transition probability from 0 to 1 is 1 rather than λ/(λ + µ). But we can see that this does not affect P1 (hit 0), and from this we can deduce that the criteria for recurrent and transience are also not affected.) This conclusion makes sense from a queueing perspective. If λ > µ then the arrival rate λ exceeds the maximum possible service rate µ, so in the long term the server will not be able to cope with demand, and the queue length will go to infinity. If λ ≤ µ then the server has sufficient capacity, and the server will have idle periods (times when X(t) = 0) indefinitely. Let λ ≤ µ. To determine positive or null recurrence we look for stationary distributions. The chain is irreducible, and it is non-explosive because the rates are bounded, or alternatively because the jump chain is recurrent. Moreover it is a birth-death chain, so we can use the detailed balance equations: πi λ = πi+1 µ, i ≥ 0. The general solution is πi = π0 λ i µ , i ≥ 0. P If λ = µ then (λ/µ)i = ∞, so there is no stationary distribution, hence the chain is null recurrent. In terms of the queue, the server just has enough capacity to handle the load, so the queue length does not go to infinity, but idle periods are very rare – the expected return time to 0 is infinite. P If λ < µ then (λ/µ)i = 1/(1 − λ/µ), so there is a stationary distribution given by λ λ i πi = 1 − , µ µ i ≥ 0. Here the maximum possible service rate µ exceeds the arrival rate λ, so the queue length has an equilibrium distribution, and the server is idle for a positive fraction π0 of the time. In fact π0 = 1 − λ/µ. We can calculate many more things: Example 5.24. For an M/M/1 queue in the positive recurrent regime λ < µ, in equilibrium, find (a) (b) (c) (d) the the the the mean mean mean mean number of customers present; duration of a typical idle period; duration of a typical busy period; waiting time (queueing plus service) for a typical customer. 75 Solution. “In equilibrium” means that we start in the stationary distribution π. (a) The stationary distribution is a Geometric distribution with parameter 1−λ/µ shifted by 1. So the mean is 1/(1 − λ/µ) − 1 = λ/(µ − λ). (b) An idle period is the time from when the server becomes idle (when X hits 0) until the first jump out of 0. This has mean 1/g0 = 1/λ. (c) A busy period means an interval of time during which the server is working, i.e. X(t) > 0. The mean return time of 0 is 1/(π0 g0 ). This comprises an idle period followed by a busy period. So the mean duration of a typical busy period is the difference 1/(π0 g0 )−1/g0 = µ/[λ(µ − λ)] − 1/λ = 1/(µ − λ). (d) Suppose that a new customer arrives when there are Q customers already present. Then they must wait for Q + 1 service times – those of customers already present plus their own. Since the service times are i.i.d. Exp(µ) (and using the memoryless property for the customer currently being served, if any), the new customer’s waiting time W satisfies E(W | Q = i) = so E(W | Q) = 1 (i + 1), µ 1 (Q + 1). µ Since the queue is in equilibrium prior to the customer’s arrival, using (i), 1 1 λ 1 EW = (EQ + 1) = +1 = . µ µ µ−λ µ−λ ♢ (Strictly speaking, the last step needs some more justification, which we won’t go into – it is an example of something called the PASTA principle – Poisson Arrivals See Time Averages.) Example 5.25. (M/M/s queue) Now suppose that customers arrive as a Poisson process of rate λ and require Exponential service times with parameter µ, and there are s servers. When all s servers are busy, customers wait in the queue. When there are s or fewer customers, each is served simultaneously by a different server, and any remaining servers are idle. When is the chain positive recurrent? Solution. The number of customers present is a Markov chain with rates i≥0 i ≥ 1. gi,i+1 = λ, gi,i−1 = µ min(i, s), The chain is non-explosive since the rates are bounded. It is a birth-death chain, so we can solve the detailed balance equations: π 0 λ = π1 µ π1 λ = π2 2µ π2 λ = π3 3µ ··· πs−1 λ = πs sµ πs λ = πs+1 sµ πs+1 λ = πs+2 sµ ··· 76 to give  i λ 1  π 0 , i ≤ s, i! πi = µ λ i 1  π 0 , i > s. µ s!si−s The chain is positive recurrent if this can be normalised, i.e. if s ∞ X X λ i1 λ i 1 + < ∞, µ i! i=s+1 µ s!si−s i=0 or equivalently when ∞ ∞ X λ i 1 ss X λ i = < ∞. i−s µ s!s s! sµ i=s+1 i=s+1 That is, when λ/(sµ) < 1, i.e. λ < sµ. (This makes sense: it is when the service capacity exceeds the arrival rate). ♢ Example 5.26. (M/M/∞ queue) Suppose that now there are an unlimited number of servers, and everything else is as in the last example. Then gi,i+1 = λ for all i ≥ 0, and gi,i−1 = µi for all i ≥ 1. The rates are not bounded, but we can conclude that the chain is non-explosive as follows. Since the arrival process is a homogeneous Poisson process, after any finite time, the number of arrivals is finite, therefore the number of jumps made by the chain is also finite. The detailed balance equations are i ≥ 1, πi−1 λ = πi iµ, which give πi = π0 We have λ i 1 µ X λ i 1 i≥0 µ i! i! , i ≥ 0. = eλ/µ < ∞, so the chain is positive recurrent for all λ and µ, and π0 = e−λ/µ , so the stationary distribution is Poisson with parameter λ/µ. Output of a queue. Suppose you see only the stream of customers coming out of the shop. What do the times of their departures look like as a stochastic process? Remarkably, reversibility can give us the answer. Consider an M/M/1 or M/M/s or M/M/∞ queue in the positive recurrent regime, in stationarity. Since the detailed balance equations are satisfied, the process is reversible. The process of departure times is the same as the process of times when the number of customers X(t) decreases by 1. By reversibility, this has the same distribution as the process of times when X(t) increases by 1. But this is simply the arrivals process. So we have established: Theorem 5.27. For a positive recurrent stationary M/M/s queue (where s ≥ 1 or s = ∞) in equilibrium, the process of departure times is a Poisson process of rate λ. Like some earlier results involving reversibility, this is quite surprising. One might expect that departures would be more “clustered together”. There is a higher departure rate µ during busy periods, and no departures during idle periods; but, remarkably, the random switching between the two is such that the result is just a Poisson process. 77 6 Martingales 6.1 Introduction Besides Markov chains, martingales are the other main class of stochastic processes. Whereas Markov chains are “memoryless processes”, martingales are “fair gambling games”. The unusual name “martingale” is an 18th century a term for a betting strategy – this in turn appears to derive from a word for a kind of harness. By way of motivation, consider a simple of gambling game: a fair coin is tossed repeatedly. Just before each toss, you are allowed to bet any amount y of your choice that the next toss will be Heads. If the next toss is Heads you win y; if it is Tails you lose y. We allow y to be negative (in which case one would normally call it betting on Tails). For example, you might choose to: 1. Bet £10 on the first toss and then stop. Then your expected profit is 21 (10)+ 12 (−10) = 0. 2. Bet £1 on each of the first 10 tosses. It is clear that the distribution of your profit at the end is symmetric around 0, so your expected profit is again 0. Is there any reasonable betting strategy that will give a positive expected profit? Well, we might hope to try: 3. Bet £1 on the first toss that is Heads(!) That certainly guarantees a profit of £1, but it is not really a reasonable strategy. We have to place the bet before the toss (good luck finding a casino that lets you place it afterwards), so the only way to do this is if we can see into the future! However, what about: 4. Bet £1 on every toss until the first time your profit is 1, and then stop. Perhaps surprisingly, this works. Your profit is a simple symmetric random walk. We know that it hits 1 almost surely, so you will definitely walk away with £1! Two disadvantages are: (a) it may take a very long time – we know that the expected time to hit 1 is infinite; (b) you need to be infinitely rich (or have an unlimited line of credit!) – the random walk might get arbitrarily negative before it hits 1. Another approach: 5. Bet 1 on H, then 2 on H, then 22 , then 23 , etc. until the first time you win a bet. This also works – we know there will be a H eventually, and if it happens on the N th toss then your profit is −1 − 2 − 22 − 23 − · · · − 2N −1 + 2N = 1, so we are guaranteed a £1 profit again. This seems like a much quicker way to get rich: N has Geometric distribution, so EN = 2. But the problem again is that you need to be infinitely rich to start with. If not, things could go very badly. Just before winning, you’ll be 2N − 1 in debt! In fact we’ll see that for any strategy that doesn’t involve seeing into the future and also keeps you wealth above some constant −c, your expected profit cannot be positive. This turns out to be a very powerful fact with implications far beyond gambling. We’ll see some nice applications to random walks, and a simple elegant solution to the HTH and ABRACADABRA problems. Here is the main definition. Let X = (Xn )n≥0 be a discrete-time stochastic process, where each Xn is real-valued. We say that X is a martingale if for each n: 78 (i) Xn has finite mean, and (ii) E(Xn+1 | X0 , X1 , . . . , Xn ) = Xn . Let’s pause to understand the meaning of condition (ii). Recall the definition of conditional expectation: for two random variables X and Y , E(X | Y = y) is simply the mean of the conditional distribution of X given that Y = y. This is necessarily a function of y, say E(X | Y = y) = g(y). Then we define E(X | Y ) = g(Y ). We are used to Y being a real-valued random variable, but in fact it could be any random variable. In (ii) above it is the random vector (X0 , X1 , . . . , Xn ). So E(Xn+1 | X0 , X1 , . . . , Xn ) = g(X0 , X1 , . . . , Xn ) where the function g is given by g(x0 , . . . , xn ) = E(Xn+1 | (X0 , . . . , Xn ) = (x0 , . . . , xn )). We don’t want to write all that out every time, which is why the conditional expectation notation is so useful. More importantly, how should we interpret the definition? The idea is that Xn represents your wealth an time n in any fair gambling game. (i) is just a technicality to make sure all expectations make sense. (ii) says that given complete knowledge of everything that has happened to your wealth up till now, X0 , . . . , Xn , the best guess (i.e. mean) for your wealth Xn+1 after one more step is your current wealth Xn . To make use of the concept we need a few basic properties of conditional expectation. Theorem 6.1. For random variables X, Y, Z and constants a, b, (i) E(aX + bY | Z) = aE(X | Z) + bE(Y | Z); (ii) E(g(Y )X | Y ) = g(Y )E(X | Y ) for any function g; (iii) E(X | Y ) = E(X) if X and Y are independent. It is important to appreciate the difference between (ii) and (iii). If we know Y , then we know g(Y ), so it behaves like a deterministic constant and comes out of the expectation. On the other other hand if X and Y are independent then knowing Y tells us nothing about X, so the conditional expectation just becomes an ordinary expectation. Proof. These follow from straightforward manipulations, first showing the corresponding statements involving conditioning on events. For instance, for (ii) we show that E(g(Y )X | Y = y) = g(y)E(X | Y = y). We omit the details. Recall also the tower property or double expectation formula: Theorem 6.2. E(E(X | Y )) = E(X). Example Let (Xi )i≥0 be i.i.d. taking values +1 and −1 each with probability 1/2, and P6.3. n let Sn = i=1 Xi be the simple symmetric random walk. Show that (Sn ) is a martingale. Solution. First note that EXi = 0, so ESn = 0 which is finite. Now, E(Sn+1 | S0 , . . . , Sn ) = E(Sn + Xn+1 | S0 , . . . , Sn ) = E(Sn | S0 , . . . , Sn ) + E(Xn+1 | S0 , . . . , Sn ) = Sn + E(Xn+1 ) = Sn , because Sn is a function of S0 , . . . , Sn while Xn+1 is independent of S0 , . . . , Sn . 79 ♢ Our first property of martingales is natural given the fair game interpretation, although not all that useful on its own. Theorem 6.4. If (Xn ) is a martingale then EXn = EX0 for every n. Proof. Using the tower property, EXn+1 = E(E(Xn+1 | X0 , . . . , Xn )) = EXn . Now use induction. Sometimes the following slightly more general definition of a martingale is useful. Let (Xn )n≥0 and (Yn )n≥0 be stochastic processes. Then we say that (Xn ) is a martingale with respect to (Yn ) if for each n: 1. Xn has finite mean, and 2. E(Xn+1 | Y0 , . . . , Yn ) = Xn . We can imagine that Yn represents all new information learned at time n, for instance the state of the stock market, and so (Y0 , . . . , Yn ) represents all information known up to time n, while Xn is the quantity we are actually interested in, perhaps the value of your stock portfolio. It follows from condition 2 that Xn must be a function of Y0 , . . . , Yn . Pn Let Sn = i=1 Xi be the simple symmetric random walk started at 0 as usual. Also take X0 = 0, since the definitions assume that both process are indexed from 0. Example 6.5. Show that (Sn ) is a martingale with respect to (Xn ). Solution. Sn is a function of X0 , . . . Xn and has finite mean, and E(Sn+1 | X0 , . . . , Xn ) = E(Sn | X0 , . . . , Xn ) + E(Xn+1 | X0 , . . . , Xn ) = Sn + E(Xn+1 ) = Sn . ♢ Example 6.6. Is (Sn2 ) is a martingale with respect to (Xn )? Solution. Since −n ≤ Sn ≤ n, it follows that Sn2 has finite mean. But 2 E(Sn+1 | X0 , . . . , Xn ) = E((Sn + Xn+1 )2 | X0 , . . . , Xn ) 2 = E(Sn2 + 2Sn Xn+1 + Xn+1 | X0 , . . . , X n ) 2 ) = Sn2 + 2Sn E(Xn+1 ) + E(Xn+1 2 2 = Sn + 2Sn 0 + 1 = Sn + 1. So (Sn2 ) is not a martingale. ♢ However, it is very close: we can fix it by adding a deterministic function: Example 6.7. Show that (Sn2 − n) is a martingale with respect to (Xn ). Solution. Let Mn = Sn2 − n. From above, Mn has finite mean, and 2 E(Mn+1 | X0 , . . . , Xn ) = E(Sn+1 − n − 1 | X0 , . . . , X n ) = Sn2 + 1 − n − 1 = Sn2 − n = Mn . 80 ♢ 6.2 New martingales from old Let (Xn ) be a martingale with respect to (Yn ), and think of Xn as representing your wealth at time n in a fair gambling game – perhaps betting 1 unit repeatedly on a sequence of fair coin tosses. Now suppose you are allowed to vary your betting strategy. You can bet any amount Hn you want to on the outcome of game n. For this to be sensible, we should insist that you cannot see into the future, so you must base your decision on what has happened before game n. To formalize this, we say that a stochastic process (Hn )n≥1 is predictable (sometimes called previsible) with respect to (Yn )n≥0 if for each n, Hn is a function of Y0 , . . . , Yn−1 . In other words, if Y0 , . . . , Yn represents the information known at time n, then Hn is known strictly before time n. For stochastic processes X = (Xn )n≥0 and H = (Hn )n≥1 we define the process H · X = ((H · X)n )n≥0 by n X (H · X)n = Hi (Xi − Xi−1 ). i=1 This is precisely the betting process referred to above. If Xn is the wealth after n games of a player who bets $1 on each game, then Xi − Xi−1 is the profit from betting $1 on game i, so Hi (Xi − Xi−1 ) is the profit from betting $Hi on game i, so (H · X)n is the wealth after n games of a player who starts with 0 and bets Hi on game i for each i. Example Let Z1 , Z2 , . . . be i.i.d. taking values +1 and −1 with equal probability, and P6.8. n let Sn = i=1 Zi be the simple symmetric random walk. Let N = min{n ≥ 1 : Zn = +1}, and let Hn = 2n−1 In≤N , n ≥ 1. In other words, we double our bet (on Heads, say) at each step until the first win, and then stop. Since {n ≤ N } = {N < n}C = {Z1 = · · · = Zn−1 = −1}, the process (Hn ) is predictable with respect to (Zn ). We have ( n X −1 − 2 − 22 − · · · − 2n−1 = −2n + 1, n<N (H · S)n = Hi Zi = 2 N −2 N −1 −1 − 2 − 2 − · · · − 2 +2 + 0 + 0 · · · = 1, n ≥ N. i=1 So using this betting strategy we eventually end up with a profit of 1, although we might go a long way down first. The next result says that betting strategies cannot “beat the system” - a fair game remains a fair game. The traditional but slightly confusing name “optional sampling” conveys the idea that you are allowed to choose when to bet on, or “sample” from X. Theorem 6.9 (Optional sampling). Let X be a martingale with respect to Y and let H be predictable with respect to Y . Then H · X is a martingale with respect to Y provided (H · X)n has finite expectation. Example 6.10. In the last example, note that P(N > n) = (1/2)n , so E(H · S)n = (−2n + 1)(1/2)n + (1)(1 − (1/2)n ) = 0 = E(H · S)0 , consistent with the theorem. 81 Proof of Theorem 6.9. Since H is predictable, (H · X)n is a function of Y0 , . . . , Yn , as are Hn+1 and Xn . So E((H · X)n+1 | Y0 , . . . , Yn ) = E (H · X)n + Hn+1 (Xn+1 − Xn ) | Y0 , . . . , Yn = (H · X)n + Hn+1 E(Xn+1 − Xn | Y0 , . . . , Yn ) = (H · X)n + Hn+1 (E(Xn+1 | Y0 , . . . , Yn ) − Xn ) = (H · X)n + Hn+1 (Xn − Xn ) = (H · X)n . 6.3 Stopped martingales One example of a “betting strategy” is simply to stop playing the game at some point. We can decide when that is based on what has happened so far. In other words, we stop at a stopping time. Recall that a random variable T taking values in N ∪ {∞} is a stopping time with respect to (Yn )n≥0 if we can determine whether or not T ≤ n from Y0 , . . . , Yn . We use the following notation for the minimum: a ∧ b = min(a, b). If X = (Xn )n≥0 is a stochastic process and T is a random time, then the process (Xn∧T )n≥0 is X stopped at T . (Xn∧T ) (Xn ) n 0 n 0 T T Given the gambling interpretation, the following should not be a surprise. A gambler can walk away whenever he wants. Theorem 6.11. If X is a martingale with respect to Y and T is a stopping time with respect to Y then (Xn∧T )n≥0 is a martingale with respect to Y . Proof. We bet 1 unit on each game until T , and then stop. In other words, let Hn = IT ≥n . Then {Hn = 1} = {T ≥ n} = {T ≤ n − 1}C , which can be determined from Y0 , . . . , Yn−1 . So H is predictable. Hence H · X is a martingale, by Theorem 6.9. But (H · X)n = n X Hi (Xi − Xi−1 ) = n∧T X i=1 1(Xi − Xi−1 ) = Xn∧T − X0 . i=1 So Xn∧T = (H · X)n + X0 . Now the constant sequence (X0 )n≥0 is trivially a martingale, and the sum of two martingales is a martingale (see the problem sheet), so (Xn∧T ))n≥0 is a martingale. 82 Using Theorem 6.4 we can immediately deduce Theorem 6.12. Under the conditions of the last theorem, EXn∧T = EX0 . Similarly to our earlier investigations, a natural and useful question is: when is it the case EXT = EX0 for a random time T . Here is an answer. Theorem 6.13 (Optional stopping). Let X be a martingale and T a stopping time with respect to Y . Any of the following conditions ensures that EXT = EX0 . (i) T is bounded (i.e. T < c a.s. for some constant c < ∞); (ii) T < ∞ a.s. and X is bounded (i.e. |Xn | ≤ c for all n and some fixed c < ∞); (iii) ET < ∞ and X has bounded increments (i.e. |Xn+1 − Xn | < c for all n and some fixed c < ∞). Proof. (partial) (i) We have EXn∧T = EX0 . We have T < c, so T ∧ c = T . So taking n = c gives EXT = EXT ∧c = EX0 . For (ii) and (iii) it can be shown using tools beyond the scope of this course (dominated convergence) that EXn∧T → EXT as n → ∞. The result then follows because EXn∧T = EX0 . Example 6.14. Let S = (Sn ) be the simple symmetric random walk started at 0, and let T be the hitting time of 1. S is a martingale and T is a stopping time, but 1 = EST ̸= ES0 = 0. Since S has bounded increments (bounded by 1), but the conclusion of the optional stopping theorem fails, we must have ET = ∞ (as we found before). Example 6.15. (Gambler’s ruin yet again) Let S = (Sn ) be the simple symmetric random walk started at k and let T be the first hitting time of {0, m}, where 0 < k < m. The martingale S has bounded increments, and by the disjoint blocks argument we know that the stopping time T has finite mean. Therefore by part (iii) of the optional stopping theorem, EST = ES0 , i.e. hm + (1 − h)0 = k, where h = P(S hits m before 0). Hence h= k m (as we found before). Example 6.16. (Gambler’s ruin duration) In the set-up of the last example, let Mn = Sn2 − n. As we have seen, this is a martingale, but it does not have bounded increments, so we cannot apply optional stopping directly. However, the stopped martingale (Mn∧T )n≥0 does have bounded increments, because Sn∧T is bounded between 0 and m, while n has increments of 1. Therefore, we can apply optional stopping to (Mn∧T ) to give EMT ∧T = EM0∧T . But EM0∧T = EM0 = k 2 − 0 = k 2 , and ST2 is m2 with probability h and 02 with probability 1 − h, so EMT ∧T = EMT = E(ST2 − T ) = hm2 + (1 − h)02 − ET. We conclude that hm2 − ET = k 2 , so ET = hm2 − k 2 = km − k 2 = k(m − k) (as we found before). 83 The method used above applies in general: if M is a martingale and T is a stopping time then we can apply the optional stopping theorem to the stopped martingale (Mn∧T ) rather than to M (provided one of the conditions (i)–(iii) holds for the stopped martingale). The conclusion EM0 = EMT is the same, since M0∧T = M0 and MT ∧T = MT . Finally, we can give the promised elegant solution to this question: Example 6.17. A monkey types random letters on a keyboard. Assume that each press is a uniformly random choice from A, . . . , Z, and different presses are independent. Find the expected time until it types ABRACADABRA. Solution. Suppose that just before each key press, a new gambler enters the casino. The gambler arrives with £1, and bets that the next letter will be A. If she loses the bet then she foreits her stake and leaves. If she wins, her wealth increases to £26. In that case she then bets her entire wealth that the next letter will be B, and either loses her stake and leaves, or wins and increases her wealth to £262 . In that case she stakes her entire wealth on R, and so on. If the gambler is lucky enough to win all 11 bets until ABRACADABRA appears, she retires with the extremely comfortable sum of £2611 . Let Mn be the total profit of the casino after n key presses. Because all the bets are fair, M is a martingale. Let T be the number of presses until the end of the first appearance of ABRACADABRA, which is a stopping time. By dividing into disjoint blocks we know that ET < ∞. And M has bounded increments, because at any given time there are at most 11 gamblers in the casino, and the total of all their bets is at most 1 + 26 + 262 + · · · + 2611 . Hence we can apply optional stopping to deduce EMT = EM0 . We have M0 = 0 so EM0 = 0. What is MT ? At time T there have been T presses, so T gamblers have entered and each paid in £1, so from this the casino has gained T . But ABRACADABRA has just come up, so the casino has just paid out 2611 to one gambler. In addition, another gambler has just won 4 bets in a row because of the last ABRA, so has been paid 264 , and the last gambler to enter has just won 1 bet because of the final A, so has been paid 26. Hence MT = T − 2611 − 264 − 261 , so 0 = EM0 = EMT = ET − 2611 − 264 − 261 . Therefore ET = 2611 + 264 + 261 . ♢ We can see from this solution that what matters is possible overlaps between different occurrences of ABRACADABRA. One way to think of it is that each possible overlap increases the degree of clustering of occurrences, which makes the mean waiting time longer. It is interesting to compare this with the problem sheet questions about arrival times of buses and taxis. Equally spaced arrivals (buses) are more regular and less clustered than Poisson arrivals (taxis) of the same rate, resulting in a shorter mean waiting time. 84 6.4 Martingale convergence (not required for the course) We give a very brief look at this important topic. Theorem 6.18. Let X = (Xn ) be a non-negative martingale (i.e. satisfying Xn ≥ 0 for all n). Then there is a random variable X such that Xn → X almost surely as n → ∞. (The non-negativity assumption of Theorem 6.18 can be weakened in various ways.) Proof. (sketch) We do not give the full proof, but we mention the main idea, which is beautiful. If Xn is bounded but does not converge, then it must oscillate forever. Imagine Xn is a stock price. Suppose there are two fixed values a < b for which Xn oscillates between being < a and > b forever. Then there would be a guaranteed way to make money with limited risk. Every time Xn drops below a we buy the stock, and every time it goes above b we sell it. This guarantees a profit of b − a for each oscillation, but because the martingale is non-negative our liability is limited. We can formalize all this by choosing a suitable predictable H so that our wealth is H · X. It can be shown that this contradicts the martingale property for H · X. Example 6.19. (Polya urn) An urn initially contains one red and one blue ball. At each step, a ball is drawn from the urn at random, and replaced together with another new ball of the same colour. Let Rn be the number of red balls in the urn after n steps (when there are n + 2 balls in total). So R0 = 1, and (Rn ) is an inhomogeneous Markov chain; (Rn , n) is a homogeneous Markov chain. We have P(Rn+1 = j | R0 = i0 , . . . , Rn = i2 ) = E(Rn+1 = j | Rn = in ). Let Rn , n+2 the proportion of red balls. How does this behave as n → ∞? We have Xn = E(Rn+1 = r + 1 | Rn = r) = P(red chosen at step n + 1) = r n+2 E(Rn+1 = r | Rn = r) = P(blue chosen at step n + 1) = 1 − Therefore E(Rn+1 r n+2 r r n+3 | Rn = r) = (r + 1) +r 1− =r , n+2 n+2 n+2 so E(Rn+1 | R0 , . . . , Rn ) = Rn n+3 , n+2 and therefore E(Xn+1 | R0 , . . . , Rn ) = 1 n+3 Rn Rn = = Xn . n+3 n+2 n+2 Thus (Xn ) is a martingale with respect to (Rn ). Note that 0 ≤ Xn ≤ 1 for all n. So by the convergence theorem, Xn → X 85 a.s. (20) for some random variable X. What is the distribution of X? Let ar,n = P(Rn = r) (which is the probability that the Markov chain (Rn , n) hits (r, n)). Then from (20) we get a difference equation: r r ar,n+1 = ar,n 1 − + ar−1,n . n+2 n+2 Experimenting with small values, we can guess the solution ar,n = 1 , n+1 r = 1, . . . , n + 1, which can then be proved by induction on n. This means that Xn is uniformly random on the set n 1 2 n + 1o , ,..., . n+2 n+2 n+2 From this we can deduce that for any u ∈ [0, 1], P(Xn < u) → u as n → ∞. In other words, P(Xn < u) → P(U < u) as n → ∞, where U ∼ Unif[0, 1]. We say that Xn converges in distribution to U , sometimes written Xn ⇒ U . (This is the same type of convergence that appears in the central limit theorem). This is a much weaker statement than almost sure convergence. However, we know that Xn → X a.s. for some X, and it can be shown that this implies Xn ⇒ X. It follows that X has the same distribution as U , i.e. it is Uniform on [0, 1]. This is quite a striking result. In the long term, the proportion of red balls settles down to a limit, but the limit is random, and in fact uniformly distributed. (We might imagine that the limiting proportion is somehow chosen ahead of time, and gradually revealed during the experiment, although of course that’s not what is really happening.) There are several other notions of convergence of random variables. Understanding how they relate to each other is a key topic in more advanced probability theory. 86 7 Brownian Motion 7.1 Introduction Brownian motion (also called the Wiener process) is one of the most important and wellstudied stochastic processes. Like the Poisson process it is just one process, rather than a broad family like Markov chains or martingales. Unlike everything else we have looked at so far, it has continuous time and continuous space. What does the simple symmetric random walk (Sn ) on Z look like if we run it for a large number of steps? We can try plotting a typical trajectory, here for the first 10, 100, 1000, 10000 steps: Sn Sn 5 50 n 0 n 0 10 100 −5 −50 Sn Sn 500 0 5000 n 1000 −500 n 0 10000 −5000 The results are a bit disappointing. But that’s because we didn’t scale the axes appropriately. In fact we could have Pn predicted this. A single step Xi of the random walk has mean 0 and √ variance 1, so Sn = 1 Xi has mean 0 and variance n, and therefore standard deviation n. (In fact the central limit theorem tells us that it is approximately Normal with these √ parameters for large n). So if both axes have scale n, the typical n deviation of Sn from 0 will be too small to see when√n is large. The correct thing to do is stretch the picture vertically to have scale of order n: 87 Sn √ 10 √ n 0 Sn 100 n 0 10 100 √ − 10 √ − 100 Sn √ 1000 √ 0 n Sn 10000 1000 √ − 1000 n 0 10000 √ − 10000 That looks much more interesting! Note that the picture looks similar in character for n = 1000 and n = 10000. The idea is that Brownian motion is the limit of these pictures as n → ∞. It’s not so clear what a limit of a random function means, nor why it should exist. These issues can be resolved, but that is beyond the scope of this course. Instead we will define Brownian motion via its properties, as we have done with other processes. We already remarked that the distribution of Sn is approximately Normal, by the central limit theorem. So it’s no surprise that the Normal distribution will feature in the definition. Recall that a standard Normal distribution has probability density function 1 2 ϕ(z) = √ e−z /2 , 2π −∞ < z < ∞, and has mean 0 and variance 1. If Z has standard Normal distribution, denoted Z ∼ N (0, 1), then X = σZ + µ has Normal distribution with mean µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ). (It’s important to remember that the second parameter σ 2 is the variance, not the standard deviation). More generally, if X ∼ N (µ, σ 2 ) then aX + b ∼ N (aµ + b, a2 σ 2 ). Recall also that if X1 and X2 are independent with Xi ∼ N (µi , σi2 ) then X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 ). 88 Now we can give the main definition. Consider a stochastic process B = (B(t))t≥0 , where t takes non-negative real values and each B(t) is a real-valued random variable. We say that B is a (standard) Brownian motion if: 1. B(0) = 0, and B is a continuous function almost surely; 2. B has independent increments (over disjoint intervals); 3. B(s + t) − B(s) ∼ N (0, t) for all s ≥ 0 and t > 0. As before, “independent increments” means that for any 0 ≤ t0 ≤ t1 ≤ · · · ≤ tn , the random variables B(t1 ) − B(t0 ), B(t2 ) − B(t1 ), . . . B(tn ) − B(tn−1 ) are independent. Sometimes it is convenient to define a N (0, 0) random variable to be deterministically equal to 0. Then property 3 is true for t = 0 as well. Brownian motion typically looks something like this: B(t) 1 t 0 1 −1 (In fact this is just the last random walk picture from before with different axis labels, which is a good approximation. If we zoom in far enough on this picture we’ll see discrete steps. Genuine Brownian motion would continue to look rough and spiky no matter how far we zoom in.) Example 7.1. For a standard Brownian motion B, compute Cov(B(s), B(t)). Solution. Assume s ≤ t. Using independent increments, E(B(s)B(t)) = E B(s)(B(t) − B(s) + B(s)) = E(B(s))E(B(t) − B(s)) + E(B(s)2 ) = 0 + s, because B(s) ∼ N(0, s), so EB(s)2 = Var B(s) = s. So Cov(B(s), B(t)) = E(B(s)B(t)) − EB(s)EB(t) = s − 0 = s. Therefore, in general Cov(B(s), B(t)) = min(s, t). 89 ♢ 7.2 Symmetries Many transformations of Brownian motion yield Brownian motion back again. Theorem 7.2. Let B = (B(t))t≥0 be a standard Brownian motion. Then each of the following is also a standard Brownian motion: (i) (−B(t))t≥0 ; (ii) (B(t + s) − B(s))t≥0 , for any constant s ≥ 0; 1 √ B(ct) , for any constant c > 0. (iii) c t≥0 Properties (i) (reflection symmetry) and (ii) (time-homogeneity) are not particularly surprising. (ii) says that Brownian motion viewed from time s onwards is just like the original, provided we put our origin at the point it has got to, (s, B(s)) (see the picture below). (iii) is more interesting: it says that Brownian motion is invariant under scaling, but we must scale the horizontal and vertical axes differently – this reflects the square-root scaling that we saw for the random walk pictures. 1 B(s) 0 s 1 −1 Proof. We just need to check the properties in each case. (i) We have −B(0) = B(0) = 0, and −B is continuous. Since B has independent increments, so does −B. Since B(t + s) − B(s) ∼ N (0, t) and the N (0, t) distribution is symmetric about 0, we have −B(s + t) − (−B(s)) = −(B(s + t) − B(s)) ∼ N (0, t). (ii) Let X(t) = B(s + t) − B(s). Then X(0) = B(s) − B(s) = 0, and X is continuous since B is. The increment of X over an interval [u, v] is X(v) − X(u) = B(s + v) − B(s + u), so X has independent increments. Finally, X(t + u) − X(u) = B(t + u + s) − B(u + s) ∼ N (0, t), as required. (iii) Let Y (t) = √1c B(ct). Then Y (0) = 0 and Y is continuous and has independent √ increments. Moreover, we have Y (s + t) − Y (s) = (1/ √ c)(B(cs + ct) − B(cs)). Since B(cs + ct) − B(cs) ∼ N (0, ct), Y (s + t) − Y (s) ∼ N (0, ct(1/ c)2 ) = N (0, t). We can also reverse time. 90 Theorem 7.3 (reversal). Let B be a Brownian motion and fix u > 0. Then the timereversed process (B(u − t) − B(u))0≤t≤u has the same distribution as (B(t))0≤t≤u . In other words, running time backwards from u to 0 simply gives a Brownian motion run up to time u (provided we subtract the starting value B(u) to make the process start at 0). Proof. Let X(t) = B(u−t)−B(u). Then X(0) = B(u)−B(u) = 0, and X is continuous since B is. For 0 ≤ s ≤ t ≤ u, the increment of X over [s, t] is X(t) − X(s) = B(u − t) − B(u) − B(u − s) + B(u) = −(B(u − s) − B(u − t)), i.e. minus the increment of B over [u − t, u − s]. Therefore X has independent increments, since B does. And finally, for 0 ≤ s ≤ s + t ≤ u we have X(s + t) − X(s) = −(B(u − s) − B(u − s − t)) ∼ N (0, t). The next property is considerably more surprising – we can invert time, meaning we replace t with 1/t, so that in particular the behaviour of B near ∞ gets mapped to near 0 and vice versa. Theorem 7.4 (inversion). If B is a standard Brownian motion then so is the process X = (X(t))t≥0 defined by ( tB(1/t), t > 0; X(t) = 0, t = 0. We won’t prove this, although in the problem sheet we check some of the ingredients. It has some striking consequences. For example, it can be shown that almost surely B takes both positive and negative values at arbitrarily large times. From the inversion formula it follows that also it takes both positive and negative values at arbitrarily small times. By time-homogeneity it then also follows that for any fixed s that B takes both values greater than B(s) and less than B(s) at times t arbitrarily close to s. Brownian motion is very wiggly! 7.3 Reflection principle Let B be a standard Brownian motion and let Ta = min{t : B(t) = a} be the first hitting time of a. It can be shown (see below) that Ta < ∞ almost surely. This should not come as a surprise, by analogy with the recurrence of simple symmetric random walk. Here is a beautiful fact that allows us to compute many things. Theorem 7.5 (reflection principle). If B is a standard Brownian motion then the process ( B(t), t ≤ Ta , W (t) = 2a − B(t), t > Ta is a standard Brownian motion. 91 B(t) a Ta W (t) See the picture. The process W (t) follows the original Brownian motion B(t) until the first time Ta that it meets the horizonal line at height a, and thereafter it follows the trajectory of B(t) reflected about that horizontal line. (The formula arises because B(t) − a is the height of B(t) relative to a, so negating this relative height gives a − (B(t) − a) = 2a − B(t).) Proof. (sketch) There is a strong Markov property for Brownian motion, and Ta is a stopping time. We omit the general statement and definitions; in our case the conclusion is that the process X(t) = B(Ta + t) − B(Ta ), t ≥ 0 started at time Ta is a Brownian motion, and it is independent of the process (B(t))0≤t≤Ta before Ta . Since B is continuous we have B(Ta ) = a. So we can obviously express the original Brownian motion as ( B(t), t ≤ Ta , B(t) = a + X(t − Ta ), t > Ta . To get W we just replace X with its negative: ( B(t), t ≤ Ta , W (t) = a − X(t − Ta ), t > Ta . This must give a process with the same distribution, because if X is a Brownian motion then so is −X, and both are independent of the process before Ta . Now we look at some applications. Let M (t) = max{B(s) : 0 ≤ s ≤ t} be the running maximum of Brownian motion. It might appear hard to compute the distribution of M (t) – it is the maximum of infinitely many random variables, which are highly dependent. But amazingly we have the following. Theorem 7.6. M (t) has the same distribution as |B(t)|. Of course B(t) ∼ N (0, t), so there is no particular mystery about the distribution of |B(t)| itself. 92 Proof. Fix a > 0. We have P(M (t) ≥ a) = P(M (t) ≥ a, B(t) ≥ a) + P(M (t) ≥ a, B(t) < a) = P(B(t) ≥ a) + P(M (t) ≥ a, B(t) < a), (21) because M (t) ≥ B(t). To deal with the second term, note the equivalence of the events (exercise: check this carefully) {M (t) ≥ a} = {Ta ≤ t}, (22) and also note that the hitting time of a is the same for B as for the process W defined in Theorem 7.5 (because they agree up to the hitting time). Hence P(M (t) ≥ a, B(t) < a) = P(Ta ≤ t, B(t) < a) = P(Ta ≤ t, W (t) < a) = P(Ta ≤ t, 2a − B(t) < a) = P(Ta ≤ t, B(t) > a) = P(M (t) ≥ a, B(t) > a) = P(B(t) > a). Therefore, from (21), and since B(t) is a continuous symmetric random variable P(M (t) ≥ a) = P(B(t) ≥ a) + P(B(t) > a) = 2P(B(t) ≥ a) = P(|B(t)| ≥ a). We can use the last result, together with (22), to compute the distribution of the hitting time. Theorem 7.7. For a > 0, the random variable Ta has probability density function a2 a f (t) = √ e− 2t , 2πt3 t > 0. Proof. See the problem sheet. In particular the proof of Theorem 7.7 implies that Ta < ∞ almost surely. (This fact was not needed for the proof of Theorem 7.6.) It is also easy to deduce that ETa = ∞, which should not be surprising by analogy with simple symmetric random walk. It can be shown that almost surely there is a unique time at which Brownian motion attains its maximum value in the interval [0, 1]. We let Y be this time, so that 0 ≤ Y ≤ 1 and B(Y ) = max{B(t) : t ∈ [0, 1]}. It turns out that we can compute the distribution of this as well! Remark – not required for the course. There is a subtle point here. We saw earlier that for any fixed time s, almost surely Brownian motion B takes values both greater than and less than B(s) arbitrarily close to s on both the left and the right. But that is not the case at Y ! By definition of a maximum, B takes only values less than B(Y ) immediately to the left and right of Y . This is not a contradiction, because the time interval [0, 1] is uncountable. We could conclude (by the union bound) that the earlier 93 almost sure statement about s holds simultaneously for a fixed countable set of times, such as all s ∈ [0, 1] ∩ Q, but not for all of [0, 1]. The random variable Y is continuous, so the probability that it takes any particular value is zero. It turns out that certain other properties do hold for all times simultaneously. For instance, it can be shown that Brownian motion is nowhere differentiable almost surely. One must be careful with such issues when dealing with continuous time processes! Theorem 7.8. The location Y of the maximum of Brownian motion on [0, 1] has cumulative distribution function P(Y < t) = √ 2 arcsin t, π 0 ≤ t ≤ 1. Proof. For any t ∈ (0, 1), P(Y < t) = P max B(s) > max B(s) s∈[0,t] s∈[t,1] = P max B(s) − B(t) > max B(s) − B(t) s∈[0,t] s∈[t,1] = P max B(t − u) − B(t) > max B(t + v) − B(t) . u∈[0,t] v∈[0,1−t] The purpose of re-writing like this is to consider the two parts of the process before and after t as processes started from t. By time homogeneity, (B(t + v) − B(t))v≥0 is a Brownian motion. By reversibility, (B(t − u) − B(t))0≤u≤t is also a Brownian motion. Moreover, by independent increments, these two processes are independent. Therefore, by Theorem 7.6, the two maxima in the last line above are simply absolute values √of independent Normal random variables. More precisely, since B(r) can be expressed as Z r for a standard Normal Z, the above probability equals √ √ P |Z1 | t > |Z2 | 1 − t , where Z1 , Z2 are independent standard Normal random variables. The joint probability density function of Z1 , Z2 is f (Z1 , Z2 ) = 1 − z12 +z22 e 2 , 2π which is circularly symmetric. Therefore if (R, Θ) are the polar coordinates of (Z1 , Z2 )√ then Θ is√uniformly distributed on [0, 2π] (and independent of R). The region {(z1 , z2 ) : |z1 | t > |z2 | 1 − t} is simply the two shaded sectors: 94 z2 √ √ ( 1 − t, t) α z1 √ √ √ where the angle α is given by tan α = t/ 1 − t, so sin α = t. Therefore the probability that the random point (Z1 , Z2 ) lies in the region is √ 2 4α = arcsin t. 2π π The distribution of Y is called the arcsine distribution. By differentiating the distribution function in Theorem 7.8, its probability density function is 1 p , π t(1 − t) 0 < t < 1; see below. 5 4 √1 π 3 t(1−t) 2 1 0 0 0.2 0.4 0.6 0.8 1 t Further remarks – not required for the course. Remarkably, it can be shown that both of the following random variables also have arcsine distribution: the last zero of Brownian motion in [0, 1]: L = sup{t ∈ [0, 1] : B(t) = 0}; 95 and the amount of time in [0, 1] for which it is positive : Z 1 IB(t)>0 dt. H= 0 Brownian motion has many applications, including to physics and finance. It is named after the botanist Robert Brown, who observed random motion of pollen grains under a microscope. Albert Einstein concluded that this motion was due to random collisions with molecules. The mathematical definition of Brownian motion is due to Norbert Wiener. Brownian motion forms the underpinning of Îto calculus, which in turn is the basis of modern mathematical finance. There are thousands of mathematical papers written on Brownian motion. Of particular interest is d-dimensional Brownian motion, which is simply the vector-valued process B (d) (t) = (B1 (t), . . . , Bd (t)), where the components (Bi (t))t≥0 are independent Brownian motions. A few celebrated results are: 1. If B is a 1-dimensional Brownian motion and M is its running maximum as defined before, then the process (2M (t)−B(t))t≥0 has the same distribution as (∥B (3) (t)∥)t≥0 , the Euclidean norm of a 3-dimensional Brownian motion. 2. If S = {B (2) (t) : t ∈ [0, 1]} is the image of 2-dimensional Brownian motion up to time 1 then conditioned on B(1) = 0, the area enclosed by S has mean π/5. 3. The outer boundary of S has fractal dimension (Hausdorff dimension) 4/3. 96

Probability II

Related documents

Products

Support

Probability II

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib