MSc MT15. Further Statistical Methods: MCMC Lecturer: Geoff Nicholls Lectures 1-2: concepts and motivation; inversion; transformation; rejection. Notes and Practicals will be made available at www.stats.ox.ac.uk\ ∼nicholls\MScMCMC15 “Monte Carlo Simulation: An analytical technique for solving a problem by performing a large number of trial runs, called simulations, and inferring a solution from the collective results of the trial runs.”, glossary at www.nasdaq.com. “Anyone who can do solid statistical programming will never miss a meal.”, Prof David Banks, 2008. Why Monte Carlo? We fit complex realistic models and analyze large data sets. Taking expectations is a fundamental operation in Statistics. Expectations are integrals and integrals are hard. Use simulation to do hard integrals. Monte Carlo theory is applied probability. Doing Monte Carlo on a computer is statistical computing. Monte Carlo methods empower you to do a large chunck of statistical inference. Simulation: the basic idea In many settings we wish to estimate expectations. Suppose X ∈ Ω is a random variable (rv) with density X ∼ p(x), f : Ω → ℜ is a function and we want to evaluate the expectation Z Ep(f ) = f (x)p(x)dx. Ω If we have Xi, i = 1, 2, ..., n with Xi ∼ p then n X f n = n−1 f (Xi) i=1 is an unbiased estimator for Ep(f ). If f (Xi) are iid with 0 < var(f (X)) < ∞ and |Ep(f )| < ∞ then f n − E(f ) q → N (0, 1), var(f )/n by the CLT. If Zα/2 is the 1 − α/2 quantile of a standard normal and n X 1 s2 = (f (Xi) − f n)2 n − 1 i=1 is an unbiased estimator for var(f ) we can report a level-α CI s f n ± Zα/2 √ n for Ep(f ). Example: X ∼ N (0, 1), what is a = P (sin(πX) > 0.5)? Here P (sin(πX) > 0.5) = E(Isin(πX)>0.5), where 1 if sin(πx) > 0.5 Isin(πx)>0.5 = 0 otherise. We fix n and draw Xi ∼ N (0, 1) iid then form n 1X â = n i=1 Isin(πXi)>0.5 Exercise: show there is a CLT for â. Exercise: suppose 0 ≤ f (x) ≤ M for all a ≤ x ≤ b. hit-and-miss method estimates Z b F = a The f (x)dx by dropping points uniformly in the box [a, b] × [0, M ] and estimating F using F̂hm = M (b − a)p̂, where p̂ is the proportion of points in the box falling under the curve of f (x). We could alternatively note that F = (b − a)E(f (X)) when X ∼ U [a, b] so we can use F̂av = (b − a)f n with f n as above. Show that var(F̂av ) ≤ var(F̂hm). Why is this desirable? Example: In Bayesian inference for θ|y with posterior p(θ|y), likelihood L(θ; y) and prior p(θ), L(θ; y)p(θ) p(θ|y) = m(y) R where m(y) = L(θ; y)p(θ)dθ is the marginal likelihood. The posterior mean, µf = E(f (θ)|y), for a function f (θ) Z µf = L(θ; y)p(θ) dθ f (θ) m(y) is usually hopelessly intractable. We can use Monte Carlo to sample θ (i) ∼ p(·|y), i = 1, 2, ..., N and estimate µf using N X 1 f (θ (i)). f¯N = N i=1 We will be interested in MC-methods that do not require evaluation of m(y) (as that’s hard too). MCMC module overview. Our first task is to design an algorithm for generating the Xi’s: given a probability density or mass function p(x) (the target distribution), give an algorithm for simulating X ∼ p. There are several main classes of Monte Carlo algorithms. They all have their strengths and weaknesses. We will use • the inversion and transformation methods (approximately 1 lecture), • rejection sampling (1L), • importance sampling (1L), • Markov chain Monte Carlo (5L) You will learn why they work, and what their relative advantages are in different settings. You will learn to adapt these algorithms to simulation for different target distributions. I wont be covering Sequential Monte Carlo. This is a method related to importance sampling. It is often very effective for sampling distributions with some kind of time-series structure. Our second task is to implement these algorithms (in R say). I wont say much about implementation, just a few comments. Our third task is to interpret the Monte Carlo output. We will look at output analysis, especially when we discuss Importance Sampling and MCMC. Inversion Say X ∈ Ω is a scalar rv with cdf F (x) = Pr(X ≤ x) at X = x and we want to simulate X ∼ F . Claim: suppose F (x) is continuous and strictly increasing with x. If U ∼ U (0, 1) and X = F −1(U ) then X ∼ F . Proof: Pr(X ≤ x) = Pr(F −1(U ) ≤ x) so applying strictly increasing F to both sides of F −1(U ) ≤ x we have Pr(X ≤ x) = Pr(U ≤ F (x)) which is F (x), since U is uniform. Remark: if we define F −1(u) = min(z; F (z) ≥ u) then F −1(U ) ∼ F without F continuous or strict increasing. Example: if we want X ∼ Exp(r), ie X ∼ p(x) with p(x) = r exp(−rx), then the CDF is F (x) = 1 − exp(−rx) and its inverse is F −1(u) = −(1/r) log(1 − u). The algorithm is U ∼ U (0, 1) X ← − log(U )/r and note I replaced 1 − U with U since U ∼ 1 − U . Inversion: discrete random variables Suppose X ∼ F with X a discrete rv with pmf p(x), x = 0, 1, 2, .... Consider the following algorithm U ∼ U (0, 1) Px Px−1 set X equal the unique x satisfying i=0 p(i) < u < i=0 p(i) Px−1 with i=0 p(i) ≡ 0 if x = 0. Proof: It is easy to check that X ∼ p. (board presentation) Px Remark: the cdf here is F (x) = i=0 p(i) and the above algorithm is actually just x = F −1(u) again with the more general definition for F −1 above. Exercise: Suppose 0 < p < 1 and q = 1 − p, and we want to simulate X ∼ Geometric(p) with pmf p(x) = pq x−1 and cdf F (x) = 1 − q x for x = 1, 2, 3, .... Show that the following algorithm simulates X : Step 1. Simulate U ∼ U [0, 1] Step 2. Set & ' log(1 − u) x= log(q) where ⌈x⌉ rounds up. (Hint: you need to show that this formula gives the smallest x such that 1 − q x ≥ u). Transformation Methods Say Y ∼ Q, Y ∈ ΩQ we can simulate. X ∼ P , X ∈ ΩP we want to simulate. If we can find a function f : ΩQ → ΩP with the property that f (Y ) ∼ P then we can simulate X by simulating Y ∼Q and setting X = f (Y ). Example: Suppose we want to simulate X ∼ Exp(1) and we can simulate Y ∼ U (0, 1). Try setting X = − log(Y ). Does that work? Ans: recall that if Y has density q(y) and X = f (Y ) then the density of X is p(x) = q(y(x))|dy/dx|. Now q(y) = 1 (the density of U (0, 1)) and y(x) = exp(−x), so p(x) = exp(−x) and so X has the density of an Exp(1) rv. Example: Inversion is a transformation method: Q is U (0, 1); Y is U ; and X = f (Y ) with f (y) = F −1(y) and F the CDF of the target distribution P . Transforming several variables We can generalize the idea. We can take functions of collections of variables. Example: Suppose we want to simulate X ∼ Γ(a, β) with a ∈ 1, 2, 3, ... and we can simulate Y ∼ Exp(1). Pa Simulate Yi ∼ Exp(1), i = 1, 2, ..., a and set X = i=1 Yi /β . Then X ∼ Γ(a, β). Proof: Use moment generating functions. Exp(1) rv Y is tY E e = (1 − t)−1 The MGF of the so the MGF of X is a Y tX tY /β E e = E e i = (1 − t/β)−a i=1 which is the MGF of a Γ(a, β) variate. Rejection sampling Suppose I want to sample X ∼ p(x). If I throw darts uniformly at random ’under the curve’ of p(x) then the x − values are distributed like p(x). We do this by throwing darts UAR under a function q(x) that sits over p(x), and keeping the darts that fall under p(x). Example: suppose I want to sample X ∼ p with pX (x) = 2x, 0 ≤ x ≤ 1. If I sample Y ∼ U (0, 1) and U ∼ U (0, 1) then (Y, 2U ) is uniform in the box [0, 1] × [0, 2]. The points under the curve are the ones we want since Y |2U < 2Y ∼ pX . 2.0 2.0 1.5 1.5 1.0 0.5 0.0 u 1.0 0.5 0.0 0.0 0.2 0.4 0.6 y 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 The Rejection algorithm Say Y ∼ q(y), y ∈ Ω is a rv we can simulate. X ∼ p(x), x ∈ Ω is a rv we want to simulate. Suppose there exists M such that M q(x) > p(x) for all x ∈ Ω. Rejection Algorithm simulate X ∼ p(x): [1] Simulate y ∼ q(y) and u ∼ U (0, 1). [2] If u < p(y)/M q(y) return X = y (’accept y ’) and stop, and otherwise goto [1] (reject y ). This algorithm works because (y, uM q(y)) is simulated UAR ’under’ q(y). We keep trying until we get a point with V = uM q(y) ’under’ p(y). This point is UAR under p(y) so its y component is distributed according to p. Proposition: The rejection algorithm simulates X ∼ p(x). Consider Ω = {1, 2, 3, ..., M }. For x ∈ Ω let p(x) be the probability mass function for X . If p = (p1, p2, p3, ..., pM ) then p(x) = px. As before we suppose we have M so that M ≥ p(x)/q(x) for each x ∈ Ω. Rejection Algorithm simulating X ∼ p: [1] Simulate y ∼ q(y) and u ∼ U (0, 1). [2] If u < p(y)/M q(y) return X = y and stop, and otherwise goto [1] and try again. Let Pr(X = x) be the probability this algorithm returns X = x. We want to show that Pr(X = x) = p(x). Claim: The number of trials in the rejection algorithm is geometric with mean M Proof: each trial at [2] succeeds independently with probability ξ say. In order to succeed we must draw a y and accept it. The first event (draw y ) happens wp q(y). The second event (accept y ) happens wp p(y)/q(y)M . We dont care which y it is so... X ξ = q(y) × p(y)/(M q(y)) (1) y∈Ω = 1/M (2) P because y p(y) = 1. If the number of trials is N then E(N ) = 1/ξ = M Comment: Each time we execute steps [1-2] we fail wp 1 − 1/M Claim: The probability the algorithm returns X = x is p(x). How do we end up with X = x? When we enter the loop [1-2] the probability we draw x and then accept it is q(x) × p(x)/M q(x) = p(x)/M. However we might pass through [1-2] any number of times. We can draw x and then accept it at the first pass, or at the second pass, or third and so on. p(x) p(x) p(x) 2 Pr(X = x) = + (1 − 1/M ) + (1 − 1/M ) + ... M M M ∞ p(x) X (1 − 1/M )n−1 = M n=1 = p(x). as supposed. First comment: Where did we use M q(x) ≥ p(x)? If for some y , p(y)/q(y) ≥ M then the probability u ≤ p(y)/M q(y) at step [2] is one, not p(y)/M q(y). Exercise: show that rejection simulates X ∼ min(p(x), M q(x)) (and hence X ∼ p(x) if p(x) ≤ M q(x) for all x). Second comment: The mean number of trials is M . If we are conservative and choose a very large M , in order to ensure the bound M ≥ p(y)/q(y) holds for all y , then the number of trials is very large, and the implementation runs slowly. Example Use Y ∼ U (0, 1) (ie q(x) = 1) to simulate X ∼ p(x) with p(x) = 2x for 0 ≤ x ≤ 1. Calculate the mean number of trials per simulated X -value. Step 1. Find M so that p(x)/q(x) ≤ M for all 0 ≤ x ≤ 1. Here p(x)/q(x) = 2x. We can use M = 2 since 2x ≤ 2 there. We have p(y)/q(y)M = y for the second step. Step 2. Write down the algorithm. Rejection Algorithm simulating X ∼ 2x: [1] Simulate y ∼ U [0, 1] and u ∼ U [0, 1]. [2] If u < y return X = y , and otherwise goto [1] and try again. The mean number of trials is M = 2. Implement this algorithm in R: > while (runif(1)>(x<-runif(1))) {}; x [1] 0.8843435 1.0 0.5 0.0 Density 1.5 2.0 Compare the histogram of 10K samples with the target density. 0.0 0.2 0.4 0.6 X 0.8 1.0