Efficient (Exact/Approximate) Monte Carlo Inference for Stochastic Epidemic Models Theo Kypraios http://www.maths.nott.ac.uk/∼tk School of Mathematical Sciences, University of Nottingham Warwick, Infer 2011 Workshop 1 / 57 What this talk is about Most of the content is (joint) work in progress . . . Interested in exploring cases (i.e. models/data) where exact “likelihood-free” Monte-Carlo inference can be drawn. If exact inference is not possible, interested in efficient, but approximate methods. Discuss how one can take advantage of the model’s structure? how some, if any, of the available (theoretical) results can aid the construction of efficient simulation-based algorithms? 2 / 57 Exact Bayesian Computation (EBC) 3 / 57 Exact Bayesian Computation (EBC) Suppose we have discrete data D, prior π(θ) for parameter(s) θ. Aim: Draw samples from the posterior distribution of the parameters, π(θ|D). Bayes Theorem gives π(θ|D) = π(D|θ)π(θ)/π(D) where π(D) is a normalising constant: Z π(D) = π(D|θ)π(θ) dθ θ and therefore π(θ|D) ∝ π(D|θ)π(θ) 4 / 57 Rejection Sampling Consider the following algorithm: ER1 1. Draw θ ∼ π(θ); 2. Accept θ with probability equal to π(D|θ); 3. Repeat until the required number of samples is drawn. * It is trivial to show that the accepted values will be distributed according to the posterior distribution π(θ|D). * Note that this algorithm requires that the likelihood, π(D|θ), can be computed for any θ (Step 2). 5 / 57 Rejection Sampling Can do better: if π(D|θ) < c for all θ, substitute Step 2 with: 2. Accept θ with probability equal π(D|θ)/c i.e. increased acceptance rate at zero cost provided that we can find an upper bound for the likelihood. 6 / 57 Rejection Sampling (cont.) Consider the following algorithm: ER2 1. Draw θ ∼ π(θ); 2. Draw D0 ∼ π(·|θ), i.e. simulate data D0 from the model; 3. Accept θ if D = D0 ; 4. Repeat until the required number of samples is drawn. * Algorithm ER2 is equivalent to algorithm ER1. * The main difference is that the computation of the likelihood π(D|θ) (ER1, Step 2) is substituted by the simulation of an event which occurs with that probability (ER2, Steps 2 and 3). * That means that the calculation of the likelihood is unnecessary as long as we can simulate from our stochastic model. 7 / 57 EBC and Monte Carlo Maximum Likelihood Suppose that we are interested in finding the MLE, say θbMLE . One needs to be able to compute the likelihood, π(D|θ) for any value θ. If this cannot be done directly, one can approximate the likelihood function using Monte Carlo (Diggle and Gratton, 1984): ML1 1. Choose a grid of values for θ, i.e. (θ1 , . . . , θK ); 2. Estimate the value of the likelihood for each θj , j = 1, . . . , K by M X \j ) = 1 π(D|θ 1(Dθ0 j = D) M i=1 Dθ0 j where is a simulated dataset from the model with parameter θj . 8 / 57 EBC and Monte Carlo Maximum Likelihood Instead of maximising the likelihood function π(D|θ) we are maximising a Monte Carlo estimator of it. One could either maximise the likelihood in brute-force manner (expensive) or by optimization (much cheaper). Although such a MCMLE scheme enjoys desired properties (such as unbiasness etc) we have to be careful, since we might get misleading results due to large Monte Carlo error → increase Monte Carlo samples M. One can also try and do a Kernel Density Estimation (KDE) of the Monte Carlo likelihood function and do optimization on the KDE. 9 / 57 Some Remarks Both algorithms ER1 & ER2 are very easy to implement but they both suffer in terms of efficiency when the prior π(θ) is very different from the posterior π(θ|D); while algorithm ML1 can be problematic if there is huge Monte Carlo variation and one needs a very large M to accurately approximate the likelihood function. 10 / 57 How does all this relate to epidemic modelling? 11 / 57 EBC for Epidemic Models Consider some final size data D (i.e. total number of infectives out of the initial susceptible population of size N ) from a standard SIR model. Having assumed a distribution for the infectious period, we wish to make inference for the infection rate λ (or R0 ), i.e. derive (or draw samples from) π(λ|D) (or π(R0 |D)). When using standard (double) precision arithmetic, numerical problems appear when computing the likelihood π(D|θ) (using for example the triangular equations) due to rounding errors even for N of the order 50–100. 12 / 57 EBC for Epidemic Models (2) Therefore, the algorithm ER1 cannot be implemented unless Multiple Precision Arithmetic (MPA) is employed (Demiris and O’Neill, ST&COM). On the other hand, algorithm ER2 can be used instead; simulating final size data from an SIR model is trivial using either the Gillespie algorithm or the Sellke construction etc. The Gillespie algorithm can be computationally very expensive for large N and therefore we prefer using the Sellke construction. Note that ER2 is very straightforward to implement (i.e. few lines of code) and it results in drawing exact samples from the posterior distribution of interest; MCMC approaches available: e.g. Demiris’ thesis, (2004) O’Neill (2009), Ball(201?), White’s thesis (2010). 13 / 57 Accounting for observational error It is very easy to allow for error in observation of the final size, e.g. account for underreporting. In other words, we are interested in the posterior distribution of π(λ|D ± k) for some k. This can be done either retrospectively, i.e. if we had stored all the pairs (θ, D0 ) we can look at those that satisfy the criterion, . . . . . . or by modifying the existing ER1 algorithm, i.e. only store on the fly the values of θ which satisfy the criterion. 14 / 57 An Exact Algorithm for Sample Data Consider the case where we have sample final size data, i.e. we only observe the final size of a proportion of the population, but wish to make inference for the whole population. Total population size of initial susceptibles N (Random) Sampled population size of initial susceptibles m Observed final size d (out of m). ER3 1. Draw θ ∼ π(θ); 2. Draw D0 ∼ π(·|θ), where D0 is total final size. 3. Accept θ with probability p(m, N , d, D). 4. Repeat until the required number of samples are drawn. 15 / 57 An Exact Algorithm for Sample Data (2) The probability p(m, N , d, D) in Step 3 can be derived from a hypergeometric distribution and can be computed explicitly. Instead of accepting θ w.p. p(m, N , d, D) one could the following: AR3 1. Draw θ ∼ π(θ); 2. Draw D0 ∼ π(·|θ), where D0 is total final size. 3. Store the pair {θ, pθ (m, N, d, D)}. 4. Repeat until the required number of samples are drawn. One can think of pθ (m, N , d, D) as a weight for each drawn θ and then approximate π(θ|D) in an Importance Sampling manner. 16 / 57 Do these algorithm really work (well) in practice? 17 / 57 Some Examples – Full Data Suppose we observe a final size of 40 out of 100 initially susceptible individuals. 1.5 Posterior Distribution of R_0 − full data 1.0 0.5 0.0 Density Gamma(10, 5) Gamma(4, 2) Gamma(2, 1) Exp(0.5) 0.5 1.0 1.5 2.0 18 / 57 Some Examples – MCMLE 150 100 50 0 Density 200 250 Posterior Distribution of lambda 0.002 0.004 0.006 0.008 0.010 0.012 lambda 19 / 57 Some Examples – Sample Data Suppose we observe a final size of 10 out of a sample population of 40 initially susceptible individuals while the size of the total population is 100 individuals. 1.5 Posterior Distribution of R_0 − sample data 0.0 0.5 Density 1.0 Gamma(10, 5) Gamma(4, 2) Gamma(2, 1) Exp(0.5) 0.5 1.0 1.5 2.0 2.5 3.0 20 / 57 Some Examples – Error in Observations Suppose we observe 60 infected individuals out of 100 but we want to allow for some error. 1.5 Posterior Distribution of R_0 − Error in Observation 1.0 0.5 0.0 Density D [D−2,D+2] 1.0 1.5 2.0 2.5 R_0 21 / 57 Challenges If the prior is very vague, e.g. λ ∼ Exp(10−6 ), then ER2 and ER3 will not be efficient (we rarely hit the target). The problem can be overcome by obtaining a Monte Carlo estimate for the likelihood to get an idea where the most likely values of λ lie. Then use this information to draw from a constrained “prior”. Problem when we have a pretty informative prior? This is some sort of an adaptive algorithm → strictly speaking we are not doing exact inference anymore but in practice we loose very little! When the size of initially susceptible population is large, then the EBC algorithm will very inefficient, i.e. it will take ages to hit the target. 22 / 57 How Can we Overcome These Problems? 1. Sequential Monte Carlo - Exact Bayesian Computation One “solution” to the problem might be to apply an SMC-ABC type algorithm . . . but, is there an alternative way? 2. “Likelihood-Free Importance Sampling”? Suppose that, some-how, we have some sort of approximation to our posterior of interest, π(θ|D), say q(θ). Can we make use of q(h) and draw honest Bayesian inference? 23 / 57 “Likelihood-Free” Importance Sampling 24 / 57 How Can we Overcome These Problems? (2) Recall, ER1 algorithm: 1. Draw θ ∼ π(θ); 2. Accept θ with probability equal to π(D|θ) or if π(D|θ)/c if c known; 3. Repeat until the required number of samples is drawn. and ER2 algorithm 1. Draw θ ∼ π(θ); 2. Draw D0 ∼ π(·|θ), i.e. simulate data D0 from the model; 3. Accept θ if D = D0 ; 4. Repeat until the required number of samples is drawn. 25 / 57 Approach I: Use ideas from the “Bernoulli Factory” Although both algorithms are equivalent, ER2, relies on the fact we can simulate events with probability equal to the likelihood (Steps 2 and 3 of ER2). If c is known (see ER1) then the equivalent ER2 algorithm should take into account the fact that we want to simulate events with probability π(D|θ)/c or k · π(D|θ) for k = 1/c. But, we only know how to simulate events with probability equal to the likelihood. The question then is: Given that we can simulate events with some probability p, how can we simulate events with probability f (p)? → Bernoulli Factory! It is a (very old) and hard problem! See Nacu and Peres (2005), Latuszynski et al. (2010) etc 26 / 57 Approach I: Use ideas from the “Bernoulli Factory” (2) Briefly, there is no exact algorithm to simulate events with probability f (p), . . . for any f (·) nevertheless, there is an algorithm (see previous refs) to simulate events with probability when f (p) = min(kp, 1 − k) for some > 0. The larger the the more inefficient this algorithm becomes! Question: If don’t want to do exact inference, are the there any fast, but approximate solutions to the Bernoulli factory problem? 27 / 57 Sequential Monte Carlo 28 / 57 Approach II: Sequential Monte Carlo SMC-E/ABC type algorithms (eg. Sisson et al (2010), Toni et al. (2010)) can offer a good alternative. The idea is: 1. We use samples from the priors (called particles) . . . 2. . . ., which are propagated through a sequence of intermediate distributions, π(θ|∆(D, D∗ ) ≤ εi ), where the tolerances i for i = 1, . . . , T are chosen such that 1 > . . . T ≥ 0. 3. In other words, these intermediate distributions evolve towards π(θ|D). 29 / 57 Approach II: Sequential Monte Carlo In general: Such algorithms can work very well; although their efficiency largely depends on certain tunable aspects, such as the choice of the tolerances i ,i = 1, . . . , T and the kernel via which the particles are propagated. Moreover, without care the approach can even fail altogether due to “particle degeneracy”. In the epidemic modelling context: At least in the case of a single-type one level mixing model, one could start the series of the ’s with large values and drag it down to zero. Effectively, this will lead to exact Monte Carlo samples from the posterior of interest. 30 / 57 Approach II: Sequential Monte Carlo Abakaliki Data: Posterior distribution of R0 : 0.0 0.2 0.4 Density 0.6 0.8 1.0 1.2 Histogram of lambda.sampled * 4.1 * 119 0.5 1.0 1.5 lambda.sampled * 4.1 * 119 2.0 2.5 31 / 57 What about models with more population structure? 32 / 57 A Model with Two Levels of Mixing We consider epidemic models in populations which mix at two levels, “global” and “local”. More precisely, we consider models in which each infectious individual i has a ‘global’ rate λG for infecting each other individual in the population . . . . . . plus a ‘local’ rate λL of infecting each other individual among a set of neighbours, N(i). Our main concern is the case where the population is partitioned into local groups or households (or farms/yards etc). 33 / 57 EBC for the 2-level Mixing Model ER2 1. Draw θ ∼ π(θ); 2. Draw D0 ∼ π(·|θ), i.e. simulate data D0 from the model; 3. Accept θ if D = D0 ; 4. Repeat until the required number of samples are drawn. Step 3 now is more ambitious than before. We need to hit the target of the total final size of the epidemic as well as the the final size in each group (households/farms/yards). If the households are all of the same size then it is not too bad → permutations or groupings. Otherwise the above algorithm is (or can be) very inefficient and an EBC-SMC type algorithm may not offer that great advantages. 34 / 57 ABC for the 2-level Mixing Model Doing exact inference seems very challenging and therefore, we seek for alternative, but approximate ways to draw inference for θ = (λL , λG ). ABC 1. Draw θ ∼ π(θ); 2. Draw D0 ∼ π(·|θ), i.e. simulate data D0 from the model; 3. Accept θ if ∆(D, D0 ) < ; 4. Repeat until the required number of samples are drawn. Questions: What do we mean by D? How small to choose ? What distance metric ∆(·) should we use? How good the approximate posteriors are to the true ones? 35 / 57 Types of data At least, two different types of data: 1. Suppose that the data D consist of the final size at each group (household/farm/yard), i.e. a series of (d1 , d2 , . . . dn ) as well as the number of susceptibles in each group (N1 , N2 , . . . Nn ). 2. Alternatively, if the population consists of various groups of similar size then, often the data appear in such a form: Susceptibles per household No of Infected 0 1 2 3 4 5 Total 1 40 93 2 49 137 3 3 22 83 2 1 4 20 80 4 1 1 133 189 108 106 5 3 23 2 1 1 1 31 36 / 57 Various Choices of Metrics Just compare the final size and ignore the group structure (not very clever): n n X X 0 0 di − di ∆(D, D ) = i=1 i=1 Just compare the final sizes in each groups (a bit clever): n X di − di0 ∆(D, D ) = 0 i=1 Look at proportions instead (quite clever, especially when Ni ’s are very different): n X di − di0 0 ∆(D, D ) = Ni i=1 A combination of the above? (see Baguelin et. al (2009)) ... 37 / 57 Practicalities – Challenges Our experience reveals that this metric n X di − di0 0 ∆(D, D ) = Ni i=1 works pretty well when compared with the results from a gold-standard approach, i.e. MCMC. Adopting an SMC-ABC algorithm one can approximate the true posteriors very well (see graphs later). In theory, one could drag the i in the series of errors to be close to zero, but some times this could be problematic, i.e. particles degenerate. When do we stop? In this case, there are only two parameters and it is possible to move the particles in a block (e.g. using a random-walk kernel). What happens when there are more model-parameters? Use a Gibbs kernel? 38 / 57 Practicalities – Challenges (2) One needs to be very careful how to choose the different values for 1 , . . . , k . A bad selection could lead to a very inefficient algorithm. Of course, the larger the k is, the lower the chance of having particles to become degenerate . . . (making slick moves) . . . . . . the more costly the algorithm becomes. Borrow results from Simulated Annealing/Parellel Tempering such that we place 1 , . . . , k efficiently, i.e in an optimal way? see: Behrens, Friel, Hurn (ST&COM, to appear) 39 / 57 Some Results 4 3 2 1 0 Density 5 6 7 Histogram of res[, 1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 res[, 1] 40 / 57 What about temporal data? 41 / 57 Temporal Data So far we have only considered final outcome data which allowed us to draw exact Monte Carlo inference for the parameters. When temporal data are available, it is often the case that a model (defined) in continuous time will be fitted; although this is not necessary always the case. In these, due the continuity, there are many more things to worry about, e.g. which summary statistics to use etc. However, in practice, real data are only available in discrete time, i.e. number of removals per day etc. An alternative approach to the one presented in Cook et al (2010) would be to involve some sort of data discretization. 42 / 57 Temporal Data – Hagelloch Dataset 20 Dates of symptoms appearance 10 5 0 Number of cases 15 class 2 class 1 pre−school 0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 41 44 Day 43 / 57 Discretizing the data The idea is, that in the Step where data are simulated in continuous time (i.e. removal times). we will perform some sort of discretization. Obviously, this has to be in accordance with the observed data (days↔days, months↔months) Similar distance metrics to the ones used for final size data could be used and enjoy similar properties. Challenges: Bins of equal width? Weighted bins? Effect of amount of discretization? 44 / 57 Piecewise Exact (Approximate) Bayesian Computation 45 / 57 Piecewise Exact (Approximate) Bayesian Computation: PW-EBC (PW-ABC) White, K, Preston and Crowe (2011) For simulation-based methods, computing resources are always a limiting factor, and a guiding principle is to take every opportunity to exploit model structure to minimise computational costs. 46 / 57 Piecewise Exact (Approximate) Bayesian Computation: PW-EBC (PW-ABC) (2) For a particular (but fairly broad) class of models, namely those that have the Markov property, and and whose vector of state variables is observable at discrete time points we would like to have a (likelihood free) simulation-based method which takes advantage of the models Markov structure, requires minimal amount of tuning, and enables less approximate, or even exact, Bayesian inference 47 / 57 PW-EBC & PW-ABC Suppose our data are a set of observations denoted X = {x1 , . . . , xn } = {x(t1 ), . . . , x(tn )} of state variable x ∈ Rm at time points t1 , . . . , tn . The posterior density is: π(θ|X ) = R π(X |θ)π(θ) ∝ π(X |θ)π(θ), θ π(X |θ)π(θ) dθ EBC (ER1/ER2) or ABC can be applied in the usual way. 1. Draw θ∗ from π(θ). 2. Simulate dataset X ∗ = {x ∗ (t1 ), . . . , x ∗ (tn )} from the model using parameters θ∗ . 3. Accept θ∗ if X ∗ = X , otherwise reject or Accept θ∗ if d(X , X ∗ ) ≤ ε, otherwise reject 48 / 57 Exploting the Markovian Structure The Markov property enables the likelihood to be written as ! n Y π(X |θ) = π(xi |xi−1 , . . . , x1 , θ) π(x1 |θ) i=2 = n Y ! π(xi |xi−1 , θ) π(x1 |θ), (1) i=2 and hence the posterior as π(θ|X ) ∝ π(X |θ)π(θ) n Y π(xi |xi−1 , θ)π(θ) ∝ π(x1 |θ)π(θ) π(θ) i=2 ! n Y (1−n) ∝ π(θ) π(θ|x1 ) π(θ|xi , xi−1 ) . (2) i=2 Essentially, the density of the posterior distribution of interest has been decomposed into a product of the densities of the posteriors θ|(xi , xi−1 ), for i = 2, . . . , n. 49 / 57 Exploting the Markovian Structure (2) We can estimate π(θ|X ) ∝ π(θ) (1−n) π(θ|x1 ) n Y ! π(θ|xi , xi−1 ) i=2 as follows: PW-EBC (PW-ABC) 1. For i = 2 to n do: (a) Apply EBC (ABC) Algorithm to draw exact (approximate) samples from π(θ|xi , xi−1 ); (b) Calculate the kernel density estimate (KDE), π̂(θ|xi , xi−1 ) 2. Calculate an estimate, π̂(θ|X ), of π(θ|X ) by replacing in the equation above the densities π(θ|xi , xi−1 ) with their corresponding KDEs, π̂(θ|xi , xi−1 ). 50 / 57 PW-EBC/PW-ABC The good news: for a given tolerance , the rejection sampling acceptance probability is substantially higher when drawing samples from π(θ|xi , xi−1 ) than it is from π(θ|X ). Also benefit from being trivial to parallelise: factors can be calculated in parallel, and so can samples within factors. The not so good news: PW-EBC and PW-ABC involve having to calculate kernel density estimates (KDEs). Sophisticated methods exist for fast calculation (eg. Fast Fourier Transform) but can be problematic . . . For parameters with bounded support (e.g. parameters that can only be positive) to avoid “edge effects” we calculated KDEs for simple transformed versions of the parameters. 51 / 57 3 1 2 5 4 Prey Prey Obs Predator Predator Obs Obs Times 7 6 8 9 13 11 10 12 14 0.0 0.5 1.0 8.9e−7 3.1e−7 1.1e−6 1.2e−7 1.1e−7 2.1e−7 5.3e−7 9.9e−8 1.2e−7 5.4e−7 2.2e−5 2.4e−7 4.4e−7 800 1000 Population 1200 1400 Application to a Stochastic Lotka-Voltera Model 1.5 Time 52 / 57 LV Model: An Example −1.5 −1.0 −0.5 0.0 0.5 1.0 (a) scaled logit Prey, logit(r1 28) −2.0 −1.5 −1.0 −0.5 0.0 0.5 (b) scaled logit Interaction, logit(r2 0.04) −1.5 −1.0 −0.5 0.0 0.5 1.0 (c) scaled logit Predator, logit(r3 28) 53 / 57 Connection with Epidemic Models Some of the most commonly used epidemic models exhibit a Markovian structure (e.g. SI, SIR etc) . . . therefore, such an approach seems suitable. → see Tj’s talk later in the week. Note that we are not drawing samples from the posterior, but instead we calculate it pointwise. One can incorporate a sequential sampling-based framework similar to the one proposed by Chopin (2002) and Del Moral et al (2006). 54 / 57 Conclusions 55 / 57 Conclusions and Some Thoughts MCMC methods for stochastic epidemic models are considered as ”gold standard”. Nevertheless, they can be computationally expensive and non-standard algorithms may be needed for efficiency {see for example, Neal and Roberts (2005), K (2007)} On the other hand, exact Monte-Carlo non MCMC inference for (some) final size data stochastic epidemic models is possible. The greatest at advantage of EBC/ABC-type methods is that they are much much easier to implement; on the other hand, careful tuning and care is required in most cases. It would be great if there was a way to quantify the error of approximation in ABC, or in other words to know how far we are from the true posterior . . . 56 / 57 Acknowledgments Final size data: Frank Ball (Nottingham), Nikos Demiris (Athens) Temporal data: Phil O’Neill (Nottingham), Eleni Elia (Nottingham) PW-EBC/ABC: Simon White (BSU-MRC, Cambridge), Simon Preston (Nottingham). 57 / 57