More of the SAME? Sequential and Pseudomarginal Monte Carlo for Point Estimation in Latent Variable Models Adam M. Johansen Collaborators: Manuel Davy, Arnaud Doucet and Axel Finke a.m.johnsen@warwick.ac.uk Imperial College — 6th February, 2014 Outline Background: Marginal MLEs SAME: An MCMC Scheme Sequential Monte Carlo The SMC Method A Population-Based SAME Method Examples Pseudomarginal Methods The Pseudomarginal Method More of the SAME: multiple extensions of the space Example Even more of the SAME: complex extensions of the space Examples 2 Background Marginal MLEs SAME: An MCMC Scheme 3 Maximum {Likelihood|a Posteriori} Estimation Consider a model with: parameters, θ, latent variables, x, and observed data, y. Aim to maximise marginal likelihood Z p(y|θ) = p(x, y|θ)dx or posterior Z p(θ|y) ∝ p(x, y|θ)p(θ)dx. Traditional approach is Expectation-Maximisation (EM) Requires objective function in closed form. Susceptible to trapping in local optima. 4 A Probabilistic Approach Optimization and probability simulated annealing. A distribution of the form π(θ|y) ∝ p(θ)p(y|θ)γ will become concentrated, as γ → ∞ on the maximisers of p(y|θ) under weak conditions. Why not target π(θ|y) using MCMC? 5 Adapted from (Hwang, 1980; Theorem 2.1). Assume: p(θ) and p(y|θ) are α-Lipschitz continuous in θ log (p(θ)) ∈ C 3 (Rn ) and log p(y|θ) ∈ C 3 (Rn ). ΘM L is a non-empty, countable set which is nowhere dense; p(θ) ≤ M < ∞; p(θ) > 0∀θ ∈ ΘM L p(y|θ) ≤ M 0 < ∞ For some k < sup p(y|θ), {θ : p(y|θ) ≥ k} is compact. Then: lim πγ (dt) ∝ γ→∞ X α(θml )δθml (dt), (1) θml ∈ΘM L #−1/2 ∂ 2 log p(y|θ) α(θml ) = det − ∂θm ∂θn θ=θml " (2) 6 State Augmentation for Maximisation of Expectations Data Augmentation: Synthetic distributions of the form: π̄γ (θ, x1:γ |y) ∝ p(θ) γ Y p(xi , y|θ) i=1 admit the marginals π̄γ (θ|y) ∝ p(θ)p(y|θ)γ . SAME Algorithm (Doucet, Godsill and Robert, 2002): t = 0: Initialise (θ0 , X0,1 ) arbitrarily. For t = 1, . . . , T : If γ(t) > γ(t − 1): Set (Xt−1,γ(t−1)+1 , . . . , Xt−1,γ (t) ) arbitrarily. Sample (θt , Xt,1 , . . . , Xt,γ(t) ) ∼ Kγ(t) (θt−1,1 , Xt−1,1 . . . , Xt−1,γ(t) , ·). Where Kγ is π̄γ -invariant. NB An inhomogeneous Markov chain. 7 Sequential Monte Carlo The SMC Method A Population-Based SAME Method Examples 8 SMC: A Motivating Example — Filtering Let X1 , . . . denote the position of an object which follows Markovian dynamics. Let Y1 , . . . denote a collection of observations: Yi |{Xi = xi } ∼ g(·|xi ). We wish to estimate, as observations arrive, p(x1:t |y1:t ). A recursion obtained from Bayes rule exists but is intractable in most cases. x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 9 More Generally Really tracking a sequence of distributions, pt . . . on increasing state spaces. Other problems with the same structure exist. Any problem of sequentially approximating a sequence of such distributions, pt , can be addressed in the same way. 10 Sequential Importance Resampling At time t, t ≥ 2. Sampling Step For i = 1 : N : (i) (Given {X1:t−1 }N i=1 approximating pt−1 (x1:t−1 )). (i) (i) sample Xt ∼ qt ·| X1:t−1 . Resampling Step For i = 1 : N : (i) compute wt X1:t = and (i) Wt = (i) pt X1:t (i) (i) (i) pt−1 X1:t−1 qt Xt X1:t−1 (i) wt X1:t PN (j) j=1 wt X1:t For i = 1 : N : (i) sample At ∼ n o (Ai ) N retain X1:t t (j) j=1 Wt δj PN i=1 11 Iteration 2 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 12 Iteration 3 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 13 Iteration 4 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 14 Iteration 5 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 15 Iteration 6 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 16 Iteration 7 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 17 Iteration 8 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 18 Iteration 9 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 19 Iteration 10 8 7 6 5 4 3 2 1 0 −1 −2 1 2 3 4 5 6 7 8 9 10 20 SMC Samplers (Del Moral et al., 2006) Can be used to sample from any sequence of distributions: Given a sequence of target distributions, ηn , on En . . . , n N construct a synthetic sequence ηen on spaces Ep p=1 by introducing Markov kernels, Lp from Ep+1 to Ep : ηen (x1:n ) = ηn (xn ) n−1 Y Lp (xp+1 , xp ) , p=1 These distributions have the target distributions as final time marginals, have the correct structure to employ SMC techniques. 21 SMC Outline (i) Given a sample {X1:n−1 }N en−1 , i=1 targeting η (i) (i) sample Xn ∼ Kn (Xn−1 , ·), calculate (i) (i) (i) ηn (Xn )Ln−1 (Xn , Xn−1 ) (i) Wn (X1:n ) = (i) (i) (i) ηn−1 (Xn−1 )Kn (Xn−1 , Xn ) . (i) Resample, yielding: {X1:n }N en . i=1 targeting η Hints that we’d like to use Ln−1 (xn , xn−1 ) = R ηn−1 (xn−1 )Kn (xn−1 , xn ) . ηn−1 (x0n−1 )Kn (x0n−1 , xn )dxn−1 22 Recall: Maximum {Likelihood|a Posteriori} Estimation A model with: parameters, θ, latent variables, x, and observed data, y. Aim to maximise Marginal likelihood Z p(y|θ) = p(x, y|θ)dx or posterior Z p(θ|y) ∝ p(x, y|θ)p(θ)dx Using π̄γ (θ, x1:γ |y) ∝ p(θ) γ Y p(xi , y|θ) i=1 23 Maximum Likelihood via SMC Use a sequence of distributions ηn = π̄γn for some {γn }. The MCMC approach (Doucet et al., 2002). Requires slow “annealing”. Separation between distributions is large. Mixes poorly as γ increases. Using SMC has some substantial advantages: Introducing bridging distributions, for γ = bγc + hγi, of: π̄γ (θ, x1:bγc+1 |y) ∝ p(θ)p(xbγc+1 , y|θ)hγi bγc Y p(xi , y|θ) i=1 is straightforward. Population of samples improves robustness. It is less dependent upon mixing of Kγ . 24 Algorithms A generic SMC sampler can be written down directly. . . An easy case: Sample from p(xt |y, θt−1 ) and p(θt |xt , y). Weight according to p(y|θt−1 )γt −γt−1 . The general case: Sample existing variables from a πt -invariant kernel: (θt , Xt,1:γt−1 ) ∼ Kt ((θt−1 , Xt−1 ), ·). Sample new variables from an arbitrary proposal: Xt,dγt−1 e+1:dγt e ∼ q(·|θt ). Use combination of time-reversal and optimal auxiliary kernel. Weight expression does not involve the marginal likelihood. 25 An SMC-Based SAME Algorithm (sorry!) initialisation: n t = 1: oN (i) (i) iid sample θ1 , X1 ∼ν calculate (i) W1 ∝ i=1 (i) (i) πγ1 (θ1 ,X1 ) (i) (i) ν(θ1 ,X1 ) N P i=1 (i) W1 =1 for t = 2 to T do resample (i) (i) (i) (i) θt , Xt,1:dγt−1 e ∼ Kt−1 θt−1 , Xt−1 ; · n obγt c (i) (i) sample Xt,j ∼ q(·|θt ) if dγt−1 e < bγt c j=dγt−1 e+1 (i) (i) Xt,dγt e ∼ qhγt i (·|θt ) if dγt−1 e < dγt e = 6 γt calculate (i) Wt p(y, Xt,dγt−1 e |θ)1∧γt −bγt−1 c ∝ p(y, Xt,dγt−1 e |θ)hγt i bγt c Y j=dγt−1 e+1 p(y, Xt,j |θt ) q(Xt,j |θt ) N i=1 p(y, Xt,dγt e |θt )hγt i qhγt i (Xt,dγt e |θt ) !I with I = I(dγt e > bγt c ≥ dγt−1 e). end for 26 Toy Example (using known marginal likelihood) Student t-distribution of unknown location parameter θ with ν = 0.05. Four observations are available, y = (−20, 1, 2, 3). Log likelihood is: log p(y|θ) = −0.525 4 X log 0.05 + (yi − θ)2 . i=1 Global maximum is at 1.997. Local maxima at {−19.993, 1.086, 2.906}. Complete log likelihood (Xi ∼ Ga): log p(y, z|θ) = − 4 X 0.475 log xi + 0.025xi + 0.5xi (yi − θ)2 . i=1 27 Toy Example: Marginal Likelihood Toy Example: Log Marginal Likelihood 0 -2 log marginal likelihood -4 -6 -8 -10 -12 -14 -30 -20 -10 0 10 20 θ 28 Toy Example: SMC Method using Gibbs Kernels 3 T=15 T=30 T=60 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0 20 40 60 80 100 29 Example: Gaussian Mixture Model – MAP Estimation Likelihood p(y|x, ω, µ, σ) = N (y|µx , σx2 ). 3 P Marginal likelihood p(y|ω, µ, σ) = ωj N (y|µj , σj2 ). j=1 Diffuse conjugate priors were employed: ω ∼ Di (δ) λi + 3 βi 2 σi ∼ IG , 2 2 2 2 µi |σi ∼ N αi , σi /λi , All full conditional distributions of interest are available. Marginal posterior can be calculated. 30 3 Component GMM (Roeders Galaxy Data Set) -152 smc same em -154 -156 -158 -160 -162 -164 -166 -168 100 1000 10000 100000 1e+06 31 Pseudomarginal Monte Carlo The Pseudomarginal Method More of the SAME: multiple extensions of the space Example Even more of the SAME: complex extensions of the space Examples 32 The Pseudomarginal Method Marginal MH-Acceptance Probability: 1∧ π(θ0 )Q(θ0 , θ) π(θ)Q(θ, θ0 ) But π(θ) isn’t tractable: how about using: 1∧ π b(θ0 )Q(θ0 , θ) π b(θ)Q(θ, θ0 ) where m π b(θ) = 1 X π(θ, Xi ) m q(Xi ) iid Xi ∼ q i=1 Suggests two algorithms ( Beaumont, 2003): Monte Carlo within Metropolis Grouped Independence Metropolis Hastings 33 Pseudomarginal Methods: GIMH is “Exact” Extended Target (Andrieu & Roberts, 2009): m Y X 1 π(θ, xj ) q(xk ) π e(θ, x1 , . . . , xm ) = m j=1 = 1 m m X j=1 k6=j m Y π(θ, xj ) · q(xj ) q(xk ) = π̂(θ) k=1 m Y q(xk ) k=1 The acceptance probability becomes: Q π e(θ0 , x01 , . . . , x0m )Q(θ0 , θ) m π̂(θ0 )Q(θ0 , θ) j=1 q(xj ) Q 1∧ = 1 ∧ 0 π e(θ, x1 , . . . , xm )Q(θ, θ0 ) m π̂(θ)Q(θ, θ0 ) j=1 q(xj ) NB MCWM is not exact. . . but perhaps we don’t care. 34 A Pseudomarginal SAME Algorithm We’d like to target πγ (θ|y) ∝ p(θ)p(y|θ)γ . Why not use the pseudomarginal approach, considering instead: π eγ (π, x11:m , . . . , xγ1:m ) = p(θ) γ X m m Y 1 p(xij , y|θ) Y q(xik |θ) m q(xij |θ) i=1 j=1 k=1 Expect behaviour like simulated annealing for large m. 35 The Student t Model Revisited 36 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 −1 −1 −1 −1 SA N = 100 SAME MCWM MCWM (prior) 5 GIMH N = 50 GIMH (prior) MCWM MCWM (prior) 5 GIMH N=5 GIMH (prior) MCWM MCWM (prior) 5 GIMH GIMH (prior) θ Summary of 200 Runs 5 37 What about complicated latent variable structures? Actually, pseudomarginal algorithms are more flexible. We’re especially interested in particle MCMC implementations (Andrieu et al., 2010): Particle Marginal Metropolis-Hastings(PMMH) MCWM variant of PMMH Particle Gibbs (with ancestor sampling) State-space models are the real motivation for this methodology. Many other complex models could be addressed using this technique. 38 Linear Gaussian Hidden Markov Model Model: Xt =AXt−1 + BUt Yt =Xt + DVt Data 50 observations simulated using: A = 0.9, B = 1, and D = 1 Algorithms The PMMH/MCWM algorithms use N = 250 particles; The PG algorithm (with ancestor sampling) uses N = 50 particles but attempts 100 static parameter updates per iteration. Inverse temperature increases linearly from 0.1 to 10. Final 1000 iterations γt = 10. Compare with exact marginal simulated annealing algorithm. 39 PMMH (N = 250) PMMH (N = 250) 3 0 0 −1 0 1000 2000 −2 3000 0 2000 0 −2 3000 0 2000 0 2000 0 −2 3000 0 1000 2000 3000 2000 3000 True parameter MLE 1 D 0 −2 1000 2 −1 1000 2000 Iteration 0 Simulated Annealing True parameter MLE 1 B 0 0 0 −2 3000 2 1 3000 True parameter MLE 1 Simulated Annealing True parameter MLE 2 2000 −1 Simulated Annealing 3 1000 2 −1 1000 0 Particle Gibbs (N = 50) D 1 0 0 −2 3000 True parameter MLE 1 B A 1000 2 True parameter MLE 2 3000 −1 Particle Gibbs (N = 50) 3 2000 True parameter MLE 1 D A B 1000 1000 MCWM (N = 250) −1 0 0 2 Particle Gibbs (N = 50) A −2 3000 True parameter MLE 1 0 −1 2000 2 True parameter MLE 1 −1 1000 MCWM (N = 250) 3 −1 0 −1 MCWM (N = 250) 2 True parameter MLE 1 D 1 2 True parameter MLE 1 B A 2 −1 PMMH (N = 250) 2 True parameter MLE 0 −1 0 1000 2000 Iteration 3000 −2 0 1000 2000 Iteration 3000 40 0.95 0.94 −0.25 −0.3 0.93 −0.35 0.92 −0.4 0.91 −0.45 0.9 −0.5 Simulated Annealing −0.2 Particle Gibbs (N = 50) 0.96 MCWM (N = 250) −0.15 PMMH (N = 250) 0.97 Simulated Annealing −0.1 Particle Gibbs (N = 50) 0.98 MCWM (N = 250) 0 −0.05 D B 1 0.99 PMMH (N = 250) Simulated Annealing Particle Gibbs (N = 50) MCWM (N = 250) PMMH (N = 250) A Summary of 100 Runs 0.4 0.35 0.3 0.25 0.2 0.15 0.1 41 A Simple Stochastic Volatility Model Model: Xi = α + δXi−1 + σu ui Xi i Yi = exp 2 X1 ∼ N µ0 , σ02 where ui and i are uncorrelated standard normal random variables, and θ = (α, δ, σu ). 200 Observations; simulated with δ = 0.95, α = −0.363 and σ = 0.26. Diffuse instrumental prior distributions: δ ∼ U (−1, 1) α ∼ N (0, 1) σ −2 ∼ Ga(1, 0.1) are quickly forgotten. Inverse temperature increases linearly from 0.1 to 10. Final 500 iterations γt = 10. A more complex multi-factor model is also under investigation. 42 PMMH (N = 250) PMMH (N = 250) 5 PMMH (N = 250) 1 20 0.5 15 σ2 0 True parameter value δ α True parameter value 0 −0.5 10 5 True parameter value −5 0 1000 2000 −1 3000 0 MCWM (N = 250) 1000 2000 0 3000 0 MCWM (N = 250) 5 1000 1 20 0.5 15 3000 σ2 True parameter value δ α True parameter value 0 2000 MCWM (N = 250) 0 −0.5 10 5 True parameter value −5 0 1000 2000 −1 3000 0 Particle Gibbs (N = 50) 1000 2000 0 3000 0 Particle Gibbs (N = 50) 5 1 20 0.5 15 2000 3000 σ2 True parameter value δ α True parameter value 0 1000 Particle Gibbs (N = 50) 0 −0.5 10 5 True parameter value −5 0 1000 2000 Iteration 3000 −1 0 1000 2000 Iteration 3000 0 0 1000 2000 Iteration 3000 43 In Conclusion Monte Carlo isn’t just for calculating posterior expectations. SMC and Pseudomarginal methods are effective for ML and MAP estimation. Still work in progress. . . Scope for embedding Pseudomarginal target within SMC algorithm. . . and adaptation. 44 References [1] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. Annals of Statistics, 37(2):697–725, 2009. [2] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo. Journal of the Royal Statistical Society B, 72(3):269–342, 2010. [3] M. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–1160, 2003. [4] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society B, 63(3):411–436, 2006. [5] A. Doucet, S. J. Godsill, and C. P. Robert. Marginal maximum a posteriori estimation using Markov chain Monte Carlo. Statistics and Computing, 12:77–84, 2002. [6] A. Finke. On Extended State-Space Constructions for Monte Carlo Methods. Ph.D. thesis, University of Warwick, 2015. In preparation. [7] A. Finke and A. M. Johansen. More of the SAME? Pseudomarginal methods for point estimation in latent variable models. In preparation, 2015. [8] C.-R. Hwang. Laplace’s method revisited: Weak convergence of probability measures. Annals of Probability, 8(6):1177–1182, December 1980. [9] A. M. Johansen, A. Doucet, and M. Davy. Maximum likelihood parmeter estimation for latent models using sequential Monte Carlo. In Proceedings of ICASSP, volume III, pages 640–643. IEEE, May 2006. [10] A. M. Johansen, A. Doucet, and M. Davy. Particle methods for maximum likelihood parameter estimation in latent variable models. Statistics and Computing, 18(1):47–57, March 2008. 45