Beyond MCMC: two case studies and a few thoughts Nicolas Chopin (joint work with Tonly Lelièvre, Gabriel Stoltz, Judith Rousseau, Brunero Liseo) nicolas.chopin@ensae.fr ENSAE-CREST, Malakoff, FRANCE EPSRC MCMC Symp’09 – p. 1 Limitations of MCMC slow: error is 0( N −1/2 ), correlation between successive steps. local exploration: how to assess convergence if posterior is not unimodal? tedious to tune ⇒ these algorithms are not robust. bugs are difficult to detect. EPSRC MCMC Symp’09 – p. 2 Alternatives fast approximations: variational Bayes, Expectation-Propagation, INLA (see Havard’s talk). adaptive MCMC; adaptive biasing; e.g. Wang-Landau. Sequential Monte Carlo, see e.g. Doucet et al. (2006). EPSRC MCMC Symp’09 – p. 3 First case study Gaussian Mixture model: K p ( yi | θ ) = ∑ qk k =1 r o n v vk k 2 exp − (yi − µk ) 2π 2 µ1 , . . . , µK ∼ N (m, s2 ), v1 , . . . , vK ∼ G ( a, b) q1 , . . . , qK ∼ Dir (1, . . . , 1) or equivalently q k = ω k / ( ω1 + . . . + ω K ) , ω1 , . . . , ωK ∼ Exp(1). EPSRC MCMC Symp’09 – p. 4 Hidalgo stamps dataset 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6 7 8 9 10 11 12 13 14 EPSRC MCMC Symp’09 – p. 5 Quotes Jasra et al. (2005): “ We feel that the Gibbs sampler run with completion is often not worth programming [...] since the chance of it falling to converge is too high.” Celeux et al. (2000): “Although somewhat presumptuous, we consider that almost the entirety of MCMC implemented for mixture models has failed to converge.” EPSRC MCMC Symp’09 – p. 6 Molecular simulation Concerns simulation of some continuous-time dynamics, e.g. dXt = −∇V ( Xt )dt + ( β/2)−1/2 dWt with associated Boltzmann-Gibbs density: p( x ) ∝ exp {− βV ( x )} (In practice, approximated by long runs of random walk Hastings-Metropolis.) EPSRC MCMC Symp’09 – p. 7 Metastability 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 1 2 4 3 5 6 7 xi EPSRC MCMC Symp’09 – p. 8 Adaptive biased sampling Aim is to sample from biased potential function where F is the free energy, i.e. e = V − F (ξ ) V n e pe( x ) ∝ exp − βV o is such that the margin of ξ is uniform. At iteration t, perform Hastings-Metropolis move w.r.t. and adjust Ft on the fly. et = V − Ft (ξ ) V EPSRC MCMC Symp’09 – p. 9 Wang-Landau (ABP) At each iteration t, compute, for ξ in small bin [ek , ek+1 ] 1 Ft (ξ ) = N n ∑ I {ξ (xi ) ∈ [ek , ek+1 ]} i =1 i.e. penalises progressively already visited regions, see e.g. Atchadé and Liu (2004). EPSRC MCMC Symp’09 – p. 10 Adaptive Biasing Force (ABF) EVe Progressively adapt the ‘force’ 1 ∂Ft (ξ ) = ∂ξ N ∂F ∂V − =0 ∂ξ ∂ξ ∂F ∂ξ , i.e. for ξ ∈ [ek , ek+1 ] n ∂V ∑ ∂ξ (xi ) I {ξ (xi ) ∈ [ek , ek+1 ]} i =1 EPSRC MCMC Symp’09 – p. 11 Back to mixture problems Our findings: 1. ABF outperforms ABP. 2. ‘temperature’ not necessarily the best ξ; certainly not the easiest to interpret. 3. seems much easier to find ‘stable’ directions, i.e. all symmetric functions of θ are ‘stable’. 4. but careful with metaphors. EPSRC MCMC Symp’09 – p. 12 ξ = q1 1. not symmetric; 2. constrained to [0, 1]; 3. Forcing q1 close to one empty other components, and helps component switching, in a symmetric way; 4. Forcing q1 close to zero: more later. EPSRC MCMC Symp’09 – p. 13 Hidalgo, K = 7 (I) mu_i 14 13 12 11 10 9 8 7 6 50.0 0.2 0.4 0.6 0.8 1.0 1e7 EPSRC MCMC Symp’09 – p. 14 Hidalgo, K = 7 (II) q_i 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.4 0.6 0.8 1.0 0.0 0.0 1e7 EPSRC MCMC Symp’09 – p. 15 Hidalgo, K = 7 (III) biased log-density -700 -710 -720 -730 -740 0.2 0.4 0.6 0.8 1.0 -750 0.0 1e7 EPSRC MCMC Symp’09 – p. 16 Hidalgo, K = 7 (IV) free energy (bias) 80 70 60 50 40 30 20 10 0 -10 0.0 0.2 0.4 0.6 0.8 1.0 EPSRC MCMC Symp’09 – p. 17 Hidalgo, K = 3 (I) mu_i 13 12 11 10 9 8 7 0.4 0.6 0.8 1e8 0.2 EPSRC MCMC Symp’09 – p. 18 Hidalgo, K = 3 (II) biased log-density -730 -735 -740 -745 -750 -755 -760 -765 0.4 0.6 0.8 1.0 0.2 1e8 EPSRC MCMC Symp’09 – p. 19 Hidalgo, K = 3 (III) bias 140 120 100 80 60 40 20 0 -20 -40 0.0 0.2 0.4 0.6 0.8 1.0 EPSRC MCMC Symp’09 – p. 20 ξ=β In the prior for inverse-variances: vi ∼ G ( a, β), β ∼ G ( g, h) 1. symmetric; 2. constrained to [0, 10 ∗ b]; 3. Large values of β penalise small variances. EPSRC MCMC Symp’09 – p. 21 Hidalgo, K = 3, ξ = β 16 14 12 Value 10 8 6 4 2 0 0e+00 2e+08 4e+08 6e+08 Iteration 8e+08 1e+09 EPSRC MCMC Symp’09 – p. 22 Visiting qi ≈ 0 regions allows us to estimate p(y|K )/p(y|K − 1): p(y|K ) = EVe [exp{ F (ξ )}] pe(y) p ( y | K − 1) 1 = pe(y) K K ∑ EVe i =1 p(y|θ ) exp{ F (ξ )} p(y|θ [qi = 0]) using importance sampling, or reversible jump (with birth-death steps). EPSRC MCMC Symp’09 – p. 23 Second case study Nonparametric inference from long memory Gaussian processes: Y | f ∼ N (0, T ( f )) T ( f )[k, l ] = γ(k − l ) = Z π −π ei(k−l )ω f (ω ) dω with a nonparametric prior for f , e.g. (Rousseau et al., 2009) −2d iω f ( ω ) = 1 − e exp ( K ∑ θ j cos( jω) j =0 ) with d ∼ U (0, 1/2), θ j ∼ N (0, 1/j) , K ∼ Poisson(1). EPSRC MCMC Symp’09 – p. 24 Computational difficulties 1 T −1/2 −n/2 p(y| f ) = (2π ) exp − y T ( f )−1 y | T ( f )| 2 where | T ( f )| is easy to approximate, and T( f ) −1 ≈ T ( g ), 1 g= 4π 2 f but we still have n integrals to compute, for k = 1, . . . , n; Z π −π eikω g(ω ) dω EPSRC MCMC Symp’09 – p. 25 Interpolation If g is replaced by a linear interpolation, all the integrals can be computed exactly using one FFT of the points g(ωk ), k = 1, . . . , M; the number of bins M must be ≥ 4N. Cost is O( M). EPSRC MCMC Symp’09 – p. 26 Sequential Monte Carlo Consider the ‘intermediate’ likelihood functions pm (y| f ) corresponding to T ( g)[k, l ] Tm [k, l ] = 0 if|k − l | ≤ 2m otherwise then apply SMC to sequence: pt ( f ) ∝ p( f ) pm (y| f )r/a pm+1 (y| f )1−r/a for t = am + r, r ∈ [0, a − 1]. (Additional benefit: use Divide and Conquer structure of FFT.) EPSRC MCMC Symp’09 – p. 27 SMC steps 0. Draw f j ∼ p( f ), j = 1, . . . , N. Set t = 0, w j = 1. 1. Re-weight p t +1 ( f j ) wj = wj pt ( f j ) 1. If Variance(weights)>γ, (a) resample (b) apply MCMC step w.r.t. pt+1 . 2. t ← t + 1. Go to 1. EPSRC MCMC Symp’09 – p. 28 Automatic tuning of MCMC Conditional on K, random walk HM steps, with proposal covariance tuned to current particle system. To move K, use reversible jump with Green (2003, HSS book)’s proposal. Or just use positive discrimination (Chopin, 2007), i.e. Wang-Landau biasing w.r.t. K. EPSRC MCMC Symp’09 – p. 29 A simulated example y posterior of d 4 8 3 7 2 6 1 5 0 4 -1 3 -2 2 -3 1 -4 0 200 400 600 800 1000 0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 EPSRC MCMC Symp’09 – p. 30 A simulated example (II) Effective Sample Size, a=4 10000 7500 5000 2500 0 4 8 12 16 20 24 28 32 36 40 44 EPSRC MCMC Symp’09 – p. 31 A simulated example (III) no RJ with RJ 0.35 0.35 0.30 0.30 0.25 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0 2 4 6 8 10 0.00 0 2 4 6 8 10 EPSRC MCMC Symp’09 – p. 32 Conclusions thriving to provide automatic/robust algorithms. SMC, together with biasing methods, seem excellent candidate for this purpose: 1. Biasing (a) makes target easier to explore the target (b) facilitates construction of proposals 2. in SMC, tuning proposals is trivial. Currently working on a SMC version of ABF. EPSRC MCMC Symp’09 – p. 33