[width=14pt,height=12pt]beamericonbookbeamericonbook [width=14pt,height=12pt]beamericonbookshadedbeamericonbook.20 beamericonbook.!20opaquebeamericonbookshaded beamericonbook.!15opaquebeamericonbookshaded beamericonbook.!10opaquebeamericonbookshaded beamericonbook.!5opaquebeamericonbookshaded beamericonbook.!2opaquebeamericonbookshaded [width=11pt,height=14pt]beamericonarticlebeamericonarticle [width=11pt,height=14pt]beamericonarticleshadedbeamericonarticle.20 beamericonarticle.!20opaquebeamericonarticleshaded beamericonarticle.!15opaquebeamericonarticleshaded beamericonarticle.!10opaquebeamericonarticleshaded beamericonarticle.!5opaquebeamericonarticleshaded beamericonarticle.!2opaquebeamericonarticleshaded @keypgfimagewidth @keypgfimageheight @keypgfimagepage @keypgfimageinterpolate[true] @keypgfimagemaskifundefinedpgf@mask@1 @keypgfmaskmatte On the flexibility of acceptance probabilities in auxiliary variable Metropolis-Hastings algorithms On the flexibility of acceptance probabilities in auxiliary variable Metropolis-Hastings algorithms Geir Storvik Dept. of Mathematics, University of Oslo SFI2 , Statistics for inovation Norwegian Computing Center SFF-CEES, Ecology and Evolution synthesis Warwick March 16-20 2009 Outline MC-methods in complex situations Simulation using auxiliary variables Common framework Importance sampling Metropolis-Hastings Sequentially generated proposals Non-sequential algorithms Summary/discussion Monte Carlo estimation R f (y)π(y)dy = E π [f (y)] ◮ Interest in θ = ◮ Markov Chain Monte Carlo ◮ ◮ ◮ P Estimated through θ̂ = n1 ni=1 f (yi∗ ) ∗ ∗ y1 , y2 , ... generated through a Markov chain with π as invariant distribution Importance sampling ◮ ◮ y Estimated through θ̂ = 1 n π(yi∗ ) ∗ i=1 q(yi∗ ) f (yi ), Pn yi∗ iid from q(·) Complex situations: Simple methods not enough ◮ ◮ ◮ Multiple modes Jumps between models Densities with long, narrow contours Combining importance sampling/MCMC ◮ MCMC gives samples close to π(y) ◮ Importance sampling correct for deviance Possible to combine? Using within importance sampling: ◮ ◮ ◮ ◮ ∗ Generate y ∗ through an MCMC algorithm x1∗ , x2∗ , ..., xt+1 = y∗ ∗ ∗ Importance weight π(y )/qy (y ) with qy (y ) = qy (xt+1 ) = ◮ ◮ ◮ Z q1 (x1 ) x1:t t+1 Y q(xj |xj−1 )dx1:t j=2 Problem: qy (y) not easy to compute Other proposal schemes of interest Many “proper” weights possible through extended space formulation Use of auxiliary variables within MH ◮ ◮ Current value y, simulate y ∗ ∼ qy (·|y) ( π(y ∗ )q (y |y ∗ ) y ∗ with prob α(y; y ∗ ) = min{1, π(y )qyy(y ∗ |y ) } new y = y otherwise Simulation through auxiliary variables x1∗ ∼q1 (x1∗ |y), ∗ xj∗ ∼qj (xj∗ |xj−1 ), j = 2, ..., t y ∗ ∼qy |x (y ∗ |xt∗ ). ◮ ◮ ◮ What acceptance probability should now be used? Ideal α(y; y ∗ ) usually not possible to evaluate Many “proper” choices are available through working in extended space. Examples from literature ◮ ◮ Annealed importance sampling (Neal, 2001) Mode jumping (Tjelmeland and Hegstad, 2001; Jennison and Sharp, 2007) ◮ Model selection and reversible jump MCMC (Al-Awadhi et al., 2004) ◮ Delayed rejection sampling (Tierney and Mira, 1999; Green and Mira, 2001) Pseudo-marginal algorithms (Beaumont, 2003) ◮ ◮ Likelihoods with intractable normalising constants (Møller et al., 2006) ◮ Directional MCMC (Eidsvik and Tjelmeland, 2006) Multi-try methods and particle proposals (Liu et al., 2000; Andrieu et al., 2008) ◮ Common framework ◮ Different ways of motivating the algorithms ◮ Different ways of proving their validity Possible to put all into common framework “Standard” use of Imp.samp/MCMC on extended spaces ◮ ◮ ◮ Easier to understand Easier to construct alternative acceptance probabilities. ◮ Toolbox for constructing/proving validity of other algorithms. ◮ Auxiliary variables and importance sampling ◮ ◮ Simulate y ∗ through x ∗ ∼ qx (·) and y ∗ ∼ qy |x (·|x ∗ ). For any conditional distribution h(x|y), Z Z Z θ = f (y)π(y)dy = f (y)π(y)h(x|y)dxdy y R R y )h(x|y ) f (y) qxπ(y (x)qy |x (y |x) qx (x)qy |x (y|x)dxdy ◮ θ= ◮ Importance sampling: y x x n θ̂ = 1X w(x i , y i )f (y i ) n i=1 π(y)h(x|y) w(x, y) = qx (x)qy |x (y|x) ◮ Many possible weights for different choices of h! Properties ◮ Define qy (y) = R x qx (x)qy |x (y|x)dx. Then π(y) qy (y) π(y) Var[w(x, y)] ≥Var qy (y) E[w(x, y)|y] = Combination of h′ s ◮ Combination of h’s: hs (x|y), s = 1, 2, ... imply for as ≥ 0, h(x|y) = P X s cond. distributions as = 1 as hs (x|y) cond. distribution s ◮ Combination of w’s: ws (x, y) = π(y)hs (x|y) qx (x)qy |x (y|x) proper weight function imply X s as ws (x, y) also proper weight function Proposal through MCMC-steps ◮ ∗ Assume x1∗ ∼ q1 (·), xj∗ ∼ q(·|xj−1 ), j = 2, ..., t + 1 with π(x)q(y|x) = π(y)q(x|y) ◮ ∗ Proposal y ∗ = xt+1 ◮ Note: ◮ Possible to choose h(x|y) such that w(x, y) → 1 as t → ∞? π(y ) qy (y ) ≈ 1 for t large Proposal through MCMC (cont) ◮ hs (x1:t |y = xt+1 ) = q1 (x1 ) ws (x, y) = ◮ π(xs ) q(xs |xs−1 ) Qs−1 j=2 q(xj |xj−1 ) Qt Special cases π(x1 ) q1 (x1 ) ◮ s = 1, w1 (x, y ) = ◮ s = t + 1, wt+1 (x, y ) = ◮ Combination: w̄(x, y ) Note: w̄(x, y ) → E[w(x, y )] = 1 as t → ∞ ◮ π(xt+1 ) q(xt+1 |xt ) Pt+1 π(xs ) = 1t j=2 q(xs |xs−1 ) j=s q(xj |xj+1 ) imply, MH and auxiliary variables ◮ Ideas from importance sampling can be transferred to MCMC Current value y, simulate x ∗ ∼ qx (·|y), y ∗ ∼ qy |x (·|x ∗ ) MH algorithm constructed in extended space (x, y). ◮ r (x, y; x ∗ , y ∗ ) = ◮ Different strategies possible ◮ ◮ ◮ ◮ w(y ;x ∗ ,y ∗ ) w(y ∗ ;x,y ) Store previous x Generate x ∗ , y ∗ and x Proposal through “inner” MCMC steps ◮ ∗ Assume x1 ∼ q1 (| · y), xj∗ ∼ q(·|xj−1 ), j = 2, 3, ..., t + 1 with π(x)q(y|x) = π(y)q(x|y) ◮ ∗ Proposal y ∗ = xt+1 r (y; x ∗ , y ∗ , y) = ◮ w(y; x ∗ , y ∗ ) w(y; x ∗ , y ∗ ) Special cases ◮ π(x1 ) q1 (x1 |y ) π(x ) = q(x t+1|xt ) t+1 Pt+1 π(xs ) 1 j=2 q(xs |xs−1 ) t w1 (x, y ) = ◮ wt+1 (x, y ) ◮ w̄(x, y ) = Note: w̄(x, y ) → E[w(x, y )] = 1 as t → ∞ ◮ Example ◮ ◮ ◮ ◮ ◮ ◮ π(y) = βN(y; µ1 , I) + (1 − β)N(y; µ2 , I) where µ1,1 = −µ2,1 = −10 while µi,j = 0, i = 1, 2, j = 2, ..., p. 2 x1∗ ∼ N(0, σlarge I) t + 1 = 10 “inner” MH steps (discrete Langevin diffusion) Previous x stored M = 20 outer MCMC-steps. w1 (x, y) = π(x1 ) q1 (x1 ) , wt+1 (x, y) = π(xt+1 ) q(xt+1 |xt ) , w̄(x, y) 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 (a) p=1 2 (d) p=5 2 3 3 1 1 (b) p=2 2 (e) p=10 2 3 3 1 1 (c) p=3 2 (f) p=10 2 3 3 Dist to truth for different weight functions 0.10 1 0.05 1 0.00 More general schemes ◮ ◮ ◮ x ∗ ∼ qx (·|y), y ∗ ∼ qy |x (·|x ∗ ). Some sequential structure in generation of x ∗ = (x1∗ , ..., xt∗ ). Examples ◮ ◮ ◮ ◮ Annealed importance sampling (Neal, 2001) Mode jumping (Tjelmeland and Hegstad, 2001; Jennison and Sharp, 2007) Model selection and reversible jump MCMC (Al-Awadhi et al., 2004) Particle proposals (Andrieu et al., 2008) RJMCMC proposals (Al-Awadhi et al., 2004) ◮ x1∗ : Jump between models ◮ xj∗ , j = 2, ..., t + 1: Jump within model ◮ (Al-Awadhi et al., 2004): w1 (y; x ∗ , y ∗ ) = ◮ Alternatives π(n)π(x1∗ ) qM (n|m,y )qn (x1∗ |y ) π(n)π(xs∗ ) ∗ ) qM (n|m, y)qn (xs∗ |xs−1 P π (x ∗ ) t+1 π(n)qM (m|n, y ∗ ) s=2 qn (xn∗ |xs∗ ) s s−1 r= Pt+1 πm (xs ) . π(m)qM (n|m, y) s=2 qm (xs |xs−1 ) ws (y; x ∗ , y ∗ ) = ◮ Converges towards π(n)qM (m|n, y ∗ )/π(m)qM (n|m, y) Particle proposals ◮ ◮ ◮ ∗ Generate parallel sequences {xj,1:t+1 , j = 1, ..., N} Choose K ∈ {1, ..., N}, put y ∗ = xK∗ ,t+1 . ∗ ∗ ∗ Each sequence independently:xj,i ∼ qi (xj,i |xj,i−1 ) Pr(K ) ∝ ◮ π(xK∗ ,t+1 ) qt+1 (xK∗ ,t+1 |xK∗ ,t ) . h(x|y, x ∗ , y ∗ ) = Qt Q Qt+1 N −1 δ(xK ,t+1 − y) i=1 qi (xK ,i |xK ,i−1 ) j6=K i=1 qi (xj,i |xj,i−1 ) gives ∗ π(xj,t+1 ) ∗ ∗) j=1 qt+1 (xj,t+1 |xj,t PN r (y; x ∗ , y ∗ , x) = PN π(xj,t+1 ) j=1 qt+1 (xj,t+1 |xj,t ) . Non-sequential examples ◮ Pseudo-marginal algorithms (Beaumont, 2003; Andrieu and Roberts, 2009) ◮ Likelihoods with intractable normalising constants (Møller et al., 2006) ◮ Delayed rejection sampling (Tierney and Mira, 1999; Green and Mira, 2001) Delayed rejection sampling ◮ ◮ ◮ ◮ ◮ ◮ x1∗ ∼ q1 (·|y), accepted with prob. α1 (y; x1∗ ) If rejected, x2∗ ∼ q2 (·|x1∗ , y), accepted with prob. α2 (y; x1∗ , x2∗ ) y ∗ = x1∗ with prob α1 , otherwise = x2∗ . Tierney and Mira (1999) π(x2∗ )q1 (x1∗ |x2∗ )q2 (y|x1∗ , x2∗ )[1 − α1 (x2∗ ; x1∗ )] ∗ ∗ α2 (y; x1 , x2 ) = min 1, π(y)q1 (x1∗ |y)q2 (x2∗ |x1∗ , y)[1 − α1 (y; x1∗ )] Alternative: Generate x1 ∼ q1 (x1 |y ∗ ), π(y ∗ )q2 (y|x1 , y ∗ )[1 − α1 (y ∗ ; x1 )] ∗ ∗ α2 (y; x1 , x2 ) = min 1, π(y)q2 (y ∗ |x1∗ , y)[1 − α1 (y; x1∗ )] n o ∗ 1 (y ;x1 ) q2 = π ⇒ α2 (y; x1∗ , x2∗ ) = min 1, 1−α ∗ 1−α1 (y ;x ) 1 Example zi =µ + ηi + εi , ε ∼N(0, τ1−1 ), X ηj , ni−1 τ2−1 ), ηi |η−i ∼N(βni−1 j∼i Interest in posterior for (τ1 , τ2 ). i = 1, ..., n iid CAR 0 1 2 tau2.v 3 4 5 Posterior for (τ1 , τ2 ) 0 1 2 3 tau1.v 4 5 Delayed rejection sampling ◮ First stage: τj1 = τj fj , ◮ Second stage: Re-parametrisation τ= ◮ q(fj ) = (1 + f −1 )I(fj ∈ [F −1 , F ]), j = 1, 2 τ2 τ1 τ2 , r= τ1 + τ2 τ1 + τ2 r ∼ [0, 1], τ ∼ π(τ |r ) Compare two choices for α2 . 0.5 0.6 KS-distance to posterior for (τ1 , τ2 ) 0.3 0.2 0.1 t(d[c(2, 3), ]) 0.4 DR1 DR2 0e+00 2e+04 4e+04 6e+04 Time 8e+04 1e+05 Pseudo-marginal algorithms ◮ ◮ ◮ ◮ ◮ Target given through π(y) = Simulate y ∗ ∼ qy (y ∗ |y) R x π̄(x, y)dx Simulate x1∗ , ..., xt∗ ∼ qx|y (x ∗ |y ∗ ) Q hs (x1:N |y) = π̄(xs |y) j6=s qx|y (xj |y) Acceptance ratio (Beaumont, 2003) rx,y (x ∗ , y ∗ ) = w(y, x ∗ , y ∗ ) w(y ∗ , x, y) where t w(y, x ∗ , y ∗ ) = π̄(xi∗ , y ∗ ) π(y ∗ ) 1X ≈ t qx|y (xi∗ |y ∗ )qy (y ∗ |y) qy (y ∗ |y) i=1 ◮ Variant: Generate new x’s each time Intractable normalising constants (Møller et al., 2006) π(y) =p(y|z) = C −1 p(y)p(z|y) p(z|y) =Z −1 (y)p̃(z|y) ◮ ◮ ◮ (x, y) is the current state q(x ∗ , y ∗ |x, y) = qy (y ∗ |x, y)qx|y (x ∗ |x, y, y ∗ ) with qx|y (x ∗ |x, y, y ∗ ) = Z −1 (y ∗ )p̃(x ∗ |y ∗ ) h(x|y) arbitrary, but state space similar to z. r (x, y; x ∗ , y ∗ ) = ◮ p(y ∗ )p̃(z|y ∗ )h(x ∗ |y ∗ )qy (y|x ∗ , y ∗ )p̃(x|y) p(y)p̃(z|y)h(x|y)qy (y ∗ |x, y)p̃(x ∗ |y ∗ ) h(x|y, x ∗ , y ∗ ) = δ(x − x ∗ ) gives p(y ∗ )p̃(z|y ∗ )q |x (y ∗ |y )p̃(x ∗ |y ) r (x ∗ , y ∗ |x, y) = p(y )p̃(z|y )qy |xy(y |y ∗ )p̃(x ∗ |y ∗ ) Summary/discussion ◮ ◮ ◮ ◮ General framework: x ∗ ∼ qx (·|y), y ∗ ∼ qy |x (·|x ∗ ) Many existing algorithms within this framework Many different weight functions/acceptance probabilities are possible General framework: ◮ ◮ ◮ ◮ Tool for constructing new algorithm Tool for constructing alternative weight functions Common understanding of many different algorithms Further work: Experimental/theoretical results on properties of different weight functions References I Al-Awadhi, F., M. Hurn, and C. Jennison (2004). Improving the acceptance rate of reversible jump mcmc proposals. Statistics & Probability Letters 69, 189–198. Andrieu, C., A. Doucet, and R. Holenstein (2008). Particle Markov chain Monte Carlo. Preprint. Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Annals of Statistics. To be published. Beaumont, M. (2003). Estimation of Population Growth or Decline in Genetically Monitored Populations. Genetics 164(3), 1139–1160. Eidsvik, J. and H. Tjelmeland (2006). On directional metropolis-hastings algorithms. Stat. Comput 16, 93–106. Green, P. and A. Mira (2001). Delayed rejection in reversible jump Metropolis-Hastings. Biometrika 88(4), 1035–1053. Jennison, C. and R. Sharp (2007). Mode jumping in MCMC: Adapting proposals to the local environment. Talk at Conference to honour Allan Seheult, Durham, March 2007. Available at http://people.bath.ac.uk/mascj/. Liu, J., F. Liang, and W. Wong (2000). The multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association 95(449), 121–134. Møller, J., A. Pettitt, R. Reeves, and K. Berthelsen (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458. Neal, R. M. (2001). Annealing importance sampling. Statistics and Computing 2, 125–139. Tierney, L. and A. Mira (1999). Some adaptive Monte Carlo methods for Bayesian inference. Statistics in Medicine 18(1718), 2507–2515. Tjelmeland, H. and B. Hegstad (2001). Mode jumping proposals in MCMC. Scand. J. Statist. 28(1), 205–223.