Use of Importance sampling with MCMC algorithms J. Rousseau CEREMADE, Université Paris-Dauphine Warwick Joint work with C. Guihennec, R. McVinish, K. Mengersen, D. Nur 1/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 2/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 3/ 25 Motivation I Repeated MCMC under different samples : • Bayesian p-values. X o = observed sample Let Hπ (X ) = E π [h(θ)|X ] be a test statistic and P[H(X ) > H(X o )] = p(X o ) : a p value to evaluate p(X o ) compute For j = 1, ...J X (j) ∼ P and compute H(X (j) ) BUT... H(X (t) ) evaluated using MCMC ∀t −→ Time consuming • Bayesian cross validation : Need of computing H(X (−l) ), where X (−l) ⊂ X for many (−l). • Evaluation of procedures by simulations I prior sensitivity analysis : Need to compute Hπj (X ) for different πj . 4/ 25 Motivation I Repeated MCMC under different samples : • Bayesian p-values. X o = observed sample Let Hπ (X ) = E π [h(θ)|X ] be a test statistic and P[H(X ) > H(X o )] = p(X o ) : a p value to evaluate p(X o ) compute For j = 1, ...J X (j) ∼ P and compute H(X (j) ) J 1X 1IH(X (j) )>H(X o ) p̂(X ) = J o j=1 BUT... H(X (t) ) evaluated using MCMC ∀t −→ Time consuming • Bayesian cross validation : Need of computing H(X (−l) ), where X (−l) ⊂ X for many (−l). • Evaluation of procedures by simulations I prior sensitivity analysis : Need to compute Hπj (X ) for different πj . 4/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 5/ 25 Framework I Bayesian model X ∼ fθ , θ ∈ Θ, θ ∼ π, I Object of interest Hπ (X ) = E π [h(θ)|X ] I Evaluation with MCMC (θ t )Tt=1 = MC (π(.|X )) Ĥπ (X ) = 1X h(θt ) T t d I New sample : Y = X R Hπ (Y ) = ΘRh(θ)[f (Y |θ)/f (X |θ)]dπ(θ|X ) Θ [f (Y |θ)/f (X |θ)]dπ(θ|X ) 6/ 25 R Hπ (Y ) = h(θ)[f (Y |θ)/f (X |θ)]dπ(θ|X ) [f (Y |θ)/f (X |θ)]dπ(θ|X ) Θ ΘR So No need to run a new MCMC : use IS on (θt )t : PT Ĥ = t t t=1 h(θ )w(θ , y , x) PT t t=1 w(θ , y , x) = ¯ hw , w̄ where w(θ, y , x) = fθt (y ) , fθt (x) or w(θ, π 0 , π) = π 0 (θ) , π(θ) I Much quicker I How good/bad is it ? 7/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 8/ 25 Convergence I Back to the theory on MCMC convergence : • Consistency (in T ) : OK if MC ergodic • rate : In the original MC : estimation of the function h̃(θ) = h(θ) f (Y |θ) f (X |θ) Usual tools • Asymptotic Variance γ2 = i mπ (x)2 h 2 ¯ − 2cov(w̄, hw)H ¯ . var( w̄) + var( hw) H π π mπ0 (y )2 • Variance estimation : Same asy var as Eπ (h(θ)w(θ)) h(θt )w(θt ) w(θt ) Z (θt ) = − Eπ (w(θ)) Eπ (w(θ)) Eπ (h(θ)w(θ)) 9/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 10/ 25 Behaviour of the weights w(θ, y , x) = fθt (y ) , fθt (x) w̃(θ) = π(θ|x)/π(θ|y ) d x = (x1 , ..., xn ) = y = (y1 , ..., yn ) + regul. condits ⇒ w̃(θ) = en(θ̂ x −θ̂ y )0 I(θ )(θ−θ̂ x ) 0 (1 + O(n−1/2 )), n o vara s (w̃(θ) | x, y ) = exp n(θ̂x − θ̂y )0 I(θ0 )(θ̂x − θ̂y ) − 1. I Stability θ̂ x = 6 θ̂y −→ Instability. 11/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 12/ 25 Stabilization by recentering I Simple stabilization θt0 = θt + θ̂y − θ̂x , w 0 (θt ) = π(θt0 )f (y n |θt0 ) = 1 + 0P (n−1/2 ) π(θt )f (x n |θt ) • Very effective if posterior not too strongly multimodal • Often θ̂x complicated to calculate : Two-step procedure I Compute centering P θt wt y θ̃ = Pt t wt I Apply centering θt0 = θt + θ̃y − θ̃x I Conditions x and y have marginally the same distribution but are not nece. indpdt. see also MacEachern+Perrugia 13/ 25 Outline 1 General ideas Motivation Framework 2 Theoretical properties Limit results Behaviour of the weights Stabilization by recentering 3 Some simulated examples test in regression 14/ 25 Regression example (explicit calculations) I Problem : Test for H0 : Yi ∼ N (β0 , σ 2 ) H1 : Yi ∼ N (β0 + β1 xi , σ 2 ) I test procedure : |β1 | < versus|β1 | ≥ ⇒ H0 h i iff H(Y ) = E π β12 |Y < I pb : Choice of ? ⇒ use of p-value Under H0 θ = (β0 , σ) unknown ⇒ Use of conditional predictive p-value (Bayarri+Berger, Robbins et al., Robert + Rousseau, Fraser + Rousseau) Z o p(Y ) = Pθ [H(Y ) > H(Y 0 )|θ̂]dπ0 (θ|θ̂) = Pθ [H(Y ) > H(Y 0 )|θ̂] Θ Special case here : π(β0 , β1 , σ) ∝ 1/σ and n = 250. 15/ 25 H(y ) = Eπ β12 | y, x = β̂12 P (yi − ŷi )2 P , + (n − 4) (xi − x̄)2 I Algorithm for each p-value • [Y 1 |β̂0 , σ̂] ∼ f (Y |θ̂) = UE + MCMC π(ψ|Y 1 ) ψ = (β0 , β1 , σ), → ψt , t = 1, ...., T iid • For j = 2, ..., J Simulate [Y j |β̂0 , σ̂] ∼ f (Y |θ̂) • ∀j = 2, ..., J compute wt (ψ t , Y j , Y 1 ) and ψ̂ j and P wt (β1t )2 tP j HS (Y ) = , and also Eπ β12 | Y j t wt P 0 t 0 2 ((β ) ) tw Pt 10 HCS1 = , (ψ t )0 = ψ t + ψ̂ j − ψ̂ 1 w t t P t 2 w ”((β t t P 1 )”) HCS2 = , (ψ t )” = ψ t + E π [ψ|Y j ] − E π [ψ|Y 1 ] w ” t t 16/ 25 Results 1 1 0.8 0.8 IS p!value MCMC p!value 250 p-values and for each : J = 1000 (M.C samples) and T = 105 (MCMC samples) with burn-in = 1000 0.6 0.4 0.2 0 0 0.2 0.4 0.6 "Exact" p!value 0.8 0 1 Posterior mean centered IS p!value 1 MLE centered IS p!value 0.4 0.2 0.8 0.6 0.4 0.2 0 0.6 0 0.2 0.4 0.6 "Exact" p!value 0.8 1 0 0.2 0.4 0.6 "Exact" p!value 0.8 1 0 0.2 0.4 0.6 "Exact" p!value 0.8 1 1 0.8 0.6 0.4 0.2 0 17/ 25 Weights 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 F IG .: Blue line is a kernel density estimate of the calculated variance of the normalised IS weights. Red line is the density of a χ21 distribution. 18/ 25 Remarks Weights : close to asymptotic 19/ 25 Remarks Weights : close to asymptotic IS simple : OK but not marvelous 19/ 25 Remarks Weights : close to asymptotic IS simple : OK but not marvelous IS recentered (MLE or posterior) : much better 19/ 25 Remarks Weights : close to asymptotic IS simple : OK but not marvelous IS recentered (MLE or posterior) : much better posterior distribution : close to Gaussian ⇒ perfect for recentering. 19/ 25 GOF : Nonparametric example I Problem H0 : f∗ ∈ F against H1 : f∗ 6∈ F, F = {fθ , θ ∈ Θ ⊂ Rd } I Nonparametric model for F c (VW+RR+R) F (y | θ) ∼ Gψ , ψ∈S on [0, 1] ∃ψ0 ; gψ0 ≡ 1 If Y ∼ fθ then F (Y | θ) ∼ U(0, 1). Model : f∗ (y | θ, ψ) = f (y | θ)g(F (y | θ) | ψ), θ ∈ Θ, ψ∈S I prior on H1 dπ1 (θ, ψ) = dπ0 (θ)dπ(ψ) 20/ 25 Test I Test statistic (Bayesian) H(x) = Eπ1 [d(1, g(· | ψ)) | x] I p-value 0 Z Pθ [H(y ) > H(x 0 )|θ̂]π0 (θ)dθ p(x ) = Θ I Set up here fθ ≡ exp(θ) π0 = Γ(γ1 , γ2 ) π1 = mixture of triangular distributions (fixed partition, random weights) P ψ = (k , ω), k ∈ N, ω ∈ Sk = {z ∈ [0, 1]k ; ki=0 zi = 1}, g(y | ω, k ) = k X ωi hi (y ; k ), π(k ) = C(ρ)ρk , k ≥ 1, i=0 π(ω | k ) = D(α0,k , . . . , αk ,k ), 21/ 25 Algorithm y 1 |θ̂x ∼ f (y |θ̂y = θ̂x ) 22/ 25 Algorithm y 1 |θ̂x ∼ f (y |θ̂y = θ̂x ) MCMC = RJMCMC η t = (θt , k t , ω t ) for π(η|y 1 ) Compute H(y 1 ) 22/ 25 Algorithm y 1 |θ̂x ∼ f (y |θ̂y = θ̂x ) MCMC = RJMCMC η t = (θt , k t , ω t ) for π(η|y 1 ) Compute H(y 1 ) d ∀j = 2, ..., J y j = y 1 (iid) P Compute t j 1 j 0 wt (η , y , y ) ⇒ H(y ) ⇒ p(x ) = j 1IH(y j )>H(x 0 ) J 22/ 25 Algorithm y 1 |θ̂x ∼ f (y |θ̂y = θ̂x ) MCMC = RJMCMC η t = (θt , k t , ω t ) for π(η|y 1 ) Compute H(y 1 ) d ∀j = 2, ..., J y j = y 1 (iid) P Compute t j 1 j 0 wt (η , y , y ) ⇒ H(y ) ⇒ p(x ) = j 1IH(y j )>H(x 0 ) J Recentering : only on k = 2 : MLE or posterior mean of (θ, w1 ) Because P π [k = 2|y1 , ..., yn ] = 1 + oP (1), if y = (y1 , ..., yn ) ∈ H0 22/ 25 results n = 250, x 0 ∼ exp(1/4), J = 1000, T = 100000 with burn-in = 5000 and 250 p-values IS p−value cdf MLE centered IS p−value cdf 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 Posterior mean centered IS p−value cdf 0.4 0.6 0.8 1 1.8 2 Estimated Power 1 1.4 1.2 0.8 0 Pr(reject H ) 1 0.6 0.4 0.8 0.6 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 1 1.2 1.4 1.6 ! 23/ 25 weights 150 100 50 0 0 0.5 1 1.5 2 2.5 0.5 log(Var(weights IS raw)+1) 3 3.5 4 0 0.5 1 1.5 2 2.5 0.5 log(Var(weights IS center MLE)+1) 3 3.5 4 150 100 50 0 200 150 100 50 0 0 0.5 1 1.5 2 2.5 3 0.5 log(Var(weights IS center MLE | k=2)+1) 3.5 4 4.5 24/ 25 conclusion IS in MCMC for repeated sampling : promising 25/ 25 conclusion IS in MCMC for repeated sampling : promising curse of dimensionality not so severe 25/ 25 conclusion IS in MCMC for repeated sampling : promising curse of dimensionality not so severe Much quicker 25/ 25 conclusion IS in MCMC for repeated sampling : promising curse of dimensionality not so severe Much quicker possible simple improvements : recentering (+ rescaling) 25/ 25 conclusion IS in MCMC for repeated sampling : promising curse of dimensionality not so severe Much quicker possible simple improvements : recentering (+ rescaling) Other improvements : If data set y j too different from y 1 IS not too good : consider (y 1 , ..., y J0 ) Guihenneuc et al. • for each : 1 MCMC • New y j : choose best y l , l = 1, ..., J0 compute IS with it. 25/ 25 conclusion IS in MCMC for repeated sampling : promising curse of dimensionality not so severe Much quicker possible simple improvements : recentering (+ rescaling) Other improvements : If data set y j too different from y 1 IS not too good : consider (y 1 , ..., y J0 ) Guihenneuc et al. • for each : 1 MCMC • New y j : choose best y l , l = 1, ..., J0 compute IS with it. Excellent for prior sensitivity analysis 25/ 25 conclusion IS in MCMC for repeated sampling : promising curse of dimensionality not so severe Much quicker possible simple improvements : recentering (+ rescaling) Other improvements : If data set y j too different from y 1 IS not too good : consider (y 1 , ..., y J0 ) Guihenneuc et al. • for each : 1 MCMC • New y j : choose best y l , l = 1, ..., J0 compute IS with it. Excellent for prior sensitivity analysis Non stationarity ⇒ bad. 25/ 25