Use of Importance sampling with MCMC algorithms J. Rousseau Warwick

advertisement
Use of Importance sampling with MCMC
algorithms
J. Rousseau
CEREMADE, Université Paris-Dauphine
Warwick
Joint work with C. Guihennec, R. McVinish, K. Mengersen,
D. Nur
1/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
2/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
3/ 25
Motivation
I Repeated MCMC under different samples :
• Bayesian p-values. X o = observed sample
Let Hπ (X ) = E π [h(θ)|X ] be a test statistic and
P[H(X ) > H(X o )] = p(X o ) : a p value to evaluate p(X o )
compute
For j = 1, ...J X (j) ∼ P and compute H(X (j) )
BUT... H(X (t) ) evaluated using MCMC ∀t −→ Time consuming
• Bayesian cross validation : Need of computing H(X (−l) ),
where X (−l) ⊂ X for many (−l).
• Evaluation of procedures by simulations
I prior sensitivity analysis : Need to compute Hπj (X ) for
different πj .
4/ 25
Motivation
I Repeated MCMC under different samples :
• Bayesian p-values. X o = observed sample
Let Hπ (X ) = E π [h(θ)|X ] be a test statistic and
P[H(X ) > H(X o )] = p(X o ) : a p value to evaluate p(X o )
compute
For j = 1, ...J X (j) ∼ P and compute H(X (j) )
J
1X
1IH(X (j) )>H(X o )
p̂(X ) =
J
o
j=1
BUT... H(X (t) ) evaluated using MCMC ∀t −→ Time consuming
• Bayesian cross validation : Need of computing H(X (−l) ),
where X (−l) ⊂ X for many (−l).
• Evaluation of procedures by simulations
I prior sensitivity analysis : Need to compute Hπj (X ) for
different πj .
4/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
5/ 25
Framework
I Bayesian model
X ∼ fθ ,
θ ∈ Θ,
θ ∼ π,
I Object of interest Hπ (X ) = E π [h(θ)|X ]
I Evaluation with MCMC (θ t )Tt=1 = MC (π(.|X ))
Ĥπ (X ) =
1X
h(θt )
T
t
d
I New sample : Y = X
R
Hπ (Y ) =
ΘRh(θ)[f (Y |θ)/f (X |θ)]dπ(θ|X )
Θ [f (Y |θ)/f (X |θ)]dπ(θ|X )
6/ 25
R
Hπ (Y ) =
h(θ)[f (Y |θ)/f (X |θ)]dπ(θ|X )
[f (Y |θ)/f (X |θ)]dπ(θ|X )
Θ
ΘR
So No need to run a new MCMC : use IS on (θt )t :
PT
Ĥ =
t
t
t=1 h(θ )w(θ , y , x)
PT
t
t=1 w(θ , y , x)
=
¯
hw
,
w̄
where
w(θ, y , x) =
fθt (y )
,
fθt (x)
or
w(θ, π 0 , π) =
π 0 (θ)
,
π(θ)
I Much quicker
I How good/bad is it ?
7/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
8/ 25
Convergence
I Back to the theory on MCMC convergence :
• Consistency (in T ) : OK if MC ergodic
• rate : In the original MC : estimation of the function
h̃(θ) = h(θ)
f (Y |θ)
f (X |θ)
Usual tools • Asymptotic Variance
γ2 =
i
mπ (x)2 h 2
¯ − 2cov(w̄, hw)H
¯
.
var(
w̄)
+
var(
hw)
H
π
π
mπ0 (y )2
• Variance estimation : Same asy var as
Eπ (h(θ)w(θ))
h(θt )w(θt )
w(θt )
Z (θt ) =
−
Eπ (w(θ))
Eπ (w(θ)) Eπ (h(θ)w(θ))
9/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
10/ 25
Behaviour of the weights
w(θ, y , x) =
fθt (y )
,
fθt (x)
w̃(θ) = π(θ|x)/π(θ|y )
d
x = (x1 , ..., xn ) = y = (y1 , ..., yn ) + regul. condits ⇒
w̃(θ) = en(θ̂
x −θ̂ y )0 I(θ )(θ−θ̂ x )
0
(1 + O(n−1/2 )),
n
o
vara s (w̃(θ) | x, y ) = exp n(θ̂x − θ̂y )0 I(θ0 )(θ̂x − θ̂y ) − 1.
I Stability θ̂ x =
6 θ̂y −→ Instability.
11/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
12/ 25
Stabilization by recentering
I Simple stabilization
θt0 = θt + θ̂y − θ̂x ,
w 0 (θt ) =
π(θt0 )f (y n |θt0 )
= 1 + 0P (n−1/2 )
π(θt )f (x n |θt )
• Very effective if posterior not too strongly multimodal
• Often θ̂x complicated to calculate : Two-step procedure
I Compute centering
P
θt wt
y
θ̃ = Pt
t wt
I Apply centering
θt0 = θt + θ̃y − θ̃x
I Conditions x and y have marginally the same distribution
but are not nece. indpdt.
see also MacEachern+Perrugia
13/ 25
Outline
1
General ideas
Motivation
Framework
2
Theoretical properties
Limit results
Behaviour of the weights
Stabilization by recentering
3
Some simulated examples
test in regression
14/ 25
Regression example (explicit calculations)
I Problem : Test for
H0 : Yi ∼ N (β0 , σ 2 )
H1 : Yi ∼ N (β0 + β1 xi , σ 2 )
I test procedure : |β1 | < versus|β1 | ≥ ⇒
H0
h
i
iff H(Y ) = E π β12 |Y < I pb : Choice of ? ⇒ use of p-value
Under H0 θ = (β0 , σ) unknown ⇒ Use of conditional predictive
p-value (Bayarri+Berger, Robbins et al., Robert + Rousseau, Fraser + Rousseau)
Z
o
p(Y ) =
Pθ [H(Y ) > H(Y 0 )|θ̂]dπ0 (θ|θ̂) = Pθ [H(Y ) > H(Y 0 )|θ̂]
Θ
Special case here : π(β0 , β1 , σ) ∝ 1/σ and n = 250.
15/ 25
H(y ) = Eπ
β12
| y, x =
β̂12
P
(yi − ŷi )2
P
,
+
(n − 4) (xi − x̄)2
I Algorithm for each p-value
• [Y 1 |β̂0 , σ̂] ∼ f (Y |θ̂) = UE + MCMC π(ψ|Y 1 ) ψ = (β0 , β1 , σ),
→ ψt ,
t = 1, ...., T
iid
• For j = 2, ..., J Simulate [Y j |β̂0 , σ̂] ∼ f (Y |θ̂)
• ∀j = 2, ..., J compute wt (ψ t , Y j , Y 1 ) and ψ̂ j and
P
wt (β1t )2
tP
j
HS (Y ) =
, and also Eπ β12 | Y j
t wt
P 0 t 0 2
((β ) )
tw
Pt 10
HCS1 =
, (ψ t )0 = ψ t + ψ̂ j − ψ̂ 1
w
t t
P
t
2
w
”((β
t
t P
1 )”)
HCS2 =
, (ψ t )” = ψ t + E π [ψ|Y j ] − E π [ψ|Y 1 ]
w
”
t t
16/ 25
Results
1
1
0.8
0.8
IS p!value
MCMC p!value
250 p-values and for each : J = 1000 (M.C samples) and
T = 105 (MCMC samples) with burn-in = 1000
0.6
0.4
0.2
0
0
0.2
0.4
0.6
"Exact" p!value
0.8
0
1
Posterior mean centered IS p!value
1
MLE centered IS p!value
0.4
0.2
0.8
0.6
0.4
0.2
0
0.6
0
0.2
0.4
0.6
"Exact" p!value
0.8
1
0
0.2
0.4
0.6
"Exact" p!value
0.8
1
0
0.2
0.4
0.6
"Exact" p!value
0.8
1
1
0.8
0.6
0.4
0.2
0
17/ 25
Weights
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
6
F IG .: Blue line is a kernel density estimate of the calculated variance
of the normalised IS weights. Red line is the density of a χ21
distribution.
18/ 25
Remarks
Weights : close to asymptotic
19/ 25
Remarks
Weights : close to asymptotic
IS simple : OK but not marvelous
19/ 25
Remarks
Weights : close to asymptotic
IS simple : OK but not marvelous
IS recentered (MLE or posterior) : much better
19/ 25
Remarks
Weights : close to asymptotic
IS simple : OK but not marvelous
IS recentered (MLE or posterior) : much better
posterior distribution : close to Gaussian ⇒ perfect for
recentering.
19/ 25
GOF : Nonparametric example
I Problem
H0 : f∗ ∈ F
against H1 : f∗ 6∈ F,
F = {fθ , θ ∈ Θ ⊂ Rd }
I Nonparametric model for F c (VW+RR+R)
F (y | θ) ∼ Gψ ,
ψ∈S
on [0, 1] ∃ψ0 ; gψ0 ≡ 1
If Y ∼ fθ then F (Y | θ) ∼ U(0, 1). Model :
f∗ (y | θ, ψ) = f (y | θ)g(F (y | θ) | ψ), θ ∈ Θ,
ψ∈S
I prior on H1
dπ1 (θ, ψ) = dπ0 (θ)dπ(ψ)
20/ 25
Test
I Test statistic (Bayesian) H(x) = Eπ1 [d(1, g(· | ψ)) | x]
I p-value
0
Z
Pθ [H(y ) > H(x 0 )|θ̂]π0 (θ)dθ
p(x ) =
Θ
I Set up here fθ ≡ exp(θ) π0 = Γ(γ1 , γ2 ) π1 = mixture of
triangular distributions (fixed partition, random
weights)
P
ψ = (k , ω), k ∈ N, ω ∈ Sk = {z ∈ [0, 1]k ; ki=0 zi = 1},
g(y | ω, k ) =
k
X
ωi hi (y ; k ),
π(k ) = C(ρ)ρk ,
k ≥ 1,
i=0
π(ω | k ) = D(α0,k , . . . , αk ,k ),
21/ 25
Algorithm
y 1 |θ̂x ∼ f (y |θ̂y = θ̂x )
22/ 25
Algorithm
y 1 |θ̂x ∼ f (y |θ̂y = θ̂x )
MCMC = RJMCMC η t = (θt , k t , ω t ) for π(η|y 1 )
Compute H(y 1 )
22/ 25
Algorithm
y 1 |θ̂x ∼ f (y |θ̂y = θ̂x )
MCMC = RJMCMC η t = (θt , k t , ω t ) for π(η|y 1 )
Compute H(y 1 )
d
∀j = 2, ..., J y j = y 1 (iid)
P
Compute
t
j
1
j
0
wt (η , y , y ) ⇒ H(y ) ⇒ p(x ) =
j
1IH(y j )>H(x 0 )
J
22/ 25
Algorithm
y 1 |θ̂x ∼ f (y |θ̂y = θ̂x )
MCMC = RJMCMC η t = (θt , k t , ω t ) for π(η|y 1 )
Compute H(y 1 )
d
∀j = 2, ..., J y j = y 1 (iid)
P
Compute
t
j
1
j
0
wt (η , y , y ) ⇒ H(y ) ⇒ p(x ) =
j
1IH(y j )>H(x 0 )
J
Recentering : only on k = 2 : MLE or posterior mean of
(θ, w1 ) Because
P π [k = 2|y1 , ..., yn ] = 1 + oP (1),
if
y = (y1 , ..., yn ) ∈ H0
22/ 25
results
n = 250, x 0 ∼ exp(1/4), J = 1000, T = 100000 with burn-in =
5000 and 250 p-values
IS p−value cdf
MLE centered IS p−value cdf
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
Posterior mean centered IS p−value cdf
0.4
0.6
0.8
1
1.8
2
Estimated Power
1
1.4
1.2
0.8
0
Pr(reject H )
1
0.6
0.4
0.8
0.6
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1
0
1
1.2
1.4
1.6
!
23/ 25
weights
150
100
50
0
0
0.5
1
1.5
2
2.5
0.5 log(Var(weights IS raw)+1)
3
3.5
4
0
0.5
1
1.5
2
2.5
0.5 log(Var(weights IS center MLE)+1)
3
3.5
4
150
100
50
0
200
150
100
50
0
0
0.5
1
1.5
2
2.5
3
0.5 log(Var(weights IS center MLE | k=2)+1)
3.5
4
4.5
24/ 25
conclusion
IS in MCMC for repeated sampling : promising
25/ 25
conclusion
IS in MCMC for repeated sampling : promising
curse of dimensionality not so severe
25/ 25
conclusion
IS in MCMC for repeated sampling : promising
curse of dimensionality not so severe
Much quicker
25/ 25
conclusion
IS in MCMC for repeated sampling : promising
curse of dimensionality not so severe
Much quicker
possible simple improvements : recentering (+ rescaling)
25/ 25
conclusion
IS in MCMC for repeated sampling : promising
curse of dimensionality not so severe
Much quicker
possible simple improvements : recentering (+ rescaling)
Other improvements : If data set y j too different from y 1 IS
not too good : consider (y 1 , ..., y J0 ) Guihenneuc et al.
• for each : 1 MCMC
• New y j : choose best y l , l = 1, ..., J0 compute IS with it.
25/ 25
conclusion
IS in MCMC for repeated sampling : promising
curse of dimensionality not so severe
Much quicker
possible simple improvements : recentering (+ rescaling)
Other improvements : If data set y j too different from y 1 IS
not too good : consider (y 1 , ..., y J0 ) Guihenneuc et al.
• for each : 1 MCMC
• New y j : choose best y l , l = 1, ..., J0 compute IS with it.
Excellent for prior sensitivity analysis
25/ 25
conclusion
IS in MCMC for repeated sampling : promising
curse of dimensionality not so severe
Much quicker
possible simple improvements : recentering (+ rescaling)
Other improvements : If data set y j too different from y 1 IS
not too good : consider (y 1 , ..., y J0 ) Guihenneuc et al.
• for each : 1 MCMC
• New y j : choose best y l , l = 1, ..., J0 compute IS with it.
Excellent for prior sensitivity analysis
Non stationarity ⇒ bad.
25/ 25
Download