More of the SAME?

advertisement
More of the SAME?
Sequential and Pseudomarginal Monte Carlo for
Point Estimation in Latent Variable Models
Adam M. Johansen
Collaborators: Manuel Davy, Arnaud Doucet and Axel Finke
a.m.johnsen@warwick.ac.uk
Imperial College — 6th February, 2014
Outline
Background:
Marginal MLEs
SAME: An MCMC Scheme
Sequential Monte Carlo
The SMC Method
A Population-Based SAME Method
Examples
Pseudomarginal Methods
The Pseudomarginal Method
More of the SAME: multiple extensions of the space
Example
Even more of the SAME: complex extensions of the space
Examples
2
Background
Marginal MLEs
SAME: An MCMC Scheme
3
Maximum {Likelihood|a Posteriori} Estimation
Consider a model with:
parameters, θ,
latent variables, x, and
observed data, y.
Aim to maximise marginal likelihood
Z
p(y|θ) = p(x, y|θ)dx
or posterior
Z
p(θ|y) ∝
p(x, y|θ)p(θ)dx.
Traditional approach is Expectation-Maximisation (EM)
Requires objective function in closed form.
Susceptible to trapping in local optima.
4
A Probabilistic Approach
Optimization and probability
simulated annealing.
A distribution of the form
π(θ|y) ∝ p(θ)p(y|θ)γ
will become concentrated, as γ → ∞ on the maximisers of p(y|θ)
under weak conditions.
Why not target π(θ|y) using MCMC?
5
Adapted from (Hwang, 1980; Theorem 2.1).
Assume:
p(θ) and p(y|θ) are α-Lipschitz continuous in θ
log (p(θ)) ∈ C 3 (Rn ) and log p(y|θ) ∈ C 3 (Rn ).
ΘM L is a non-empty, countable set which is nowhere dense;
p(θ) ≤ M < ∞; p(θ) > 0∀θ ∈ ΘM L
p(y|θ) ≤ M 0 < ∞
For some k < sup p(y|θ), {θ : p(y|θ) ≥ k} is compact.
Then:
lim πγ (dt) ∝
γ→∞
X
α(θml )δθml (dt),
(1)
θml ∈ΘM L
#−1/2
∂ 2 log p(y|θ) α(θml ) = det −
∂θm ∂θn θ=θml
"
(2)
6
State Augmentation for Maximisation of Expectations
Data Augmentation: Synthetic distributions of the form:
π̄γ (θ, x1:γ |y) ∝ p(θ)
γ
Y
p(xi , y|θ)
i=1
admit the marginals
π̄γ (θ|y) ∝ p(θ)p(y|θ)γ .
SAME Algorithm (Doucet, Godsill and Robert, 2002):
t = 0: Initialise (θ0 , X0,1 ) arbitrarily.
For t = 1, . . . , T :
If γ(t) > γ(t − 1): Set (Xt−1,γ(t−1)+1 , . . . , Xt−1,γ (t) ) arbitrarily.
Sample (θt , Xt,1 , . . . , Xt,γ(t) ) ∼ Kγ(t) (θt−1,1 , Xt−1,1 . . . , Xt−1,γ(t) , ·).
Where Kγ is π̄γ -invariant.
NB An inhomogeneous Markov chain.
7
Sequential Monte Carlo
The SMC Method
A Population-Based SAME Method
Examples
8
SMC: A Motivating Example — Filtering
Let X1 , . . . denote the position of an object which follows
Markovian dynamics.
Let Y1 , . . . denote a collection of observations:
Yi |{Xi = xi } ∼ g(·|xi ).
We wish to estimate, as observations arrive, p(x1:t |y1:t ).
A recursion obtained from Bayes rule exists but is intractable in
most cases.
x1
x2
x3
x4
x5
x6
y1
y2
y3
y4
y5
y6
9
More Generally
Really tracking a sequence of distributions, pt . . .
on increasing state spaces.
Other problems with the same structure exist.
Any problem of sequentially approximating a sequence of such
distributions, pt , can be addressed in the same way.
10
Sequential Importance Resampling
At time t, t ≥ 2.
Sampling Step
For i = 1 : N :
(i)
(Given {X1:t−1 }N
i=1 approximating pt−1 (x1:t−1 )).
(i)
(i)
sample Xt ∼ qt ·| X1:t−1 .
Resampling Step
For i = 1 : N :
(i)
compute wt X1:t =
and
(i)
Wt
=
(i)
pt X1:t
(i)
(i) (i)
pt−1 X1:t−1 qt Xt X1:t−1
(i)
wt X1:t
PN
(j)
j=1 wt X1:t
For i = 1 : N :
(i)
sample At ∼
n
o
(Ai ) N
retain X1:t t
(j)
j=1 Wt δj
PN
i=1
11
Iteration 2
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
12
Iteration 3
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
13
Iteration 4
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
14
Iteration 5
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
15
Iteration 6
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
16
Iteration 7
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
17
Iteration 8
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
18
Iteration 9
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
19
Iteration 10
8
7
6
5
4
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
20
SMC Samplers (Del Moral et al., 2006)
Can be used to sample from any sequence of distributions:
Given a sequence of target distributions, ηn , on En . . . ,
n
N
construct a synthetic sequence ηen on spaces
Ep
p=1
by introducing Markov kernels, Lp from Ep+1 to Ep :
ηen (x1:n ) = ηn (xn )
n−1
Y
Lp (xp+1 , xp ) ,
p=1
These distributions
have the target distributions as final time marginals,
have the correct structure to employ SMC techniques.
21
SMC Outline
(i)
Given a sample {X1:n−1 }N
en−1 ,
i=1 targeting η
(i)
(i)
sample Xn ∼ Kn (Xn−1 , ·),
calculate
(i)
(i)
(i)
ηn (Xn )Ln−1 (Xn , Xn−1 )
(i)
Wn (X1:n ) =
(i)
(i)
(i)
ηn−1 (Xn−1 )Kn (Xn−1 , Xn )
.
(i)
Resample, yielding: {X1:n }N
en .
i=1 targeting η
Hints that we’d like to use
Ln−1 (xn , xn−1 ) = R
ηn−1 (xn−1 )Kn (xn−1 , xn )
.
ηn−1 (x0n−1 )Kn (x0n−1 , xn )dxn−1
22
Recall: Maximum {Likelihood|a Posteriori} Estimation
A model with:
parameters, θ,
latent variables, x, and
observed data, y.
Aim to maximise Marginal likelihood
Z
p(y|θ) = p(x, y|θ)dx
or posterior
Z
p(θ|y) ∝
p(x, y|θ)p(θ)dx
Using
π̄γ (θ, x1:γ |y) ∝ p(θ)
γ
Y
p(xi , y|θ)
i=1
23
Maximum Likelihood via SMC
Use a sequence of distributions ηn = π̄γn for some {γn }.
The MCMC approach (Doucet et al., 2002).
Requires slow “annealing”.
Separation between distributions is large.
Mixes poorly as γ increases.
Using SMC has some substantial advantages:
Introducing bridging distributions, for γ = bγc + hγi, of:
π̄γ (θ, x1:bγc+1 |y) ∝ p(θ)p(xbγc+1 , y|θ)hγi
bγc
Y
p(xi , y|θ)
i=1
is straightforward.
Population of samples improves robustness.
It is less dependent upon mixing of Kγ .
24
Algorithms
A generic SMC sampler can be written down directly. . .
An easy case:
Sample from p(xt |y, θt−1 ) and p(θt |xt , y).
Weight according to p(y|θt−1 )γt −γt−1 .
The general case:
Sample existing variables from a πt -invariant kernel:
(θt , Xt,1:γt−1 ) ∼ Kt ((θt−1 , Xt−1 ), ·).
Sample new variables from an arbitrary proposal:
Xt,dγt−1 e+1:dγt e ∼ q(·|θt ).
Use combination of time-reversal and optimal auxiliary kernel.
Weight expression does not involve the marginal likelihood.
25
An SMC-Based SAME Algorithm
(sorry!)
initialisation:
n t = 1: oN
(i)
(i) iid
sample
θ1 , X1
∼ν
calculate
(i)
W1
∝
i=1
(i)
(i)
πγ1 (θ1 ,X1 )
(i)
(i)
ν(θ1 ,X1 )
N
P
i=1
(i)
W1
=1
for t = 2 to T do
resample
 (i) (i)
(i)
(i)

θt , Xt,1:dγt−1 e ∼ Kt−1 θt−1 , Xt−1 ; ·


 n
obγt c
(i)
(i)
sample
Xt,j ∼ q(·|θt )
if dγt−1 e < bγt c

j=dγt−1 e+1


(i)
 (i)
Xt,dγt e ∼ qhγt i (·|θt ) if dγt−1 e < dγt e =
6 γt
calculate
(i)
Wt
p(y, Xt,dγt−1 e |θ)1∧γt −bγt−1 c
∝
p(y, Xt,dγt−1 e |θ)hγt i
bγt c
Y
j=dγt−1 e+1
p(y, Xt,j |θt )
q(Xt,j |θt )
N








i=1
p(y, Xt,dγt e |θt )hγt i
qhγt i (Xt,dγt e |θt )
!I
with I = I(dγt e > bγt c ≥ dγt−1 e).
end for
26
Toy Example (using known marginal likelihood)
Student t-distribution of unknown location parameter θ with
ν = 0.05.
Four observations are available, y = (−20, 1, 2, 3).
Log likelihood is:
log p(y|θ) = −0.525
4
X
log 0.05 + (yi − θ)2 .
i=1
Global maximum is at 1.997.
Local maxima at {−19.993, 1.086, 2.906}.
Complete log likelihood (Xi ∼ Ga):
log p(y, z|θ) = −
4
X
0.475 log xi + 0.025xi + 0.5xi (yi − θ)2 .
i=1
27
Toy Example: Marginal Likelihood
Toy Example: Log Marginal Likelihood
0
-2
log marginal likelihood
-4
-6
-8
-10
-12
-14
-30
-20
-10
0
10
20
θ
28
Toy Example: SMC Method using Gibbs Kernels
3
T=15
T=30
T=60
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0
20
40
60
80
100
29
Example: Gaussian Mixture Model – MAP Estimation
Likelihood p(y|x, ω, µ, σ) = N (y|µx , σx2 ).
3
P
Marginal likelihood p(y|ω, µ, σ) =
ωj N (y|µj , σj2 ).
j=1
Diffuse conjugate priors were employed:
ω ∼ Di (δ)
λi + 3 βi
2
σi ∼ IG
,
2
2
2
2
µi |σi ∼ N αi , σi /λi ,
All full conditional distributions of interest are available.
Marginal posterior can be calculated.
30
3 Component GMM (Roeders Galaxy Data Set)
-152
smc
same
em
-154
-156
-158
-160
-162
-164
-166
-168
100
1000
10000
100000
1e+06
31
Pseudomarginal Monte Carlo
The Pseudomarginal Method
More of the SAME: multiple extensions of the space
Example
Even more of the SAME: complex extensions of the space
Examples
32
The Pseudomarginal Method
Marginal MH-Acceptance Probability:
1∧
π(θ0 )Q(θ0 , θ)
π(θ)Q(θ, θ0 )
But π(θ) isn’t tractable: how about using:
1∧
π
b(θ0 )Q(θ0 , θ)
π
b(θ)Q(θ, θ0 )
where
m
π
b(θ) =
1 X π(θ, Xi )
m
q(Xi )
iid
Xi ∼ q
i=1
Suggests two algorithms ( Beaumont, 2003):
Monte Carlo within Metropolis
Grouped Independence Metropolis Hastings
33
Pseudomarginal Methods: GIMH is “Exact”
Extended Target (Andrieu & Roberts, 2009):
m
Y
X
1
π(θ, xj )
q(xk )
π
e(θ, x1 , . . . , xm ) =
m
j=1
=
1
m
m
X
j=1
k6=j
m
Y
π(θ, xj )
·
q(xj )
q(xk ) = π̂(θ)
k=1
m
Y
q(xk )
k=1
The acceptance probability becomes:
Q
π
e(θ0 , x01 , . . . , x0m )Q(θ0 , θ) m
π̂(θ0 )Q(θ0 , θ)
j=1 q(xj )
Q
1∧
=
1
∧
0
π
e(θ, x1 , . . . , xm )Q(θ, θ0 ) m
π̂(θ)Q(θ, θ0 )
j=1 q(xj )
NB MCWM is not exact. . . but perhaps we don’t care.
34
A Pseudomarginal SAME Algorithm
We’d like to target πγ (θ|y) ∝ p(θ)p(y|θ)γ .
Why not use the pseudomarginal approach, considering instead:
π
eγ (π, x11:m , . . . , xγ1:m ) = p(θ)
γ X
m
m
Y
1 p(xij , y|θ) Y
q(xik |θ)
m q(xij |θ)
i=1 j=1
k=1
Expect behaviour like simulated annealing for large m.
35
The Student t Model Revisited
36
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
SA
N = 100
SAME
MCWM
MCWM (prior)
5
GIMH
N = 50
GIMH (prior)
MCWM
MCWM (prior)
5
GIMH
N=5
GIMH (prior)
MCWM
MCWM (prior)
5
GIMH
GIMH (prior)
θ
Summary of 200 Runs
5
37
What about complicated latent variable structures?
Actually, pseudomarginal algorithms are more flexible.
We’re especially interested in particle MCMC implementations
(Andrieu et al., 2010):
Particle Marginal Metropolis-Hastings(PMMH)
MCWM variant of PMMH
Particle Gibbs (with ancestor sampling)
State-space models are the real motivation for this methodology.
Many other complex models could be addressed using this
technique.
38
Linear Gaussian Hidden Markov Model
Model:
Xt =AXt−1 + BUt
Yt =Xt + DVt
Data 50 observations simulated using:
A = 0.9, B = 1, and D = 1
Algorithms
The PMMH/MCWM algorithms use N = 250 particles;
The PG algorithm (with ancestor sampling) uses N = 50 particles
but attempts 100 static parameter updates per iteration.
Inverse temperature increases linearly from 0.1 to 10.
Final 1000 iterations γt = 10.
Compare with exact marginal simulated annealing algorithm.
39
PMMH (N = 250)
PMMH (N = 250)
3
0
0
−1
0
1000
2000
−2
3000
0
2000
0
−2
3000
0
2000
0
2000
0
−2
3000
0
1000
2000
3000
2000
3000
True parameter
MLE
1
D
0
−2
1000
2
−1
1000
2000
Iteration
0
Simulated Annealing
True parameter
MLE
1
B
0
0
0
−2
3000
2
1
3000
True parameter
MLE
1
Simulated Annealing
True parameter
MLE
2
2000
−1
Simulated Annealing
3
1000
2
−1
1000
0
Particle Gibbs (N = 50)
D
1
0
0
−2
3000
True parameter
MLE
1
B
A
1000
2
True parameter
MLE
2
3000
−1
Particle Gibbs (N = 50)
3
2000
True parameter
MLE
1
D
A
B
1000
1000
MCWM (N = 250)
−1
0
0
2
Particle Gibbs (N = 50)
A
−2
3000
True parameter
MLE
1
0
−1
2000
2
True parameter
MLE
1
−1
1000
MCWM (N = 250)
3
−1
0
−1
MCWM (N = 250)
2
True parameter
MLE
1
D
1
2
True parameter
MLE
1
B
A
2
−1
PMMH (N = 250)
2
True parameter
MLE
0
−1
0
1000
2000
Iteration
3000
−2
0
1000
2000
Iteration
3000
40
0.95
0.94
−0.25
−0.3
0.93
−0.35
0.92
−0.4
0.91
−0.45
0.9
−0.5
Simulated Annealing
−0.2
Particle Gibbs (N = 50)
0.96
MCWM (N = 250)
−0.15
PMMH (N = 250)
0.97
Simulated Annealing
−0.1
Particle Gibbs (N = 50)
0.98
MCWM (N = 250)
0
−0.05
D
B
1
0.99
PMMH (N = 250)
Simulated Annealing
Particle Gibbs (N = 50)
MCWM (N = 250)
PMMH (N = 250)
A
Summary of 100 Runs
0.4
0.35
0.3
0.25
0.2
0.15
0.1
41
A Simple Stochastic Volatility Model
Model:
Xi = α + δXi−1 + σu ui
Xi
i
Yi = exp
2
X1 ∼ N µ0 , σ02
where ui and i are uncorrelated standard normal random
variables, and θ = (α, δ, σu ).
200 Observations; simulated with δ = 0.95, α = −0.363 and
σ = 0.26.
Diffuse instrumental prior distributions:
δ ∼ U (−1, 1)
α ∼ N (0, 1)
σ −2 ∼ Ga(1, 0.1)
are quickly forgotten.
Inverse temperature increases linearly from 0.1 to 10.
Final 500 iterations γt = 10.
A more complex multi-factor model is also under investigation.
42
PMMH (N = 250)
PMMH (N = 250)
5
PMMH (N = 250)
1
20
0.5
15
σ2
0
True parameter value
δ
α
True parameter value
0
−0.5
10
5
True parameter value
−5
0
1000
2000
−1
3000
0
MCWM (N = 250)
1000
2000
0
3000
0
MCWM (N = 250)
5
1000
1
20
0.5
15
3000
σ2
True parameter value
δ
α
True parameter value
0
2000
MCWM (N = 250)
0
−0.5
10
5
True parameter value
−5
0
1000
2000
−1
3000
0
Particle Gibbs (N = 50)
1000
2000
0
3000
0
Particle Gibbs (N = 50)
5
1
20
0.5
15
2000
3000
σ2
True parameter value
δ
α
True parameter value
0
1000
Particle Gibbs (N = 50)
0
−0.5
10
5
True parameter value
−5
0
1000
2000
Iteration
3000
−1
0
1000
2000
Iteration
3000
0
0
1000
2000
Iteration
3000
43
In Conclusion
Monte Carlo isn’t just for calculating posterior expectations.
SMC and Pseudomarginal methods are effective for ML and MAP
estimation.
Still work in progress. . .
Scope for embedding Pseudomarginal target within SMC
algorithm. . .
and adaptation.
44
References
[1] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte
Carlo computations. Annals of Statistics, 37(2):697–725, 2009.
[2] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo.
Journal of the Royal Statistical Society B, 72(3):269–342, 2010.
[3] M. Beaumont. Estimation of population growth or decline in genetically monitored
populations. Genetics, 164(3):1139–1160, 2003.
[4] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of
the Royal Statistical Society B, 63(3):411–436, 2006.
[5] A. Doucet, S. J. Godsill, and C. P. Robert. Marginal maximum a posteriori estimation
using Markov chain Monte Carlo. Statistics and Computing, 12:77–84, 2002.
[6] A. Finke. On Extended State-Space Constructions for Monte Carlo Methods.
Ph.D. thesis, University of Warwick, 2015. In preparation.
[7] A. Finke and A. M. Johansen. More of the SAME? Pseudomarginal
methods for point estimation in latent variable models. In preparation,
2015.
[8] C.-R. Hwang. Laplace’s method revisited: Weak convergence of probability measures.
Annals of Probability, 8(6):1177–1182, December 1980.
[9] A. M. Johansen, A. Doucet, and M. Davy. Maximum likelihood parmeter estimation
for latent models using sequential Monte Carlo. In Proceedings of ICASSP, volume III,
pages 640–643. IEEE, May 2006.
[10] A. M. Johansen, A. Doucet, and M. Davy. Particle methods for maximum
likelihood parameter estimation in latent variable models. Statistics and
Computing, 18(1):47–57, March 2008.
45
Download