Towards scalable Monte Carlo algorithms for some models involving latent variables

advertisement
Towards scalable Monte Carlo algorithms for some
models involving latent variables
Christophe Andrieu
joint with Arnaud Doucet (Oxford) and Sinan Yıldırım (Bristol
Istanbul)
April 22, 2016
Overview
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
Overview
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I Most general purpose algorithms to do so require one to be able to
evaluate π(θ) for θ ∈ Θ. This is not always possible.
Overview
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I Most general purpose algorithms to do so require one to be able to
evaluate π(θ) for θ ∈ Θ. This is not always possible.
I Some progress has been made in recent years on this problem, but the
solutions do not always scale with the complexity of some problems.
Overview
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I Most general purpose algorithms to do so require one to be able to
evaluate π(θ) for θ ∈ Θ. This is not always possible.
I Some progress has been made in recent years on this problem, but the
solutions do not always scale with the complexity of some problems.
I The presentation is about how some of these algorithms can be made to
scale.
The Metropolis-Hastings algorithm
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
The Metropolis-Hastings algorithm
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0},
which leaves π invariant.
The Metropolis-Hastings algorithm
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0},
which leaves π invariant.
The MH update proceeds as follows:
Given θn = θ,
I Propose θ0 ∼ q(θ, ·)
The Metropolis-Hastings algorithm
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0},
which leaves π invariant.
The MH update proceeds as follows:
Given θn = θ,
I Propose θ0 ∼ q(θ, ·)
I Calculate the acceptance ratio
r (θ, θ0 ) :=
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
The Metropolis-Hastings algorithm
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0},
which leaves π invariant.
The MH update proceeds as follows:
Given θn = θ,
I Propose θ0 ∼ q(θ, ·)
I Calculate the acceptance ratio
r (θ, θ0 ) :=
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I Set θn+1 = θ0 with probability α(θ, θ0 ) := min {1, r (θ, θ0 )}, otherwise set
θn+1 = θ
The Metropolis-Hastings algorithm
I Assume we are interested in sampling from a probability distribution with
density π(θ), defined on some space (Θ, E)
I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0},
which leaves π invariant.
The MH update proceeds as follows:
Given θn = θ,
I Propose θ0 ∼ q(θ, ·)
I Calculate the acceptance ratio
r (θ, θ0 ) :=
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I Set θn+1 = θ0 with probability α(θ, θ0 ) := min {1, r (θ, θ0 )}, otherwise set
θn+1 = θ
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to
compute).
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to
compute).
I Example: there is a latent variable z such that π(θ) =
is intractable.
´
π(θ, z)dz which
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to
compute).
I Example: there is a latent variable z such that π(θ) =
´
π(θ, z)dz which
is intractable.
I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of
MCMCs:
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to
compute).
I Example: there is a latent variable z such that π(θ) =
´
π(θ, z)dz which
is intractable.
I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of
MCMCs:
I
exact: the transition kernel leaves π(θ) invariant.
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to
compute).
I Example: there is a latent variable z such that π(θ) =
´
π(θ, z)dz which
is intractable.
I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of
MCMCs:
I
I
exact: the transition kernel leaves π(θ) invariant.
approximate: they use an approximation of r (θ, θ0 ).
Intractable acceptance ratio
I Being able to implement the MH update requires one to evaluate
r (θ, θ0 ) =
π(θ0 )q(θ0 , θ)
.
π(θ)q(θ, θ0 )
I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to
compute).
I Example: there is a latent variable z such that π(θ) =
´
π(θ, z)dz which
is intractable.
I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of
MCMCs:
I
I
exact: the transition kernel leaves π(θ) invariant.
approximate: they use an approximation of r (θ, θ0 ).
I But they may not scale.
Latent variables and pseudo-marginals
I Assume interest is in a posterior distribution
ˆ
π(θ) = p(θ|y ) ∝ η(θ)p(y |θ) = η(θ)
p(y , x|θ)dx
where the integral cannot be computed analytically.
Latent variables and pseudo-marginals
I Assume interest is in a posterior distribution
ˆ
π(θ) = p(θ|y ) ∝ η(θ)p(y |θ) = η(θ)
p(y , x|θ)dx
where the integral cannot be computed analytically.
I Then with x (i) iid
∼ Qθ and p(y , x|θ)/Qθ (x) well defined, consider an IS
approximation of the likelihood
p̂(y |θ) =
N
1 X p(y , x (i) |θ)
N i=1 Qθ (x (i) )
This is a noisy measurement of the intractable “likelihood” p(y |θ).
Latent variables and pseudo-marginals
I Assume interest is in a posterior distribution
ˆ
π(θ) = p(θ|y ) ∝ η(θ)p(y |θ) = η(θ)
p(y , x|θ)dx
where the integral cannot be computed analytically.
I Then with x (i) iid
∼ Qθ and p(y , x|θ)/Qθ (x) well defined, consider an IS
approximation of the likelihood
p̂(y |θ) =
N
1 X p(y , x (i) |θ)
N i=1 Qθ (x (i) )
This is a noisy measurement of the intractable “likelihood” p(y |θ).
I One could define the “pseudo-marginal”
π̂(θ) ∝ η(θ)p̂(y |θ)
Pseudo-marginal approach
I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e.
such that for some C > 0
E[π̂(θ)] = C × π(θ),
θ ∈ Θ.
Pseudo-marginal approach
I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e.
such that for some C > 0
E[π̂(θ)] = C × π(θ),
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and π̂(θ),
θ ∈ Θ.
Pseudo-marginal approach
I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e.
such that for some C > 0
E[π̂(θ)] = C × π(θ),
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and π̂(θ),
I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 )
θ ∈ Θ.
Pseudo-marginal approach
I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e.
such that for some C > 0
E[π̂(θ)] = C × π(θ),
θ ∈ Θ.
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and π̂(θ),
I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 )
I Calculate the acceptance ratio
rˆ(θ, θ0 ) :=
π̂(θ0 ) q(θ0 , θ)
.
π̂(θ) q(θ, θ0 )
Pseudo-marginal approach
I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e.
such that for some C > 0
E[π̂(θ)] = C × π(θ),
θ ∈ Θ.
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and π̂(θ),
I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 )
I Calculate the acceptance ratio
rˆ(θ, θ0 ) :=
π̂(θ0 ) q(θ0 , θ)
.
π̂(θ) q(θ, θ0 )
I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ.
Pseudo-marginal approach
I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e.
such that for some C > 0
E[π̂(θ)] = C × π(θ),
θ ∈ Θ.
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and π̂(θ),
I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 )
I Calculate the acceptance ratio
rˆ(θ, θ0 ) :=
π̂(θ0 ) q(θ0 , θ)
.
π̂(θ) q(θ, θ0 )
I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ.
I Pseudo marginal algorithms are exact approximations of the marginal MH.
Toy latent variables example
I We consider here a simple example where the target distribution is
π(θ, x) = N
θ
x
0
1
;
,
0
−0.9
−0.9
1
Toy latent variables example
I We consider here a simple example where the target distribution is
π(θ, x) = N
θ
x
I Marginal is π(θ) = N (θ; 0, 1)
0
1
;
,
0
−0.9
−0.9
1
Toy latent variables example
I We consider here a simple example where the target distribution is
π(θ, x) = N
θ
x
0
1
;
,
0
−0.9
I Marginal is π(θ) = N (θ; 0, 1)
I Sample with random walk Metropolis algorithm
−0.9
1
Toy latent variables example
I We consider here a simple example where the target distribution is
π(θ, x) = N
θ
x
0
1
;
,
0
−0.9
−0.9
1
I Marginal is π(θ) = N (θ; 0, 1)
I Sample with random walk Metropolis algorithm
I
Q
with q(θ, θ0 ) = N θ0 ; θ, 2.42 and Qθ (X ) = N
i=1 N (xi ; 0, 1) for IS.
Toy latent variables example
I We consider here a simple example where the target distribution is
π(θ, x) = N
θ
x
0
1
;
,
0
−0.9
−0.9
1
I Marginal is π(θ) = N (θ; 0, 1)
I Sample with random walk Metropolis algorithm
I
I
Q
with q(θ, θ0 ) = N θ0 ; θ,2.42 and Qθ (X ) = N
i=1 N (xi ; 0, 1) for IS.
q(θ, θ0 ) = N θ0 ; θ, 2.42 is known to be optimal in terms of
asymptotic variance.
Standard AV
Beaumont"s algorithm with N=1
0.5
0.4
0.3
0.2
0.1
0
−4
−3
−2
−1
0
1
2
3
4
3
2
1
0
−1
−2
−3
0
100
200
300
400
500
600
700
800
900
1000
N =5
Beaumont"s algorithm with N=5
0.5
0.4
0.3
0.2
0.1
0
−4
−3
−2
−1
0
1
2
3
4
3
2
1
0
−1
−2
−3
−4
0
100
200
300
400
500
600
700
800
900
1000
N = 10
Beaumont"s algorithm with N=10
0.5
0.4
0.3
0.2
0.1
0
−4
−3
−2
−1
0
1
2
3
4
4
3
2
1
0
−1
−2
−3
0
100
200
300
400
500
600
700
800
900
1000
N = 20
Beaumont"s algorithm with N=20
0.5
0.4
0.3
0.2
0.1
0
−4
−3
−2
−1
0
1
2
3
4
4
2
0
−2
−4
−6
0
100
200
300
400
500
600
700
800
900
1000
Intuition
I The acceptance probability of the algorithm is
π̂(θ0 ) q(θ0 , θ)
min 1,
π̂(θ) q(θ, θ0 )
Intuition
I The acceptance probability of the algorithm is
π̂(θ0 ) q(θ0 , θ)
min 1,
π̂(θ) q(θ, θ0 )
I The probability of escaping (θ, π̂(θ)) can be made arbitrarily small by
increasing π̂(θ)...
Intuition
I The acceptance probability of the algorithm is
π̂(θ0 ) q(θ0 , θ)
min 1,
π̂(θ) q(θ, θ0 )
I The probability of escaping (θ, π̂(θ)) can be made arbitrarily small by
increasing π̂(θ)...
I The Markov chain becomes “sticky”.
Asymptotic variance and expected acceptance probability
I With Π a Markov transition kernel with invariant distribution π, letting
θ1 ∼ π and θn ∼ Π(θn−1 , ·),
τ := lim T E
T →∞
2
T
1 X
f (θk ) − π(f ) ∈ [0, ∞],
T
k=1
In other words τ is such that
var
T
1 X
f (θk )
T
k=1
!
≈
τ
T
Asymptotic variance and expected acceptance probability
I With Π a Markov transition kernel with invariant distribution π, letting
θ1 ∼ π and θn ∼ Π(θn−1 , ·),
τ := lim T E
T →∞
2
T
1 X
f (θk ) − π(f ) ∈ [0, ∞],
T
k=1
In other words τ is such that
var
T
1 X
f (θk )
T
k=1
!
≈
τ
T
I The expected acceptance probability of a MH algorithm with invariant
distribution π is
ˆ
α(θ, θ0 )π(dθ)q(θ, dθ0 )
Estimated autocorrelation time
Expected acceptance probability
Performance as a function of N
0.5
0.4
0.3
0.2
0.1
0
20
40
60
80
100
Value of N
120
140
160
180
200
0
20
40
60
80
100
Value of N
120
140
160
180
200
18
16
14
12
10
8
6
4
Large datasets?
I In order to fix ideas we are going to consider a scenario where the target
distribution is a posterior distribution,
Large datasets?
I In order to fix ideas we are going to consider a scenario where the target
distribution is a posterior distribution,
I That is it is a posterior distribution obtained from Bayes’ rule, given a
prior distribution η(θ) is the prior distribution
π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ)
where y1:T are observations and p(y1:T | θ) the assumed probability
density of the observations for the value θ of the model’s parameter.
Large datasets?
I In order to fix ideas we are going to consider a scenario where the target
distribution is a posterior distribution,
I That is it is a posterior distribution obtained from Bayes’ rule, given a
prior distribution η(θ) is the prior distribution
π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ)
where y1:T are observations and p(y1:T | θ) the assumed probability
density of the observations for the value θ of the model’s parameter.
I We are going to assume that
ˆ
pθ (y1:T ) := p(y1:T | θ) =
pθ (x1:T , y1:T )d(x1:T )
XT
Large datasets?
I In order to fix ideas we are going to consider a scenario where the target
distribution is a posterior distribution,
I That is it is a posterior distribution obtained from Bayes’ rule, given a
prior distribution η(θ) is the prior distribution
π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ)
where y1:T are observations and p(y1:T | θ) the assumed probability
density of the observations for the value θ of the model’s parameter.
I We are going to assume that
ˆ
pθ (y1:T ) := p(y1:T | θ) =
pθ (x1:T , y1:T )d(x1:T )
XT
I Where pθ (y1:T ) is hard or impossible to evaluate, while pθ (x1:T , y1:T ) is
relatively easy to evaluate for any x1:T ∈ XT y1:T ∈ YT .
Large datasets?
I In order to fix ideas we are going to consider a scenario where the target
distribution is a posterior distribution,
I That is it is a posterior distribution obtained from Bayes’ rule, given a
prior distribution η(θ) is the prior distribution
π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ)
where y1:T are observations and p(y1:T | θ) the assumed probability
density of the observations for the value θ of the model’s parameter.
I We are going to assume that
ˆ
pθ (y1:T ) := p(y1:T | θ) =
pθ (x1:T , y1:T )d(x1:T )
XT
I Where pθ (y1:T ) is hard or impossible to evaluate, while pθ (x1:T , y1:T ) is
relatively easy to evaluate for any x1:T ∈ XT y1:T ∈ YT .
I Some progress has been made in recent years for large classes of models,
but it is unclear how these algorithms behave for large T ?
Motivating example: state-space models
I Let {Xn , n ≥ 1} be a Markov chain defined by
X1 ∼ µθ (·) and Xn | (Xn−1 = xn−1 ) ∼ fθ (· | xn−1 ) .
Motivating example: state-space models
I Let {Xn , n ≥ 1} be a Markov chain defined by
X1 ∼ µθ (·) and Xn | (Xn−1 = xn−1 ) ∼ fθ (· | xn−1 ) .
I We only have access to a process {Yn , n ≥ 1} such that, conditional upon
{Xn , n ≥ 1}, the observations are statistically independent and
Yn | (Xn = xn ) ∼ gθ (· | xn ) .
Motivating example: state-space models
I Let {Xn , n ≥ 1} be a Markov chain defined by
X1 ∼ µθ (·) and Xn | (Xn−1 = xn−1 ) ∼ fθ (· | xn−1 ) .
I We only have access to a process {Yn , n ≥ 1} such that, conditional upon
{Xn , n ≥ 1}, the observations are statistically independent and
Yn | (Xn = xn ) ∼ gθ (· | xn ) .
I θ is an unknown parameter of prior density η (θ) . In general
ˆ
pθ (y1:T ) =
is intractable.
XT
µθ (x1 )gθ y1 | x1
T
Y
i=2
gθ yi | xi fθ xi | xi−1 d(x1:T )
Motivating example: state-space models
I Then the MH acceptance ratio
r (θ, θ0 ) =
η(θ0 ) pθ0 (y1:T ) q(θ0 , θ)
η(θ) pθ (y1:T ) q(θ, θ0 )
Motivating example: state-space models
I Then the MH acceptance ratio
r (θ, θ0 ) =
η(θ0 ) pθ0 (y1:T ) q(θ0 , θ)
η(θ) pθ (y1:T ) q(θ, θ0 )
I The intractable quantity is the “likelihood” ratio
LT (θ, θ0 ) :=
pθ0 (y1:T )
.
pθ (y1:T )
Motivating example: state-space models
I Then the MH acceptance ratio
r (θ, θ0 ) =
η(θ0 ) pθ0 (y1:T ) q(θ0 , θ)
η(θ) pθ (y1:T ) q(θ, θ0 )
I The intractable quantity is the “likelihood” ratio
LT (θ, θ0 ) :=
pθ0 (y1:T )
.
pθ (y1:T )
I There are efficient methods to estimate pθ (y1:T ) unbiasedly, based on
particle filters (or sequential Monte Carlo methods).
SMC methods
I Particle filters, or SMC, were initially developed to approximate
{p(xt |y1:t ), t = 1, . . .} recursively (or p(x1:T |y1:T )).
SMC methods
I Particle filters, or SMC, were initially developed to approximate
{p(xt |y1:t ), t = 1, . . .} recursively (or p(x1:T |y1:T )).
I The main idea is to propagate through time a cloud of (possibly
weighted) samples {xti , i = 1, . . . , N} such that their empirical
distribution is a good approximation of p(xt |y1:t ).
SMC methods
I Particle filters, or SMC, were initially developed to approximate
{p(xt |y1:t ), t = 1, . . .} recursively (or p(x1:T |y1:T )).
I The main idea is to propagate through time a cloud of (possibly
weighted) samples {xti , i = 1, . . . , N} such that their empirical
distribution is a good approximation of p(xt |y1:t ).
I An unbiased and non-negative estimator pθ (y1:T ) can easily be obtained
as a by-product.
Sequential Monte Carlo
Description
for i = 1, . . . , N do
(i)
Sample x1 ∼ mθ (·) ;
(i)
(i) Compute w1 ∝ µθ x1
(i)
(i) gθ x1 , y1 /mθ x1
end
for t = 2, . . . , T do
for i = 1, . . . , N do
(i)
(at−1 ) (i)
(1)
(N) (i)
Sample at−1 ∼ P wt−1 , . . . , wt−1 and zt ∼ Mθ xt−1
, · ,;
(i)
(i)
(at−1 ) (i) (at−1 ) (i) (i)
Compute wti = fθ xt−1 , xt gθ xt , yt /Mθ xt−1 , xt
end
end
Algorithm 1: SMC N, Mθ , Aθ
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Sequential Monte Carlo
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
15
20
25
15
20
25
time index
1.6
1.4
state
1.2
1
0.8
0.6
0.4
5
10
time index
Picture created by Olivier Cappé.
Pseudo-marginal approach
I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ).
i.e. such that for some C > 0
E[p̂θ (y1:T )] = Cpθ (y1:T ),
θ ∈ Θ.
Pseudo-marginal approach
I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ).
i.e. such that for some C > 0
E[p̂θ (y1:T )] = Cpθ (y1:T ),
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and p̂θ (y1:T ),
θ ∈ Θ.
Pseudo-marginal approach
I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ).
i.e. such that for some C > 0
E[p̂θ (y1:T )] = Cpθ (y1:T ),
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and p̂θ (y1:T ),
I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T )
θ ∈ Θ.
Pseudo-marginal approach
I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ).
i.e. such that for some C > 0
E[p̂θ (y1:T )] = Cpθ (y1:T ),
θ ∈ Θ.
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and p̂θ (y1:T ),
I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T )
I Calculate the acceptance ratio
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
Pseudo-marginal approach
I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ).
i.e. such that for some C > 0
E[p̂θ (y1:T )] = Cpθ (y1:T ),
θ ∈ Θ.
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and p̂θ (y1:T ),
I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T )
I Calculate the acceptance ratio
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ.
Pseudo-marginal approach
I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ).
i.e. such that for some C > 0
E[p̂θ (y1:T )] = Cpθ (y1:T ),
θ ∈ Θ.
Pseudo-marginal MH [Andrieu and Roberts, 2009]
Given θn = θ and p̂θ (y1:T ),
I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T )
I Calculate the acceptance ratio
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ.
I Pseudo marginal algorithms are exact approximations of the marginal MH.
Numerical experiments
I Stochastic volatility (toyish) model
2
Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1
) + 8 cos(1.2t) + Vt ,
Yt = Xt2 /20 + Wt ,
iid
t ≥ 1,
iid
2 ).
where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw
t ≥ 2,
Numerical experiments
I Stochastic volatility (toyish) model
2
Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1
) + 8 cos(1.2t) + Vt ,
Yt = Xt2 /20 + Wt ,
iid
t ≥ 2,
t ≥ 1,
iid
2 ).
where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw
2 ) and was ascribed a prior
I The static parameter of the model is then θ = (σv2 , σw
distribution.
Performance measure
I For a generic Markov transition probability Π of invariant distribution π,
Performance measure
I For a generic Markov transition probability Π of invariant distribution π,
I For any f for which the limit exist, we define the asymptotic variance
P
τ := lim Mvarπ M −1 M
i=1 f (θi ) .
M→∞
Results
AIS MCMC - PGBS
MwG - PGBS
PMMH
σv2
2
σw
σv2
2
σw
σv2
2
σw
19.433
T = 500
16.583
21.424
18.880
32.686
19.742
T = 1000
15.294
19.787
20.895
24.633
43.657
54.389
T = 2000
17.431
21.680
22.798
26.487
279.009
554.928
T = 3000
18.336
21.689
20.056
32.555
1202.928
1856.514
T = 4000
17.239
19.782
24.127
32.678
2602.677
1916.475
Table: Estimated IAC times for σv2 and σw2 calculated for the algorithms
being compared. All algorithms use N = 500 particles and 1 intermediate
step.
What’s wrong with pseudo-marginal algorithms?
I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T )
independently
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
What’s wrong with pseudo-marginal algorithms?
I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T )
independently
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Variability of the acceptance ratio has a big impact on the performance of
the algorithm [CA & M. Vihola 2014,2015].
What’s wrong with pseudo-marginal algorithms?
I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T )
independently
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Variability of the acceptance ratio has a big impact on the performance of
the algorithm [CA & M. Vihola 2014,2015].
I What is happening here?
I
we are in a situation where for a fixed number of particles the
variability of p̂θ (y1:T ) increases with T ,
What’s wrong with pseudo-marginal algorithms?
I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T )
independently
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Variability of the acceptance ratio has a big impact on the performance of
the algorithm [CA & M. Vihola 2014,2015].
I What is happening here?
I
we are in a situation where for a fixed number of particles the
variability of p̂θ (y1:T ) increases with T ,
I
we notice that due to the independence as θ0 → θ the variability in
p̂θ0 (y1:T )/p̂θ (y1:T ) does not vanish.
What’s wrong with pseudo-marginal algorithms?
I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T )
independently
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Variability of the acceptance ratio has a big impact on the performance of
the algorithm [CA & M. Vihola 2014,2015].
I What is happening here?
I
we are in a situation where for a fixed number of particles the
variability of p̂θ (y1:T ) increases with T ,
I
we notice that due to the independence as θ0 → θ the variability in
p̂θ0 (y1:T )/p̂θ (y1:T ) does not vanish.
I In what follows we motivate the need to introduce dependence between
estimates, in order to ensure that as θ0 → θ the ratio converges to one.
What’s wrong with pseudo-marginal algorithms?
I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T )
independently
rˆ(θ, θ0 ) :=
η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ)
.
η(θ) p̂θ (y1:T ) q(θ, θ0 )
I Variability of the acceptance ratio has a big impact on the performance of
the algorithm [CA & M. Vihola 2014,2015].
I What is happening here?
I
we are in a situation where for a fixed number of particles the
variability of p̂θ (y1:T ) increases with T ,
I
we notice that due to the independence as θ0 → θ the variability in
p̂θ0 (y1:T )/p̂θ (y1:T ) does not vanish.
I In what follows we motivate the need to introduce dependence between
estimates, in order to ensure that as θ0 → θ the ratio converges to one.
I We then explain how to estimate LT (θ, θ 0 ) = pθ0 (y1:T )/pθ (y1:T ) directly
and design a correct algorithm.
What’s wrong with pseudo-marginal algorithms?
I In order to simplify discussion, consider the scenario where
pθ (y1:T ) :=
T
Y
t=1
pθ (yt ) =
T ˆ
Y
t=1
X
pθ (xt , yt )dxt
What’s wrong with pseudo-marginal algorithms?
I In order to simplify discussion, consider the scenario where
pθ (y1:T ) :=
T
Y
pθ (yt ) =
t=1
T ˆ
Y
t=1
pθ (xt , yt )dxt
X
I Let P0 be the actual distribution of the observations Y1 , Y2 , . . . and
consider the tractable scenario.
What’s wrong with pseudo-marginal algorithms?
I In order to simplify discussion, consider the scenario where
pθ (y1:T ) :=
T
Y
pθ (yt ) =
t=1
T ˆ
Y
t=1
pθ (xt , yt )dxt
X
I Let P0 be the actual distribution of the observations Y1 , Y2 , . . . and
consider the tractable scenario.
I For θ, θ 0 ∈ Θ given, if
P0 −a.s. as T → ∞
´
log
pθ0 (y )
P (y )dy
P0 (y ) 0
log LT (θ, θ0 ) =
T
X
t=1
log
−
´
log
pθ (y )
P (y )dy
P0 (y ) 0
pθ0 (Yt )
→ ±∞
pθ (Yt )
6= 0 then
What’s wrong with pseudo-marginal algorithms?
I In order to simplify discussion, consider the scenario where
pθ (y1:T ) :=
T
Y
pθ (yt ) =
t=1
T ˆ
Y
t=1
pθ (xt , yt )dxt
X
I Let P0 be the actual distribution of the observations Y1 , Y2 , . . . and
consider the tractable scenario.
I For θ, θ 0 ∈ Θ given, if
P0 −a.s. as T → ∞
´
log
pθ0 (y )
P (y )dy
P0 (y ) 0
log LT (θ, θ0 ) =
T
X
t=1
log
−
´
log
pθ (y )
P (y )dy
P0 (y ) 0
6= 0 then
pθ0 (Yt )
→ ±∞
pθ (Yt )
I As a result the acceptance ratio of the exact algorithm we would want to
implement vanishes or diverges to infinity.
rT (θ, θ0 ) :=
η(θ0 )
q(θ0 , θ)
LT (θ, θ0 )
η(θ)
q(θ, θ0 )
What’s wrong with pseudo-marginal algorithms?
√
I However if θ 0 = θ + / T for some (which corresponds to a random
walk Metropolis)
What’s wrong with pseudo-marginal algorithms?
√
I However if θ 0 = θ + / T for some (which corresponds to a random
walk Metropolis)
I Under some regularity assumptions the likelihood ratio is (with P0 the law
of the observations)
T
√
1
X˙
`θ Yi − 2 V (θ) + oP0 (1)
log LT (θ, θ + / T ) = √
2
T i=1
What’s wrong with pseudo-marginal algorithms?
√
I However if θ 0 = θ + / T for some (which corresponds to a random
walk Metropolis)
I Under some regularity assumptions the likelihood ratio is (with P0 the law
of the observations)
T
√
1
X˙
`θ Yi − 2 V (θ) + oP0 (1)
log LT (θ, θ + / T ) = √
2
T i=1
I We expect a central limit theorem to hold, and the acceptance ratio to
have a “non-degenerate” limit.
What’s wrong with pseudo-marginal algorithms?
√
I However if θ 0 = θ + / T for some (which corresponds to a random
walk Metropolis)
I Under some regularity assumptions the likelihood ratio is (with P0 the law
of the observations)
T
√
1
X˙
`θ Yi − 2 V (θ) + oP0 (1)
log LT (θ, θ + / T ) = √
2
T i=1
I We expect a central limit theorem to hold, and the acceptance ratio to
have a “non-degenerate” limit.
I This type of strategy exploits the smoothness, for any θ ∈ Θ of
θ0 7→
pθ0 (y1:T )
pθ (y1:T )
and the fact that pθ0 (y1:T )/pθ (y1:T ) →θ0 →θ 1 in order to control the
asymptotic behaviour of the ratio.
What’s wrong with pseudo-marginal algorithms?
√
I However if θ 0 = θ + / T for some (which corresponds to a random
walk Metropolis)
I Under some regularity assumptions the likelihood ratio is (with P0 the law
of the observations)
T
√
1
X˙
`θ Yi − 2 V (θ) + oP0 (1)
log LT (θ, θ + / T ) = √
2
T i=1
I We expect a central limit theorem to hold, and the acceptance ratio to
have a “non-degenerate” limit.
I This type of strategy exploits the smoothness, for any θ ∈ Θ of
θ0 7→
pθ0 (y1:T )
pθ (y1:T )
and the fact that pθ0 (y1:T )/pθ (y1:T ) →θ0 →θ 1 in order to control the
asymptotic behaviour of the ratio.
I This does not work if we use independent estimators, p̂θ(1) (y1:T ) and
(2)
p̂θ0 (y1:T ).
Estimate of the likelihood ratio
I A classical way of estimating the likelihood ratio consists of exploiting the
identity
pθ0 (y )
=
pθ (y )
ˆ
pθ0 (x, y )
pθ (x | y )dx
pθ (x, y )
ˆ
pθ0 (x | y )pθ0 (y )
pθ (x | y )dx
=
pθ (x | y )pθ (y )
Estimate of the likelihood ratio
I A classical way of estimating the likelihood ratio consists of exploiting the
identity
pθ0 (y )
=
pθ (y )
ˆ
pθ0 (x, y )
pθ (x | y )dx
pθ (x, y )
ˆ
pθ0 (x | y )pθ0 (y )
pθ (x | y )dx
=
pθ (x | y )pθ (y )
I Assuming we can sample from X ∼ pθ (· | y ) then
p 0 (y )
pθ0 (X , y )
estimates θ
pθ (X , y )
pθ (y )
and further with sufficient smoothness the estimator goes to one as
θ0 → θ.
Estimate of the likelihood ratio
I A classical way of estimating the likelihood ratio consists of exploiting the
identity
pθ0 (y )
=
pθ (y )
ˆ
pθ0 (x, y )
pθ (x | y )dx
pθ (x, y )
ˆ
pθ0 (x | y )pθ0 (y )
pθ (x | y )dx
=
pθ (x | y )pθ (y )
I Assuming we can sample from X ∼ pθ (· | y ) then
p 0 (y )
pθ0 (X , y )
estimates θ
pθ (X , y )
pθ (y )
and further with sufficient smoothness the estimator goes to one as
θ0 → θ.
I Sampling from pθ (x | y ) is an issue, but we could design an algorithm
targetting the joint distribution π(θ, x | y ) ∝ η(θ)pθ (x, y ).
Estimate of the likelihood ratio
I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ).
Estimate of the likelihood ratio
I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ).
I One can show reversibility of the algorithm
1. Given (θ, x), sample θ0 ∼ q(θ, ·)
2. With probability
η(θ0 )pθ0 (x, y )q(θ0 , θ)
min 1,
η(θ)pθ (x, y )q(θ, θ0 )
jump to (θ0 , x)
Estimate of the likelihood ratio
I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ).
I One can show reversibility of the algorithm
1. Given (θ, x), sample θ0 ∼ q(θ, ·)
2. With probability
η(θ0 )pθ0 (x, y )q(θ0 , θ)
min 1,
η(θ)pθ (x, y )q(θ, θ0 )
jump to (θ0 , x)
I x is not updated and we must combine this with an update of x
(Metropolis-within-Gibbs idea).
Estimate of the likelihood ratio
I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ).
I One can show reversibility of the algorithm
1. Given (θ, x), sample θ0 ∼ q(θ, ·)
2. With probability
η(θ0 )pθ0 (x, y )q(θ0 , θ)
min 1,
η(θ)pθ (x, y )q(θ, θ0 )
jump to (θ0 , x)
I x is not updated and we must combine this with an update of x
(Metropolis-within-Gibbs idea).
I But we can exploit the continuity of pθ0 (x, y ), and there is hope the
algorithm scales.
Estimate of the likelihood ratio
I However for θ̃ ∈ Θ, using the identity twice,
pθ̃ (y ) pθ0 (y )
=
pθ (y ) pθ̃ (y )
ˆ ˆ
pθ̃ (x, y ) pθ0 (x 0 , y )
pθ (x | y )pθ̃ (x 0 | y )dxdx 0
pθ (x, y ) pθ̃ (x 0 , y )
This was originally proposed as a variance reduction technique (think of
θ̃ = (θ + θ0 )/2) [Crooks, 1998, Jarzynski, 1997, Neal, 2001, Neal, 1996].
Estimate of the likelihood ratio
I However for θ̃ ∈ Θ, using the identity twice,
pθ̃ (y ) pθ0 (y )
=
pθ (y ) pθ̃ (y )
ˆ ˆ
pθ̃ (x, y ) pθ0 (x 0 , y )
pθ (x | y )pθ̃ (x 0 | y )dxdx 0
pθ (x, y ) pθ̃ (x 0 , y )
This was originally proposed as a variance reduction technique (think of
θ̃ = (θ + θ0 )/2) [Crooks, 1998, Jarzynski, 1997, Neal, 2001, Neal, 1996].
I Assume for a moment that sampling from pθ̃ (x | y ) is possible and let us
target π(θ, x | y ) ∝ η(θ)pθ (x, y ).
1. Given (θ, x), sample θ0 ∼ q(θ, ·), let θ̃ = (θ + θ0 )/2 and sample
x 0 ∼ pθ̃ (· | y )
2. With probability
pθ̃ (y )
η(θ0 )pθ0 (x 0 , y )q(θ0 , θ)pθ̃ (x | y )
min 1,
×
η(θ)pθ (x, y )q(θ, θ0 )pθ̃ (x 0 | y )
pθ̃ (y )
jump to (θ0 , x 0 ), otherwise stay at (θ, x).
Estimate of the likelihood ratio
I However for θ̃ ∈ Θ, using the identity twice,
pθ̃ (y ) pθ0 (y )
=
pθ (y ) pθ̃ (y )
ˆ ˆ
pθ̃ (x, y ) pθ0 (x 0 , y )
pθ (x | y )pθ̃ (x 0 | y )dxdx 0
pθ (x, y ) pθ̃ (x 0 , y )
This was originally proposed as a variance reduction technique (think of
θ̃ = (θ + θ0 )/2) [Crooks, 1998, Jarzynski, 1997, Neal, 2001, Neal, 1996].
I Assume for a moment that sampling from pθ̃ (x | y ) is possible and let us
target π(θ, x | y ) ∝ η(θ)pθ (x, y ).
1. Given (θ, x), sample θ0 ∼ q(θ, ·), let θ̃ = (θ + θ0 )/2 and sample
x 0 ∼ pθ̃ (· | y )
2. With probability
pθ̃ (y )
η(θ0 )pθ0 (x 0 , y )q(θ0 , θ)pθ̃ (x | y )
min 1,
×
η(θ)pθ (x, y )q(θ, θ0 )pθ̃ (x 0 | y )
pθ̃ (y )
jump to (θ0 , x 0 ), otherwise stay at (θ, x).
I The algorithm requires exact sampling from pθ̃ (x 0 | y )...
Estimate of the likelihood ratio
I Crucially one can show that if Rθ̃ is a Markov transition probability
leaving pθ̃ (· | y ) invariant then also
ˆ ˆ
pθ̃ (y ) pθ0 (y )
pθ̃ (x, y ) pθ0 (x 0 , y )
=
pθ (x | y )Rθ̃ (x, x 0 )dxdx 0
pθ (y ) pθ̃ (y )
pθ (x, y ) pθ̃ (x 0 , y )
Estimate of the likelihood ratio
I Crucially one can show that if Rθ̃ is a Markov transition probability
leaving pθ̃ (· | y ) invariant then also
ˆ ˆ
pθ̃ (y ) pθ0 (y )
pθ̃ (x, y ) pθ0 (x 0 , y )
=
pθ (x | y )Rθ̃ (x, x 0 )dxdx 0
pθ (y ) pθ̃ (y )
pθ (x, y ) pθ̃ (x 0 , y )
I If further we have reversibility pθ̃ (x | y )Rθ̃ (x, x 0 ) = pθ̃ (x 0 | y )Rθ̃ (x 0 , x)
then the following is a valid algorithm to sample from
π(θ, x | y ) ∝ η(θ)pθ (x, y )
1. Given (θ, x), sample θ0 ∼ q(θ, ·), let θ̃ = (θ + θ0 )/2 and sample
x 0 ∼ Rθ̃ (x, ·)
2. With probability
η(θ0 )pθ0 (x 0 , y )q(θ0 , θ)pθ̃ (x, y )
min 1,
η(θ)pθ (x, y )q(θ, θ0 )pθ̃ (x 0 , y )
jump to (θ0 , x 0 ), otherwise stay at (θ, x).
More general AIS...
I It is possible to generalise this and introduce K intermediate intermediate
steps
I
e.g. consider Pθ,θ0 ,K := {πθ,θ0 ,k (x), k = 0, . . . , K + 1} with
πθ,θ0 ,k (x) := pθ×k/(K +1)+θ0 ×[1−k/(K +1)] (· | y )
I
and the associated
[K ] Markov kernels
Rθ,θ0 ,K := Rθ,θ0 ,k (·, ·) : Xn × X ⊗n → [0, 1], k = 1, . . . , K .
More general AIS...
I It is possible to generalise this and introduce K intermediate intermediate
steps
I
e.g. consider Pθ,θ0 ,K := {πθ,θ0 ,k (x), k = 0, . . . , K + 1} with
πθ,θ0 ,k (x) := pθ×k/(K +1)+θ0 ×[1−k/(K +1)] (· | y )
I
and the associated
[K ] Markov kernels
Rθ,θ0 ,K := Rθ,θ0 ,k (·, ·) : Xn × X ⊗n → [0, 1], k = 1, . . . , K .
I What is interesting is that in this case the estimator is of the form
K
Y
πθ,θ0 ,k+1 xk
πθ,θ0 ,k xk
k=0
More general AIS...
I It is possible to generalise this and introduce K intermediate intermediate
steps
I
e.g. consider Pθ,θ0 ,K := {πθ,θ0 ,k (x), k = 0, . . . , K + 1} with
πθ,θ0 ,k (x) := pθ×k/(K +1)+θ0 ×[1−k/(K +1)] (· | y )
I
and the associated
[K ] Markov kernels
Rθ,θ0 ,K := Rθ,θ0 ,k (·, ·) : Xn × X ⊗n → [0, 1], k = 1, . . . , K .
I What is interesting is that in this case the estimator is of the form
K
Y
πθ,θ0 ,k+1 xk
πθ,θ0 ,k xk
k=0
I And with some conditions on Pθ,θ0 ,K : and Rθ,θ0 ,K the estimator is
consistent.
The conditional SMC
I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ).
The conditional SMC
I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ).
I The conditional Sequential Monte Carlo algorithm is a πθ −invariant Markov
kernel (in fact πθ −reversible).
The conditional SMC
I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ).
I The conditional Sequential Monte Carlo algorithm is a πθ −invariant Markov
kernel (in fact πθ −reversible).
I As suggested by its name it is a particle based algorithm (say N particles, call it
Rθ,N ).
The conditional SMC
I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ).
I The conditional Sequential Monte Carlo algorithm is a πθ −invariant Markov
kernel (in fact πθ −reversible).
I As suggested by its name it is a particle based algorithm (say N particles, call it
Rθ,N ).
I It can be shown to have very good mixing properties [Andrieu et al., 2013] and
Lindsten et al. 2015.
Sequential Monte Carlo
Description
for i = 1, . . . , N do
(i)
If i 6= 1 sample x1 ∼ mθ (·) ;
(i)
(i) (i)
(i) Compute w1 ∝ µθ x1 gθ x1 , y1 /mθ x1
end
for t = 2, . . . , T do
for i = 1, . . . , N do
(i)
(at−1 ) (i)
(1)
(N) (i)
If i 6= 1 sample at−1 ∼ P wt−1 , . . . , wt−1 and zt ∼ Mθ xt−1
, · ,;
(i)
(i)
(at−1 ) (i) (a
)
(i)
(i)
t−1
Compute wti = fθ xt−1 , xt gθ xt , yt /Mθ xt−1
, xt
end
end
(1)
(N) Sample kn ∼ P wn , . . . , wn
for t = T − 1, . . . , 1 do
(kt+1 )
kt = at−1
;
(kn )
and set xn0 = xn
.
(kt )
Set xt0 = xt
end
0
return x1:T
Algorithm 2: cSMC N, x1:T , Mθ , Aθ
Numerical experiments
I Stochastic volatility (toyish) model
2
Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1
) + 8 cos(1.2t) + Vt ,
Yt =
Xt2 /20
+ Wt ,
iid
t ≥ 1,
iid
2 ).
where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw
t ≥ 2,
Numerical experiments
I Stochastic volatility (toyish) model
2
Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1
) + 8 cos(1.2t) + Vt ,
Yt =
Xt2 /20
+ Wt ,
iid
t ≥ 2,
t ≥ 1,
iid
2 ).
where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw
2 ) and was ascribed a prior
I The static parameter of the model is then θ = (σv2 , σw
distribution.
Numerical experiments
I Stochastic volatility (toyish) model
2
Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1
) + 8 cos(1.2t) + Vt ,
Yt =
Xt2 /20
+ Wt ,
iid
t ≥ 2,
t ≥ 1,
iid
2 ).
where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw
2 ) and was ascribed a prior
I The static parameter of the model is then θ = (σv2 , σw
distribution.
I We implemented exact approximations of the random walk Metropolis with
q(θ, θ0 ) = N θ, σ 2 /n .
Results
AIS MCMC - PGBS
MwG - PGBS
PMMH
σv2
2
σw
σv2
2
σw
σv2
2
σw
19.433
T = 500
16.583
21.424
18.880
32.686
19.742
T = 1000
15.294
19.787
20.895
24.633
43.657
54.389
T = 2000
17.431
21.680
22.798
26.487
279.009
554.928
T = 3000
18.336
21.689
20.056
32.555
1202.928
1856.514
T = 4000
17.239
19.782
24.127
32.678
2602.677
1916.475
Table: Estimated IAC times for σv2 and σw2 calculated for the algorithms
being compared. All algorithms use N = 500 particles and 1 intermediate
step.
Preliminary results
I Considered the situation
pθ (y1:T ) :=
T
Y
t=1
pθ (yt ) =
T ˆ
Y
t=1 X
pθ (xt , yt )dxt
Preliminary results
I Considered the situation
pθ (y1:T ) :=
T
Y
pθ (yt ) =
t=1
T ˆ
Y
pθ (xt , yt )dxt
t=1 X
I We make some assumptions including
1. Θ ⊂ R is compact,
2. For any x, y ∈ X × Y, θ 7→ log pθ (x, y ) is three times differentiable
with derivatives uniformly bounded in θ, x, y ...
Preliminary results
I Considered the situation
pθ (y1:T ) :=
T
Y
pθ (yt ) =
t=1
T ˆ
Y
pθ (xt , yt )dxt
t=1 X
I We make some assumptions including
1. Θ ⊂ R is compact,
2. For any x, y ∈ X × Y, θ 7→ log pθ (x, y ) is three times differentiable
with derivatives uniformly bounded in θ, x, y ...
I The Markov kernel Rθ,N is a product of Markov kernels each targetting
pθ (xt | yt ) and N is not necessarily a number of particles.
Some asymptotics
Theorem
Under some assumptions... Then P−a.s., forany ε0 > 0 there exist T0 , N0 ∈ N
such that for any T ≥ T0 and any sequence NT ∈ NN such that NT ≥ N0
for T ≥ T0
ω
sup Eω
T [min{1, r˜T (θ, ; ω, ξ)}] − ĚT [min{1, rT (θ, ; ω) exp(Z )}] ≤ ε0 ,
T ≥T0
and
2
2 ≤ ε0
sup Eω
− Ěω
T min{1, r˜T (θ, ; ω, ξ)}
T min{1, rT (θ, ; ω) exp(Z )}
T ≥T0
where
Z | (θ, , ω) ∼ N
2
ς (θ, ) 2
− T
, ςT (θ, )
2
for some ςT2 (θ, ) such that limT →∞ ςT2 (θ, ) exists and is finite!
The end
Thank you!
Andrieu, C., Lee, A., and Vihola, M. (2013).
Uniform ergodicity of the iterated conditional smc and geometric ergodicity of
particle gibbs samplers.
(arXiv:1312.6432).
Andrieu, C. and Roberts, G. O. (2009).
The pseudo-marginal approach for efficient Monte Carlo computations.
Annals of Statistics, 37(2):569–1078.
Crooks, G. (1998).
Nonequilibrium measurements of free energy differences for microscopically
reversible markovian systems.
Journal of Statistical Physics, 90(5-6):1481–1487.
Jarzynski, C. (1997).
Equilibrium free-energy differences from nonequilibrium measurements: A
master-equation approach.
Phys. Rev. E, 56:5018–5035.
Neal, R. (1996).
Sampling from multimodal distributions using tempered transitions.
Statistics and computing, 6(4):353–366.
Neal, R. (2001).
Annealed importance sampling.
Statistics and Computing, 11(2):125–139.
Download